WTF? When will people finally learn to read the spec and implement things based on the spec and test things based on the spec instead of just making up themselves what a URL is or what HTML is or what an email address is or what a MIME body is or ...
There are supposed URIs in that list that aren't actually URIs, there are supposed non-URIs in that list that are actually URIs, and most of the candidate regexes obviously must have come from some creative minds and not from people who should be writing software. If you just make shit up instead of referring to what the spec says, you urgently should find yourself a new profession, this kind of crap has been hurting us long enough.
(Also, I do not just mean the numeric RFC1918 IPv4 URIs, which obviously are valid URIs but have been rejected intentionally nonetheless - even though that's idiotic as well, of course, given that (a) nothing prevents anyone from putting those addresses in the DNS and (b) those are actually perfectly fine URIs that people use, and I don't see why people should not want to shorten some class of the URIs that they use.)
By the way, the grammar in the RFC is machine readable, and it's regular. So you can just write a script that transforms that grammar into a regex that is guaranteed to reflect exactly what the spec says.
This entire rant begins with the premise that the spec matches the real world implementation. Given that one of your examples is "what an email address is", I submit that expecting reality and the spec to match is a beautiful dream, from which a developer should awaken before trying to implement such a scheme.
Except there is no "the real world implementation". There only are lots of implementations that are incompatible with the spec as well as amongst each other. Inventing yet another variant of your own that also isn't going to be compatible with anything is not going to help anyone. Deviating from the formal spec because everyone practically agrees how to do things, albeit differently than in the formal spec, is something quite different from making shit up, and actually tends to be even harder than building things to spec, as there tends to be no easy reference to look things up in, but instead you might have to look into the guts of existing implementations and talk to people who have built them to figure out what to do - and you would normally start with an implementation according to spec anyhow, and only add special cases for non-normative conventions lateron.
Also, what exactly is the problem with email addresses? There is a very unambiguous grammar of those in the RFC, and there are lots of implementations of exactly what the spec specifies. Just because some web kiddies have made up some shit about email addresses and use that for validation, doesn't mean that postfix, qmail, or exim are written by morons.
> Deviating from the formal spec because everyone practically agrees how to do things, albeit differently than in the formal spec, is something quite different from making shit up, and actually tends to be even harder than building things to spec, as there tends to be no easy reference to look things up in, but instead you might have to look into the guts of existing implementations and talk to people who have built them to figure out what to do - and you would normally start with an implementation according to spec anyhow, and only add special cases for non-normative conventions lateron.
The goal was to come up with a good regular expression to validate URLs in user input, and not to match any URL that browsers can handle (as per the URL Standard). I am fully aware that this is not the same as what any spec says.
> By the way, the grammar in the RFC is machine readable, and it's regular.
The RFC does not reflect reality either (which, ironically, is what you seem to be complaining about). If you’re looking for a spec-compliant solution, the spec to follow is http://url.spec.whatwg.org/.
> If you just make shit up instead of referring to what the spec says, you urgently should find yourself a new profession, this kind of crap has been hurting us long enough.
I am aware of, and am a contributor to, the URL Standard: http://url.spec.whatwg.org/ That doesn’t mean there aren’t any situations in which I need/want to blacklist some technically valid URL constructs.
> The goal was to come up with a good regular expression to validate URLs in user input, and not to match any URL that browsers can handle (as per the URL Standard).
WTF? What is "validation" supposed to be good for if it doesn't actually validate what it claims to? Exactly this mentality of making up your own stuff instead of implementing standards is what causes all these interoperability nightmares! If you claim to accept URLs, then accept URLs, all URLs, and reject non-URLs, all non-URLs. There is no reason to do anything else, other than lazyness maybe, and even then you are lying if you claim that you are validating URLs - you are not. If you say you accept a URL, and I paste a URL, your software is broken if it then rejects that URL as invalid.
This does not apply to intentionally selecting only a subset of URLs that are applicable in a given context, of course - if the URL is to be retrieved by an HTTP client, it's perfectly fine to reject non-HTTP URLs, of course, but any kind of "nobody is going to use that anyhow" is not a good reason. In particular, that kind of rejection most certainly is something that should not happen in the parser as that is likely to give inconsistent results as the parser usually works at the wrong level of abstraction.
> The RFC does not reflect reality either (which, ironically, is what you seem to be complaining about).
A spec for a formal language that doesn't contain a grammar? The world is getting crazier every day ...
> That doesn’t mean there aren’t any situations in which I need/want to blacklist some technically valid URL constructs.
Yeah, but blocking IPv4 literals of certain address ranges seems like a stupid idea nonetheless. Good software should accept any input that is meaningful to it and that is not a security problem. And as I said above, such rejection most certainly should not happen in the parser.
> Doesn’t matter – if there’s a discrepancy between what a document says and what implementors do, that document is but a work of fiction.
Yes and no. When there is a de-facto standard that just doesn't happen to match the published standard, yeah, sure. Otherwise, bug compatibility is a terrible idea and should be avoided as much as possible, many security problems have resulted from that.
> This is not a parser.
Well, even worse then. Manually integrating semantics from higher layers into parsing machinery (which it is, never mind the fact that you don't capture any of the syntactic elements within that parsing automaton) is both extremely error prone and gives you terrible maintainability.
edit:
For the fun of it, I just had a look at the "winning entry" (diegoperini). Unsurprisingly, it's broken. It was trivial to find cases that it will reject that you most certainly don't intend to reject. For exactly the reasons pointed out above.
There are supposed URIs in that list that aren't actually URIs, there are supposed non-URIs in that list that are actually URIs, and most of the candidate regexes obviously must have come from some creative minds and not from people who should be writing software. If you just make shit up instead of referring to what the spec says, you urgently should find yourself a new profession, this kind of crap has been hurting us long enough.
(Also, I do not just mean the numeric RFC1918 IPv4 URIs, which obviously are valid URIs but have been rejected intentionally nonetheless - even though that's idiotic as well, of course, given that (a) nothing prevents anyone from putting those addresses in the DNS and (b) those are actually perfectly fine URIs that people use, and I don't see why people should not want to shorten some class of the URIs that they use.)
By the way, the grammar in the RFC is machine readable, and it's regular. So you can just write a script that transforms that grammar into a regex that is guaranteed to reflect exactly what the spec says.