There are problems I've seen with using Regex strings and expecting them to work in all cases on all Regex engines which is why I tend to stick with PCRE ~ http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express... a point in favour of the Gruber example.
"... The pattern is also liberal about Unicode glyphs within the URL ..."
On the penultimate paragraph, and somewhat of a tangent:
Wow, I'd completely forgotten that you could have Unicode in ___domain names, and I suspect a lot of people don't think about it very much either. In my limited experience, even Chinese-only websites rarely stray from normal alphanumeric domains, even though the people visiting those sites could easily type out URLs with Chinese glyphs.
Perhaps I'm missing something here, but it seems that with good alphanumeric domains becoming less available, cool/clever/classy Unicode domains could be a viable alternative, given an appropriate purpose -- Google would probably not want one -- and a techie enough audience. When [for which sites?] and how often do people actually type URLs?
Example: a friend of mine did a cheeky web branding project a while ago named "Heart Star Heart"... ♥★♥.com would have been perfect.
EDIT: I should probably do more research on this myself, but it looks like there's some mysterious isomorphism between Unicode domains and "normal" domains. Firefox renders U+272A in http://✪df.ws/e7m correctly but changes its text to http://xn--df-oiy.ws/e7m and when I access ♥★♥.com my ISP complains that xn--p3hxmb.com doesn't exist. Anybody know what the isomorphism actually is?
I believe it has something to do with security, when browsers first added unicode url support there were issues with hackers and spammers using blank and lookalike unicode characters to trick people into visiting shady domains.
That said, non-ASCII URLs suck because not everyone can type them. Imagine being a tourist in Tokyo who has to lookup a restaurant on your laptop or having to lookup the product page for this gadget you bought in China…
Right. As I noted above, it's absolutely not acceptable for some situations, particularly those where you want lost or confused people to look you up. But I can still think of plenty of other situations, and was merely pointing out the disparity between the number of Unicode URLs I've encountered and the number I'd expect to have encountered, given all the possibilities.
It looks like line noise. But it's funny though, I can read regexps better than I can read formal semantics. Having tried once this evening to read http://matt.might.net/papers/might2007diss.pdf, regexps are refreshing :)
what is wrong with URI.parse in the stdlib? the articles' re goes out of url validation to get urlish things in free text, (e.g. " lokk at http://goo.com/bat, lovely" the comma would actually be part of a valid url, but the regex tries to detect this. Yet, for url validation ruby's lib is enough, I believe.
Well, with URI.parse, you can send in a ftp uri, or just "something.com" (which is parsed as just a path). So not really what I need from URL validation. If there's something else I'm missing, I'm happy to know.
Edit: And btw, for an industrial-strength Ruby URI library (to replace the standard library), check out Addressable: http://addressable.rubyforge.org/
is the lexing part, and then there are other files in the same directory that do other little bits. The whole hyperlinks framework is under a BSD license.
An RFC-822-compliant regex is listed at (http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html), but if anything, it's just a strong using for using real parsing tools. Regexes don't handle recursion and balanced delimiters very well.
For validating emails, I've settled on /.@./, or if you really want to push for valid emails, /.@[^.]+\../. (Note the lack of anchoring to the beginning or end.) (That, and some limit on length.)
The rules are so flipping complicated and so easy to get wrong that you're better off just trying to send a mail and seeing what happens, and asking the recipient to validate reception if you care about the address. Is it really that important to exclude bad emails, at the cost of, say, blocking email addresses from the UK, as your regex seems to do? Even "validating" for sheer user error is only useful if you get it right.
I like soft validation for emails. "This doesn't look like an email address, verify you typed everything correctly, and resubmit". That way you handle legit typos, without hassling people who have weird emails (gmail plus signs and such).
What would be the equivalent in Python to the :punct: character class operator? I don't think the re module supports those. I guess they'd have to be spelled out pretty much?
It doesn't work with standard permalinks that feature hyphens in the url, and none of his examples show links with hyphens. Most blogs out there (wordpress) are using hyphens in their permalinks.
There are problems I've seen with using Regex strings and expecting them to work in all cases on all Regex engines which is why I tend to stick with PCRE ~ http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express... a point in favour of the Gruber example.
"... The pattern is also liberal about Unicode glyphs within the URL ..."
PCRE supports Unicode but it's not switched on by default ~ http://www.pcre.org/pcre.txt