A Liberal, Accurate Regex Pattern for Matching URLs

bootload · on Nov 27, 2009

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

I found the w3 specs (rfc3987) suitable for my needs ~ http://www.ietf.org/rfc/rfc3987.txt a nice Regex to parse Url formats. This Regex allows you to extract scheme ($2), authority ($4), path ($5), query ($7) and fragment ($9) ~ http://www.flickr.com/photos/bootload/238916518/

There are problems I've seen with using Regex strings and expecting them to work in all cases on all Regex engines which is why I tend to stick with PCRE ~ http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express... a point in favour of the Gruber example.

"... The pattern is also liberal about Unicode glyphs within the URL ..."

PCRE supports Unicode but it's not switched on by default ~ http://www.pcre.org/pcre.txt

nanotone · on Nov 27, 2009

On the penultimate paragraph, and somewhat of a tangent:

Wow, I'd completely forgotten that you could have Unicode in ___domain names, and I suspect a lot of people don't think about it very much either. In my limited experience, even Chinese-only websites rarely stray from normal alphanumeric domains, even though the people visiting those sites could easily type out URLs with Chinese glyphs.

Perhaps I'm missing something here, but it seems that with good alphanumeric domains becoming less available, cool/clever/classy Unicode domains could be a viable alternative, given an appropriate purpose -- Google would probably not want one -- and a techie enough audience. When [for which sites?] and how often do people actually type URLs?

Example: a friend of mine did a cheeky web branding project a while ago named "Heart Star Heart"... ♥★♥.com would have been perfect.

EDIT: I should probably do more research on this myself, but it looks like there's some mysterious isomorphism between Unicode domains and "normal" domains. Firefox renders U+272A in http://✪df.ws/e7m correctly but changes its text to http://xn--df-oiy.ws/e7m and when I access ♥★♥.com my ISP complains that xn--p3hxmb.com doesn't exist. Anybody know what the isomorphism actually is?

natrius · on Nov 27, 2009

http://en.wikipedia.org/wiki/Punycode

showerst · on Nov 27, 2009

I believe it has something to do with security, when browsers first added unicode url support there were issues with hackers and spammers using blank and lookalike unicode characters to trick people into visiting shady domains.

Related: http://www.mozilla.org/security/announce/2009/mfsa2009-50.ht...

I'm not 100% sure of this though, hopefully someone more knowledgeable can chime in.

sorbits · on Nov 27, 2009

URLs do not allow unicode but most user agents support http://en.wikipedia.org/wiki/Internationalized_domain_name

That said, non-ASCII URLs suck because not everyone can type them. Imagine being a tourist in Tokyo who has to lookup a restaurant on your laptop or having to lookup the product page for this gadget you bought in China…

nanotone · on Nov 27, 2009

Right. As I noted above, it's absolutely not acceptable for some situations, particularly those where you want lost or confused people to look you up. But I can still think of plenty of other situations, and was merely pointing out the disparity between the number of Unicode URLs I've encountered and the number I'd expect to have encountered, given all the possibilities.

ehsanul · on Nov 27, 2009

I was using a monster of a regex for validating URLs in a Ruby (Sinatra) app, and it wasn't even looking in unstructured text). Found it at http://snipplr.com/view/6889/regular-expressions-for-uri-val...

/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:@]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})*))?$/i

Yeah, more involved. Though it parses it into the url parts, and it does work.

wingo · on Nov 27, 2009

It looks like line noise. But it's funny though, I can read regexps better than I can read formal semantics. Having tried once this evening to read http://matt.might.net/papers/might2007diss.pdf, regexps are refreshing :)

mahmud · on Nov 28, 2009

Please do not link to random papers on semantics, you don't know whose weekend you might ruin.

Horrible nerd-snipping there; I grok half of the paper, and have no choice but to study the formal semantics of the other half :-(

PStamatiou · on Nov 28, 2009

Matt Might (author of linked PDF) was my Computer Science TA four or five years ago at Georgia Tech... for a java class. small world haha

riffraff · on Nov 27, 2009

what is wrong with URI.parse in the stdlib? the articles' re goes out of url validation to get urlish things in free text, (e.g. " lokk at http://goo.com/bat, lovely" the comma would actually be part of a valid url, but the regex tries to detect this. Yet, for url validation ruby's lib is enough, I believe.

ehsanul · on Nov 28, 2009

Well, with URI.parse, you can send in a ftp uri, or just "something.com" (which is parsed as just a path). So not really what I need from URL validation. If there's something else I'm missing, I'm happy to know.

Edit: And btw, for an industrial-strength Ruby URI library (to replace the standard library), check out Addressable: http://addressable.rubyforge.org/

riffraff · on Nov 28, 2009

try URI.regexp('http') :)

durin42 · on Nov 28, 2009

Adium also has a pretty insane way of matching URLs, that's been in use (and growing all the while) since 2004.

http://hg.adium.im/adium-1.4/file/542aa252713b/Frameworks/Au...

is the lexing part, and then there are other files in the same directory that do other little bits. The whole hyperlinks framework is under a BSD license.

philfreo · on Nov 28, 2009

I generally dislike when periods are taken as part of the URL. Then you can't end a sentence with a URL like http://example.com.

HN does it right, but Gruber's example seems to put the period in the URL.

tsetse-fly · on Nov 28, 2009

In the case of

http://en.wikipedia.org/wiki/O3b_Networks,_Ltd.

HN does it wrong. There are exceptions either way, I wouldn't say that one is more correct.

notauser · on Nov 27, 2009

Great stuff, my URL matching regex was very limited.

For e-mails I use:

  /\b[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9]([-a-z0-9_]?[a-z0-9])*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi|[a-z]{2})|([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3})(:[0-9]{1,5})?\b/ig

And for twitter user names I use:

  /\B@\w+\b/ig

(which incorrectly matches @@username as @username but I assume that kind of thing is a typo - the important thing is not to match e-mail addresses)

silentbicycle · on Nov 27, 2009

An RFC-822-compliant regex is listed at (http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html), but if anything, it's just a strong using for using real parsing tools. Regexes don't handle recursion and balanced delimiters very well.

jerf · on Nov 27, 2009

For validating emails, I've settled on /.@./, or if you really want to push for valid emails, /.@[^.]+\../. (Note the lack of anchoring to the beginning or end.) (That, and some limit on length.)

The rules are so flipping complicated and so easy to get wrong that you're better off just trying to send a mail and seeing what happens, and asking the recipient to validate reception if you care about the address. Is it really that important to exclude bad emails, at the cost of, say, blocking email addresses from the UK, as your regex seems to do? Even "validating" for sheer user error is only useful if you get it right.

cschneid · on Nov 28, 2009

I like soft validation for emails. "This doesn't look like an email address, verify you typed everything correctly, and resubmit". That way you handle legit typos, without hassling people who have weird emails (gmail plus signs and such).

notauser · on Nov 28, 2009

I'm actually use it to find addresses in text, for validation I usually use in the Django built in which is very liberal.

Incidentally this does correctly pick up UK e-mail addresses (tested with .co.uk .org.uk and .net.uk).

blasdel · on Nov 27, 2009

Indent your lines with two spaces and they'll end up in a <pre> with your asterisks intact.

Why bother trying to validate the ___domain lexically, when you can just try resolving it?

notauser · on Nov 27, 2009

Thanks, fixed the formatting.

(It's not my regex - I lifted it from someone who had the time to write a comprehensive test suite.)

kprobst · on Nov 27, 2009

What would be the equivalent in Python to the :punct: character class operator? I don't think the re module supports those. I guess they'd have to be spelled out pretty much?

kingkilr · on Nov 28, 2009

I guess that would be re.escape(string.punctuation), I've never looked/thought about it though.

kprobst · on Dec 2, 2009

Figured as much after I played around with making this work. Thanks!

techiferous · on Nov 27, 2009

Wow, I just had this very problem a few days ago for an entry I submitted to CodeRack! http://coderack.org/users/techiferous/entries/90-racklinkify

(Note: you can't plug-n-play this middleware yet--still a coupla bugs. Will fix soon.)

whalesalad · on Dec 1, 2009

It doesn't work with standard permalinks that feature hyphens in the url, and none of his examples show links with hyphens. Most blogs out there (wordpress) are using hyphens in their permalinks.

ds · on Nov 28, 2009

This is great. The regex we use on tinychat for URLs is self made and not as all inclusive as this.