This is why I always do: s/^\s+//; s/\s+$//; Instead of: s/^\s+|\s+$//; Weirdly,...

burntsushi · on July 20, 2016

You can see in my other post that this doesn't always work. For example, Python's regex engine chokes on `\s+$` if you use `re.search`, but works fine if you use `re.match`.

It's likely that Perl has an optimization for `\s+$` but not `^\s+|\s+$` (the former regex only ever matches at the end of the string, which is amenable to optimizations).

mwpmaybe · on July 20, 2016

I did see your other post, and upvoted it. This rule of thumb has served me well between different regex dialects and implementations, but it's not surprising that there are some specific cases that are "broken" for lack of a better word.

I haven't done much Python but the documentation for re.search() and re.match() is very clear: use search to find an expression anywhere in a string, use match to find an expression at the beginning of a string. It appears to ignore anchors in both cases? Left undetermined then is how to anchor an expression at the end of a string. You say re.match() works, but this is pretty confusingly described, and it's easy to see how the ambiguity can lead to problems for even the most experienced programmers.

burntsushi · on July 20, 2016

Anchors aren't ignored. For re.match, `^abc$` is equivalent to `abc`, so the anchors are just redundant. (N.B. `^^abc$$` will match the same set of strings as `^abc$`.) For re.search, `^abc$` is equivalent to `re.match(^abc$)` but `abc` is equivalent to `re.match(.STAR?abc.STAR?)`.

But yes, this is a very subtle semantic and different regex engines handle it differently. My favorite approach is to always use `.STAR?theregex.STAR?` and don't provide any additional methods. Instead, if the user wants to match from the beginning to the end, they just need to use the well known anchors. (Go's regexp package does this, and Rust borrowed that API choice as well.)

chrismorgan · on July 20, 2016

Correction: re.match anchors at the start, but not the end, so `^abc$` is equivalent to `abc$` but not `abc`.