Haven't tried it since I'd rather not sign up for an account. But regarding this example:
Minimum eight characters, at least one letter, one number and one special character
>> ^(?=.*[A-Za-z])(?=.*d)(?=.*[@$!%*#?&])[A-Za-zd@$!%*#?&]{8,}$
I assume the "d" characters here are supposed to be "\d" but the backslashes wandered off somewhere. Not sure if that's just in the example or the actual output from the AI. Some of those special characters like "*" and "$" need a backslash too.
Also, this pattern will not allow any character which is not a letter, number, or in that very finite list of special characters - so no spaces, no carets, no quote marks or apostrophes, etc. Not allowing certain printable characters in a password is a bad practice (I won't complain too much if you forbid the non-printable ones).
All words starting with the letter "n" and ending with "g", case insensitive
>> /^nw*g$/i
Yeah, seems like backslash got gobbled up in some sort of parsing before what gets displayed on the site. Should be `\w` here.
Regarding your other point, characters like `*` won't need escaping within character class. Not sure why only those particular set of characters are deemed "special characters" - why not `-`, `=`, parentheses, curly braces, etc?
Also, the site should specify which regex flavor is being generated, I'd guess JS.
All said, it is good to see tools to help with regex, as long as the solutions will be tested to make sure it fits their particular problem.
> Regarding your other point, characters like `*` won't need escaping within character class.
Really? Huh. I've been escaping them anyway, and I think I'll continue to do so just to avoid ambiguity when reading patterns.
I was about to end this post with "an extra backslash never hurt anybody" but then I realized that this is HN and writing that would surely prompt someone to reply with some bizarre case where an extra backslash caused a cascade of failures which cost a company fifteen million dollars and/or killed 43 people.
Haha, you are right to be skeptical. For example, `GNU grep` doesn't treat backslash as special within character class. You need to place metacharacters like `-`, `^`, `]`, etc in particular spots to match them literally.
Start of line “^”, then an “n”, then any word characters zero or more times (\w*), then a “g”, then the end of the line “$”. The surrounding slashes are convention for certain regex languages, and the last “i” turns on the case insensitive flag.
Personally I’d write this as “(?i)\bn[a-z]*g\b” but there’s many approaches for many different situations.
Because i struggled to understand what you meant I'd like to rephrase it:
It matches only if the regex is applied to a singular word. It's not going to match if there is a sentence or any apostrophe etc, which is implied to be valid input because it supposedly matches "all words".
You can always use tools like Regex101[0] to verify if they actually work or not. I have tried a few generated by the AI, and it seems to do the job most of the time.
if you have an edge case you know you want, you could add the description into the input of the AI.
If you are afraid of unintended matches, that's a different problem, which you might also get writing the regex yourself!
The solution, i reckon, is to create (may be even via the same AI?) a large list of matches, and you manually look thru to see if there's unintended matches.
Depends very heavily on the task and the string implementation involved, but my prior is that you should always bet on regex for anything more complex than "ForEach chr in str: doSomething(chr)".
Modern regex implementations are their own language interpreters/compilers, they implement their own string data strutures, they compile your regex into specialized bytecode (different from the hosting language), they have tons of special cases checks for fast paths (e.g. capturing groups make regexes much much more complicated than they need to be, a regex without capturing has the option to run signficantly faster).
A random search just found out this 2016 SO question[1] where a compiled regex was faster than string.contains (!). Regex is just insanely optimized. It reminds me of C++, ugly and badly/un designed, but it can go fast.
I think the worst case here is it writing a regex that mostly works but fails for some edge cases that you don't think to test but will encounter in production.
OK, but back to the regex. That's just pattern matching, and ML/AI has been shown to be amazing at pattern matching (albeit underwhelming at most other tasks). I would trust an AI/ML generated regex, but only because such structures are easily testable. This tool from the University of Trieste has probably been around for 10+ years--probably only 1e3 parameters, not 1e10.
The tool you linked uses a GA (not a NN) to find a short regex that gives correct answers to whatever testcases you provide. If your test cases are correct, it is guaranteed to produce correct output.
GPT-3 is known to be confidently wrong at things like basic arithmetic, so I really wouldn't trust it in scenarios like this. Maybe one day I'll be able to trust NN-generated code without testing it first, but we're not there yet.
It's a form of defense in depth. Ideally your application has specific test cases, property-based testing, linting, and whatever other forms of static and dynamic analysis you can think of. But if your code is obfuscated and/or you don't have a clear mental model of what it does, that adds a layer of uncertainty and could potentially hurt debugability.
Linked lists, hash functions, etc are mostly solved problems with clearly defined interfaces, built and tested for edge cases by humans. Each regex is a special snowflake.
Totally! And this is one of the worst kinds of code to generate with AI given how often regexes are write-only code. Personally, for any important regex I'm either going to have good unit tests, an extended-mode regex with comments, or both. Which I'm sure this AI is not going to do.
So to me this mainly looks like a way for people who don't understand something to put that ignorance into the codebase, setting traps for colleagues down the road. That's not a new experience for me, but this does seem likely to make that easier and more fun, two things I don't think dangerous code needs.
Not an apple to apple comparison, but if you are trying to regex with help of software, my daily driver of building regex is RegexBuddy [1]. Not only makes building and testing regular expressions easy, but this pretty much covers all the Regex variants in the wild. (And comes with an excellent help file.)
The same author also make variant called RegexMagic [2] which probably would have closer premise with AutoRegex (less NN part, perhaps) as it is designed to make Regex without too much knowledge of regex, but I don't know how well it works as I haven't used it much...
Hey! I’m the creator of www.autoregex.xyz (@gd3kr on twitter) I originally built it as a small side project in a couple days, I was absolutely not expecting a response this massive. I realise concerns about email and password sign up, and I built it in an hour-ish with firebase auth as a temporary solution to capping the sudden surge in GPT3 requests to the server after the twitter post gained traction. Im working on a better approach to that involving not having to create an account. Otherwise, all suggestions are absolutely welcome. Please tell me how I can make this a better experience for everyone.
So I was talking to a friend about this, and he thought this was a parody because of all the obviously incorrect results you were highlighting in the Twitter thread. Other people have mentioned the \ escapes going missing, but my friend called out https://twitter.com/gd3kr/status/1545495732265766913 as so hilariously wrong that it couldn't have been anything other than a parody.
Is this really a parody, or is it just another example of people not actually reading GPT-3 output carefully enough to notice it's nonsense?
Hey, no it’s not a parody lmao. I shipped it because realised it could have at least some utility while building it. To be fair, the wrong examples would be an oversight on my part, but I thought of them being more of genuine benchmark of what GPT3 was capable of; sort of like an experimental feature. I’m working hard on refining the results by fine tuning DaVinci and getting the output as close to ideal as possible — I think there’s a lot of potential there. In the end, even boosting a user’s productivity marginally is a win in my book, and as a lot of people have pointed out, it’s already doing that to some extent.
That seems like a computationally challenging problem. To avoid .+ you would have to include non matching examples and then I don't know how similar to the matches / specific those would have to be.
One probably wants to provide a set of matching and a set of non-matching strings. Then the software would output a regex and some edge-case matching strings and non-matching strings.
This could be built using set operations on deterministic finite automata (dfa). Every regex is equivalent to a dfa. You can now construct automata for every positive and negative example input. Then calculate the union for all positive examples and the union for all negative examples. And finally calculate the difference between the two unions. Convert the resulting automaton back to regex.
I was thinking of something that could categorize parts of these strings into a “language”, so there is no non-matching strings. It’s hard to specify in a formal way, but by looking at these strings you may see that e.g. […] is a static syntactic element, and a number follows it, and time precedes it. This would be nice to have to browse logs (which these strings are obviously a part of) but instead of scrolling through thousands of rows, see all of the patterns that occur among them at once, and then dig down into a pattern to inspect what happened and when to improve on “health” of a conpkex system. Of course if you know all of them in advance, it’s easy to filter by each. But lots of software/apis do not document their output in such detail.
Technically .* is a valid regex for those strings, so the issue here is not only to reverse engineer them, but to do so in a way that's meaningful for the person who has to use it after.
It shouldn't be hard to start with .* and resursively split it in two parts that still match the input strings, but I believe you will end up with matching but useless regexes.
This is a special case of the general problem of program synthesis[1][2][3][4], where the search space of possible programs are all regex strings and the seed driving the synthesis is Input-Output examples.
There's research [5][6] as well as practical tools [7][8][9].
With enough inputs it should end up with something somewhat reasonable for the leading part, but it will never be smart enough to understand that the error message is "arbitrary" and should be matched with e.g. `(.+)`.
JS has String.prototype.replaceAll, which can take a regex with multiple capture groups and output them as separate params to a callback function. This can be used to create a functional DSL which generates the regexes and callbacks.
I know this comment isn't helpful on its own, but yes this exists. I've seen it before. I just have no idea what it was called or how to find it again.
EDIT: Ah no, sorry. Was thinking of the other way around[0].
RegexBuddy has some limited ability to do this, and the author of that program has a whole separate program called RegexMagic that I believe specializes in exactly this.
It is surprising / not surprising to see the lengths people go to not learn regex. There's not much to it. If you're just starting out, find a good reference, and memorize it. Mastery is another thing entirely, but it always is.
In *nix land, a decent reference is built-in: `man awk` and jump to the "Regular Expressions" section.
This is all true, but suspect that the issue is mostly self-correcting since anything other than an extremely vanilla email address is likely to run into an infuriatingly high number of rejections precisely because the software at the other end has taken an overly simplistic view of what is allowed, forcing the owner to change it out of sheer frustration.
I've seen a handful of libraries and DSLs that are intended to replace regex now. It might be interesting to compile a list of them and attempt to compare them.
You’ll still have to fully comprehend the auto-generated regex to make sure that it really does what you want. So the tool may help with coming up with a suitable regex, but doesn’t remove the need for comprehension.
\"object\": \[(.*?)\]
----
The regular expression matches a string that contains "object": followed by a space and an open bracket, then any characters, then a close bracket. The characters between the open and close bracket are captured in a group.
Mmm, it produces an expression matching the input? You should put some examples. I'm a noob but AFAIK there are different regex engines out there, which one is this output? I remember trying other regex generators in knime and not working.
> This regular expression is used to match an email address. It starts by matching any character that is not a space or the "@" symbol, followed by the "@" symbol, then any characters that are not a space or the "." symbol, followed by a "." character, and finally any characters that are not a space.
"Something I can use to grep the word list on my Mac to cheat at Worldle. My remaining letters are ...., my green letters thus far are .... and my yellows are ..."
I created Melody (https://github.com/yoav-lavi/melody) for the same reasons you've mentioned in the website.
I knew AI would try to replace me one day, but I didn't think it'd be so soon.
This is the kind of use case I’d like to see more of! Less of the p-zombie copywriting neoplagiarism services with pot-luck output, more GPT-as-backend functional apps as productivity multipliers, if that makes sense.
All male English names is just a list. It doesn’t follow a grammar. Its regex would be akin to
(John|Jane|Alice|Bob)
and so on. It’s not a case for regex. In fact, I’ve found success in replacing regex with regular string operators (length, contains substring, doesn’t contain substring, starts with a capital letter, …) of the language at hand, then do final regex passes for whatever is left at the end. It’s infinitely more readable and debuggable. I’ve grown to avoid regex when possible.
which is similar, but has some interesting differences. That shows it's black-box association. Then I tried "my email address" and this came out, line break included
"An identifier" vs "a Javascript identifier" does work as expected, but "a number" and "a floating point number" don't. "A quoted string" doesn't escape the quotes inside, but if you add "with escaped quotes" it does.
So, it's cute, and might set you on the right track, as long as you study the output a bit.
Excellent. Worked for me. The output can modified but this provide a great start. I do regex so seldom I can't remember notation for lookahead for example which this provides.
Also, this pattern will not allow any character which is not a letter, number, or in that very finite list of special characters - so no spaces, no carets, no quote marks or apostrophes, etc. Not allowing certain printable characters in a password is a bad practice (I won't complain too much if you forbid the non-printable ones).