Hacker News new | past | comments | ask | show | jobs | submit login
AutoRegex (autoregex.xyz)
247 points by fbuilesv on July 9, 2022 | hide | past | favorite | 109 comments



Haven't tried it since I'd rather not sign up for an account. But regarding this example:

   Minimum eight characters, at least one letter, one number and one special character
    >> ^(?=.*[A-Za-z])(?=.*d)(?=.*[@$!%*#?&])[A-Za-zd@$!%*#?&]{8,}$
I assume the "d" characters here are supposed to be "\d" but the backslashes wandered off somewhere. Not sure if that's just in the example or the actual output from the AI. Some of those special characters like "*" and "$" need a backslash too.

Also, this pattern will not allow any character which is not a letter, number, or in that very finite list of special characters - so no spaces, no carets, no quote marks or apostrophes, etc. Not allowing certain printable characters in a password is a bad practice (I won't complain too much if you forbid the non-printable ones).


    All words starting with the letter "n" and ending with "g", case insensitive
    >> /^nw*g$/i
Yeah, seems like backslash got gobbled up in some sort of parsing before what gets displayed on the site. Should be `\w` here.

Regarding your other point, characters like `*` won't need escaping within character class. Not sure why only those particular set of characters are deemed "special characters" - why not `-`, `=`, parentheses, curly braces, etc?

Also, the site should specify which regex flavor is being generated, I'd guess JS.

All said, it is good to see tools to help with regex, as long as the solutions will be tested to make sure it fits their particular problem.


> Regarding your other point, characters like `*` won't need escaping within character class.

Really? Huh. I've been escaping them anyway, and I think I'll continue to do so just to avoid ambiguity when reading patterns.

I was about to end this post with "an extra backslash never hurt anybody" but then I realized that this is HN and writing that would surely prompt someone to reply with some bizarre case where an extra backslash caused a cascade of failures which cost a company fifteen million dollars and/or killed 43 people.


Obligatory example where the extra backslash makes a difference:

    $ printf '%s\n' 'asterisk: *' 'backslash: \' | grep '[*]'
    asterisk: *
    $ printf '%s\n' 'asterisk: *' 'backslash: \' | grep '[\*]'
    asterisk: *
    backslash: \
It depends on the regex flavour though (doesn't happen with Python, JavaScript, or Perl for example).


Haha, you are right to be skeptical. For example, `GNU grep` doesn't treat backslash as special within character class. You need to place metacharacters like `-`, `^`, `]`, etc in particular spots to match them literally.


Serious answer: some people use character classes as a more readable form of escape, e.g. [.]

So escaping inside would be extra redundant.


That regex doesn't do what the description above it says, with or without a backslash.


How so?

Start of line “^”, then an “n”, then any word characters zero or more times (\w*), then a “g”, then the end of the line “$”. The surrounding slashes are convention for certain regex languages, and the last “i” turns on the case insensitive flag.

Personally I’d write this as “(?i)\bn[a-z]*g\b” but there’s many approaches for many different situations.


I assume mostly using start of line instead of \b, and expecting w only to be zero+ times in the middle.


Because i struggled to understand what you meant I'd like to rephrase it:

It matches only if the regex is applied to a singular word. It's not going to match if there is a sentence or any apostrophe etc, which is implied to be valid input because it supposedly matches "all words".


That’s easy to misunderstand. And if humans have trouble with it, then I won’t blame the poor AI.

Getting an AI to ask clarifying questions would be useful, of course…


I guess it doesn't work for example when words contain dashe(s) or apostrophe(s) which might exist (for a single indivisible word) in some languages.


No, you don't need to escape these special characters if they're in [].

For JS (which is used here) or Python at least. Some other implementations may vary.


I don't trust regular expressions that I wrote, let alone some doped up parameter sniffing AI.


You can always use tools like Regex101[0] to verify if they actually work or not. I have tried a few generated by the AI, and it seems to do the job most of the time.

[0]: https://regex101.com/


You still could have edge cases you don't want or want


if you have an edge case you know you want, you could add the description into the input of the AI.

If you are afraid of unintended matches, that's a different problem, which you might also get writing the regex yourself!

The solution, i reckon, is to create (may be even via the same AI?) a large list of matches, and you manually look thru to see if there's unintended matches.


or just if clause it - why spend 1 hour chasing down regex to save yourself from writing 3 lines of code.


Because application performance matters more than developer performance.


This varies a lot.

Reading a 100mb file 10 times a day vs. 100 times a day is perfectly acceptable in many scenarios.


True, but in a fast-changing world, someone with slow data integrations might be leaving money on the table.


I thought Regex was typically less performant than string operations?


Depends. If you need to go through a 100MB file, you want to do it once, not twice.


Depends very heavily on the task and the string implementation involved, but my prior is that you should always bet on regex for anything more complex than "ForEach chr in str: doSomething(chr)".

Modern regex implementations are their own language interpreters/compilers, they implement their own string data strutures, they compile your regex into specialized bytecode (different from the hosting language), they have tons of special cases checks for fast paths (e.g. capturing groups make regexes much much more complicated than they need to be, a regex without capturing has the option to run signficantly faster).

A random search just found out this 2016 SO question[1] where a compiled regex was faster than string.contains (!). Regex is just insanely optimized. It reminds me of C++, ugly and badly/un designed, but it can go fast.

[1] https://stackoverflow.com/questions/2962670/regex-ismatch-vs...


Assuming you're using a non-backtracking library, they should both be O(n).


That's true, but I think it's important to take the encryption approach, I understand how it works vaguely and hope I don't get burned.


I think the worst case here is it writing a regex that mostly works but fails for some edge cases that you don't think to test but will encounter in production.


It'd be cool if it split out a bunch of test cases/examples so you can see what's happening in edge cases.


That's always an issue with regard regardless of who wrote it


I disagree. If I write a regexp, I have to think about what I am writing.

If I press a button and one is magically made for me, I can skip that step.

Skipping that step is bad.


Do you feel the same way about linked lists, hash functions, garbage collectors, etc?


Implementations autogenerated by an AI? Absolutely, there are so many edge cases to consider.

However, I'm much more likely to trust a popular library implementation.


OK, but back to the regex. That's just pattern matching, and ML/AI has been shown to be amazing at pattern matching (albeit underwhelming at most other tasks). I would trust an AI/ML generated regex, but only because such structures are easily testable. This tool from the University of Trieste has probably been around for 10+ years--probably only 1e3 parameters, not 1e10.

http://regex.inginf.units.it/


The tool you linked uses a GA (not a NN) to find a short regex that gives correct answers to whatever testcases you provide. If your test cases are correct, it is guaranteed to produce correct output.

GPT-3 is known to be confidently wrong at things like basic arithmetic, so I really wouldn't trust it in scenarios like this. Maybe one day I'll be able to trust NN-generated code without testing it first, but we're not there yet.


It's a form of defense in depth. Ideally your application has specific test cases, property-based testing, linting, and whatever other forms of static and dynamic analysis you can think of. But if your code is obfuscated and/or you don't have a clear mental model of what it does, that adds a layer of uncertainty and could potentially hurt debugability.


Linked lists, hash functions, etc are mostly solved problems with clearly defined interfaces, built and tested for edge cases by humans. Each regex is a special snowflake.


Totally! And this is one of the worst kinds of code to generate with AI given how often regexes are write-only code. Personally, for any important regex I'm either going to have good unit tests, an extended-mode regex with comments, or both. Which I'm sure this AI is not going to do.

So to me this mainly looks like a way for people who don't understand something to put that ignorance into the codebase, setting traps for colleagues down the road. That's not a new experience for me, but this does seem likely to make that easier and more fun, two things I don't think dangerous code needs.


fair enough, but few things are easier to test than a regex.


Few things are easier to miss than a string that breaks your (nontrivial) regex.

I’d like to see a mathematical estimate of the number of test strings I should generate given some input regex.


If you cannot enumerate the test cases, then the problem is too complex for a single regex. It's sort of self limiting.


Not an apple to apple comparison, but if you are trying to regex with help of software, my daily driver of building regex is RegexBuddy [1]. Not only makes building and testing regular expressions easy, but this pretty much covers all the Regex variants in the wild. (And comes with an excellent help file.)

The same author also make variant called RegexMagic [2] which probably would have closer premise with AutoRegex (less NN part, perhaps) as it is designed to make Regex without too much knowledge of regex, but I don't know how well it works as I haven't used it much...

[1]: https://www.regexbuddy.com/

[2]: https://www.regexmagic.com/


regex101 does almost all of this but in the browser!


Hey! I’m the creator of www.autoregex.xyz (@gd3kr on twitter) I originally built it as a small side project in a couple days, I was absolutely not expecting a response this massive. I realise concerns about email and password sign up, and I built it in an hour-ish with firebase auth as a temporary solution to capping the sudden surge in GPT3 requests to the server after the twitter post gained traction. Im working on a better approach to that involving not having to create an account. Otherwise, all suggestions are absolutely welcome. Please tell me how I can make this a better experience for everyone.


You could probably include a ‘Why do I need to sign up?’ disclaimer.


Two of your three examples are incorrect as they're missing backslashes. /^nw*g$/i should be /^n\w*g$/i and (?=.*d) should be (?=.*\d).


So I was talking to a friend about this, and he thought this was a parody because of all the obviously incorrect results you were highlighting in the Twitter thread. Other people have mentioned the \ escapes going missing, but my friend called out https://twitter.com/gd3kr/status/1545495732265766913 as so hilariously wrong that it couldn't have been anything other than a parody.

Is this really a parody, or is it just another example of people not actually reading GPT-3 output carefully enough to notice it's nonsense?


Hey, no it’s not a parody lmao. I shipped it because realised it could have at least some utility while building it. To be fair, the wrong examples would be an oversight on my part, but I thought of them being more of genuine benchmark of what GPT3 was capable of; sort of like an experimental feature. I’m working hard on refining the results by fine tuning DaVinci and getting the output as close to ideal as possible — I think there’s a lot of potential there. In the end, even boosting a user’s productivity marginally is a win in my book, and as a lot of people have pointed out, it’s already doing that to some extent.


Was curious to try it, but got turned off when asked to create an account, and didn't bother.


How do I delete my account?


I’d bet users would’ve preferred to provide examples of input and output to get the regexp they want , instead of designing it in plain English.


That seems like a computationally challenging problem. To avoid .+ you would have to include non matching examples and then I don't know how similar to the matches / specific those would have to be.



Plain English has the benefit that you can use the microphone/dictation on your cellphone...


Why would you need to generate a regex on your cell phone?


Btw, does anyone know a library/program which can reverse engineer a regex from multiple source strings? E.g.

  14:51 [info] 51 some message
  … more of 51 lines …
  15:22 [error] 24 error!
  … more of 24 lines …

  ^(\d\d:\d\d) \[(info|error)\] (\d+) (.+)$
Or maybe not a regex, but a structured pattern.


One probably wants to provide a set of matching and a set of non-matching strings. Then the software would output a regex and some edge-case matching strings and non-matching strings.

This could be built using set operations on deterministic finite automata (dfa). Every regex is equivalent to a dfa. You can now construct automata for every positive and negative example input. Then calculate the union for all positive examples and the union for all negative examples. And finally calculate the difference between the two unions. Convert the resulting automaton back to regex.

https://scanftree.com/automata/dfa-union-property


I was thinking of something that could categorize parts of these strings into a “language”, so there is no non-matching strings. It’s hard to specify in a formal way, but by looking at these strings you may see that e.g. […] is a static syntactic element, and a number follows it, and time precedes it. This would be nice to have to browse logs (which these strings are obviously a part of) but instead of scrolling through thousands of rows, see all of the patterns that occur among them at once, and then dig down into a pattern to inspect what happened and when to improve on “health” of a conpkex system. Of course if you know all of them in advance, it’s easy to filter by each. But lots of software/apis do not document their output in such detail.


Technically .* is a valid regex for those strings, so the issue here is not only to reverse engineer them, but to do so in a way that's meaningful for the person who has to use it after.

It shouldn't be hard to start with .* and resursively split it in two parts that still match the input strings, but I believe you will end up with matching but useless regexes.



The closest thing I know of to this is https://github.com/devongovett/regexgen (or my Ruby port https://github.com/amake/regexgen-ruby).

    % bundle exec bin/regexgen '14:51 [info] 51 some message' '15:22 [error] 24 error!'
    (?-mix:1(?:4:51\ \[info\]\ 51\ some\ message|5:22\ \[error\]\ 24\ error!))
With enough inputs it should end up with something somewhat reasonable for the leading part, but it will never be smart enough to understand that the error message is "arbitrary" and should be matched with e.g. `(.+)`.


JS has String.prototype.replaceAll, which can take a regex with multiple capture groups and output them as separate params to a callback function. This can be used to create a functional DSL which generates the regexes and callbacks.


I know this comment isn't helpful on its own, but yes this exists. I've seen it before. I just have no idea what it was called or how to find it again.

EDIT: Ah no, sorry. Was thinking of the other way around[0].

0: https://www.npmjs.com/package/regex-to-strings


RegexBuddy has some limited ability to do this, and the author of that program has a whole separate program called RegexMagic that I believe specializes in exactly this.



It is surprising / not surprising to see the lengths people go to not learn regex. There's not much to it. If you're just starting out, find a good reference, and memorize it. Mastery is another thing entirely, but it always is.

In *nix land, a decent reference is built-in: `man awk` and jump to the "Regular Expressions" section.


There are many awk implementations, so whether that reference is decent or not may vary. It looks like on BSDs, there's so such section at all.


Tested on Linux, OpenBSD, & Mac:

POSIX: man 7 re_format

PCRE: man pcrepattern

PCRE2: man pcre2pattern

PCRE may be a "works on my machine" thing, but POSIX should be there.


    All words starting with the letter "n" and ending with "g", case insensitive
    >> /^nw*g$/i
This is not right. That should be a "\w" instead of "w".


Same with the last one, it should be \d instead of d. Seems like an escape character issue.


It's otherwise wrong, too. It shouldn't have ^ or $, and \w matches non word characters. It should be /n[a-z]*g/gi


I have played with this for a while, and here are some prompts that might be interesting.

Prompt: "URL regex"

  ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
Prompt: "Email regex"

  /^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))
This one's weirdly sophisticated for some reason.

Prompt: "An HTML tag with a close tag and attributes"

  <([a-zA-Z]+)[^>]*?(?<!\/)>.*?<\/\1>


Email addresses are notoriously hard to parse.[0]

I'd wager that one probably misses a bunch of corner cases.

[0] https://www.regular-expressions.info/email.html


This is all true, but suspect that the issue is mostly self-correcting since anything other than an extremely vanilla email address is likely to run into an infuriatingly high number of rejections precisely because the software at the other end has taken an overly simplistic view of what is allowed, forcing the owner to change it out of sheer frustration.


> Prompt: "An HTML tag with a close tag and attributes"

Obviously trained on a data set that didn't include <h1>

This whole thread is why I'm not scared of Copilot or similar taking away jobs anytime soon, since bug fixing is way harder than writing code


Also regex-related: Pomsky: https://pomsky-lang.org/ (formerly Rulex)


I've seen a handful of libraries and DSLs that are intended to replace regex now. It might be interesting to compile a list of them and attempt to compare them.


> difficult to […] comprehend

You’ll still have to fully comprehend the auto-generated regex to make sure that it really does what you want. So the tool may help with coming up with a suitable regex, but doesn’t remove the need for comprehension.


Tried RegEx → English:

    \"object\": \[(.*?)\]
    ----
    The regular expression matches a string that contains "object": followed by a space and an open bracket, then any characters, then a close bracket. The characters between the open and close bracket are captured in a group.
Pretty cool.


As a web scraper, thank god for .*? or to be exact [\s\S]*?

Does 90% of what I need.


Did a hobby project a while ago, just the reverse. Wrote a blog post - https://medium.com/codist-ai/generating-natural-language-des... (colab - https://colab.research.google.com/drive/1QibOifIJQB2tfLyy_mm...)

Also gave a pycon talk - https://www.youtube.com/watch?v=Zugbqg9HFHQ

It was fun, achieved good result but did not need a monster like GPT3!


Why do I have to sign up?


Mmm, it produces an expression matching the input? You should put some examples. I'm a noob but AFAIK there are different regex engines out there, which one is this output? I remember trying other regex generators in knime and not working.


I definitely need something like this. My brain seems to insta-delete all Regex knowledge after 1 week.

However, why do I need to sign up to test it. I'm guessing there is some paywall after. This feels like that github co-pilot rugpull.


Switch from "regex to english" and write "email".

Instead of saying it will match the string "email", it says the opposite, english -> regex conversion

> regex = /\A([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})\Z/i

> This regular expression is used to match an email address. It starts by matching any character that is not a space or the "@" symbol, followed by the "@" symbol, then any characters that are not a space or the "." symbol, followed by a "." character, and finally any characters that are not a space.


"Something I can use to grep the word list on my Mac to cheat at Worldle. My remaining letters are ...., my green letters thus far are .... and my yellows are ..."


I created Melody (https://github.com/yoav-lavi/melody) for the same reasons you've mentioned in the website. I knew AI would try to replace me one day, but I didn't think it'd be so soon.

Kidding of course, looks really cool!


Here's how you can do it without GPT-3 (using BLOOM an opensource alternative) https://twitter.com/1littlecoder/status/1545818058153140224


I always imagined having a tool like this! Really psyched to have something to play with!


A tool like this really cuts the time!


This is the kind of use case I’d like to see more of! Less of the p-zombie copywriting neoplagiarism services with pot-luck output, more GPT-as-backend functional apps as productivity multipliers, if that makes sense.


I typed in

> Email

> English → RegEx

> 95

> GO

> \w+@\w+\.\w+

That's an interesting email regex.


That's not that wrong though.

What does it do for queries like "all male English names" or "comfortable temperature range"?


For "comfortable temperature range", it generates:

  (\d+\.?\d*)\s?-\s?(\d+\.?\d*)
And for "all male English names", it generates:

  [A-Z][a-z]+
The first one might be good, but the latter seems rather unsophisticated.


All male English names is just a list. It doesn’t follow a grammar. Its regex would be akin to

    (John|Jane|Alice|Bob)
and so on. It’s not a case for regex. In fact, I’ve found success in replacing regex with regular string operators (length, contains substring, doesn’t contain substring, starts with a capital letter, …) of the language at hand, then do final regex passes for whatever is left at the end. It’s infinitely more readable and debuggable. I’ve grown to avoid regex when possible.


I typed "an email address" and it came up with

    [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
which is similar, but has some interesting differences. That shows it's black-box association. Then I tried "my email address" and this came out, line break included

   is [email protected]
   My email address is \w+\.\w+@\w+\.\w+
"An identifier" vs "a Javascript identifier" does work as expected, but "a number" and "a floating point number" don't. "A quoted string" doesn't escape the quotes inside, but if you add "with escaped quotes" it does.

So, it's cute, and might set you on the right track, as long as you study the output a bit.


>All words starting with the letter "n" and ending with "g", case insensitive >> /^nw*g$/i

The example regex is completely wrong, it should be /n[a-z]*g/ig


You also can use this for general GPT-3 queries, which is quite cool!


How did you do that?


Excellent. Worked for me. The output can modified but this provide a great start. I do regex so seldom I can't remember notation for lookahead for example which this provides.


Hmmm, it might works, but need a unit test to verify.


Different programming languages use different forms of RegEx, it doesnt suggest it caters for this difference.


If we only could get safari to support lookbehind regex that would be nice.


Doesn't work for me.

> any word that has a penultimate character of a > .*[a-z]n$


Just... Learn regex. Jeez.


Needs a way to delete your account


That's why disposable email addresses exist


My prayers have been answered.


the font is so tiny on the website preview I cant read anything


damn this was good




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: