A few years back we came into work one morning to find that some bot was scanning our site so hard that it seemed the lights nearly dimmed. Some detective work suggests that it was a service performed on behalf of a competitor, to get our price list (bear in mind that our catalog has a few hundred thousand products).
We were really annoyed that rather than just ask us, they had launched what amounted to a DDOS attack. So we thought about how we might exact vengeance...
After a few hours we figured out a pattern to the rogue requests that allowed us to filter them, despite their efforts at stealth (like, they cycle through a list of various user agent strings to make it look like there are multiple different users). We toyed with the idea of, rather than outright banning them, making our pages sensitive to their presence, so that when we detected them, we'd display a false price, defeating their whole operation.
We finally just decided to take the high road, temporarily banning any rogue IP addresses we detected (we couldn't make it permanent because many of the requests came from the Amazon cloud, from which we also receive some legitimate requests)
EDIT: you wouldn't think that requests for a few hundred thousand products would amount to a DDOS, but the bot was rather poorly written and grossly inefficient in the way it walked through the list.
I built a system called caltrops that did almost exactly that. As a given session's requests grew more and more suspicious, their data would skew from reality further and further. A real user on the line would notice immediately (and the more real-looking the user interactions, the more it would reduce suspicion), but competitors scraping our data would get pretty deliciously bunk data.
"Thousands of people use Mailinator everyday, so clearly, its a useful tool that many sites accept"
How many of you would have an outright revolt on your hands from your QA/QE folks if you banned mailinator? I think everyplace I worked would experience this same issue if we did this.
Could use + in the first part of email such as: [email protected] to create throwaways. most sites consider those to be different email address then [email protected] for account purposes but email service, who respect the rfc, will threat them as the same.
Many sites won't accept email addresses with + in them, because many devs have extremely wrongheaded ideas about validation.
I used to have a [email protected] address and that one was touch-and-go as well due to the fact that the mailbox had two .'s in it. I actually had to file a support request to get Amazon Student to accept it, even. Nobody from a university with that scheme ever registered before?
For the record, the gold standard for email validation is "send a confirmation link and see if they click it". Don't try and get fancy.
One other trick is that Gmail ignores .'s in addresses entirely. [email protected] is the same as firstlast or f.irstlast.
What's worse: some sites accept email addresses with + in them and then years later, they stop working, and you can't log in to fix your email address.
Hah, this happened to me once. Turns out the email validation was occurring client-side though... so a quick edit later and the server still gladly accepted my '+'-enabled email address. :-)
>For the record, the gold standard for email validation is "send a confirmation link and see if they click it". Don't try and get fancy.
But make sure you have some sort of rate limiting set up, so malicious users can't take advantage to spam someone's mailbox (and get your server blacklisted).
My favorite is e-mails with three dots in them. Which is actually not a valid address - the RFC specifies that you must have a valid textual character between dots[1]. However, because of poor decisions by Japanese telcoms, a substantial chunk of their users have 'e-mails' associated with their mobile phones with three dots, breaking goddamn every sensible validation script.
Sadly, if you're sending e-mail sanely, your mail provider likely validates recipients, and will be annoyed at you if you send them recipients they think are bogus.
The relevant rfc (on mobile; don't remember which) specifically states that intermediate servers must not validate mailboxes (local parts). And honestly the ___domain should be "validated" by the server doing an mx lookup; let dns handle it.
So wait, what now? So you can have an email address like [email protected]? Can you give me a generic example?
Yeah, I'm with the other guy, regardless of whether or not it's a good idea to do validation (it's not), that's not an address that should pass validation because it's not a valid ___domain or hostname.
I could see it being less of a big deal in the mailbox portion given that it's now kinda kosher to ignore dots there.
This is why we allow subdomain magic at FastMail, so that if you wanted to use [email protected] you can use [email protected] as well, and it works fine. Everyone accepts that form.
I had a [email protected] (though I have a really common name so it was actually [email protected]), but fortunately they also gave us 8 character usernames (which were also our login to our shared hosting on the Sun E6500 machine), but I never used the long form since it was rejected nearly everywhere.
> Could use + in the first part of email such as: [email protected] to create throwaways. most sites consider those to be different email address then [email protected] for account purposes but email service, who respect the rfc, will threat them as the same.
There is no RFC that requires this behavior. Subaddressing within the local part is recognized as a common practice (e.g., in RFC 5233), but nothing requires a system to support subaddressing, or requires a system that does to support a particular separator character or character sequence (e.g., "+") for subaddressing. Email systems are free to implement or not implement subaddressing, and to use any character sequence they want as the separator.
I've always found this to be such an odd option. If the concern is people spamming you or selling your email address, how does this really help? Anyone intentionally doing nefarious things can just add a simple regex to strip the garbage.
The parent mentioned QA, it would work for that. For antispam you could add a filter for emails to the direct address without suffix. Personally I prefer aliases, but I think they are very rare on freemailers.
> Could use + in the first part of email [...] email service, who respect the rfc, will threat them as the same.
I'm not aware of any RFC that says that mail sent to [email protected] should go to the same mailbox as mail sent to [email protected] (nor am I aware of any RFC that forbids this). I thought that GMail made up that feature and other vendors followed suit since users find it handy.
> Subaddressing is the practice of augmenting the local-part of an
> [RFC2822] address with some 'detail' information in order to give
> some extra meaning to that address. One common way of encoding
> 'detail' information into the local-part is to add a 'separator
> character sequence', such as "+", to form a boundary between the
^^^^^^^^^^^
> 'user' (original local-part) and 'detail' sub-parts of the address,
> much like the "@" character forms the boundary between the local-part
> and ___domain.
(Highlighting by me)
The RFC even gives an example using the hash:
> o A message addressed to "5551212#[email protected]" is delivered to
the voice mailbox number "123" at phone number "5551212".
One way to get around a ___domain blacklist is to point your own ___domain to Mailinator. Heck, since last year you can even get your own private Mailinator...
It took me a bit to get my head around the use cases. It's sometimes amazing how many different ways you can twist a simple (complex really) thing like email into a product/idea.
However tricking site scrappers may not work perfectly if the site scrappers maintained a list of websites in their "whitelist". Say if I am scrapping mailinator.com for ___domain names, if I see gmail.com or yahoo.com, I might just not put them in my database because they are in my whitelist.
Mailinator seems to have added some other anti-scraping detection.
Unfortunately it does not work very well as I was not scraping mailinator, but still somehow got IP banned. Fortunately my ip has changed. But they definitely have some strange and overzealous method now.
I would go one step further and look for {spam_words} in "username+{text}@{googledomain}.com", where spam_words can be "junk", "spam", etc. This is like a very narrow edge case, but still might catch something. Again, if you're into that kind of thing; I'm quite skeptical that it brings any value.
That's not at all what I meant. Gmail redirects emails to "[email protected]" to "[email protected]". Some people filter emails this way, henceforth my suggestion to check for this edge case.
FTA: "Could I make it harder to scrape? Well, I could, but wouldn't really slow anyone down much."
I think that's the basic idea. He could spend his time making it harder to scrape, like the bar across the steering wheel. Some people would be deterred, others wouldn't, and time would be wasted all around.
At least at the time of writing, if you had enough foresight and engineering time to set something like that up, you had enough foresight and engineering time to not make your system treat email addresses as meaningful identities.
Perhaps I'm missing something, but an extremely high percentage of the sites I have accounts on use my email address for authentication. Those that don't often suffer from username squatting. Maybe most sites are just doing it wrong, but what's the prevailing alternative?
Your email address isn't your identity. It's a name associated with your identity, but the identity itself is your account. Or put another way, not all valid email addresses are valid identities for these websites.
If the website is doing things right, they have other means (like a CAPTCHA at the least, or phone verification, or you buying an item from them) before deciding that an email address really is an identity.
I guess I still fail to see the distinction. CAPTCHAs really only keep out bots . . . they do nothing for keeping out Mailinator abuse. Throwaway phone numbers are easily obtainable. They might not be as cheap as Mailinator, but the point is Mailinator made it faster and cheaper for people. Buying an item doesn't really work out when the expectation is you offer a free trial and that's where the bulk of abuse occurs.
I realize this was a non-comprehensive list and I'm not trying to just attack it. I think I agree with the core assessment around what constitutes an identity. But short of some really draconian methods, I think you're basically trading off one insufficient method for another. And at that point, you may as well focus on making things easy for people, which typically means just working with email verification.
FWIW, when faced with Mailinator abuse I resorted to requiring a credit card number to sign up for a trial of my SaaS product. The abuse stopped immediately. But there were other impacts to the business as a result. I still debate the wisdom of it and how much of this should have been foresight. As a bootstrapped company, dealing with abuse was just a resource drain and forced me to focus my efforts on dealing with a segment of the population that was never going to give me money. Suffice to say, it was all very disheartening.
Anyway, thanks for sharing your thoughts on the matter.
I hadn't read that in many years, and what fun to do a re-read.
Thanks Internet - don't stop being you.