American Chemical Society bans university after "spider-trap" is clicked

freshyill · on April 4, 2014

It's worth noting that many journals don't control the platform their scholarly content. It looks like ACS uses this [Atypon](http://www.atypon.com). That's the likely source of this spider trap, not ACS.

Atypon has [a relatively small client list](http://www.atypon.com/our-clients/featured-clients.php). Compare it to [Highwire](http://highwire.stanford.edu/lists/allsites.dtl). I'd be willing to bet that all journals hosted with Atypon share this spider trap—even journals that are supposed to be open access where spidering should be OK.

Scientific publishing is weird. Source: I work in scientific publishing.

ak217 · on April 4, 2014

Scientific publishing is not just weird, it's sinister.

That said, ACS is pretty sinister, too. They opposed PubChem (http://en.wikipedia.org/wiki/PubChem#ACS.27s_concerns) and in general don't behave like the nonprofit scientist trade organization that they present themselves as.

wumpus · on April 4, 2014

In Highwire's case, they typically have a robots.txt blocking everyone but Google ... and the reason is not malice, it's inefficient software. Fetching a page once every few seconds is enough to overload their system.

AYBABTME · on April 5, 2014

A problem with robots is that they will scan all of a website. Normal user traffic is somewhat focused and is easy to cache. When one robot comes and scans everything at once, it brings a bunch of unpopular pages in the cache, possibly evicting more popular ones in the process. The more popular one will then need to be re-cached as requests come back for it.

If you avoid caching requests by robots, then instead you end up having to go through all the layers of your app, possibly going in the database.

In most situation, I don't think the above matters much. But I can see why it could be a worst case for some stacks.

batbomb · on April 5, 2014

That's easily detectable and mitigable via a MRU cache, or no caching at all.

sgt101 · on April 5, 2014

That might have been true in 1997, but in 2014 a page every few seconds is so far back in the noise that it's irrelevant. If you are getting problems from an IP it's normally that they are trying 1000's requests per second, and this isn't any halfway sensible spider, it's a dos attack. Robots.txt only stops you if you read and abide by it, and if you are getting a page every few seconds it's almost certain that the ___domain will not notice.

Blahah · on April 4, 2014

All the Atypon clients seem to have now disabled this trap (I tried the 5 biggest ones).

lazyjeff · on April 5, 2014

Those client lists are a bit unfair to compare: Atypon lists large publishers (Elsevier, IEEE, Oxford University Press, Taylor & Francis, ACS) while HighWire's list has a lot of individual journals (Journal of Early Childhood Research, Monthly Notices of the Royal Astronomical Society: Letters, etc.)

s_q_b · on April 4, 2014

Is it bad that I'm just as insulted by the so-called "spider trap"? It's so technologically simple as to be useless against anyone who could deploy a web scraper in the first place.

I mean, it's marked by comment tags that say "spider trap" right on them! Its the worst type of disambiguation system: likely to generate false positives, unlikely to catch real violators.

PavlovsCat · on April 4, 2014

Yet the off-the-shelf bots that are just let loose on the web in general will likely fall for it, as long as the "spider trap" is not off-the-shelf itself; and the ones actually targetted at just you you likely can't defeat anyway.

Kliment · on April 4, 2014

Note how this means that anyone who is tricked into clicking that link has just blacked-out their entire institution. This has massive potential for abuse.

danieltillett · on April 4, 2014

I almost wish I was back at uni so I could have some fun with this.

Kliment · on April 4, 2014

Spam every academic email with link to "I thought you might find this paper interesting", lockout every university in the world? It only takes one click by one unsuspecting victim per institution. I wonder what would happen if someone includes that DOI in a publication reference. Watch which institutions go dark, and you know who is doing your blind review.

tedks · on April 4, 2014

Undergrads won't open that link, and grad students and faculty are too busy.

To really do damage, you need to create something that'll go viral, and piggyback the link on it. Undergrads will click anything that looks fun. Make it look fun and you're golden.

mikeash · on April 4, 2014

If you're up for a little fraud, get a list of faculty and undergrad e-mails and send e-mails to undergrads purporting to be from faculty saying, "Please read this and be ready to discuss in class on Monday." The hit rate will be low, but you're bound to get a few students who really do have that professor's class on Monday and will click it.

MichaelGG · on April 5, 2014

You wouldn't need to click the link. Just include it on another page as an img src. Right?

userbinator · on April 5, 2014

Things like signatures on forums visited by university students are the first to come to mind. Embed it through one of the many referer-stripping redirectors out there, and none could be the wiser...

ams6110 · on April 5, 2014

Send it to the adminstrative staff then, they spend about 50% of their day surfing.

danieltillett · on April 4, 2014

Yep - the naughtiness you could get up to with this is almost endless. I get the feeling that this will not be with us for much longer as every university in the world is about to get locked out.

dsl · on April 4, 2014

That isn't true. http://blog.codinghorror.com/preventing-csrf-and-xsrf-attack...

tghw · on April 4, 2014

You're assuming the creator of the spider trap also put in CSRF protections. However, if you read the article, you'll see that it is only a link, with no CSRF protection at all. Therefore, it is quite true that spamming research mailing lists could lock a lot of people out. In fact, since this spider trap is triggered on a GET request, you would just need to embed the link as an image, script, or stylesheet and get the target to visit a page.

yahelc · on April 4, 2014

You could even embed the URL as an image in the email, and if images are enabled, it would trigger a GET request.

naich · on April 4, 2014

They have posted a reply here: http://www.asihcopeiaonline.org/doi/pdf/10.1046/9999-9999.99...

nitrogen · on April 4, 2014

That looks like the spider trap link.

naich · on April 4, 2014

That's the joke.

nitrogen · on April 4, 2014

This sort of Slashdot-esqe misdirection is not exactly appropriate on HN. It starts with black-holing universities with a link mislabeled as a response, but could quickly devolve into misdirections to other black holes of Slashdot that (I assume) we really don't want here.

user24 · on April 5, 2014

Not to mention the fact that if someone at a university does click the link, they could ruin any number of people doing research there. That includes undergrads doing a last-minute reference to their thesis and tenured professors about to cure cancer.

sebcat · on April 5, 2014

As a user of HN, don't tell me or others what's appropriate, thanks.

dang · on April 5, 2014

Nitrogen is right. This sort of Slashdot-esqe misdirection is not exactly appropriate on HN.

MertsA · on April 4, 2014

I know it was meant as a joke but some people click first and think later...

throwaway13qf85 · on April 4, 2014

Appropriately, "The requested article is not currently available on this site."

dalke · on April 4, 2014

arXiv.org, back when it was still xxx.lanl.gov had a similar trap. Yes, I clicked on it. It gave a warning of the sort "don't this again, here's what's happening, if we see many more requests from your site then we'll shut off access."

This was in the late 1990s.

HCIdivision17 · on April 4, 2014

I still remember that page. As a middle schooler who didn't know anything from anything, it was a perplexing thing. The site's got an xxx at the front, but looks like a legit government site from wait, Los Alamos? Like from "Surely You're Joking Mr. Feynman"? Oh jeez, I'm gonna get in trouble with the school...

PaulHoule · on April 4, 2014

Funny, we used to do this when I was working at arXiv.org. We had incessant problems with robots that didn't obey robots.txt so we needed spider traps to keep the site from going down.

SixSigma · on April 4, 2014

That's some level of incompetence - the trappers I mean. A half arsed solution because they couldn't think of a better one. A registration system with abstracts and unlock-this-article links would be a better one, off the top of my head.

freshyill · on April 4, 2014

I'm willing to bet that they provide site licenses, where everyone in an entire university's subnet range might have access. In an open access journal, it shouldn't matter, but many journals are hosted on the same few platforms, and the spider trap is a feature of the platform.

danso · on April 4, 2014

Reporting the content since site is down:

Tl;dr: researcher is browsing source code of a research paper's web page and finds a strange link (but same ___domain). She clicks and is informed that her IP is banned for automated spidering.

Apparently, this research site is meant to be open-access...

-------

Pandora is a researcher (won’t say where, won’t say when). I don’t know her field – she may be a scientist or a librarian. She has been scanning the spreadsheet of the Open Access publications paid for by Wellcome Trust. It’s got 2200 papers that Wellcome has paid 3 million GBP for. For the sole reason to make them available to everyone in the world. She found a paper in the journal Biochemistry (that’s an American Chemical Society publication) and looked at http://pubs.acs.org/doi/abs/10.1021/bi300674e . She got that OK – looked to see if they could get the PDF - http://pubs.acs.org/doi/pdf/10.1021/bi300674e - yes that worked OK.

What else can we download? After all this is Open Access, isn’t it? And Wellcome have paid 666 GBP for this “hybrid” version (i.e. they get subscription income as well. So we aren’t going to break any laws…

The text contains various other links and our researcher follows some of them. Remember she’s a scientist and scientists are curious. It’s their job. She finds: <a href="/doi/pdf/10.1046/9999-9999.99999"> </a> Since it's a bioscience paper she assumes it's about spiders and how to trap them.

She clicks it. Pandora opens the box... Wham!

The whole university got cut off immediately from the whole of ACS publications. "Thank you", ACS

The ACS is stopping people spidering their site. EVEN FOR OPEN ACCESS. It wasn't a biological spider. It was a web trap based on the assumption that readers are, in some way, basically evil.. Now I have seen this message before. About 7 years ago one of my graduate students was browsing 20 publications from ACS to create a vocabulary. Suddenly we were cut off with this awful message. Dead. The whole of Cambridge University. I felt really awful.

I had committed a crime. And we hadn't done anything wrong. Nor has my correspondent. If you create Open Access publications you expect - even hope - that people will dig into them. So, ACS, remove your spider traps. We really are in Orwellian territory where the point of Publishers is to stop people reading science.

I think we are close to the tipping point where publishers have no value except to their shareholders and a sick, broken, vision of what academia is about.

UPDATE: See comment from Ross Mounce: The society (closed access) journal ‘Copeia’ also has these spider trap links in it’s HTML, e.g. on this contents page:http://www.asihcopeiaonline.org/toc/cope/2013/4

you can find

I may have accidentally cut-off access for all at the Natural History Museum, London once when I innocently tried this link, out of curiosity. Why do publishers ‘booby-trap’ their websites? Don’t they know us researchers are an inquisitive bunch? I’d be very interested to read a PDF that has a 9999-9999.9999 DOI string if only to see what it contained – they can’t rationally justify cutting-off access to everyone, just because ONE person clicked an interesting link? PMR: Note - it's the SAME link as the ACS uses. So I surmise that both society's outsource their web pages to some third-party hackshop. Maybe 10.1046 is a universal anti-publisher.

PMR: It's incredibly irresponsible to leave spider traps in HTML. It's a human reaction to explore.

gmisra · on April 4, 2014

Seems like an easy way for a university-based "conscientious objector" to have this issue addressed would be to intentionally click on the spider trap link once a day?

logfromblammo · on April 4, 2014

Too much work. Wget with cron. Then you can click every day without having to click every day.

specialp · on April 4, 2014

I work for a (non profit) journal publisher and we do indeed cut off robot downloading but not after one click of a link. We analyze traffic to determine robot downloads. I suspect though that the whole entire university did not get cut off in this incident. Usually it is on a per IP basis and unless the University proxies all of their journal traffic through a single IP which is not common I think saying the whole university being blocked may be an exaggeration. I personally wish we had no robot monitor but then again we would get heavy spidering then of large files.

dllthomas · on April 4, 2014

Is there reason to block instead of throttling?

specialp · on April 4, 2014

We do have a CAPTCHA too before the block. Basically to get the block you have to really work hard at it. We also do not mind limited robot use for cases like downloading all papers given a search term or author but we do not want people downloading our entire corpus either. So throttling is not an option.

I think the case mentioned in the article is definitely heavy a heavy handed approach. When it comes down to it at my place we are just trying to block the wget -r's of the world.

zAy0LfpBZLC8mAC · on April 5, 2014

Is there any reason why you don't want people downloading your entire corpus?

ceph_ · on April 4, 2014

Or captcha?

raverbashing · on April 4, 2014

Doesn't Chrome pre-load links as well?

Not sure it checks for styling before prefetching them.

acdha · on April 4, 2014

That's opt-in using a <meta> tag:

https://developers.google.com/chrome/whitepapers/prerender

raverbashing · on April 4, 2014

Interesting

In Chrome 32 there's a "Predict network actions to improve page load performance" config option, and here's what it does https://support.google.com/chrome/answer/1385029?hl=en

nraynaud · on April 4, 2014

no I think this plan was scrapped, too dangerous (too many websites had stateful actions as GET). I think they just stuck to pre-loading DNS.

edit: and i was messing the webstats for advertisement.

endianswap · on April 4, 2014

The best example of a stateful GET action is the Logout button on basically every website (including HN).

aidenn0 · on April 4, 2014

We had an internal wiki where the "delete article" link was a GET. Then someone wrote a crawler for it and deleted the entire wiki in 15 minutes. It was changed to a POST after that.

vijayp · on April 5, 2014

Heh, this reminds of a story many years ago at Google, where we got angry messages from some guy complaining that Google kept deleting all the photos from his online album.

We eventually figured out he his online album had an unprotected "delete this photo" endpoint via a GET, and no robots restriction! We eventually had to fix the crawler to detect things like this...

userbinator · on April 5, 2014

I bet Google gets tons of these: http://thedailywtf.com/Articles/The_Spider_of_Doom.aspx

nraynaud · on April 5, 2014

I love this idea that you have to be exceptionally smart not to do the best job, but to do a normal job while surrounded by mediocrity.

rcxdude · on April 4, 2014

Plus it adds load to websites, so admins don't like it. There were a number of websites which blocked users using the fasterfox extension which used prefetching.

nraynaud · on April 4, 2014

that's quite a few nails in the coffin!

owenversteeg · on April 4, 2014

For anyone that can't load the page, here's the site from Google's cache: http://webcache.googleusercontent.com/search?q=cache:_EBW_po...

sp332 · on April 4, 2014

If the site is really down, it helps to link to the text-only version of the cache. http://webcache.googleusercontent.com/search?q=cache:_EBW_po...

jrochkind1 · on April 4, 2014

Oddly, the google cache version won't load for me either. The google cache header is there, but the content area is blank, with the chrome status bar saying "Waiting for blogs.ch.cam.ac.uk".

Looking at the source... there's some weird things going on, I think maybe the _original_ page loaded it's content with Javascript, and the google cached version is just the JS skeleton, waiting on trying to load JS from the original (overloaded) site which will actually load the content?

Ugh. The trend for JS-dependent sites for simple content breaks the web, people.

MertsA · on April 4, 2014

No, you need to click the text only link in the header or it will still try to load images and will eventually time out.

a3n · on April 4, 2014

It would be interesting to see what conversations might happen if lots of people from lots of universities clicked on these traps.

k2enemy · on April 4, 2014

The warning message returned by the spider-trap says that it banned a particular IP address. How does this cut off the entire university? Is everyone behind a NAT?

TillE · on April 4, 2014

For licensing purposes, they'd need to be able to associate ranges of IP addresses with a specific institution. So if they want to, it's easy to block that whole license for one violation.

DangerousPie · on April 4, 2014

Yes, it is.

DangerousPie · on April 4, 2014

This is an important topic, but that blog entry was not very well written. If I hadn't heard about this before already I would have been very confused what they actually wanted to say with this convoluted story.

dang · on April 4, 2014

Can anyone suggest a better url? If so, I'll change it.

gcb0 · on April 4, 2014

1. get university with good ties to ACLU and other such movements.

2. subscribe

3. click link

4. sue them for breach of contract and damages. (they didn't deliver the content you paid for, it damaged your main source of income: providing knowledge to paying students)

5. repeat.

ChuckMcM · on April 4, 2014

Sigh, did no one notice that the link is in a ? Look at the style sheet and note that class 'hide' sets the link to be the same color as the background (it makes it invisible to humans) and yet it got clicked on anyway.

There are bad actors out there, they exploit services, and one of the ways the services detect them is to create situations that a script would follow but that a human would not. When they do something bad you've got a couple of choices, cut them off or lie to them (some of the Bing markov generated search pages for robots are pretty fun))

So she sends an email to the address provided, they talk to her, she gets educated and they re-enable access. If it happens again the issue gets escalated. Its the circle of fraud.

abruzzi · on April 4, 2014

Some people also override site based CSS with their own, which could likely make the link that was intended to be hidden come unbidden. Most browsers I've used have that option.

ChuckMcM · on April 4, 2014

True, I've done this for sites that had really slow loading webfonts and/or remote css pages. Although we really can't say what it looked like without the rest of the page contents. I don't suppose anyone has a page source dump.

I wonder sometimes if this sort of activity (honey pots for autobanning) angers people because they feel they have a right to script the site or if they feel poorly for "falling for" the honeypot. Clearly it generates some emotion though.

cesarb · on April 4, 2014

No need to override the CSS, if your connection is unstable enough the CSS load can simply fail, and the page will display unstyled.

zAy0LfpBZLC8mAC · on April 5, 2014

There even are browsers out there that simply don't support CSS.

dllthomas · on April 5, 2014

Lynx seems a likely candidate, though I'm not positive it has zero support.

pseut · on April 5, 2014

Konqueror on RHEL seems more likely than Lynx.

danudey · on April 4, 2014

A link the same colour as the background can still be seen (e.g. if it's selected by accident, or by Select All), can still be clicked on whether it's seen or not, etc.

spb · on April 4, 2014

Some people view source.

thaumasiotes · on April 5, 2014

True enough, but we might expect those people to have some idea of what a spider trap is.

I definitely noticed that the tag was , though... that raises all kinds of interesting questions like "what if I want to hide more than one thing on the page?" and "if I'm going to do this, why not display="none" instead of just adjusting the text color on the link?"

joshdance · on April 4, 2014

Tack spider traps and booby trapped documents to the long list of scientific publishing problems.

fit2rule · on April 4, 2014

This is only interesting for as long as ACS is asleep at the wheel.

Lets wait and find out how long it takes them to respond to the inevitable interest that 999999.99999 people will have sent their way ..

userbinator · on April 5, 2014

I sometimes get a similar message from Google (maybe it's due to the search queries I use...), but they provide a CAPTCHA so you can (reasonably) show that you're a human.

patcon · on April 4, 2014

Some asshole just discovered a whole new reason to wardrive...

keithgabryelski · on April 5, 2014

It's odd that at the top of the article the author claims Pandora might be a scientist or a librarian (but they won't reveal such things) Then later claims they looked at the hidden link because they were curious (because scientists are curious). Maybe someone should have re-read their text for consistency.

obastemur · on April 4, 2014

last 5 mins I'm trying to reach this link. the website is not reachable any more. How many people are trying to do the same ?

pbhjpbhj · on April 4, 2014

http://blogs.ch.cam.ac.uk/pmr/ worked for me eventually, also http://webcache.googleusercontent.com/search?q=cache:_EBW_po....

There are several posts about this issue, http://blogs.ch.cam.ac.uk/pmr/2014/04/03/acsgate-the-america... appears to be the best I've looked at as it gives details of the links that were followed that initiated the suspension of service by ACS.

nathanvanfleet · on April 4, 2014

I work at a university and just clicked on it.

notastartup · on April 4, 2014

Trying to stop spidering or web scraping or making it criminal is asinine. Do not publish it online. Even if you put content up as Flash or Java applet, someone will find a way to crawl/scrape it.

This goes against the nature of the internet and information, it is bound to be free.

logfromblammo · on April 4, 2014

You cannot simultaneously publish something and stop people from knowing what it contains. Expecting that you can is absolutely insane--literally insane--as in believing that P and not P can simultaneously have the same logical value.

This makes me furious. It isn't because the intent is malicious. That only makes me just a tiny bit angry. I am furious because the malice was implemented in the stupidest, most useless, laziest manner possible.

It's like keeping the neighborhood kids off your lawn by burying a pressure plate switch out there for the armed nuclear bomb in your garage. And then not telling anyone about it. And then inviting all the neighbors over for a croquet tournament.