It's worth noting that many journals don't control the platform their scholarly content. It looks like ACS uses this [Atypon](http://www.atypon.com). That's the likely source of this spider trap, not ACS.
Scientific publishing is not just weird, it's sinister.
That said, ACS is pretty sinister, too. They opposed PubChem (http://en.wikipedia.org/wiki/PubChem#ACS.27s_concerns) and in general don't behave like the nonprofit scientist trade organization that they present themselves as.
In Highwire's case, they typically have a robots.txt blocking everyone but Google ... and the reason is not malice, it's inefficient software. Fetching a page once every few seconds is enough to overload their system.
A problem with robots is that they will scan all of a website. Normal user traffic is somewhat focused and is easy to cache. When one robot comes and scans everything at once, it brings a bunch of unpopular pages in the cache, possibly evicting more popular ones in the process. The more popular one will then need to be re-cached as requests come back for it.
If you avoid caching requests by robots, then instead you end up having to go through all the layers of your app, possibly going in the database.
In most situation, I don't think the above matters much. But I can see why it could be a worst case for some stacks.
That might have been true in 1997, but in 2014 a page every few seconds is so far back in the noise that it's irrelevant. If you are getting problems from an IP it's normally that they are trying 1000's requests per second, and this isn't any halfway sensible spider, it's a dos attack. Robots.txt only stops you if you read and abide by it, and if you are getting a page every few seconds it's almost certain that the ___domain will not notice.
Those client lists are a bit unfair to compare: Atypon lists large publishers (Elsevier, IEEE, Oxford University Press, Taylor & Francis, ACS) while HighWire's list has a lot of individual journals (Journal of Early Childhood Research, Monthly Notices of the Royal Astronomical Society: Letters, etc.)
Is it bad that I'm just as insulted by the so-called "spider trap"? It's so technologically simple as to be useless against anyone who could deploy a web scraper in the first place.
I mean, it's marked by comment tags that say "spider trap" right on them! Its the worst type of disambiguation system: likely to generate false positives, unlikely to catch real violators.
Yet the off-the-shelf bots that are just let loose on the web in general will likely fall for it, as long as the "spider trap" is not off-the-shelf itself; and the ones actually targetted at just you you likely can't defeat anyway.
Note how this means that anyone who is tricked into clicking that link has just blacked-out their entire institution. This has massive potential for abuse.
Spam every academic email with link to "I thought you might find this paper interesting", lockout every university in the world? It only takes one click by one unsuspecting victim per institution. I wonder what would happen if someone includes that DOI in a publication reference. Watch which institutions go dark, and you know who is doing your blind review.
Undergrads won't open that link, and grad students and faculty are too busy.
To really do damage, you need to create something that'll go viral, and piggyback the link on it. Undergrads will click anything that looks fun. Make it look fun and you're golden.
If you're up for a little fraud, get a list of faculty and undergrad e-mails and send e-mails to undergrads purporting to be from faculty saying, "Please read this and be ready to discuss in class on Monday." The hit rate will be low, but you're bound to get a few students who really do have that professor's class on Monday and will click it.
Things like signatures on forums visited by university students are the first to come to mind. Embed it through one of the many referer-stripping redirectors out there, and none could be the wiser...
Yep - the naughtiness you could get up to with this is almost endless. I get the feeling that this will not be with us for much longer as every university in the world is about to get locked out.
You're assuming the creator of the spider trap also put in CSRF protections. However, if you read the article, you'll see that it is only a link, with no CSRF protection at all. Therefore, it is quite true that spamming research mailing lists could lock a lot of people out. In fact, since this spider trap is triggered on a GET request, you would just need to embed the link as an image, script, or stylesheet and get the target to visit a page.
This sort of Slashdot-esqe misdirection is not exactly appropriate on HN. It starts with black-holing universities with a link mislabeled as a response, but could quickly devolve into misdirections to other black holes of Slashdot that (I assume) we really don't want here.
Not to mention the fact that if someone at a university does click the link, they could ruin any number of people doing research there. That includes undergrads doing a last-minute reference to their thesis and tenured professors about to cure cancer.
arXiv.org, back when it was still xxx.lanl.gov had a similar trap. Yes, I clicked on it. It gave a warning of the sort "don't this again, here's what's happening, if we see many more requests from your site then we'll shut off access."
I still remember that page. As a middle schooler who didn't know anything from anything, it was a perplexing thing. The site's got an xxx at the front, but looks like a legit government site from wait, Los Alamos? Like from "Surely You're Joking Mr. Feynman"? Oh jeez, I'm gonna get in trouble with the school...
Funny, we used to do this when I was working at arXiv.org. We had incessant problems with robots that didn't obey robots.txt so we needed spider traps to keep the site from going down.
That's some level of incompetence - the trappers I mean. A half arsed solution because they couldn't think of a better one. A registration system with abstracts and unlock-this-article links would be a better one, off the top of my head.
I'm willing to bet that they provide site licenses, where everyone in an entire university's subnet range might have access. In an open access journal, it shouldn't matter, but many journals are hosted on the same few platforms, and the spider trap is a feature of the platform.
Tl;dr: researcher is browsing source code of a research paper's web page and finds a strange link (but same ___domain). She clicks and is informed that her IP is banned for automated spidering.
Apparently, this research site is meant to be open-access...
-------
Pandora is a researcher (won’t say where, won’t say when). I don’t know her field – she may be a scientist or a librarian. She has been scanning the spreadsheet of the Open Access publications paid for by Wellcome Trust. It’s got 2200 papers that Wellcome has paid 3 million GBP for. For the sole reason to make them available to everyone in the world.
She found a paper in the journal Biochemistry (that’s an American Chemical Society publication) and looked at http://pubs.acs.org/doi/abs/10.1021/bi300674e . She got that OK – looked to see if they could get the PDF - http://pubs.acs.org/doi/pdf/10.1021/bi300674e - yes that worked OK.
What else can we download? After all this is Open Access, isn’t it? And Wellcome have paid 666 GBP for this “hybrid” version (i.e. they get subscription income as well. So we aren’t going to break any laws…
The text contains various other links and our researcher follows some of them. Remember she’s a scientist and scientists are curious. It’s their job. She finds:
<span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999">
<!-- Spider trap link --></a></span>
Since it's a bioscience paper she assumes it's about spiders and how to trap them.
She clicks it. Pandora opens the box...
Wham!
The whole university got cut off immediately from the whole of ACS publications. "Thank you", ACS
The ACS is stopping people spidering their site. EVEN FOR OPEN ACCESS. It wasn't a biological spider.
It was a web trap based on the assumption that readers are, in some way, basically evil..
Now I have seen this message before. About 7 years ago one of my graduate students
was browsing 20 publications from ACS to create a vocabulary.
Suddenly we were cut off with this awful message. Dead. The whole of Cambridge University. I felt really awful.
I had committed a crime.
And we hadn't done anything wrong. Nor has my correspondent.
If you create Open Access publications you expect - even hope - that people will dig into them.
So, ACS, remove your spider traps. We really are in Orwellian territory where the
point of Publishers is to stop people reading science.
I think we are close to the tipping point where publishers have no
value except to their shareholders and a sick, broken, vision of what academia is about.
UPDATE:
See comment from Ross Mounce:
The society (closed access) journal ‘Copeia’ also has these spider trap links in it’s HTML, e.g. on this contents page:http://www.asihcopeiaonline.org/toc/cope/2013/4
you can find
<span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999">
<!-- Spider trap link --></a></span>
I may have accidentally cut-off access for all at the Natural History Museum, London
once when I innocently tried this link, out of curiosity.
Why do publishers ‘booby-trap’ their websites? Don’t they know us researchers are an
inquisitive bunch? I’d be very interested to read a PDF that has a 9999-9999.9999
DOI string if only to see what it contained – they can’t rationally justify
cutting-off access to everyone, just because ONE person clicked an interesting link?
PMR: Note - it's the SAME link as the ACS uses. So I surmise that both society's outsource their web pages to some third-party
hackshop. Maybe 10.1046 is a universal anti-publisher.
PMR: It's incredibly irresponsible to leave spider traps in HTML. It's a human reaction to explore.
Seems like an easy way for a university-based "conscientious objector" to have this issue addressed would be to intentionally click on the spider trap link once a day?
I work for a (non profit) journal publisher and we do indeed cut off robot downloading but not after one click of a link. We analyze traffic to determine robot downloads. I suspect though that the whole entire university did not get cut off in this incident. Usually it is on a per IP basis and unless the University proxies all of their journal traffic through a single IP which is not common I think saying the whole university being blocked may be an exaggeration. I personally wish we had no robot monitor but then again we would get heavy spidering then of large files.
We do have a CAPTCHA too before the block. Basically to get the block you have to really work hard at it. We also do not mind limited robot use for cases like downloading all papers given a search term or author but we do not want people downloading our entire corpus either. So throttling is not an option.
I think the case mentioned in the article is definitely heavy a heavy handed approach. When it comes down to it at my place we are just trying to block the wget -r's of the world.
We had an internal wiki where the "delete article" link was a GET. Then someone wrote a crawler for it and deleted the entire wiki in 15 minutes. It was changed to a POST after that.
Heh, this reminds of a story many years ago at Google, where we got angry messages from some guy complaining that Google kept deleting all the photos from his online album.
We eventually figured out he his online album had an unprotected "delete this photo" endpoint via a GET, and no robots restriction! We eventually had to fix the crawler to detect things like this...
Plus it adds load to websites, so admins don't like it. There were a number of websites which blocked users using the fasterfox extension which used prefetching.
Oddly, the google cache version won't load for me either. The google cache header is there, but the content area is blank, with the chrome status bar saying "Waiting for blogs.ch.cam.ac.uk".
Looking at the source... there's some weird things going on, I think maybe the _original_ page loaded it's content with Javascript, and the google cached version is just the JS skeleton, waiting on trying to load JS from the original (overloaded) site which will actually load the content?
Ugh. The trend for JS-dependent sites for simple content breaks the web, people.
The warning message returned by the spider-trap says that it banned a particular IP address. How does this cut off the entire university? Is everyone behind a NAT?
For licensing purposes, they'd need to be able to associate ranges of IP addresses with a specific institution. So if they want to, it's easy to block that whole license for one violation.
This is an important topic, but that blog entry was not very well written. If I hadn't heard about this before already I would have been very confused what they actually wanted to say with this convoluted story.
1. get university with good ties to ACLU and other such movements.
2. subscribe
3. click link
4. sue them for breach of contract and damages. (they didn't deliver the content you paid for, it damaged your main source of income: providing knowledge to paying students)
Sigh, did no one notice that the link is in a <span id="hide"> ? Look at the style sheet and note that class 'hide' sets the link to be the same color as the background (it makes it invisible to humans) and yet it got clicked on anyway.
There are bad actors out there, they exploit services, and one of the ways the services detect them is to create situations that a script would follow but that a human would not. When they do something bad you've got a couple of choices, cut them off or lie to them (some of the Bing markov generated search pages for robots are pretty fun))
So she sends an email to the address provided, they talk to her, she gets educated and they re-enable access. If it happens again the issue gets escalated. Its the circle of fraud.
Some people also override site based CSS with their own, which could likely make the link that was intended to be hidden come unbidden. Most browsers I've used have that option.
True, I've done this for sites that had really slow loading webfonts and/or remote css pages. Although we really can't say what it looked like without the rest of the page contents. I don't suppose anyone has a page source dump.
I wonder sometimes if this sort of activity (honey pots for autobanning) angers people because they feel they have a right to script the site or if they feel poorly for "falling for" the honeypot. Clearly it generates some emotion though.
A link the same colour as the background can still be seen (e.g. if it's selected by accident, or by Select All), can still be clicked on whether it's seen or not, etc.
True enough, but we might expect those people to have some idea of what a spider trap is.
I definitely noticed that the tag was <span id="hide">, though... that raises all kinds of interesting questions like "what if I want to hide more than one thing on the page?" and "if I'm going to do this, why not display="none" instead of just adjusting the text color on the link?"
I sometimes get a similar message from Google (maybe it's due to the search queries I use...), but they provide a CAPTCHA so you can (reasonably) show that you're a human.
It's odd that at the top of the article the author claims Pandora might be a scientist or a librarian (but they won't reveal such things)
Then later claims they looked at the hidden link because they were curious (because scientists are curious).
Maybe someone should have re-read their text for consistency.
Trying to stop spidering or web scraping or making it criminal is asinine. Do not publish it online. Even if you put content up as Flash or Java applet, someone will find a way to crawl/scrape it.
This goes against the nature of the internet and information, it is bound to be free.
You cannot simultaneously publish something and stop people from knowing what it contains. Expecting that you can is absolutely insane--literally insane--as in believing that P and not P can simultaneously have the same logical value.
This makes me furious. It isn't because the intent is malicious. That only makes me just a tiny bit angry. I am furious because the malice was implemented in the stupidest, most useless, laziest manner possible.
It's like keeping the neighborhood kids off your lawn by burying a pressure plate switch out there for the armed nuclear bomb in your garage. And then not telling anyone about it. And then inviting all the neighbors over for a croquet tournament.
Atypon has [a relatively small client list](http://www.atypon.com/our-clients/featured-clients.php). Compare it to [Highwire](http://highwire.stanford.edu/lists/allsites.dtl). I'd be willing to bet that all journals hosted with Atypon share this spider trap—even journals that are supposed to be open access where spidering should be OK.
Scientific publishing is weird. Source: I work in scientific publishing.