The US Library of Congress has put 25M items free online

xvilka · on May 29, 2017

Meanwhile Elsevier, who is widely known for inhibiting science progress by setting incredible high prices even for government funded research papers, makes a move against SciHub [1] and LibGen [2] again [3].

[1] https://sci-hub.cc/

[2] http://libgen.io/

[3] https://torrentfreak.com/elsevier-wants-15-million-piracy-da...

agumonkey · on May 29, 2017

I find scihub/libgen a very important project actually. Mirrors should be done (not an order, just a plan).

petepete · on May 29, 2017

I've never heard of libgen but it's blocked by my ISP (in the UK)

tim333 · on May 29, 2017

Oh yeah so it is. https://www.hotspotshield.com/ (free) works well for that kind of thing.

user5994461 · on May 29, 2017

Indeed.

    Access to the websites listed on this page has been blocked pursuant to orders of the high court.

xvilka · on May 29, 2017

SciHub has onion site scihub22266oqcxt.onion, I don't know about LibGen though.

vog · on May 29, 2017

Here in Germany libgen is not blocked.

lgierth · on May 28, 2017

I added the raw .gz files to IPFS when Library of Congress announced this last week: https://github.com/ipfs/archives/issues/152

It's slightly more than 100GB and here it is: https://ipfs.io/ipfs/QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtH...

(Note that the filesizes in the directory listing are all wrong -- that's the original index.html from loc.gov/cds/downloads/MDSConnect/)

This makes it a lot easier to use this dataset at e.g. hackathons, where a lot of people would simultaneously pester that LoC server, which already seemed pretty bandwidth-limited on its own when I downloaded the files.

lgierth · on May 28, 2017

You can pin it: `ipfs pin add QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtHen9hVw9`

Or list it: `ipfs ls QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtHen9hVw9`

Or copy it into your local filesystem `ipfs get QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtHen9hVw9`

toomuchtodo · on May 28, 2017

Did you push them into the Internet Archive yet? If not, going to grab a beer and start iterating through your IPFS objects.

lgierth · on May 28, 2017

Go for it! :) I've never pushed anything to IA so far to be honest.

toomuchtodo · on May 28, 2017

For future reference then! https://github.com/jjjake/internetarchive

voltagex_ · on May 29, 2017

If someone wants to pay for / donate 100GB worth of bandwidth on a VPS somewhere, I'll do it.

tty7 · on May 29, 2017

scaleway: €2.99/m 200Mbit/s Unmetered bandwidth

voltagex_ · on May 29, 2017

Okay, so who's going to spot me 3 EUR? ;)

toomuchtodo · on May 29, 2017

Could TransferWise you the $3.35USD. That work?

voltagex_ · on May 29, 2017

Having never done this before I've got to have a look at the internetarchive tool first, but yes, that would work (I'd hate to take your money then not be able to deliver).

voltagex_ · on May 29, 2017

This thread is getting too long; how do I contact you?

tombrossman · on May 29, 2017

This amount of usage falls well within most VPS companies' free trial / promo codes offerings and should cost you nothing. Use a throwaway email account and drop it after a month.

AWS will give you a whole year if you haven't tried them yet and the other popular VPS companies (DO, Linode, etc.) all will give you at least $10 startup credit. This is probably simpler and faster than figuring out how to receive <$4 from some random internet commenter.

voltagex_ · on May 29, 2017

Yes. This isn't the first time I've tried to contact toomuchtodo.

Anywho, the amount of data is actually ~19GB which is well within what I can upload with my home connection. Unfortunately the ia tool is failing for me: https://github.com/jjjake/internetarchive/issues/176

Also, it's not really about the $3, more that a tonne of "$3" projects really add up over a year or so.

toomuchtodo · on May 29, 2017

Email sent.

voltagex_ · on May 31, 2017

I have been advised that the data already exists on archive.org.

toomuchtodo · on June 2, 2017

Email me back. Would still like to buy you a beer for your troubles.

lgierth · on May 29, 2017

Oops, it turns out I somehow ended up with decompressed files, that's why the filesizes are so far off. I'm creating a new fixed dump.

lgierth · on May 30, 2017

The correct dataset (with gzip compression) is here: https://ipfs.io/ipfs/QmcvfB6pAqUfTnuAK8zFKVxbdhopnBPveJrDcy1...

themodelplumber · on May 29, 2017

Just in case anyone else is wondering: This is, as I understand it, 25M pieces of metadata, not 25M books, songs, movies, and treasures from the past.

gt_ · on May 29, 2017

It's something along those lines. It looks like the music is all "cover songs" :( one of the heavier metadata types.

wordupmaking · on May 28, 2017

What's the copyright? Would it be legal to unzip those and serve them directly, so archive.org or anyone else can make them more inviting for access?

I know you shouldn't look a gift horse in the mouth but there's not even an index or a rough idea what something like "Name Authorities" might mean. That's not what I call wide open doors, that more seems like doing some legally required minimum.

wonderous · on May 28, 2017

Files are 25 million bibliographic index files which were produced by US Federal employees, so yes, they're likely in the public ___domain as a result.

https://en.m.wikipedia.org/wiki/Copyright_status_of_work_by_...

wordupmaking · on May 28, 2017

Awesome, thanks!

an27 · on May 28, 2017

I suppose you could do that and then wait for the C&D notice?

I can't imagine you'd get sued for putting online a copy of this if you comply with C&D notices, then again I'm very narrow-minded.

Regarding Name Authorities, this article should clarify it somewhat: https://en.wikipedia.org/wiki/Authority_control

It's basically an authentication provider maintained by the Library of Congress, which serves to define cannonical identifiers for library-catalogued entities, like books and public figures.

The Library of Congress uses the MARC standard (developed internally) and that is the format of the Name Authorities files: https://en.wikipedia.org/wiki/MARC_standards

pmoriarty · on May 28, 2017

I wonder how much more extensive the release could have been were copyright laws not in the way.

Then there's the old question of whether the works under copyright today will ever go in to the public ___domain, or if their copyright will be extended forever by future changes in copyright law.

wonderous · on May 28, 2017

Release is 25 million bibliographic index files and has nothing to do with copyright, since none of the data was ever covered by copyright protection.

pmoriarty · on May 28, 2017

They didn't have to limit their release just to bibliographic index files. If they wanted to, they could have released manuscripts, letters, newsletters, videos, or any other media they have. But they may have felt inhibited by copyright laws.

So my question is, had copyright laws not been an issue, how much more would they have released?

There is also the larger question of whether the value of copyright law outweighs the value of not having it, so that everyone can benefit from this treasure trove of knowledge.

AndrewUnmuted · on May 29, 2017

I don't think these records meet the standard definition of 'media' anyway. This is really just data that can be used for cataloguing purposes and other media custodian/librarian applications.

Given that the LoC has made it their goal to archive at least one copy of everything, I think they are not quite the right people to fall into your anti-copyright cross hairs. However, I do strongly agree with your overall premises.

gjjrfcbugxbhf · on May 29, 2017

I don't think the op has anything against the LoC. More like they are lamenting that the LoC has its hands tied.

brudgers · on May 29, 2017

https://loc.gov/collections

The Library of Congress has put materials online for a number of years. American Memory was the first I became aware of:

https://memory.loc.gov/ammem/index.html

According to Wikipedia, it began in 1994.

alphonsegaston · on May 28, 2017

For those who haven't ever worked with MARC record data before, there's a python library that's a pretty easy interface called pymarc:

https://github.com/edsu/pymarc

Mathnerd314 · on May 28, 2017

Can someone change the title to match the article? s/items/records/

dogruck · on May 28, 2017

Another nit -- paid for via taxes.