Hacker News new | past | comments | ask | show | jobs | submit login
The US Library of Congress has put 25M items free online (sciencealert.com)
210 points by leephillips on May 28, 2017 | hide | past | favorite | 40 comments



Meanwhile Elsevier, who is widely known for inhibiting science progress by setting incredible high prices even for government funded research papers, makes a move against SciHub [1] and LibGen [2] again [3].

[1] https://sci-hub.cc/

[2] http://libgen.io/

[3] https://torrentfreak.com/elsevier-wants-15-million-piracy-da...


I find scihub/libgen a very important project actually. Mirrors should be done (not an order, just a plan).


I've never heard of libgen but it's blocked by my ISP (in the UK)


Oh yeah so it is. https://www.hotspotshield.com/ (free) works well for that kind of thing.


Indeed.

    Access to the websites listed on this page has been blocked pursuant to orders of the high court.


SciHub has onion site scihub22266oqcxt.onion, I don't know about LibGen though.


Here in Germany libgen is not blocked.


I added the raw .gz files to IPFS when Library of Congress announced this last week: https://github.com/ipfs/archives/issues/152

It's slightly more than 100GB and here it is: https://ipfs.io/ipfs/QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtH...

(Note that the filesizes in the directory listing are all wrong -- that's the original index.html from loc.gov/cds/downloads/MDSConnect/)

This makes it a lot easier to use this dataset at e.g. hackathons, where a lot of people would simultaneously pester that LoC server, which already seemed pretty bandwidth-limited on its own when I downloaded the files.


You can pin it: `ipfs pin add QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtHen9hVw9`

Or list it: `ipfs ls QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtHen9hVw9`

Or copy it into your local filesystem `ipfs get QmWSzgkftVrkh2859bGT44ahzoqcGhFkjsrQUtHen9hVw9`


Did you push them into the Internet Archive yet? If not, going to grab a beer and start iterating through your IPFS objects.


Go for it! :) I've never pushed anything to IA so far to be honest.



If someone wants to pay for / donate 100GB worth of bandwidth on a VPS somewhere, I'll do it.


scaleway: €2.99/m 200Mbit/s Unmetered bandwidth


Okay, so who's going to spot me 3 EUR? ;)


Could TransferWise you the $3.35USD. That work?


Having never done this before I've got to have a look at the internetarchive tool first, but yes, that would work (I'd hate to take your money then not be able to deliver).


This thread is getting too long; how do I contact you?


This amount of usage falls well within most VPS companies' free trial / promo codes offerings and should cost you nothing. Use a throwaway email account and drop it after a month.

AWS will give you a whole year if you haven't tried them yet and the other popular VPS companies (DO, Linode, etc.) all will give you at least $10 startup credit. This is probably simpler and faster than figuring out how to receive <$4 from some random internet commenter.


Yes. This isn't the first time I've tried to contact toomuchtodo.

Anywho, the amount of data is actually ~19GB which is well within what I can upload with my home connection. Unfortunately the ia tool is failing for me: https://github.com/jjjake/internetarchive/issues/176

Also, it's not really about the $3, more that a tonne of "$3" projects really add up over a year or so.


Email sent.


I have been advised that the data already exists on archive.org.


Email me back. Would still like to buy you a beer for your troubles.


Oops, it turns out I somehow ended up with decompressed files, that's why the filesizes are so far off. I'm creating a new fixed dump.


The correct dataset (with gzip compression) is here: https://ipfs.io/ipfs/QmcvfB6pAqUfTnuAK8zFKVxbdhopnBPveJrDcy1...


Just in case anyone else is wondering: This is, as I understand it, 25M pieces of metadata, not 25M books, songs, movies, and treasures from the past.


It's something along those lines. It looks like the music is all "cover songs" :( one of the heavier metadata types.


What's the copyright? Would it be legal to unzip those and serve them directly, so archive.org or anyone else can make them more inviting for access?

I know you shouldn't look a gift horse in the mouth but there's not even an index or a rough idea what something like "Name Authorities" might mean. That's not what I call wide open doors, that more seems like doing some legally required minimum.


Files are 25 million bibliographic index files which were produced by US Federal employees, so yes, they're likely in the public ___domain as a result.

https://en.m.wikipedia.org/wiki/Copyright_status_of_work_by_...


Awesome, thanks!


I suppose you could do that and then wait for the C&D notice?

I can't imagine you'd get sued for putting online a copy of this if you comply with C&D notices, then again I'm very narrow-minded.

Regarding Name Authorities, this article should clarify it somewhat: https://en.wikipedia.org/wiki/Authority_control

It's basically an authentication provider maintained by the Library of Congress, which serves to define cannonical identifiers for library-catalogued entities, like books and public figures.

The Library of Congress uses the MARC standard (developed internally) and that is the format of the Name Authorities files: https://en.wikipedia.org/wiki/MARC_standards


I wonder how much more extensive the release could have been were copyright laws not in the way.

Then there's the old question of whether the works under copyright today will ever go in to the public ___domain, or if their copyright will be extended forever by future changes in copyright law.


Release is 25 million bibliographic index files and has nothing to do with copyright, since none of the data was ever covered by copyright protection.


They didn't have to limit their release just to bibliographic index files. If they wanted to, they could have released manuscripts, letters, newsletters, videos, or any other media they have. But they may have felt inhibited by copyright laws.

So my question is, had copyright laws not been an issue, how much more would they have released?

There is also the larger question of whether the value of copyright law outweighs the value of not having it, so that everyone can benefit from this treasure trove of knowledge.


I don't think these records meet the standard definition of 'media' anyway. This is really just data that can be used for cataloguing purposes and other media custodian/librarian applications.

Given that the LoC has made it their goal to archive at least one copy of everything, I think they are not quite the right people to fall into your anti-copyright cross hairs. However, I do strongly agree with your overall premises.


I don't think the op has anything against the LoC. More like they are lamenting that the LoC has its hands tied.


https://loc.gov/collections

The Library of Congress has put materials online for a number of years. American Memory was the first I became aware of:

https://memory.loc.gov/ammem/index.html

According to Wikipedia, it began in 1994.


For those who haven't ever worked with MARC record data before, there's a python library that's a pretty easy interface called pymarc:

https://github.com/edsu/pymarc


Can someone change the title to match the article? s/items/records/


Another nit -- paid for via taxes.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: