Hacker News new | past | comments | ask | show | jobs | submit login
Full-history English Wikipedia dump produced: 5.6TB uncompressed, 32GB 7z'd (infodisiac.com)
58 points by chl on April 14, 2010 | hide | past | favorite | 32 comments



Doesn't include deleted articles, so no hope if you want to recover one of them. This is a pity since Wikipedia deletes too many articles.


That's talked about on and off, but one issue is that they'd have to filter deleted articles by deletion reason, at least broadly into "deleted for legal reasons" and "deleted for non-legal reasons" bins. There'd be no problem distributing a dump of articles deleted due to non-notability, but a dump of articles axed for copyright violation, libel, or other legal issues would be a problem.

For specific articles deleted for non-legal reasons (most commonly notability), you can get a copy from a WP admin. Some have volunteered themselves as willing to answer requests: http://en.wikipedia.org/wiki/Category:Wikipedia_administrato...


Surely articles have posted on them, or their talk pages, the reason for deletion to allow for responses? Also there should be some way for people to find that an article has been deleted so that they don't recreate the article and repeat the error. Indeed rather than delete couldn't a placeholder be implemented.

Wrong forum for these suggestions but I've never had the time or inclination to attempt to reach the Wikipedia inner sanctum.


Yeah, the deleted articles could probably be automatically filtered, at least on a fail-safe basis by only dumping the ones that have a known "not a legal issue" deletion reason, like "notability" (there are even semi-formalized deletion codes, itself a mild absurdity). There are probably non-technical / non-legal reasons people don't want to dump them, but there's also some of just a "not a priority" aspect. The dump reported in this story is actually the first successful full-history dump in quite some time, because the dump scripts were perennially broken / bogging down due to the size of the data / crashing due to MySQL weirdness. So most of the dump effort has been on just getting the official stuff out. Next up on the priority list will probably be some way of doing image dumps.

You do get a bit of a warning if you recreate a deleted page. When you go to the editing screen at the title of an article that was previously deleted, it'll show you the summary from the deletion log at the top, and ask you if you're sure you want to recreate it. There's also a "nothing can go here" protected placeholder used for articles that are persistently being recreated, which'll make it impossible to edit at that ___location.

Yeah, I can sympathize on the Wikipedia-inner-sanctum thing. I was actually pretty deeply into it (I've been an admin since '04, was formerly on the Arbitration Committee, formerly active on the mailing lists, etc.), but as the Policy And Process kept accumulating, I lost interest in navigating it, so am more on the periphery these days. It's probably inevitable that things would go that direction, because in the early days there were probably <100 Wikipedians active enough to form the Wikipedia Cabal, all of whom at least recognized each others' names, so stuff could be pretty informal. But it's hard to scale that up to a site with 1700 admins and 15k+ editors. A lot of things are kind of lame about how things are organized these days, but honestly I have no idea how I'd do it better; despite its flaws it's often still amazing to me that Wikipedia works at all.


Deleted articles are stored in the "archive" table. They may have changed this since then, but last time I checked toolserver (http://toolserver.org/) users had access to that table.


> 5.6 Tb uncompressed, 280 Gb in bz2 compression format, 32 Gb in 7z compression format

Wow, I didn't know 7z was this much better than bz2. Is this the expected result, or is there something special with Wikipedia that plays to the strengths of 7z?


I'd guess it has to do mainly with 7z being able to use a larger block size, while bzip2's is 900kb; and possibly being able to do something better with large runs of repeated text. There are large articles with hundreds of revisions in a row that leave most of the content unchanged; [[George W. Bush]], for example, is around 180kb per revision, and is edited a lot, mostly with minor changes. So bzip2's block size means it can only squeeze 5 revisions per block: so in the degenerate case where 100 edits in a row changed only one character each, bz2 would be storing 20 or so basically identical copies of the article.

IIRC from some tests a year or so ago, Wikipedia hasn't found any significant improvement from 7z over bz2 on the current-revisions-only dump, which looks more like just normal English text; that's why it doesn't bother to provide a separate 7z version of that. It seems to only be this pattern of [200kb article][almost the same 200kb article][almost the same 200kb article again] that 7z kills bz2 on.


It would be cool to see how the 7z archive compares to rzip, which (I think) might have the largest input size window of them all (up to 900MB). This software was written by Andrew Tridgell of rsync fame. It's major drawback is that it cant work on stdio and uses tons of RAM.


Does Wikipedia really store every single revision of every single file? As in, not deltas? Why is it done that way?


In the dump, I think for robustness and ease of extracting subsets.

Robustness: Having to essentially play back a log to recover any particular revision increases a chance of something eventually getting corrupted, and so it's somewhat safer to avoid it in something intended to be archival.

Ease of extracting subsets: For researchers, having the revisions be independent allows you to filter the XML dump through a SAX parser (or similar) to grab only revisions meeting particular criteria. If deltas were stored, you'd have to reconstruct those revisions from the deltas, which would make it really expensive to do things like, "I want to look at every article as it appeared at noon on April 1, 2007".

In the live DB, I think just because it's cheaper to get a ton of storage, esp. for rarely-retrieved old revisions, than to add the overhead of computing deltas and applying them to reconstruct revisions. In particular, you'd have to compute a diff for every edit in that situation, whereas currently MediaWiki only computes diffs when a user requests to view one from the "history" tab, which is a tiny proportion of all edits.


I understand it's simpler to store everything, and simplicity _is_ a virtue; but one could store the current revision plus deltas (and perhaps a few intermediate revisions for oft-edited articles), and obtain performance at least as good as in the current case. It would also save lots of space.


The article says Tb not TB, but in reality it appears to be TB. That's quite a difference. Still seems heavy for text, but I assume the full text of every revision is in it, not just diffs.


Yeah, every revision is standalone, which is why it compresses so well (obviously there are a lot of edits that make relatively small changes). One reason is to make it easier for researchers to grab specific revisions, e.g. run the dump through a filter returning only revisions as of June 1, 2006--- without having to apply a ton of diffs to reconstruct those revisions.

The dump schema is something like:

  <mediawiki blah blah ...>
    <siteinfo>
      some metadata
    </siteinfo>
    <page>
      <title>Article Title</title>
      <id>15580374</id>
      <revision>
        <id>139992</id>
        <timestamp>2002-01-26T15:28:12Z</timestamp>
        <contributor>
          <username>_delirium</username>
          <id>82</id>
        </contributor>
        <comment>vandalized this page</comment>
        <text xml:space="preserve">Complete text of this revision of the article goes here.
        </text>
      </revision>
      <revision>
        ...next revision of this page...
      </revision>
    </page>
    <page>
      ...revisions of the next page...
    </page>
  </mediawiki>


Impressive...I wonder how big a content snapshot is, ie no article histories and no meta-material like talk pages or WP:xxx pages, just the user-facing content.

I was also sort of hoping to see from the stats what proportion of content was public-facing vs devoted to arguments between wikipedians...if you look at the stats for 'most edited articles' (accessible from the top link) it's interesting that of the top 50 most edited articles, only one, 'George W. Bush' is user-facing - and I suspect that only made it in because of persistent vandalism.

Still, with history and all included, there is some fabulous data-mining potential here, with which there's the potential to do some really innovative work. I'd hazard a guess that the size of Wikipedia already exceeds that of existing language corpuses like the US code...

/retreats into corner muttering about semantic engines and link free concepts of total hypertext as necessary AI boot conditions


> I wonder how big a content snapshot is, ie no article histories and no meta-material like talk pages or WP:xxx pages, just the user-facing content

I don't know how big it is uncompressed, but they do have a dump of just that part:

  2010-03-16 08:44:40 done Articles, templates, image descriptions, and primary meta-pages.
  2010-03-16 08:44:40: enwiki 9654328 pages (255.402/sec), 9654328 revs (255.402/sec), 82.9% prefetched, ETA 2010-03-17 03:08:26 [max 26568677]
  This contains current versions of article content, and is the archive most mirror sites will probably want.
  pages-articles.xml.bz2 5.7 GB


Well spotted. This has great possibilities for education in the 3rd world.


Perhaps this is a good time to point to this?

http://thewikireader.com/index.html


One wonders if this will be the file first fed into something approximating machine consciousness. I'm not sure where else you can easily get such a high-quantity of fairly consistent human interest data.

Quick question: what does "bot-edited" entries refer to?


bots are used for hugely common editing operations, such as various kinds on cleanup.


I like the quick fix the site designer used to switch from a static layout to a fluid one.


Interesting, but somehow doubt that many people have the set-up to handle this number of data.


That's because, despite what the 14 year olds on digg and reddit think, this isn't for you to download on your computer at your house. This is for archival or data-mining purposes.

I apologize for the minor insult at digg/reddit, I just remember a few years ago a link to the archive was posted on digg and everyone started downloading it...unnecessarily wasting wikipedia's limited and donated resources.


If that's a problem, they could have put it up with bittorrent and throttled the bandwidth.


Yeah, this is exactly the sort of problem bittorrent was designed to handle


40 + 15 days to compress? how long it would take to decompress this thing


7zip, like many other compression schemes, is optimized and designed so that decompression is typically (much) faster than compression.

The web page (http://www.7-zip.org/7z.html) states that the default "native" LZMA format decompresses at between 10 and 20 times the speed that it compresses.

So, 15 days / 15 is about one day to decompress, then.


LZMA is well-known for its decompression speed. This is one of the reasons it's a popular choice for filesystem compression. It's quite easy for LZMA to keep a pretty fair pace with a disk, so you get a pretty noticeable performance boost by adding LZMA at the filesystem layer, especially for read-heavy workloads.

Gzip usually gets a slightly better compression ratio, but at the expense of decompression speed, particularly on less compressible data (LZMA somehow seems to know better when to give up trying to compress). Bzip2 has the best compression ratio of the three, but is far too slow to compress and decompress, so you end up losing more time decompressing than you gained by doing less actual I/O.

EDIT: source, for those curious folk out there: http://portal.acm.org/citation.cfm?id=1534536 (caveat: the experiments were run by taking a large file on disk and compressing it to another large file on disk, so seek thrashing may have been an issue and I'm not quick to take the numbers for all they should be worth)


I'm pretty sure LZMA (7z) compresses better than bz2.

At least it does when I test it. But it's slower - sometimes much slower (depends on settings).


That might be true, though I don't know whether that depends on the entropy in the file. It is likely the case that one of them compresses text better but takes a larger performance hit with binary data or somesuch.

Turns out the paper I was recalling and referencing dealt with LZO, not LZMA, so maybe I have less to say about LZMA than I thought. Shows how much you jerks read before upvoting. ;-)


They also parallelize the compression (at least with bzip2) and run more than one dump process at the same time.


I imagine that the compression is only for saving time during transfer.

That, and hopefully you could decompress just what you needed. I.E. a tarball of compressed articles.


A lot of researchers just stream the decompressed stream directly from 7z into their analysis scripts. If you were to actually decompress to a giant XML file first, you'd both: 1. need a 6-TB drive; and 2. start getting disk I/O as a big bottleneck. In a lot of cases, the analysis scripts, rather than decompression, are the bottleneck anyway: 7z can feed you data faster than your XML parser and scripts can consume it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: