Hacker News new | past | comments | ask | show | jobs | submit login

?? I have offline copy of entire (en)wiki on my disk, its <100GB for images, 12GB articles compressed (~30GB with entire edit history). All other languages might double that, still nowhere close even one TB.




Wikipedia EN is 100GB[0] in an xml dump. You linked to the dump of the entire primary db which probably (and this is total spec.) is all users, editors, edits, languages, usage dtatistics and internal metrics.

I don't know if the wouted stat above includes pics, but i believe it is a text dump. You would have to read further to confirm. If they made their usage stats available you could pull 10% of that data (10gb) which corresponds to the most frequented articles

0. https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

Edit: maybe you meant what you said and are in fact correct. From a user standpoint, the article text, supplementary material and some edits are likely the most inportant metrics. However a massive dump of the entire site and infra would be about that large. H




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: