Hacker News new | past | comments | ask | show | jobs | submit login
A 14-year-old could build 1998's Google using her Dad's credit card (paulbohm.com)
122 points by enki on Jan 16, 2012 | hide | past | favorite | 71 comments



'And if you'd wanted to use a hash table, if you even knew what a hash table was, you'd have to write your own.'

BSD's hash table code has been around since probably longer than the author has been alive.

Here is the FreeBSD version, it's very compact and works quite well: http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/db/hash/h...


You missed the temporal context of what I wrote:

"A few decades ago [...] if you'd wanted to use a hash table, if you even knew what a hash table was, you'd have to write your own."

I agree with you on the quality of the BSD code, and I'm glad such great code is readily available. But a) I definitely had been programming before 1990 (the copyright on that BSD code), b) back then hash tables were far less tightly integrated with programming languages than they are today and fewer people knew about them, and c) if you want to be pedantic Hash Tables have been around since 1953, so way before most programming languages that are still in use today. http://en.wikipedia.org/wiki/Hash_table#History - they're however much more commonly understood, and in ubiquitous use today!


'And if you'd wanted to use a hash table, if you even knew what a hash table was, you'd have to write your own.'

And just for the record, Common Lisp had hash tables since 1984 (and I guess Maclisp had them before that), but earlier lisp dialects had things like plists and alists.


Unless you were working in academia, you weren't using Lisp. (Probably.) I know I wasn't.


This post seems to miss the point that the major hurdle faced by a 14-year old trying to learn how to program and find the right libraries etc. to use is solved by Google itself.


good point! google definitely is part of what makes us so much more productive programmers today! (despite google codesearch going away)


Im going to go off topic here and mention http://searchco.de/ which does try to replace the outgoing Google Codesearch.


Obligatory StackOverflow Reference


?


Redundant explanation that Google has itself been surpassed by $Site as a resource for programmers.


I'd wager SO's main source of traffic is Google


Since he has already written about it (88%), I don't think many people will take that wager :D

http://www.codinghorror.com/blog/2011/01/trouble-in-the-hous...



I thought Google's real innovation was their technique of using the interconnectedness of the web to determine the true value of content. So rather than only looking at the content of a page, they also look at the content from incoming links to that page. What package out there implements the algorithms for this, and is well-documented and trivial enough to use that a 14-year-old can understand them?

As far as I can tell, this article says 1) Shucks, hardware sure is cheap these days! and 2) There sure is a lot of software out there that you can mash together! Those things make it easier to start a company, but they don't provide the essential insights that make that company truly revolutionary.


I don't think the point is that the breakthrough idea of today is within the means of some real fourteen-year-old. The breakthrough idea of today is something that today's concepts, economics, and best practices are NOT well-suited to handle; otherwise it wouldn't be much of a breakthrough. The amazing thing is how quickly something has gone from the realm of obsessed genius to the realm of the mundane. It goes back to Whitehead's observation that, "Civilization advances by extending the number of important operations which we can perform without thinking of them."

"Without thinking" is an exaggeration for some of the items in the post, but consider the problem of storing 200GB of data. "Um... on a hard drive?" "And how will you finance that?" "Gee, maybe with the money in my wallet right now? When do these questions get hard?" Shucks, hardware sure is cheap these days! Problems simply disappear from being challenges to not requiring any thought at all. The exponential increase in the power of affordable hardware may not be surprising, but to me it seems worth thinking about even though it's been normal and predictable my whole life.


I've said this before, I'll try to sum it up as succinctly as possible:

Google's innovation was 3-fold: better search algorithms (pagerank), which did use the implicit data from the interconnectedness of the web to judge the relevancy and rank of search results; revolutionary data center ops (using commodity hardware with heavy reliance on automation); and state of the art software engineering (sharding, map reduce, etc.) The last 2 enabled the first to run efficiently on a rather small set of hardware and to scale up speed just by adding more hardware. The end result was better results, delivered faster, and at lower cost to google.

This led to a much better product for the end users (better/faster) and allowed them to acquire a huge portion of search marketshare quickly. But the low cost of operations meant that they could better take advantage of advertising (lower cost per search means that even lower revenue per search can be profitable).


What package out there implements the algorithms for this, and is well-documented and trivial enough to use that a 14-year-old can understand them?

Nutch[1].

Nutch doesn't deal with modern web spam particularly well, but I'd say it matched early Google pretty well. Specifically, it implements Page Rank, has a reliable web crawler and a web-scale data store.

[1] http://nutch.apache.org/about.html


Wow yeah, that actually looks like it would do the job. There's a part of me now that wants to implement a spam classifier on top of Nutch to see how good of a web crawler I can create… thanks for the link!


even if you had had the same brilliant insights into the graph structure of the web when they did, you most likely would have failed because it was prohibitively expensive (the cost in the article is probably underestimated by orders of magnitude). it's simply a fact that:

1) getting the data, 2) computing the eigenvector of a large matrix, 3) and serving that data to users, wasn't cheap in 1998. it's comparatively dirt cheap today.

not to diss larry and sergey's impressive achievement - they were brilliant and they pulled it off - but i think back then game was so costly that a lot of brilliant people never made it to the starting line. it's cool to see that it's become a much more level playing field now. i'm curious what cool stuff we missed out on because of people who didn't make it to the starting line!


Agreed, the notion of pagerank and doing search properly in a time when it wasn't even on the radar is completely missing from this article.

The real message is that servers are cheap, albeit brought forward in a long vague buildup, and hardly novel information.


By the end of 1998, Google had an index of about 60 million pages

Sounds like a marvelous challenge. Anyone have other similar "technological frontier then, high-school science fair project now" type challenges? OPer notes BioCurious as one. A major factor in education is walking kids thru a subject from basic principles to state-of-the-art, recreating historical milestones along the way.


AOL in the late 1990's, minus the dialup itself.

Content publishing: Weekend project. Rails, memcached and CloudFront and you're done.

IM and Buddy Lists: 1.5 million simultaneous users doing n^2 pub/sub-type distributed transactions.

Mail: 4,000 emails per second with live unsend and recipient read/unread status. I think PostgreSQL tops out in the millions of rows per second nowadays.

Web caching/acceleration: pick your favorite proxy solution and configure it.

Single sign-on: Form strategic partn-- Hey, you said technical challenge, not political.


Building mobile, handheld computer games(A 14 year's old did build an app game that got very high in the app store).

opening a web shop.

Building robots(at today's kid's levels).

Designing really complex and fast digital circuits(using FPGA, and IP blocks).

Building a global, scalable and complex database application(using something like MS lightswitch).


For extremely contrived definitions of "1998's Google" yes. But if all it took was a pile of servers and hard-drives for 1998's Google to succeed then a lot more other companies would have done so as well. It takes more than that to build a company.


(author here)

I was writing this more in the sense that kids at BioCurious (and the DIY Bio Movement in general) are doing electrophoresis to transfer DNA from glowing jellyfish to bacteria. This is just a few (two?) years after someone got a Nobel prize for that.

That's progress. If stuff that used to be hard falls into kids hands, you're gonna see impressive stuff happening.

However I fully agree that it takes more than that to build a company (Also I wouldn't try to compete with 2012 Google using 1998 technology)


Just pointing out... Nobel prizes aren't given for cutting edge work, they're given many years later. People have been doing transfection of genes for decades.

The Nobel prize you're referring to was probably the one for GFP. Interestingly, a huge challenge in using GFP now is patent issues and thus money issues, rather than technical issues.


Fair enough. The title seems a bit link-baity, I think something along the lines of "the infrastructure of 1998's Google" would have been better.


I half disagree. If you're blogging then the point is to get that blog some eyeballs on it. Otherwise you write in a journal or don't make it publicly accessible or at the very least don't help it get indexed and never link to it.

I think there's link-bait and then there's LINK BAIT! (TM). It's a fine line between the two. You have to have a catchy, preferably keyword splattered, title or you become yet another blog no one cares about. I also think there's too much focus on the title when it comes to real link-bait. The really awful kind of link-bait is the kind that links to an article with very little to no content having anything to do with the title. In this case I think the article corresponded with the title enough for it not to be link-bait-style misleading. But that's me and there is no real answer. Just interpretations.


I wholeheartedly disagree. If you are blogging ideally you are doing so because you are injecting valuable insights or information into the world at large. The value is not to you that eyeballs are on your blog but to the eyeballs themselves.


I think the article was more about "Google the search technology" rather than "Google the company". It wasn't about startups or entrepreneurship but rather about technological progress.


This also means that search is now commoditized.

Google's value doesn't come so much from search any more (it's good at it, though there are now grumblings from the Googluminati), but from its advertising network (and the concomitant connections and contracts associated with it), and the value-added services built on top of Google's underlying search technology, to the extent that those leverage Google's base tools and/or expertise.

The chinks in Google's armor are starting to show though:

- Cheap and/or federated search is now available. - OpenStreetMap is providing mapping data (and APIs) to rival Google Maps. - There's a lot of grumbling going on over privacy especially in the social and mobile spaces. Neither has quite fully coalesced, but if you look at the volatility in both spaces (consider what the largest social network and most popular smartphones were 5 years ago vs. today), things could again change quickly. - Most tellingly, trust in Google to "not be evil" is eroding, rapidly in some quarters.

Google is valuable -- because it dominates advertising, and has the users to monetize that. Chip away at the user base and it could find its hegemony starting to fail.

The fact that it's very, very cheap to replicate Google's underlying tech helps with this. DuckDuckGo is essentially a one-man shop. Yes, it has a very small fraction of Google's traffic, but it compares favorably with everyone else who's tackling Google, including Micorosft's Bing, with ... more than one man equivalent last I checked.


A pile of servers and a special algorithm. Now that the algorithm is published, rather than yet-to-be-invented, it would be very possible. So "Dad's credit card and a few late nights reading papers".


I think the heart of Google (at least at the get-go) was PageRank. Sure you had to write a web crawler, but that wasn't the magic sauce that made Google's search so good. I don't think most 14 year olds could understand the math behind PageRank, much less derive it from scratch.


I'm not sure whether this is applicable but my main objection with this article is that the numbers don't add up. How many Ph.D. candidates do you know who are granted a budget of $10k+ to do their research? Surely something else must have been going on to shrink the expenses to a more acceptable amount.

Then again, according to the wikipedia page the original BackRub was conceived when the web was only 10 million pages large, $2000 is considerably more acceptable for a Ph.D. project.


"The SDLP is notable in the history of Google as a primary sources of funding for Lawrence Page's and Sergey Brin (Brin was also supported by a NSF Graduate Research Fellowship) during the period they developed the precursors and initial versions of the Google search engine prior to the incorporation of Google as a private entity"

This included a $4,516,573 NSF grant (that didn't go to Larry & Sergey in full, but probably helped their project's infrastructure quite a bit).

http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=9411... http://en.wikipedia.org/wiki/Stanford_Digital_Library_Projec...

On the expense side I've probably actually underestimated the expenses by orders of magnitude. Bandwidth wasn't cheap back then and the storage requirements probably were significantly higher.


tl;dr version: Computers and disks are a lot cheaper now.

Basically the article boils down to this, what counted as a 'cluster' in 1998 is a single system in 2008, what used to take hundreds of disk drives to store, you can store on 1 today.

Not particularly deep, but useful to think about from time to time. There is a quote, perhaps apocryphal, which says

"There are two ways to solve a problem that would take 1000 computers 10 years to solve. One is to buy a 1000 computers and start crunching the numbers, the other is party for 9 years, use as much of the money as you need to buy the best computer you can at the end of the 9th year, and compute the answer in one day."

The idea that computers get more powerful every year, and that in 10 years they will be more than 1000x more powerful than the ones you would have started with so one can solve the same problem.

Of course they haven't been getting as powerful as quickly as they once were, but the amount of data you can store per disk has continued to outperform.

The point is that if you are designing for the long haul (say 10 yrs from now) you can probably assume a much more powerful compute base and a lot more data storage.


That's not even close to what he's saying - I thought that was actually a rhetorical weakness, to tell the truth.

What he's saying is that the existence of the cloud and library advances such as MapReduce and APIs mean that the bar is lowered, when writing new software, to an extent it's hard even to comprehend.

Every time I get a module from CPAN I still get a shiver down my spine, remembering trying to do new and interesting things in the 80's and early 90's and every single time ending up trying to build a lathe to build a grinder to grind a chisel to hack out my reinvented wheel.


A bit off, but CPAN really hasn't changed all that much. I tried installing a module the other day, something simple like a word stemmer, and got so disgusted that I quit Perl.


Try writing it from scratch, whippersnapper. In C.

I guarantee you'll end up having to write a damn string library and garbage collection - and you'll get it wrong.


I'm 45.


Which explains your dismay at using new stuff. I'm 45, too. Fight it.

My point - which, as a 45-year-old programmer, you should have understood - was that modern languages and library repositories make a whole lot of basic work go away, so that we're working at a higher level than was possible in 1985.


Your last sentence is really lovely. Thanks.


> what used to take hundreds of disk drives to store, you can store on 1 today

Though, given that hard drives very much do not obey Moore's Law, a well-designed 1998 solution with hundreds of disks may well have far faster IO than the 2012 one-disk solution.


What really changed is you don't need to use 1998 approach to solve the same problem. A single SSD can beat 50 1998 HDD in terms of IOPS, storage, latency etc. Your PC probably has more RAM now than you had HDD in 1998 and CPU's have almost as much cache a 1998 computer.

PS: A traditional HDD is hard pressed to break 200 IOPS / second cheap SSD's easily 100x you can break 100,000 for well under a grand. http://en.wikipedia.org/wiki/IOPS


Oh, IOPs, absolutely. Throughput, though, not so much.


This is an excellent example of link bait


Where does to 200Gb figure come from? I was quite busy building a web crawler too at the time and I can distinctly remember that our crawlers had about 17Tb of storage. So let's say we had crawled something like 15Tb of data to get a meaningful sample of the web.

I agree with the gist of the blog posting though.


In http://www.salon.com/1998/12/21/straight_44/ it said "Page says the current version of Google, which has indexed about 60 million pages, will continue to be improved as the company expands." and http://en.wikipedia.org/wiki/History_of_Google#cite_note-sal... said Total indexable HTML urls: 75.2306 Million Total content downloaded: 207.022 gigabytes.


If you think a 14 year old could build something as complicated as 1998's google.com, think of what an adult with training could do at the same time with the same resources. As technology advances, so do our expectations.


Comparing 1998's problem set with today's tools is not a good comparison. The tools are cheaper but problem sets are also much bigger.


The problem sets are bigger only because our tools allow them to be.


There is also simply more information in the world now to index, because the internet has been around for longer.


This would require a 1998 Internet!


The author makes a great point about technology advancing so quickly that the bleeding edge of just yesterday is now just cute compared what we have now and about how cheap of a commodity server hardware has now become.

Unfortunately he had to use the 14 year old girl analogy and exaggerate the ease with we could build Google circa '98 today. Now his whole point is lost to click clacking of a thousand pedants' keyboards. Guys, this isn't about 14 year old girls nor is it about Google per se as much as it is about the fast pace of tech innovation, the ease and costs associated with acquiring infrastructure, and to a lesser extent there's a tiny but about how we're totally spoiled compared to what we had to work with 14 years ago.

The stuff about Google and 14 year old girls is just a literary tool (along with some mild hyperbole) to help illustrate his point which so far is getting completely missed. Come on guys, is this Hacker News or Pedantic Literary Scholar News? Focus on the point, not little Google girls. PLSN does have a nice ring to it but no, we're not on PSLN. At least not yet.


A 14-year-old could probably do it using her mom's credit card too.


I don't even understand the point of this post. I could have started Amazon.com at 22, but I didn't.


"Google" + "bleeding edge hard drives"

hehehe


So, just a gripe about your startup plug at the end of the article.

Look, I don't care whether your product cures cancer, dispenses oral sexual favors, and mints pure gold dubloons-- I will not give you my email address without a damned good reason.

Every single goddamn link on your page brings me to a "Enter your email here" prompt, except for the company tab, which brings me instead to a pile of vapid marketing bullshit.

Flotype Inc. is a venture-backed company building a suite of enterprise technology for real-time messaging. Flotype takes a unique approach by building developer-friendly technologies focused on ease-of-use and simplicity, while still exceeding enterprise-grade performance expectations.

Flotype licenses enterprise-grade middleware, Bridge, to customers ranging from social web and software enterprises to financial and fleet management groups.

What does that even mean? You using carrier pidgins? Dwarves? Cyborgs? UDP? ZeroMQ? Smoke signals?

You don't even tell me how my email is going to be used.

Fix your shit.


^ This kind of post just drags HN down, and is the kind of thing that jacquesm was talking about. Seeing something this rude at the top of HN for a post this guy worked hard on is probably not what he expected, and made his lunch taste a little worse today.

There's a time and place for profanity/verbal hostility. Feedback to a stranger on website UX isn't it, the perceived intensity level and level of anger is just dialed wrong. I wish pg would implement a filter for this kind of comment.


TL;DR: angersock seems jaded, he expected a blog post and got an MVP plug / blatant marketing post.

Honestly, I think it's a reaction to "Minimum Viable Product" overkill on HN.

The first 10 times it's OK. The next 50 times it gets less interesting. Once you're into three figures it really starts to grate. So you start to the MVP style posts. Which means those making MVP posts have to turn to a different strategy: the "interesting headline" blog post, to drive traffic to their site.

Oh, and I think that the older people here (and at 31 I'm probably one of them) are turned off by really blatant marketing..


Yeah, the filter is voting up or down.


I don't see any hostility. The comment was certainly not polite, but I wouldn't say it was unfriendly.


To be fair, his name is "angersock." (I jest, I jest - thanks for calling him out).


^This kind of post clutters up discussions with metabullshit already accounted for by the karma system.

More seriously,this is not a mere UX problem. This isn't a problem with colors not matching, with poor navigation, or with anything else.

Absent any other information, this site appears to be a way of fishing for email addresses. That's the long and the short of it.

I am not just a string to send messages to. I am not just a networking opportunity. I am not just an entry in your preferred database.

I am a developer, and I don't like it when sites treat me otherwise.

I thanked the author for his (very fast) response.

I'm sorry about the tone of the post, but frankly we can't let this dehumanization and arrogance towards users (and worse in this case, fellow developers) slide.

EDIT: Note also that, had he simply posted a good article (which it was!) without the shameless plug, I would've said nothing. If the plug had linked to a page that had anything other than email scraping, I wouldn't have complained. But the linked page was so offensive that it deserved calling out. Let this be a lesson for you startupy folks: don't cheapen a good thing with a bad plug.


Then don't give him your e-mail address.


Which is an example of a "dehumanizing and arrogant behavior towards a fellow developer": cursing them out personally in a comments thread or collecting an email address like every other site in the world?


Hey man, at this point whether you're right or wrong doesn't matter. You may very well be the rightest man on earth about his site but this is neither the time or place for that kind of talk. You're so off topic you might as well be on another planet right now, for starters. Then not only do post something totally unrelated to the topic of the post or the discussion at hand but you proceed not only to insult the author but your language was out of line and your tone was as if the guy had just broke into your home and strangled your dog.

Everyone here has something we could all criticize really harshly. I've got crappy UX design in some of my sites, I have personality flaws, you probably have some site with something very annoying to a user, etc., etc. but if I noticed it while reading something of yours posted here I wouldn't barge into the comments and call you out on it when it has nothing to do with the content of the article just like I wouldn't want you to do that either. If you absolutely must say something then let the author know privately and nicely. Good manners go a long way.

My intention isn't to sit here and say "hey look at this asshole, let's all pile on!". Not at all. There was just a discussion about this sort of thing last night (see the comments on the 16yo Indian girl who passed away for the reason that discussion was necessary) so it's up to everyone to try to stop it. Anyway, I'm sure you're a nice guy. I'm a nice guy. We're all nice guys so lets all chill and be friends okay? And I know I probably said some asinine and or dickish things in the past so I hope someone tell me when I've gone too far too.

Okay. Good. We're all friends again. Let's move on.


sorry, we didn't get to making a better site yet, but we'll definitely address your concerns when we get to it!


Thank you.

:)

If possible, at least have a dev writeup some use case or sample code or something we can see to get an idea of what Bridge does.

Thanks!

(My team's website is rather bad right now, but at least it has direct download links without asking for emails. I feel your pain on the web stuff, though, when you've got code to hack.)


I'm guessing it means the site isn't done, but they'll contact you when it is.


Thank you for articulating what was exactly on my mind.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: