There's a few differences. We don't use SQL in the routing chain, we use regex to pick out the site name and then serve from a directory of the same name (this is NOT as bad as it sounds, most filesystems can do this quite well now and take MUCH more than half a million sites to bottleneck).
DRBD is also a little hardcore for my tastes. Nothing wrong with it, I just don't know it well, and I don't like being dependant on things I don't know how to debug.
An alternative I wanted to show uses inotify, rsync and ssh combined into a simple replication daemon. It's obviously not as fast, but if you enable persistent SSH connections, it's not too bad. If it screws up, you can just run rsync.
Rumor has it the Internet Archive uses an approach not too far away from this for Petabox. Check it out if you're looking for something a little more lightweight for real-time replication:
https://code.google.com/p/lsyncd/
We're still working on open sourcing (!) our serving infrastructure, so eventually you will be able to see all of the code we use for this (sans the secrets and keys, of course). I've just been having trouble coming up with a good solution for doing this. For now, enjoy the source of our web app: https://github.com/neocities/neocities
> Nothing wrong with it, I just don't know it well, and I don't like being dependant on things I don't know how to debug.
Yep, this is basically our approach as well.
We've been using DRBD for quite a long time now on our Git fileservers (which also run in active/standby pairs - in fact, they look a lot like our Pages fileservers) so we have quite a lot of in-house experience with it and it's a technology we're pretty comfortable with. Given this, using it for the new Pages infrastructure was a pretty straight-forward decision.
This exchange is wonderful, and absolutely what I'd expect out of you two, and maybe I'm just in a bad mood, but it stands in such contrast to the way I often see technologies discussed online.
This kind of thing is the way engineering should be. Kudos.
I make use of Lua and Redis for handling a few million redirects and have been happy with it so far. I never considered MySQL due to performance concerns.
3ms for connection setup + auth + query seems reasonable. Are you using persistent DB connections? Any other mods? What sort of timeouts have you configured for DB connections?
Here's our current nginx config on the proxy server. I've got the DDoS psuedo-protection (there's another layer upstream) and caching turned off right now because we're working on something, but this is basically it:
Critique away. As you can see, we've just barely avoided pulling out the lua scripting.
The next step for me would probably be to write something in node.js or Go. There's probably a lot of people cringing at that thought right now, but it's actually pretty good with this sort of work, and I'd really like to be able to do things like on-demand SSL registration and sending logs via a distributed message queue. Hacking nginx into doing this sort of thing has diminishing returns, we're kindof at the wall as-is.
> We're still working on open sourcing (!) our serving infrastructure, so eventually you will be able to see all of the code we use for this (sans the secrets and keys, of course).
I just want to applaud what you've been doing with neocites -- when the project started, I thought "Oh, nice." -- but not much more -- but I love the fact that you've kept at it, and your approach to openness is great (pending infrastructure code notwithstanding). I especially like your status-page:
(Which I found from your excellent update blog-post[1] -- but I think it could be even more discoverable. It's not linked from the donate/about pages?)
I hope your financial situation improves -- and still: I wonder how (almost) half-a-cent of revenue/month compares to most ad-funded startups sites? While you'll need... a "few" more to reach your goal. Actually "just" need 43x as many users with the same revenue/head to get there :-) )
We used a master/master DRBD setup at a previous company, it was kind of a pain to work with. We had a fairly extensive document to solve split-brain problems.
I imagine the problems with DRBD mostly disappear if you're using it properly though, master/slave setups probably work really well.
This factored in for me. Neocities is two people. We don't have the budget yet to hire an ops team, so we need to use parts that we can understand without a lot of mental investment. DRBD is definitely something you need to invest in. Github obviously doesn't have our budget constraints and can hire the people needed to really own problems like this.
I also am pretty conservative on engineering choices generally, and the "superfilesystems" (DRBD, Gluster) feel a little monolithic (read: not very unix) to me. It's not that they're bad, it's that they're solving a lot of hard problems, and there's a lot that can go wrong when you have to do that, and if something happens, you're the one that has to fix it.
I'm not religious about "do one thing and do it well", but SSH handles the transfers, rsync does efficient copying, and inotify fires events on file changes. Put them together and you've got a very "unix" solution. It's more or less an event-driven script that sits on the stable solutions to hard problems. If something goes wrong, you just run rsync.
I can't say enough how awesome OpenSSH is. I want to use it for pretty much everything. It's a work horse that really hauls.
lsyncd seems like a really cool project. However in practice I used it to replicate a docroot accross 3 servers and it actually got out of sync pretty often.
I have a scheduled rsync execute that periodically checks for any inconsistency. It plays nicely with any updates that come in while it's doing it's work, so that's been our fallback incase things get out of wack.
All of github pages was run off of _one_ server? (with one failover standby).
That's pretty amazing. If all you're serving is static assets, apparently you have to grow to pretty huge scale before one server will not be sufficient.
I'm curious if there was at least a caching layer, so every request didn't hit the SSD. They didn't mention it.
We do have Fastly in front of *.github.io, but there's still a significant amount of traffic (on the order of thousands of requests per second throughout the day) that make it through to our own infrastructure.
We don't do any other caching on our own, although the other replies are correct in that the Linux kernel has its own filesystem cache which means not all requests end up hitting the SSD.
Invalidating quickly is the main problem with caching on a CDN. We use an async worker via Redis pubsub to call for expiration on individual files locally on proxy servers we run. I'm looking at using NSQ for this in the future. One interesting solution is to just use an HTTP hit to expire a cache, which you can see a flavor of in our nginx config file. Nginx in effect becomes it's own cache SoA. We needed a special nginx module to make that work, default nginx only lets you expire the entire cache.
Fastly probably has something similar. You've really got to do this within 5 seconds or your user is going to get pissed off waiting for it every time they save/reload the page.
The big problem with passing caching to third party CDNs is that they need to be able to handle your SSL certs inline to request static files you don't need to change the URL of (because then you would be changing the published content which you don't control), and supports wildcards.
In effect you're doing what Cloudflare does. I'd use Cloudflare in a minute for this, but wildcards require their commercial plan and we can't afford it ($6k/mo). Also we would need fine-grained cache expire.
If Fastly does this and is priced in our range, maybe we should talk to them. :)
FWIW Fastly invalidates objects, services, and batches of objects globally at ~150ms via an API. They have a different approach to wildcard/batch invalidation: it groups based on a prior-set header that can be programmatically based off of URL paths (or a bunch of other things).
That would work for Neocities. Wildcard pricing is unfortunately a little out of our budget, but if you were interested in coming to an arrangement with us (we could say powered by Fastly on the site, for example) in exchange for a reduction on that, let me know. :) [email protected]
Are the assets extracted from git and dumped on the filesystem or are they fetched from the Git objects? aka If I generate a git tree object with millions of files, but only pointing to a handful of blobs (think a maze) would this exploded on the filesystem?
I believe that only CNAMEs use the CDN for custom domains (naked domains do not) [0]. So that rules out a chunk but there will be lots of cache misses too. Maybe someone from GH can confirm.
I believe this is automatically done by the operating system. Nginx usually performs some kind of mmap of the requested file, and the virtual memory management will automatically take care of caching.
BTW, I'm often surprised how people are afraid from opening files from their scripts, as if they think this will always lead to disk access. Then, they start implementing a hand-written caching layer on top of that, which usually performs worse than what the OS already offers.
If it's a static file with no dynamic compression (nginx allows the use of pre-compressed assets) or SSL, nginx doesn't even have to open the file, it just uses the sendfile system call.
Correct. Static file serving is amazingly efficient. Even the filesystem caches for you automagically with whatever RAM you're not using. There's even a kernel function for speeding up file transfers, that's how optimized this stuff is: http://man7.org/linux/man-pages/man2/sendfile.2.html
My rough estimates are that Neocities can handle 20-50 million sites with only two fileservers and a sharding strategy. So double it for each shard, and you've got a solution that actually scales pretty well.
It's not that surprising. Serving static files is pretty easy and lightweight (part of why I like the movement towards SPAs), and there's been 20+ years of work done to make it fast. We've been conditioned for years by dynamic application servers that perform multiple DB queries, external service calls, etc per request. It doesn't have to be slow or inefficient.
I'm amazed that Github Pages ran on just two servers (well, aside from the MySQL clusters). That is absolutely incredible given the sheer amount of projects and people who rely on it for their sites (me included!). I love the philosophy behind Pages of abstaining from over-engineering and sticking to the simple, proven solutions. It's a great service and I'm a massive fan.
It's cool that they were able to do this with two machines, and I don't want to detract from that, but it's probably worth pointing out that a "machine" is not a very useful unit of capacity. These two machines could be dual core i5s or they could be 20 core xeon boxes with hugely varying amounts of memory and storage. Too bad they don't clarify, I'm curious.
The fileserver tier consists of pairs of Dell R720s
running in active/standby configuration. Each pair is
largely similar to the single pair of machines that the
old Pages infrastructure ran on.
As an ops guy, I'm going to remember this article and use it as a talking point for the benefits of static pages when devs want to put everything into their dynamic framework of choice.
It's a bit of a trade off... static generation of dynamic content isn't so much different from front-side caching, if the bulk of your requests can cache for 30+ minutes, you can setup some pretty decent caching rules on a couple of servers fronting a dynamic backend that doesn't need to be too big to support many requests.
For the most part, the load tends to come down to less than optimized database storage and queries on the backend, often duplicated for every page load for content that could easily be offloaded to the client browser, constructed client-side, and requested on demand.
It's a matter of striking a balance... In this case the content changes very infrequently, and having a publish step to static storage makes sense. It really depends on one's needs.
If you have a million pages you'd need to regenerate daily for a single site, and thousands every hour, then you might think differently.
3. Use the distro's default Apache connection limit of 20, despite the server being able to handle thousands of concurrent connections. The limited connections are then quickly used by users with slow connections.
Well done GitHub. Also a special mention to the invisible workers making nginx such a cornerstone of the modern infrastructure, it's a project that I don't hear about often, probably due to the fact that it's not the sexiest piece of technology, but it really seems solid and battle-tested. Kudos.
I've been using GitHub pages for a while now and I always wondered why they had the "your site may not be available for another 30 minutes" message on creating a new GitHub pages site while pushes to an already-existing gh-pages branch were displayed instantly. Neat to see that explained here.
> We also have Fastly sitting in front of GitHub Pages caching all 200 responses. This helps minimise the availability impact of a total Pages router outage. Even in this worst case scenario, cached Pages sites are still online and unaffected.
Only tangentially related, but I sometimes wonder if GitHub's management regret making GitHub Pages available for free, now that it's being used so heavily for personal and even business blogs, rather than just companion sites for open-source projects. They could be charging for static websites, as Amazon S3 does.
I never heard anyone gripe about it... not even once. The cost is pretty negligible, and there's a lot of halo benefit (i.e., you just get more people involved on GitHub the platform itself).
The fact that a lot of non-technical employees in marketing and other fields are using it for corporate blogs is actually a nice bit of pressure on the organization to make Pages and web editing even simpler for those users. It becomes harder to lean on "oh it's a developer site so they'll figure it out".
Mostly, though, I think it's just a matter that we wanted it for ourselves. It's pretty awesome from an industry bystander's perspective to have something free, simple, and static, so we can all benefit from more stable docs, blogs, and so on. Maybe that'll change in the future and something Totally Different will change the industry, but for right now I think it's pretty rad, and totally worth the investment.
If it was able to run off a single server for that long, I suspect the goodwill it engendered (as well as the familiarity with Git/Github.com it built in a lot of people) was well worth the minimal resources and cost it entailed.
I doubt it. After seeing how simple the architecture is to run it, I'm sure it's a drop in the proverbial bucket. Pages drives traffic to the site and in order to serve a site from a private repo you have to be a paying customer anyways.
Nice! Does HN still run off of a single server and CDN too?
The CDN is key here, which you get if you use a CNAME (or ALIAS) instead of an A record for your custom ___domain on GH pages. I've found pairing pages with CloudFlare works great if you want to use a naked ___domain and you get HTTPS too. You can set up a page rule on CF to redirect all HTTP to HTTPS as well.
It's time for github to start offering some basic hosting infrastructure of small projects, a light heroku, at least for JavaScript (which kind of already works).
I'd pay extra for that, I (we all) have a bunch of personal sites, landing pages, marketing sites and tiny side projects that'd love to not have to deal with hosting – I think they'd make a killing, but also think must be in the works.
Also the mental jump. Just because something is easy for them to do doesn't mean it is worth the distraction cost.
Github builds tools for developers. Atom, chat (abandoned), Pages, Gists, and github.com all fit within this. They tie into how teams operate. Serving JS is tangentially related — certainly something a web developer does — but not really core to their mission.
I'm keen to the demand level via our research. IMHO it's not worth it for their business size and growth plans. Their other initiatives are far more lucrative. There's also a lot of competition at that level.
I think of this more as a convenience feature for their existing business that adds value. They use this instead of the "project page" design that the other code sites used, and it gives their users more control over the presentation. Which is awesome, and feeds into my conspiracy to get everybody to know and use HTML for presentation. :)
Great summary of your architecture. Thanks for sharing!
A few questions:
- Is everything in the same datacenter or in different datacenters? What happens if the datacenter is unavailable for some reason? Are data replicated somewhere?
- You moved from 2 machines to at least 10 (at least 2 load balancers, 2 front ends, 1 MySQL master, 1 MySQL slave and 2 pairs of fileservers). That's a lot more. Do you need more machines because you need more capacity (to serve the growing traffic) or just because the new architecture is more distributed and requires more machines by "definition"?
- I understand the standby fileservers are idle most of the time: reads go the active fileserver, only writes are replicated to the standby. Am I understanding correctly? If yes, it looks like a "wasted" capacity?
Something I would really like is to be able to set the custom MIME type for an app cache manifest file. That way you could easily host offline web apps from GH pages. Anyone know a way to do this without using S3 or similar?
You shouldn't need to specify a custom one. GitHub Pages will automatically serve the file with the appropriate mime type given its file extension. Here [1] is the list.
Seems odd to me that the router hits a MySQL database on every single request rather than just hashing the hostname as the key for the filesystem node.
Hash-based partitioning has a big problem that when you change the hash size all of the data moves around. Eventually you'll need to do a lookup-based partitioning scheme. You also probably want control over where some users live since you don't want two super hot users on the same server.
My thought too. If they have half a million sites, and only 1% are hot enough to need deliberate placement, then they only need to store five thousand special cases, which is a few megabytes of memory, easily stored on each load balancer and loaded at boot.
Use hashing for the rest. Ideally consistent hashing, or rendezvous hashing, which i just read about on Wikipedia so must be good:
I threw out that prototype soon after the talk. At the time, there weren't a lot of other engineers at the company doing Erlang, so maintenance was considered to be a long-term problem. I'm glad we made that call.
There are so many presentations of the form "We're using Erlang/Scala/whatever, it's so awesome!", but so few followups when they give up on the idea for production..
It's hard to sustain some alternate technology in the face of common knowledge. Rarely does technical advantage outweigh hard-won operational experience.
As much of a fan of Erlang and Riak as I am, you did make the right call. If you only have one or two people on the team who want to know that technology, then it isn't a smart move to base a core piece of technology on it (Erlang) just because it might be the best answer. Sometimes an okay answer that everyone is familiar with is much better.
It seems to me, they could have gone a farther step removed via something like Cassandra. With a Cassandra cluster, they could have used a partition key that is the ___domain name + route in question, they could then do lookup against that entry, with the resource path (excluding querystring params) could be used to find a single resource in cassandra, and return it directly.
A preliminary hit against a ___domain forwarder would be a good idea as well, but for those CNAME domains, dual-publishing might be a better idea... where the github name would be a pointer for said redirect.
While Cassandra itself might not be quite as comfortable as say mySQL, in my mind this would have been a much better fit... Replacing the file servers and the database servers with a Cassandra cluster... Any server would be able to talk to the cluster, and be able to resolve a response, with a reduced number of round trips and requests... though the gossip in Cassandra would probably balance/reduce some of that benefit.
I remember the time I mistakenly drove huge amounts of traffic to Github Pages, believing they had the infrastructure to handle it. I apologise for last year's downtime :)
Glad to hear it's being improved. I'm impressed that it was able to run on such simple infrastructure for so long.
My heuristic is that MySQL is faster for inserts and single-table reads by primary key, but PostgreSQL is faster for more complex queries, particularly if they involve joins or subqueries.
Basically, MySQL is a key-value store in an RDBMS's clothing.
On the flip side, PostgreSQL's high availability options aren't in the box, or are at the very least varied, problematic and/or cost more than other options for support contracts.
Every time I've worked with mySQL I've seen some irksome behavior... just the same, setting up failover options is miles ahead of PostgreSQL. And when you already have in-house talent, it becomes even more obvious.
My only thought was that using a clustered database (such as Cassandra) as the store with the data itself might have been better. ___domain/url (minus querystring) would hash/distribute fairly well, and with even a relatively small cluster with 2 replica nodes for the shard would be pretty effective. Also, it would be easier to manage a replicated database, in my mind, than tracking sites to pairs of static servers. GoDaddy is/was moving to something similar with new development on one of their applications when I worked there, and able to serve a huge number of static requests (hundreds of thousands per second) off of a relative few servers with a sub 10ms response time, for content not backed by cdn.
In the end it just goes to show that serving static content on modern hardware can scale really well, with a number of options for technology. Which is why I'm somewhat surprised that something hasn't taken over the tide of poorly configured Wordpress blogs.
From OP:
> we made sure to stick with the same ideas that made our previous architecture work so well: using simple components that we understand and avoiding prematurely solving problems that aren't yet problems
So, if their team have lots of experience with MySQL but not so much with PostgreSQL, that could be a good reason to prefer one over another.
I looked into GlusterFS at one point. GlusterFS is a no-go for static file serving in hostile environments. It asks every node to look for a file, even if it's not there. You can imagine the DDoS attacks you could build here using a bunch of 404 requests for files that don't exist.
One story I heard from a PHP dev is that it would take 30 seconds to load a page while it looked for all the files needed to run it.
While a potential for concern... I would guess that CloudFlare's own routing would take them to an exit node close to Github's... that does make quite a few assumptions though. It would still be more complicated to mitm between CloudFlare and Github than ISP-X (or China) and Github...
And it doesn't work on custom domains (Github can't present a cert for your ___domain); you can make it "work" via Cloudflare but that only works if Cloudflare doesn't validate Github's cert, which introduces yet another unsecure link. [https://github.com/isaacs/github/issues/156]
Somewhat related, what's the SLA for Github pages? Let's say someone wants to move a highly successful blog, are there some quotas? Maybe I missed them but I don't remember seeing any...
I know users are using Travis-ci to automatically build when there are new commits in master and push back the results to the gh-pages branch with a lot of success.
Github pages is so cheap and easy it is the disruptive technology that is eating the lunch of stand alone hosting. Users don't think about servers, deployment or other things, they simply are pushing a branch and poof it is on the web.
Do you have a wishlist of features to add to GitHub pages? Maybe allowing minimal sandboxed server side computation with a max runtime of say 1ms, setting headers, redirects or other stuff? I am guessing every little addition would eat away at the alternatives.
> Do you have a wishlist of features to add to GitHub pages? Maybe allowing minimal sandboxed server side computation with a max runtime of say 1ms, setting headers, redirects or other stuff? I am guessing every little addition would eat away at the alternatives.
Moving from WordPress, the fact that you couldn't constantly tweak a thousand little things was extremely liberating. That's the zen-like simplicity of GitHub Pages that to me, makes it an attractive option over heavyweight alternatives. Just push and your site is live. Fewer things to break and fewer things to worry about means more time to focus on what matters: your content.
I think the only thing that I wish were supported would be simple rewrite rules....
/article/foo => /pub/article/foo.html
Or something very similar... That, or a custom 404 map, to take old urls, and send them to new ones (in the case of a blog, for example)... that said, gh-pages works well, and it's pretty awesome that it's offered to so many floss projects, and in my mind reduces the chances of a custom ___domain/website going away because someone doesn't really support something they put out there 7 years ago anymore.
What do you mean by commercial offering? A paid version of GitHub Pages? What would you want to see in a paid offering that couldn't be baked into the current version?
Paid version, something that would rival web page hosting. Regarding features, I'm mostly thinking about the possibility of basing business on top of Github, for example uptime guarantees and notifications before termination in case of abuse. Maybe even stuff like throttling instead of pulling the plug in case of DDoS.
If these can be added to the free version, cool! :)
I've never known about this way to extend nginx - I've been looking at a way to get nginx "smarter" on our mesos config as dns-based configuration had some holes.
I find that for such tasks it makes sense instead of trying to hack around nginx config and its lua scripting, to throw it away and write a small app in, say, Haskell+Warp that would do the job. It's as fast as nginx (and probably faster than nginx+lua), would have much more static guarantees, express logic more clear.
Unfortunately this isn't an option for us. We need to be able to move sites between fileservers periodically to keep disk usage and load balanced. Adding new fileservers when using a hash modulus based routing scheme is also quite complex as it would require copying quite a lot of site data between fileservers.
Pages also supports custom domains for both user sites and per-project sites, so we'd still need a way to resolve domains to users.
Think about moving around hash buckets, not sites nor files. You can decide what you want to keep in memory data or metadata. Now you need both sites+file metadata, neither of those need to be stored when serving static files you yourself rename and place everytime they are published.
> Consistent hashing is a very simple solution to a common problem: how can you find a server in a distributed system to store or retrieve a value identified by a key, while at the same time being able to cope with server failures and network partitions? [2]
awesome write up. I thought i was the only crazy guy to use lua to route subdomains to internal servers.
Little surprised on the MySql part instead of redis like fast look up key.
off-topic, but since we're talking about github pages.. i own github.id, and i'm thinking of making "linkedin for github users", do you guys think there's a market for that?
That's a clear violation of GitHub's trademark so you'd almost certainly receive a C&D (GitHub legally has to protect their trademark or they could lose it).
I agree with
brandonwamboldt, mellett68, and tsm; the ___domain name is useless.
If you had a different business name...
I think a site that summarizes someone's github contributions from a recruiter / interviewer's perspective would be very helpful.
Poking around github to research a candidate is time consuming. Perhaps a more useful one page snapshot could be created. X profile views is free and you sell recruiters a per-company subscription fee.
There's a few differences. We don't use SQL in the routing chain, we use regex to pick out the site name and then serve from a directory of the same name (this is NOT as bad as it sounds, most filesystems can do this quite well now and take MUCH more than half a million sites to bottleneck).
DRBD is also a little hardcore for my tastes. Nothing wrong with it, I just don't know it well, and I don't like being dependant on things I don't know how to debug.
An alternative I wanted to show uses inotify, rsync and ssh combined into a simple replication daemon. It's obviously not as fast, but if you enable persistent SSH connections, it's not too bad. If it screws up, you can just run rsync. Rumor has it the Internet Archive uses an approach not too far away from this for Petabox. Check it out if you're looking for something a little more lightweight for real-time replication: https://code.google.com/p/lsyncd/
We're still working on open sourcing (!) our serving infrastructure, so eventually you will be able to see all of the code we use for this (sans the secrets and keys, of course). I've just been having trouble coming up with a good solution for doing this. For now, enjoy the source of our web app: https://github.com/neocities/neocities