Rearchitecting GitHub Pages

kyledrake · on May 27, 2015

Pretty similar to how Neocities serves static sites (https://neocities.org).

There's a few differences. We don't use SQL in the routing chain, we use regex to pick out the site name and then serve from a directory of the same name (this is NOT as bad as it sounds, most filesystems can do this quite well now and take MUCH more than half a million sites to bottleneck).

DRBD is also a little hardcore for my tastes. Nothing wrong with it, I just don't know it well, and I don't like being dependant on things I don't know how to debug.

An alternative I wanted to show uses inotify, rsync and ssh combined into a simple replication daemon. It's obviously not as fast, but if you enable persistent SSH connections, it's not too bad. If it screws up, you can just run rsync. Rumor has it the Internet Archive uses an approach not too far away from this for Petabox. Check it out if you're looking for something a little more lightweight for real-time replication: https://code.google.com/p/lsyncd/

We're still working on open sourcing (!) our serving infrastructure, so eventually you will be able to see all of the code we use for this (sans the secrets and keys, of course). I've just been having trouble coming up with a good solution for doing this. For now, enjoy the source of our web app: https://github.com/neocities/neocities

haileys · on May 27, 2015

> Nothing wrong with it, I just don't know it well, and I don't like being dependant on things I don't know how to debug.

Yep, this is basically our approach as well.

We've been using DRBD for quite a long time now on our Git fileservers (which also run in active/standby pairs - in fact, they look a lot like our Pages fileservers) so we have quite a lot of in-house experience with it and it's a technology we're pretty comfortable with. Given this, using it for the new Pages infrastructure was a pretty straight-forward decision.

kyledrake · on May 27, 2015

Yeah, I've read the engineering posts about DRBD for the git fileservers, so I assumed that's why you made that decision. Makes total sense to me.

steveklabnik · on May 28, 2015

This exchange is wonderful, and absolutely what I'd expect out of you two, and maybe I'm just in a bad mood, but it stands in such contrast to the way I often see technologies discussed online.

This kind of thing is the way engineering should be. Kudos.

patcon · on May 28, 2015

Yay for acknowledgements! Yay for recursive enthusiasm!

aleem · on May 27, 2015

I make use of Lua and Redis for handling a few million redirects and have been happy with it so far. I never considered MySQL due to performance concerns.

3ms for connection setup + auth + query seems reasonable. Are you using persistent DB connections? Any other mods? What sort of timeouts have you configured for DB connections?

kyledrake · on May 27, 2015

Here's our current nginx config on the proxy server. I've got the DDoS psuedo-protection (there's another layer upstream) and caching turned off right now because we're working on something, but this is basically it:

https://kyledrake.neocities.org/misc/nginx.conf.txt

Critique away. As you can see, we've just barely avoided pulling out the lua scripting.

The next step for me would probably be to write something in node.js or Go. There's probably a lot of people cringing at that thought right now, but it's actually pretty good with this sort of work, and I'd really like to be able to do things like on-demand SSL registration and sending logs via a distributed message queue. Hacking nginx into doing this sort of thing has diminishing returns, we're kindof at the wall as-is.

e12e · on May 28, 2015

> We're still working on open sourcing (!) our serving infrastructure, so eventually you will be able to see all of the code we use for this (sans the secrets and keys, of course).

I just want to applaud what you've been doing with neocites -- when the project started, I thought "Oh, nice." -- but not much more -- but I love the fact that you've kept at it, and your approach to openness is great (pending infrastructure code notwithstanding). I especially like your status-page:

https://neocities.org/stats

(Which I found from your excellent update blog-post[1] -- but I think it could be even more discoverable. It's not linked from the donate/about pages?)

I hope your financial situation improves -- and still: I wonder how (almost) half-a-cent of revenue/month compares to most ad-funded startups sites? While you'll need... a "few" more to reach your goal. Actually "just" need 43x as many users with the same revenue/head to get there :-) )

ed: clarity (hopefully)

[1] https://neocities.org/blog/the-new-neocities

datums · on May 27, 2015

Have you looked at http://www.cis.upenn.edu/~bcpierce/unison/

nichochar · on May 27, 2015

I didn't know about this, but really wanted this to exist, and it makes me happy! Keep it up

Pfiffer · on May 27, 2015

We used a master/master DRBD setup at a previous company, it was kind of a pain to work with. We had a fairly extensive document to solve split-brain problems.

I imagine the problems with DRBD mostly disappear if you're using it properly though, master/slave setups probably work really well.

kyledrake · on May 27, 2015

This factored in for me. Neocities is two people. We don't have the budget yet to hire an ops team, so we need to use parts that we can understand without a lot of mental investment. DRBD is definitely something you need to invest in. Github obviously doesn't have our budget constraints and can hire the people needed to really own problems like this.

I also am pretty conservative on engineering choices generally, and the "superfilesystems" (DRBD, Gluster) feel a little monolithic (read: not very unix) to me. It's not that they're bad, it's that they're solving a lot of hard problems, and there's a lot that can go wrong when you have to do that, and if something happens, you're the one that has to fix it.

I'm not religious about "do one thing and do it well", but SSH handles the transfers, rsync does efficient copying, and inotify fires events on file changes. Put them together and you've got a very "unix" solution. It's more or less an event-driven script that sits on the stable solutions to hard problems. If something goes wrong, you just run rsync.

I can't say enough how awesome OpenSSH is. I want to use it for pretty much everything. It's a work horse that really hauls.

shanemhansen · on May 27, 2015

lsyncd seems like a really cool project. However in practice I used it to replicate a docroot accross 3 servers and it actually got out of sync pretty often.

kyledrake · on May 27, 2015

I have a scheduled rsync execute that periodically checks for any inconsistency. It plays nicely with any updates that come in while it's doing it's work, so that's been our fallback incase things get out of wack.

jrochkind1 · on May 27, 2015

All of github pages was run off of _one_ server? (with one failover standby).

That's pretty amazing. If all you're serving is static assets, apparently you have to grow to pretty huge scale before one server will not be sufficient.

I'm curious if there was at least a caching layer, so every request didn't hit the SSD. They didn't mention it.

haileys · on May 27, 2015

We do have Fastly in front of *.github.io, but there's still a significant amount of traffic (on the order of thousands of requests per second throughout the day) that make it through to our own infrastructure.

We don't do any other caching on our own, although the other replies are correct in that the Linux kernel has its own filesystem cache which means not all requests end up hitting the SSD.

simonw · on May 27, 2015

Do you explicitly invalidate Fastly, or do you use a timeout? If so, how long is the timeout?

kyledrake · on May 27, 2015

Invalidating quickly is the main problem with caching on a CDN. We use an async worker via Redis pubsub to call for expiration on individual files locally on proxy servers we run. I'm looking at using NSQ for this in the future. One interesting solution is to just use an HTTP hit to expire a cache, which you can see a flavor of in our nginx config file. Nginx in effect becomes it's own cache SoA. We needed a special nginx module to make that work, default nginx only lets you expire the entire cache.

Fastly probably has something similar. You've really got to do this within 5 seconds or your user is going to get pissed off waiting for it every time they save/reload the page.

The big problem with passing caching to third party CDNs is that they need to be able to handle your SSL certs inline to request static files you don't need to change the URL of (because then you would be changing the published content which you don't control), and supports wildcards.

In effect you're doing what Cloudflare does. I'd use Cloudflare in a minute for this, but wildcards require their commercial plan and we can't afford it ($6k/mo). Also we would need fine-grained cache expire.

If Fastly does this and is priced in our range, maybe we should talk to them. :)

aspir · on May 27, 2015

FWIW Fastly invalidates objects, services, and batches of objects globally at ~150ms via an API. They have a different approach to wildcard/batch invalidation: it groups based on a prior-set header that can be programmatically based off of URL paths (or a bunch of other things).

Disclaimer: I'm employed by Fastly :)

kyledrake · on May 27, 2015

That would work for Neocities. Wildcard pricing is unfortunately a little out of our budget, but if you were interested in coming to an arrangement with us (we could say powered by Fastly on the site, for example) in exchange for a reduction on that, let me know. :) [email protected]

icefox · on May 27, 2015

Are the assets extracted from git and dumped on the filesystem or are they fetched from the Git objects? aka If I generate a git tree object with millions of files, but only pointing to a handful of blobs (think a maze) would this exploded on the filesystem?

fugyk · on May 27, 2015

What percentage of the requests is handled by CDN?

jsingleton · on May 27, 2015

I believe that only CNAMEs use the CDN for custom domains (naked domains do not) [0]. So that rules out a chunk but there will be lots of cache misses too. Maybe someone from GH can confirm.

[0] https://help.github.com/articles/about-custom-domains-for-gi...

michaelmior · on May 27, 2015

github.io will still also hit the CDN. It's just custom naked domains that will miss.

toomuchtodo · on May 27, 2015

Their docs mention ever 200 response code request is cached, so I'd say a majority of the requests?

vog · on May 27, 2015

I believe this is automatically done by the operating system. Nginx usually performs some kind of mmap of the requested file, and the virtual memory management will automatically take care of caching.

BTW, I'm often surprised how people are afraid from opening files from their scripts, as if they think this will always lead to disk access. Then, they start implementing a hand-written caching layer on top of that, which usually performs worse than what the OS already offers.

falcolas · on May 27, 2015

If it's a static file with no dynamic compression (nginx allows the use of pre-compressed assets) or SSL, nginx doesn't even have to open the file, it just uses the sendfile system call.

shanemhansen · on May 27, 2015

Minor nit. The sendfile system call requires an open file. nginx can just skip reading the file contents into userspace.

Also, BSD has some recent extensions to make sendfile work with ssl [https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf] (pdf)

kyledrake · on May 27, 2015

Correct. Static file serving is amazingly efficient. Even the filesystem caches for you automagically with whatever RAM you're not using. There's even a kernel function for speeding up file transfers, that's how optimized this stuff is: http://man7.org/linux/man-pages/man2/sendfile.2.html

My rough estimates are that Neocities can handle 20-50 million sites with only two fileservers and a sharding strategy. So double it for each shard, and you've got a solution that actually scales pretty well.

_ondq · on May 27, 2015

It's not that surprising. Serving static files is pretty easy and lightweight (part of why I like the movement towards SPAs), and there's been 20+ years of work done to make it fast. We've been conditioned for years by dynamic application servers that perform multiple DB queries, external service calls, etc per request. It doesn't have to be slow or inefficient.

nomel · on May 27, 2015

SSD cache and os file system cache is always there. I'm sure they were filled with ram, and that's all it would be used for.

adriancooney · on May 27, 2015

I'm amazed that Github Pages ran on just two servers (well, aside from the MySQL clusters). That is absolutely incredible given the sheer amount of projects and people who rely on it for their sites (me included!). I love the philosophy behind Pages of abstaining from over-engineering and sticking to the simple, proven solutions. It's a great service and I'm a massive fan.

joshmoz · on May 27, 2015

It's cool that they were able to do this with two machines, and I don't want to detract from that, but it's probably worth pointing out that a "machine" is not a very useful unit of capacity. These two machines could be dual core i5s or they could be 20 core xeon boxes with hugely varying amounts of memory and storage. Too bad they don't clarify, I'm curious.

ceequof · on May 27, 2015

  The fileserver tier consists of pairs of Dell R720s 
  running in active/standby configuration. Each pair is 
  largely similar to the single pair of machines that the 
  old Pages infrastructure ran on.

http://www.dell.com/us/business/p/poweredge-r720/pd

Two Xeon E5-2600s per machine.

ChiperSoft · on May 27, 2015

The ES-2600 series comes in 4 to 12 cores, so that's still a pretty wide range of compute power.

vinhboy · on May 27, 2015

I am going to remember this article and bring it up every time some ops crazy people try to trick me into over-engineering shit.

clinta · on May 28, 2015

As an ops guy, I'm going to remember this article and use it as a talking point for the benefits of static pages when devs want to put everything into their dynamic framework of choice.

tracker1 · on May 28, 2015

It's a bit of a trade off... static generation of dynamic content isn't so much different from front-side caching, if the bulk of your requests can cache for 30+ minutes, you can setup some pretty decent caching rules on a couple of servers fronting a dynamic backend that doesn't need to be too big to support many requests.

For the most part, the load tends to come down to less than optimized database storage and queries on the backend, often duplicated for every page load for content that could easily be offloaded to the client browser, constructed client-side, and requested on demand.

It's a matter of striking a balance... In this case the content changes very infrequently, and having a publish step to static storage makes sense. It really depends on one's needs.

If you have a million pages you'd need to regenerate daily for a single site, and thousands every hour, then you might think differently.

vog · on May 27, 2015

In the same vein, I'm surprised how many static websites go down just because they appear on a news site.

munificent · on May 27, 2015

Most of those I see either:

1. Aren't static sites but are things like WordPress blogs running server-side PHP essentially unnecessarily.

2. Don't take down the server, but just hit the tiny bandwidth quota of some cheap shared hosting provider, so the host blocks it.

josephmx · on May 27, 2015

A significant amount also seem to saturate their line, especially on photo-heavy static sites

mortehu · on May 28, 2015

3. Use the distro's default Apache connection limit of 20, despite the server being able to handle thousands of concurrent connections. The limited connections are then quickly used by users with slow connections.

rajeevk · on May 28, 2015

They also use CDN. As the pages are cacheable, the CDN will decrease the load on their server significantly.

VeejayRampay · on May 27, 2015

Well done GitHub. Also a special mention to the invisible workers making nginx such a cornerstone of the modern infrastructure, it's a project that I don't hear about often, probably due to the fact that it's not the sexiest piece of technology, but it really seems solid and battle-tested. Kudos.

nicolewhite · on May 27, 2015

I've been using GitHub pages for a while now and I always wondered why they had the "your site may not be available for another 30 minutes" message on creating a new GitHub pages site while pushes to an already-existing gh-pages branch were displayed instantly. Neat to see that explained here.

maxmcd · on May 27, 2015

This is all sitting behind a CDN correct?[1] Might explain why it was able to survive on two servers for so long.

Or is that automatically assumed when reading about a static hosting setup?

1. https://www.fastly.com/customers/github/

manigandham · on May 28, 2015

It's mentioned in the article.

> We also have Fastly sitting in front of GitHub Pages caching all 200 responses. This helps minimise the availability impact of a total Pages router outage. Even in this worst case scenario, cached Pages sites are still online and unaffected.

mwcampbell · on May 27, 2015

Only tangentially related, but I sometimes wonder if GitHub's management regret making GitHub Pages available for free, now that it's being used so heavily for personal and even business blogs, rather than just companion sites for open-source projects. They could be charging for static websites, as Amazon S3 does.

holman · on May 27, 2015

I never heard anyone gripe about it... not even once. The cost is pretty negligible, and there's a lot of halo benefit (i.e., you just get more people involved on GitHub the platform itself).

The fact that a lot of non-technical employees in marketing and other fields are using it for corporate blogs is actually a nice bit of pressure on the organization to make Pages and web editing even simpler for those users. It becomes harder to lean on "oh it's a developer site so they'll figure it out".

Mostly, though, I think it's just a matter that we wanted it for ourselves. It's pretty awesome from an industry bystander's perspective to have something free, simple, and static, so we can all benefit from more stable docs, blogs, and so on. Maybe that'll change in the future and something Totally Different will change the industry, but for right now I think it's pretty rad, and totally worth the investment.

sneak · on May 28, 2015

You've just optimized for my happiness. Thank you.

ceejayoz · on May 27, 2015

If it was able to run off a single server for that long, I suspect the goodwill it engendered (as well as the familiarity with Git/Github.com it built in a lot of people) was well worth the minimal resources and cost it entailed.

dyladan · on May 27, 2015

I doubt it. After seeing how simple the architecture is to run it, I'm sure it's a drop in the proverbial bucket. Pages drives traffic to the site and in order to serve a site from a private repo you have to be a paying customer anyways.

jsingleton · on May 27, 2015

Nice! Does HN still run off of a single server and CDN too?

The CDN is key here, which you get if you use a CNAME (or ALIAS) instead of an A record for your custom ___domain on GH pages. I've found pairing pages with CloudFlare works great if you want to use a naked ___domain and you get HTTPS too. You can set up a page rule on CF to redirect all HTTP to HTTPS as well.

nvk · on May 27, 2015

It's time for github to start offering some basic hosting infrastructure of small projects, a light heroku, at least for JavaScript (which kind of already works).

I'd pay extra for that, I (we all) have a bunch of personal sites, landing pages, marketing sites and tiny side projects that'd love to not have to deal with hosting – I think they'd make a killing, but also think must be in the works.

spdionis · on May 27, 2015

I think the leap they'd have to make in infrastructure and architecture to support that is not worth it in their mind. But who knows.

chralieboy · on May 27, 2015

Also the mental jump. Just because something is easy for them to do doesn't mean it is worth the distraction cost.

Github builds tools for developers. Atom, chat (abandoned), Pages, Gists, and github.com all fit within this. They tie into how teams operate. Serving JS is tangentially related — certainly something a web developer does — but not really core to their mission.

kyledrake · on May 27, 2015

I'm keen to the demand level via our research. IMHO it's not worth it for their business size and growth plans. Their other initiatives are far more lucrative. There's also a lot of competition at that level.

I think of this more as a convenience feature for their existing business that adds value. They use this instead of the "project page" design that the other code sites used, and it gives their users more control over the presentation. Which is awesome, and feeds into my conspiracy to get everybody to know and use HTML for presentation. :)

keehun · on May 27, 2015

Agreed. I think static sites have their niche and Github probably isn't interested in more than that.

lstoll · on May 27, 2015

Heroku already does this pretty well, not sure what the benefit would be?

ngrilly · on May 27, 2015

Great summary of your architecture. Thanks for sharing!

A few questions:

- Is everything in the same datacenter or in different datacenters? What happens if the datacenter is unavailable for some reason? Are data replicated somewhere?

- You moved from 2 machines to at least 10 (at least 2 load balancers, 2 front ends, 1 MySQL master, 1 MySQL slave and 2 pairs of fileservers). That's a lot more. Do you need more machines because you need more capacity (to serve the growing traffic) or just because the new architecture is more distributed and requires more machines by "definition"?

- I understand the standby fileservers are idle most of the time: reads go the active fileserver, only writes are replicated to the standby. Am I understanding correctly? If yes, it looks like a "wasted" capacity?

jsingleton · on May 27, 2015

Something I would really like is to be able to set the custom MIME type for an app cache manifest file. That way you could easily host offline web apps from GH pages. Anyone know a way to do this without using S3 or similar?

https://en.wikipedia.org/wiki/Cache_manifest_in_HTML5#Basics

ryanseys · on May 27, 2015

You shouldn't need to specify a custom one. GitHub Pages will automatically serve the file with the appropriate mime type given its file extension. Here [1] is the list.

[1]: https://github.com/jekyll/jekyll/blob/master/lib/jekyll/mime...

Edit: As you can see in that link, both .manifest and .appcache file extensions map to text/cache-manifest mime type.

jsingleton · on May 28, 2015

Nice. Very helpful. I must have been using the wrong (or no) extension before.

datums · on May 27, 2015

Have you thought about using bind as the db for the routing ? an internal dns lookup for the storage node storage.url -> 10.0.12.1

BillinghamJ · on May 27, 2015

Seems odd to me that the router hits a MySQL database on every single request rather than just hashing the hostname as the key for the filesystem node.

tdicola · on May 27, 2015

Hash-based partitioning has a big problem that when you change the hash size all of the data moves around. Eventually you'll need to do a lookup-based partitioning scheme. You also probably want control over where some users live since you don't want two super hot users on the same server.

TheLoneWolfling · on May 27, 2015

There are hash schemes that prevent that, however. Look up consistent hashing.

And you could probably have a (small) hashmap that allows overrides for the most used pages.

twic · on May 27, 2015

My thought too. If they have half a million sites, and only 1% are hot enough to need deliberate placement, then they only need to store five thousand special cases, which is a few megabytes of memory, easily stored on each load balancer and loaded at boot.

Use hashing for the rest. Ideally consistent hashing, or rendezvous hashing, which i just read about on Wikipedia so must be good:

http://en.wikipedia.org/wiki/Rendezvous_hashing

BillinghamJ · on May 28, 2015

You could make the hash, say, 2 bytes long. That would allow for up to 65536 servers.

Say you have 8 servers at present, you'd allocate 8192 sequential keys to each server.

If you need more capacity, use 16 servers instead, and move half the files from each server to the corresponding new server.

Thus exactly half the data is moved each time the number of servers is doubled or halved.

kid0m4n · on May 27, 2015

Won't CDN come into force for the two super hot users? I would gone with consistent hashing as well....

linc01n · on May 27, 2015

I thought github pages is running on riak and webmachine from 2012[0].

[0] https://speakerdeck.com/jnewland/github-pages-on-riak-and-we...

jnewland · on May 27, 2015

I threw out that prototype soon after the talk. At the time, there weren't a lot of other engineers at the company doing Erlang, so maintenance was considered to be a long-term problem. I'm glad we made that call.

misterbee · on May 27, 2015

There are so many presentations of the form "We're using Erlang/Scala/whatever, it's so awesome!", but so few followups when they give up on the idea for production..

troutwine · on May 27, 2015

It's hard to sustain some alternate technology in the face of common knowledge. Rarely does technical advantage outweigh hard-won operational experience.

nosequel · on May 28, 2015

As much of a fan of Erlang and Riak as I am, you did make the right call. If you only have one or two people on the team who want to know that technology, then it isn't a smart move to base a core piece of technology on it (Erlang) just because it might be the best answer. Sometimes an okay answer that everyone is familiar with is much better.

tracker1 · on May 28, 2015

It seems to me, they could have gone a farther step removed via something like Cassandra. With a Cassandra cluster, they could have used a partition key that is the ___domain name + route in question, they could then do lookup against that entry, with the resource path (excluding querystring params) could be used to find a single resource in cassandra, and return it directly.

A preliminary hit against a ___domain forwarder would be a good idea as well, but for those CNAME domains, dual-publishing might be a better idea... where the github name would be a pointer for said redirect.

While Cassandra itself might not be quite as comfortable as say mySQL, in my mind this would have been a much better fit... Replacing the file servers and the database servers with a Cassandra cluster... Any server would be able to talk to the cluster, and be able to resolve a response, with a reduced number of round trips and requests... though the gossip in Cassandra would probably balance/reduce some of that benefit.

samlambert · on May 28, 2015

Adding a database that is new to GitHub would not be a pragmatic move.

ddevault · on May 27, 2015

I remember the time I mistakenly drove huge amounts of traffic to Github Pages, believing they had the infrastructure to handle it. I apologise for last year's downtime :)

Glad to hear it's being improved. I'm impressed that it was able to run on such simple infrastructure for so long.

cddotdotslash · on May 28, 2015

I doubt you caused any downtime. All the page content is behind a CDN.

ddevault · on May 28, 2015

I did, confirmed by the angered emails from GitHub and the suspension of the relevant repository.

methyl · on May 27, 2015

I'm wondering why GitHub prefers MySQL over PostgreSQL.

samlambert · on May 27, 2015

Stable, proven, popular, great roadmap, great replication story, great tooling and an awesome community.

oblio · on May 27, 2015

And possibly in-house experience, which is always important.

chralieboy · on May 27, 2015

This.

The rest of GitHub runs on MySQL. As the rest of the post alludes, adding unnecessary complexity isn't what they are about (in this case.)

samlambert · on May 28, 2015

We avoid complexity as much as possible.

brador · on May 27, 2015

Would postgreSQL be meaningfully faster?

samlambert · on May 27, 2015

MySQL is extremely fast.

twic · on May 27, 2015

My heuristic is that MySQL is faster for inserts and single-table reads by primary key, but PostgreSQL is faster for more complex queries, particularly if they involve joins or subqueries.

Basically, MySQL is a key-value store in an RDBMS's clothing.

tracker1 · on May 28, 2015

On the flip side, PostgreSQL's high availability options aren't in the box, or are at the very least varied, problematic and/or cost more than other options for support contracts.

Every time I've worked with mySQL I've seen some irksome behavior... just the same, setting up failover options is miles ahead of PostgreSQL. And when you already have in-house talent, it becomes even more obvious.

My only thought was that using a clustered database (such as Cassandra) as the store with the data itself might have been better. ___domain/url (minus querystring) would hash/distribute fairly well, and with even a relatively small cluster with 2 replica nodes for the shard would be pretty effective. Also, it would be easier to manage a replicated database, in my mind, than tracking sites to pairs of static servers. GoDaddy is/was moving to something similar with new development on one of their applications when I worked there, and able to serve a huge number of static requests (hundreds of thousands per second) off of a relative few servers with a sub 10ms response time, for content not backed by cdn.

In the end it just goes to show that serving static content on modern hardware can scale really well, with a number of options for technology. Which is why I'm somewhat surprised that something hasn't taken over the tide of poorly configured Wordpress blogs.

brianpgordon · on May 27, 2015

That's... one way of looking at it.

Just because most of the comments here have been (bizarrely) pro-MySQL, I have to link to this article:

http://grimoire.ca/mysql/choose-something-else

nakovet · on May 27, 2015

From OP: > we made sure to stick with the same ideas that made our previous architecture work so well: using simple components that we understand and avoiding prematurely solving problems that aren't yet problems

So, if their team have lots of experience with MySQL but not so much with PostgreSQL, that could be a good reason to prefer one over another.

misterbee · on May 27, 2015

Full ACID reliability is not mission critical for them.

shlomi-noach · on May 27, 2015

Enough with the FUD please. These claims are so 80s. MySQL & InnoDB is fully ACID compliant.

baghali · on May 27, 2015

Question from Githubbers:

Have considered using cluster file systems such as GlusterFS or Ceph?

kyledrake · on May 27, 2015

I looked into GlusterFS at one point. GlusterFS is a no-go for static file serving in hostile environments. It asks every node to look for a file, even if it's not there. You can imagine the DDoS attacks you could build here using a bunch of 404 requests for files that don't exist.

One story I heard from a PHP dev is that it would take 30 seconds to load a page while it looked for all the files needed to run it.

jldugger · on May 28, 2015

Yea, GlusterFS is terrible at PHP, or anything involving lots of small files. Like, static sites or say git repos.

samlambert · on May 27, 2015

I don't know if it has been considered, however we do have a strong pattern for using DRDB.

el33th4xx0r · on May 27, 2015

Surprisingly, they uses mysql (instead of current hype k/v store) to map hostname and fileserver.

haileys · on May 27, 2015

MySQL is actually a really good key value store!

Here's the schema we use for the routing information:

    CREATE TABLE `pages_routes` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `user_id` int(11) NOT NULL,
      `host` varchar(255) NOT NULL,
      PRIMARY KEY (`id`),
      UNIQUE KEY `index_pages_routes_on_user_id` (`user_id`)
    );

Since we use MySQL for everything else, we decided it made the most sense to keep this routing data here rather than introducing a new database.

toomuchtodo · on May 27, 2015

> Since we use MySQL for everything else, we decided it made the most sense to keep this routing data here rather than introducing a new database.

Very smart. "Perfection is Achieved Not When There Is Nothing More to Add, But When There Is Nothing Left to Take Away"

jamesblonde · on May 27, 2015

Do you run database management software, like severalnines clustercontrol, to manage your MySQL instances?

MrBuddyCasino · on May 27, 2015

I was surprised the lookup takes just 3ms. Is Lua pooling the db connections?

joshrotenberg · on May 27, 2015

I remember seeing a talk about a Github Pages rearchitecture at Erlang Factory in SF in 2012: http://www.erlang-factory.com/conference/SFBay2012/speakers/...

I don't see anything about those components in this post. Did that architecture never make it to production?

krallja · on May 27, 2015

answered elsewhere - https://news.ycombinator.com/item?id=9612914

serverholic · on May 27, 2015

Can you explain a bit more about the load balancer tier? The blog post doesn't say much about it and I'm curious about your haproxy config.

jssjr · on May 28, 2015

This sounds like a great idea for a future blog post. Thanks for the suggestion!

gabrtv · on May 27, 2015

I was hoping to see something about optional SSL/TLS. I'm even willing to pay for it.

jsingleton · on May 27, 2015

You can use CloudFlare with flexible SSL in front of it and set up a page rule to redirect to HTTPS.

gabrtv · on May 27, 2015

I had no idea. That's great news!

drewbug · on May 27, 2015

Be warned, though: this will only create a secure connection between the user and CloudFlare, not between CloudFlare and GitHub Pages.

tracker1 · on May 28, 2015

While a potential for concern... I would guess that CloudFlare's own routing would take them to an exit node close to Github's... that does make quite a few assumptions though. It would still be more complicated to mitm between CloudFlare and Github than ISP-X (or China) and Github...

cben · on May 30, 2015

There already kinda is TLS on github.io but it's only user<->CDN, CDN<->Github is unencrypted. [https://konklone.com/post/github-pages-now-sorta-supports-ht...]

And it doesn't work on custom domains (Github can't present a cert for your ___domain); you can make it "work" via Cloudflare but that only works if Cloudflare doesn't validate Github's cert, which introduces yet another unsecure link. [https://github.com/isaacs/github/issues/156]

cben · on May 31, 2015

Edit: forgot to say I'm also willing to pay for it.

oblio · on May 27, 2015

Somewhat related, what's the SLA for Github pages? Let's say someone wants to move a highly successful blog, are there some quotas? Maybe I missed them but I don't remember seeing any...

benbalter · on May 27, 2015

Ben from GitHub here. We don't offer an SLA for GitHub Pages (or GitHub.com generally).

We've talked a bit about our typical traffic in the past (https://github.com/blog/1992-eight-lessons-learned-hacking-o...), but suffice to say, we host some highly trafficked sites like the Bootstrap documentation.

If your blog isn't on that scale, and conforms to our terms of service (https://help.github.com/articles/github-terms-of-service/#g-...), we generally don't enforce hard usage quotas, but if you're concerned, feel free to reach out to [email protected] any time.

icefox · on May 27, 2015

I know users are using Travis-ci to automatically build when there are new commits in master and push back the results to the gh-pages branch with a lot of success.

Github pages is so cheap and easy it is the disruptive technology that is eating the lunch of stand alone hosting. Users don't think about servers, deployment or other things, they simply are pushing a branch and poof it is on the web.

Do you have a wishlist of features to add to GitHub pages? Maybe allowing minimal sandboxed server side computation with a max runtime of say 1ms, setting headers, redirects or other stuff? I am guessing every little addition would eat away at the alternatives.

benbalter · on May 27, 2015

> Do you have a wishlist of features to add to GitHub pages? Maybe allowing minimal sandboxed server side computation with a max runtime of say 1ms, setting headers, redirects or other stuff? I am guessing every little addition would eat away at the alternatives.

Moving from WordPress, the fact that you couldn't constantly tweak a thousand little things was extremely liberating. That's the zen-like simplicity of GitHub Pages that to me, makes it an attractive option over heavyweight alternatives. Just push and your site is live. Fewer things to break and fewer things to worry about means more time to focus on what matters: your content.

tracker1 · on May 28, 2015

I think the only thing that I wish were supported would be simple rewrite rules....

    /article/foo => /pub/article/foo.html

Or something very similar... That, or a custom 404 map, to take old urls, and send them to new ones (in the case of a blog, for example)... that said, gh-pages works well, and it's pretty awesome that it's offered to so many floss projects, and in my mind reduces the chances of a custom ___domain/website going away because someone doesn't really support something they put out there 7 years ago anymore.

oblio · on May 27, 2015

Well, I think this questions would benefit more people so I'll ask in public: are there any plans for a commercial offering? Thanks!

benbalter · on May 27, 2015

What do you mean by commercial offering? A paid version of GitHub Pages? What would you want to see in a paid offering that couldn't be baked into the current version?

oblio · on May 27, 2015

Paid version, something that would rival web page hosting. Regarding features, I'm mostly thinking about the possibility of basing business on top of Github, for example uptime guarantees and notifications before termination in case of abuse. Maybe even stuff like throttling instead of pulling the plug in case of DDoS.

If these can be added to the free version, cool! :)

bobfunk · on May 27, 2015

Check out Netlify (https://www.netlify.com), we're pretty much like a paid version of Github pages.

Full SSL support, integrated build automation (not just for Jekyll), redirects, built-in performance optimizations, rewrite rules, proxying, fine grained HTTP header control, integration with prerender services, etc, etc...

We can also do custom plans with true DDoS mitigation.

weinzierl · on May 28, 2015

I'd love to see private pages for private repos. GitHub pages are often used to present the project from the repo to the public.

Why not use them to present customer projects (in private repos) to your customers? As long the pages are public this is pretty often a no-go.

nemothekid · on May 27, 2015

I've never known about this way to extend nginx - I've been looking at a way to get nginx "smarter" on our mesos config as dns-based configuration had some holes.

lamby · on May 27, 2015

Curiously, I remember sticking with lighttpd on a project for quite a while simply for their Lua integration (nginx didn't have it at the time).

tangent128 · on May 27, 2015

"Hooks" like that is where the Lua language really shines.

wcdolphin · on May 27, 2015

I love this write up. Awesome example of how to build the minimal, simple thing first and expand as traction and needs develop.

k_bx · on May 28, 2015

I find that for such tasks it makes sense instead of trying to hack around nginx config and its lua scripting, to throw it away and write a small app in, say, Haskell+Warp that would do the job. It's as fast as nginx (and probably faster than nginx+lua), would have much more static guarantees, express logic more clear.

siliconc0w · on May 27, 2015

If you use the magic of hashing you can lookup where user's pages are without the dependency on the database.

haileys · on May 27, 2015

Unfortunately this isn't an option for us. We need to be able to move sites between fileservers periodically to keep disk usage and load balanced. Adding new fileservers when using a hash modulus based routing scheme is also quite complex as it would require copying quite a lot of site data between fileservers.

Pages also supports custom domains for both user sites and per-project sites, so we'd still need a way to resolve domains to users.

emj · on May 28, 2015

Think about moving around hash buckets, not sites nor files. You can decide what you want to keep in memory data or metadata. Now you need both sites+file metadata, neither of those need to be stored when serving static files you yourself rename and place everytime they are published.

brador · on May 27, 2015

How would you use hashing for that?

slig · on May 27, 2015

Using "Consistent Hash" [1], for instance.

> Consistent hashing is a very simple solution to a common problem: how can you find a server in a distributed system to store or retrieve a value identified by a key, while at the same time being able to cope with server failures and network partitions? [2]

[1]: http://en.wikipedia.org/wiki/Consistent_hashing [2]: http://www.martinbroadhurst.com/Consistent-Hash-Ring.html

TheLoneWolfling · on May 27, 2015

Huh. I wonder if a consistent hash + remalloc would be faster than a standard hashmap for frequent insertions/deletions.

spdionis · on May 27, 2015

Great example of YAGNI, related to the post by martin fowler posted earlier today on HN.

nosideeffects · on May 27, 2015

Why did they choose the 98th percentile, specifically? Why not 99th and friends?

rattray · on May 27, 2015

I hadn't heard of Fastly before. How do they compare to Cloudflare?

nullrouted · on May 27, 2015

Fastly is actually really great but they cost $50/month to start with. Their throughput and routing is amazing through.

pbowyer · on May 28, 2015

I hear many great things about Fastly from users, but whenever I look at CDN performance comparisons (e.g. https://cloudharmony.com/reports/editions/cdn-performance-re...) they never do brilliantly for US/EU regions. What gives?

cagenut · on May 27, 2015

fastly is vi, cloudflare is emacs

outworlder · on May 27, 2015

Even if Lua is all but a footnote, it's nice to see that, yet again, things that rely on it work as advertised.

sneak · on May 28, 2015

Guess I'm learning Lua now. :D

coolrhymes · on May 28, 2015

awesome write up. I thought i was the only crazy guy to use lua to route subdomains to internal servers. Little surprised on the MySql part instead of redis like fast look up key.

wiradikusuma · on May 27, 2015

off-topic, but since we're talking about github pages.. i own github.id, and i'm thinking of making "linkedin for github users", do you guys think there's a market for that?

brandonwamboldt · on May 27, 2015

That's a clear violation of GitHub's trademark so you'd almost certainly receive a C&D (GitHub legally has to protect their trademark or they could lose it).

mellett68 · on May 27, 2015

Legal issues aside, github already offers social features, what features would you offer that would enhance that existing functionality?

ozten · on May 27, 2015

I agree with brandonwamboldt, mellett68, and tsm; the ___domain name is useless.

If you had a different business name...

I think a site that summarizes someone's github contributions from a recruiter / interviewer's perspective would be very helpful.

Poking around github to research a candidate is time consuming. Perhaps a more useful one page snapshot could be created. X profile views is free and you sell recruiters a per-company subscription fee.

tsm · on May 27, 2015

Wouldn't there be huge copyright issues with that?

krallja · on May 27, 2015

No. "GitHub" is a trademark, not a copyrightable work.