Hacker News new | past | comments | ask | show | jobs | submit login

I rent dedicated servers at Hetzner.

No cloud machines, no hosted cloud services for production beyond DNS.

* 3 machines in separate data centers (equivalent of AWS AZs) for >= 30 EUR/month each. ECC RAM.

* These machines are /very/ reliable. Uptime of > 300 days are common, reboots happen only for the relevant kernel updates.

* Triple-redundancy Postgres synchronous replication with automatic failover (using Stolon), CephFS as distributed file system. I claim this is the only state you need for most businesses at the beginning. Anything that's not state is easy to make redundant.

* Failure of 1 node can be tolerated, failure of 2 nodes means I go read-only.

* Almost all server code is in Haskell. 0 crash bugs in 4 years.

* DNS based failover using multi-A-response Route53 health checks. If a machine stops serving HTTP, it gets removed from DNS within 10 seconds.

* External monitoring: StatusCake that triggers Slack (vibrates my phone), and after short delay PagerDuty if something is down from the perspective of site visitors.

* Internal monitoring: Consul health checks with consul-alerts that monitor every internal service (each of the 3 Postgres, CephFS, web servers) and ping on Slack if one is down. This is to notice when the system falls into 2-redundancy which is not visible to site visitors.

* I regularly test that both forms of monitoring work and send alerts.

* Everything is configured declaratively with NixOS and deployed with NixOps. Config changes and rollbacks deploy within 5 seconds.

* In case of total disaster at Hetzner, the entire production infrastructure can be deployed to AWS within 15 minutes, using the same NixOps setup but with a different backend. All state is backed up regularly into 2 other countries.

* DB, CephFS and web servers are plain processes supervised by systemd. No Docker or other containers, which allows for easier debugging using strace etc. All systemd services are overridden to restart without systemd's default restart limit, to come back reliably after network failures or out-of-memory situations.

* No proprietary software or hosted services that I cannot debug.

* I set up PagerDuty on Android to override any phone silencing. If it triggers at night, I had to wake up. This motivated me to bring the system to zero alerts very quickly. In the beginning it was tough but I think it paid off given that now I get alerts only every couple months at worst.

* I investigate any downtime or surprising behaviour until a reason is found. "Tire kicking" restarts that magically fix things are not accepted. In the beginning that takes time but after a while you end up with very reliable systems without surprises.

Result: Zero observable downtimes in the last years that were not caused by me deploying wrong configurations.

The total cost of this can be around 100 EUR/month, or 400 EUR/month if you want really beefy servers that have all of fast SDDs, large HDDs, and GPUs.

There are a few ways I'd like to improve this setup in the future, but it's enough for the current needs.

I still take my laptop everywhere to be safe, but didn't have to make use of that for a while.




Very well-thought infra and nice metrics. What kind of application are you running if I may ask?


Computer vision, specifically reconstruction of 3D models from 2D photos, as a service.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: