Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do solo SaaS founders handle monitoring/PagerDuty?
149 points by exctaticraz on Feb 20, 2021 | hide | past | favorite | 113 comments
Can you ever take a break? What if you go on vacation — or simply out for dinner with your friends — and the server goes down?

I guess for less complex apps this can be mitigated with something like Heroku, but still... do they hire freelancers to “watch the shop” when they want a break or are they chained to PagerDuty 24/7?




You can't get out of it completely, but you can reduce the risk of it actually occuring and, maybe more importantly, reduce the constant paranoia whether the system is ok or not.

What has helped me as the only technical founder, as freelancer or in very small teams in general:

- Choose boring technology. Especially when alone I prefer reliability and tons of state of the art on how to operate it, over shiny features

- Choose technolgy and infrastructure that you know. It is a whole lot easier to maintain a stable system with something that you have ample experience with.

- Keep system complexity roughly aligned with team size. E.g. when alone, it might not be the best idea to maintain 5 very different database systems altough on paper each is "the best tool for the job"

- I don't think you need any super advanced, well tought out archiecture, but if you are constantly fire fighting while at work, it might not even be good enough

- Setup basic automation so the system can recover iteselffrom the unavoidable but benign hickup every now and then.

- Don't deploy before going for lunch, coffee break, dinner, weekends, etc.

- While working, observe your systems behaviour over time, and especially the impact of changes on it. If you see a degradation, fix it or at least put it in the backlog. Otherwise it will bite you eventually out of nowhere.

- Have nice error pages and messaging that are shown to users when the system fails. In my experience in early stage companies, crashes suck, but aren't actually that bad after all and users are quite lenient as long as they see that the system is down instead of having the bad experience of it just not working correctly.


This is great advice for any SaaS company. We scaled to ~80 engineers and >1,000 customers on an almost entirely monolithic app running JARs on EC2 instances with a single Postgres database. Keep it simple & focus on delivering product features that create customer value.


I think this last point is super important! Focus on making the experience of your app crashing as not miserable as possible.

Try to eliminate as much as possible user loss of effort if your server crashes (if you have a long form a user needs to fill out, consider persisting the data and restoring it from local storage, so if the server is down and the user has to come back later to submit their data isn't lost).

If you can have a not awful experience when your small SaaS crashes, then it's probably okay to aim for 2-nines* of reliability instead of 5-nines. You're not Amazon as a small SaaS, it's okay to have a little bit of downtime now and again.

Your product is important, but it's also important to keep in mind your quality of life. Spending the time to polish the failure scenario, means you can be a little more tolerant of failure scenarios.

* maybe slightly more than 2-nines, but that's the general idea.


Yes, unless your SaaS deals with life and death or 99.999% uptime contract, you're pretty safe running something like Django+Redis+Postgres+NGINX in a single box for years and sleeping well... once you hit some really big number of users you might want to change it a bit and hire some people.


Twitter's Fail Whale is an example of the last. I was surprised how long the good will lasted. Twitter started in 2006, the SXSW where it gained a lot of traction was in 2007, FailWhale looks like it was publicaly "named" in 2007/2008 and they discontinued it in 2013. The image itself was a stock image they bought.


Monit for automated restarts.

Hardware raid cards.

Plus an architecture that is robust.

In my experience, good dedicated servers practically never crash. You might lose a HDD every few years, but that is not urgent to fix if you have a good raid.

Avoid most cloud services. Heroku, Rackspace, AWS all had much more outages than Hetzner. Plus they'll sometimes force reboot or force migrate ( =pause) your instances.

So if you go cloud, you'll need failover, distributed database, all that messy and complicated stuff. If you go dedicated, it's much easier and you only need to keep that one box running.

Plus, honestly, would your customers really mind if you're offline for 5 minutes? My dedicated hoster also has a service where they will monitor standard services like Apache, Postgresql, Rails for you and restart as needed. They have 5-10 minutes response time in my experience and I belive its good enough :)

Also, going dedicated makes it affordable to overprovision 10x the hardware you need, so you will practically never have a traffic spike high enough to cause issues.

With Heroku / AWS on the other hand, everyone else will also be scaling up when their cloud has hiccups, so your on-demand instances might not start when you need them.

Anyway, Hetzner dedicated + raid + monit is how I've been running my SaaS company for 10+ years. And I don't even remember which year I last had an issue that was both urgent and required my attention. The Hetzner ppl can exchange HDDs just fine without me. C++ core, Ruby website, Postgresql and RabbitMQ. 100GB database, 5TB customer data.


Ignoring the fact that this sounds like an ad for Hetzner, I'm not sure this is good _generic_ advice. It may be good for _some_, but the vast majority of single SaaS founders have access to platforms now (mostly via big IaaS providers) that allow them to build, develop, and deploy without ever worrying about RAID, Apache servers or Postgresql restarts.

> Plus they'll sometimes force reboot or force migrate ( =pause) your instances.

Extremely rare, but probably happens at a similar rate as your "single box" dedicated provider losing an HDD or having a datacenter blip.

My point here is that what you described sounds like the kinds of things SaaS developers needed to worry about ~10 years ago. The platforms of today aren't perfect, but they abstract away 90% of that and allow you to focus on business logic, which is exactly what a single SaaS developer should be doing.


In theory, I agree with you that cloud providers should be more comfortable and more resilient. It's just that my practical experience has been the opposite.

BTW, I don't get commission, payments or anything from Hetzner. I'm just super enthusiastic about them because their affordable pricing is making me rich.

I agree that it's a tradeoff, but the question was about small companies. And there I'd say 5min of occasional downtime are absolutely fine if it saves you $100k annually. And for a single founder, those 100k in profit will be kind of a big deal ;)


As someone else who has been happily running a small business on dedicated hardware with a managed hosting service for years, I share some of the GP's scepticism about modern cloud hosting.

Modern platforms should remove most of the complexity of operating routine infrastructure and allow you to focus on your business logic, but it doesn't always work out that way.

Just the fact that your VMs have a significant chance of being forcibly shut down with little notice is a significant downside, for example. As a solo operator, you now have to arrange all the automatic scaling and failover configuration on your cloud host as well (possibly at considerable extra cost for capacity you might not be using 99% of the time) and you have all the 24/7/365 monitoring problems that OP was asking about.

Cloud services are also notorious for obfuscating their pricing so it's hard to work out the TCO. In my experience, arguments that cloud hosting works out much cheaper overall tend to be based on rather optimistic assumptions. It might be true if you lease some VMs at carefully chosen sizes and then set everything up yourself including scaling things down again any time you don't need them. However, once you start using the automatic services that actually do something for you beyond supplying a machine on demand, the prices might jump 3x or more (sometimes much more, like orders of magnitude) compared to ordering the equivalent basic resources and setting the same functionality up manually.

Then there are the security and compatibility updates. The basic cloud services tend to be provided as-is and it's up to you to ensure everything gets updated when it needs to be. Or again, you might be able to get a more automated service that does some of this for you, but it will come with a pricing premium.

Meanwhile, a solo operator using a more traditional managed service probably doesn't have to worry much about any of this, because those services will often be happy to take care of things like setting up your redundant database servers or monitoring the security mailing lists and applying emergency patches very quickly so you don't have to. That's the level of individual service and advice they tend to offer to distinguish themselves from the generic cloud hosting services. Obviously you do pay extra for that management service compared to just a basic hosting arrangement, but whether you pay more than you'd have paid trying to do all of it yourself on AWS or Azure or even DO is another question.


The topic is reliability, not ease of use. What was described is not a big deal to pull off.

If Hetzner fits your use case (now and future case) then it's a great way to go.


I agree with most of the points here, but it's interesting that you mention:

> In my experience, good dedicated servers practically never crash

One of my toy servers (ecc ram/xeon cpu - but bought "second hand" via hetzner's auction) disappeared the other day. I thought maybe a disk had failed - but I couldn't bring it up in their network booted rescue mode - and requested a "hands on" power cycle - and after a few minutes the server was up again:

> Dear Client.

> A fault in your neighbor servers PSU tripped the fuse of the small rack segment which your server is located in too. We have fixed the issue and now your OS is back online.

Now, I think that box had a 700-900 days up-time before - I didn't really have to do anything (or pay) to get it back up.

But it was kind of surprising.

I guess all I'm saying is that I do like cheap, dedicated servers from hetzner - but if you need to guarantee five nines uptime, the architecture part is important.


I guess all I'm saying is that I do like cheap, dedicated servers from hetzner - but if you need to guarantee five nines uptime, the architecture part is important.

Five-nines is less than 10 minutes of downtime per year. I doubt anyone is really guaranteeing that without 24/7 active monitoring and maintaining extensive automated failover systems, which is already several full-time jobs. No solo operator is credibly providing that level of service.


I’m running five-nines with my setup and I’m the only operator. Monitoring and automatic failover is not difficult but I think it requires a solid architecture from the ground up. When I first started in 2011 I was running DRBD in VMs and zebra to unicast my presence. Future upgrades were incremental steps to more resilience to where I am today with a fully redundant architecture in 2 data centers. In fact the only thing that made me miss my uptime target one year was failed generator maintenance by my provider.


I’m running five-nines with my setup and I’m the only operator. Monitoring and automatic failover is not difficult but I think it requires a solid architecture from the ground up.

OK, I concede that it is not completely inconceivable to do that, but unless the service you're operating is relatively light in its demands on the tech stack, I think it's a very impressive achievement to maintain infrastructure that can consistently and reliably deliver that performance on your own if you're also the person doing the development work and your infrastructure costs aren't getting silly.

We have a simple, fully redundant architecture at one of my businesses as well, and I suppose we probably do achieve five-9s most years, but I wouldn't be willing to guarantee that to customers with serious money on the line if we missed it. We're still only D disk failures at similar times away from degraded performance while we spin up new machines from scratch, or N network failures away from degraded performance until we can bring up more capacity where it's still available.


Agreed. I doubt most people/services should build to 'guarantee' even .9999.

.9990 or .9995 is much cheaper, much easier, and probably closer to what your end user's network connectivity is anyway. (Yes, they're multiplicative, but if your user is connecting from a single-path residential connection and a $50 router, your 5th nine isn't needed to demolish their local 5-10 hours of downtime per year.)


That argument works pretty well in b2c, but in b2b your customers often insist on high uptimes, even if they benefit little from them.

Though the promised uptime might not matter that much in practice, since the penalties for a couple of hours of downtime are often affordable.


That sounds like an enterprise feature with "call us" pricing to me. :-)


I remember significant outages of S3, Gmail, and Heroku last year. So to me, it looks like they also don't reach five nines uptime.. .

Also, I'm not convinced that it's needed for a normal product. When my work Gmail was offline, I just had lunch early and then later it worked again.

A single founder offering higher availability than Gmail sounds like a masochist to me.


> If you go dedicated, it's much easier and you only need to keep that one box running.

Can/should you really run everything on just 1 box, with rather huge projects? Why not gain redundancy/uptime/peace of mind by having multiple (redundant) dedicated boxes?


If I were doing the small/medium SaaS thing, I would vastly prefer to scale vertically rather than horizontally.

Maintaining a single machine is always going to be much easier than a cluster with k8s. Not to mention you can often toss most of your data set in RAM.

Not having to worry about sharding, affinity issues, DNS/addressing/networking, extra security is a godsend. Everything is easier on one machine.

Having a redundant machine for failover and release staging might be a good idea. But you'll need to figure out how to replicate your database and possibly your in memory cache layer (redis/memcached/etc.) and test it all. Not to mention database migrations can get tricky. Really, most people can probably get away with the typical maintenance window and notification, and shut everything down for 4 hours on a Saturday night or whatever. I mean... major banks and utilities do this. You'll be fine.


Two servers with manual failover is probably the sweetspot; especially if you exercise the failover often. Confirming after every release is best, but once a month is probably fine. Then your incident response can be verify the lead server is dead, or ensure it's dead and switch to the alternate.

But one server is way more convenient, until it isn't.


To avoid configuring every service for high-availability on two servers, you can do surprisingly well with a replicated VM and DRBD or equivalent.

In the event of unplanned failover, it looks to the VM like an unplanned abrupt reboot took place. In reality it reboots, usually very fast, on the other host.

All services running inside can recover in the usual way (journalled filesystems, databases, programs restart), and don't need any high-availability configuration or replicas configured.

You do need to ensure I/O is committed durably across the network, including I/O barriers. This is a combination of VM host, filesystem and DRBD config.

(It is actually possible to do this with the VM not even seeming to reboot, so network connections and processes are unaffected by the fail-triggered instant migration. This is done by running VMs in synchronised tandem and is a rather more advanced technique. I've never used it.)


Everything else being equal, having more boxes leads to more failures. If each server fails on average every 10 years, then with ten servers you expect on average one failure every year.


Even day to day operations/maintenance becomes that much painful. Think of all those security patches, zero down time deployments, failovers, log aggregation, monitoring setup so on and so forth.


True. But a failure of a redundant server (say, 1 out of 3 application servers) would then not force you to cancel your night/weekend/vacation.


I had a dedicated server at Hetzner that would completely lock up every few weeks to months randomly, so they aren't always reliable. I ran it for about a year as I always had other things to deal with, but the abrupt loss of my services was kind of embarrassing and I should have reported it sooner than I did.

To be fair to Hetzner, eventually when I reported it they immediately took the initiative to replace it, no questions, and gave me options for when I'd like that to happen. Never had any problems since.


> Plus, honestly, would your customers really mind if you're offline for 5 minutes?

Idk what kind of software you work on but in my slice of the B2B SaaS world, 5 minutes of downtime during business hours would generate 100s of support tickets with very angry users letting us know they couldn’t do their job.


> 5 minutes of downtime during business hours would generate 100s of support tickets

Perhaps part of the answer is that if you're a solo founder who is staying solo, try not to create a business where you have these kinds of dynamics.

OR, if you are creating one that does, scale up past being solo ASAP.


Fair point and agree 100%.


I would argue there are some services that need 5 nines (or more) of availability.

Your support ticketing system and company web presence are two I can think of. I provide the ticketing service myself but I outsource my web presence (the landing page not the app) to a third party service. My thought is the web site should never go down. If the ticketing system is down it means things are beyond hosed :)


If you are a solo SAAS operator and have enough customers to generate hundreds of support tickets in 5 minutes, maybe hiring more help several years ago would have been a good idea?

I doubt you're trying to solve the same problems as the GP at that scale, or probably even the same types of problems.


That’s fair. I forgot all this was in context of solo founder. Back when I was solo we definitely didn’t have that many users.


Out of curiosity how many customers do you have as a solo founder?


I definitely had too many users before we hired more developers and support staff.

Maybe 400 businesses each with 10+ employees and 10,000+ customers (who also had access to the platform). It was too much and I regret not hiring sooner.

Another thread mentioned here that hiring someone you trust enough to own problems when you’re unavailable is daunting. I concur with that.


Maybe 400 businesses each with 10+ employees and 10,000+ customers (who also had access to the platform). It was too much and I regret not hiring sooner.

FWIW, that sounds like an interesting story/case study, if you're willing to share some time. I'm not surprised by your conclusion, but I'm quite impressed if you managed to scale even tolerably well to that level before bringing in some extra help.


OpenRent[1] founder here. I was the only technical person at our company until we hit 1m users (certainly only person who could restart/switch servers).

I guess the question is, what happens if the server goes down whilst you're at work? The answer is that if you're constantly fighting fires 9-6, your software is probably severely broken. I'd suggest this is pretty unusual, or at least, I've never heard of software being held together like that at a company that still exists.

You wouldn't want the servers to go down whilst you're at work, in a meeting, or out to dinner with friends. So you design things to be as redundant as reasonably possible.

Then when you make a mistake, you fix it so it never happens again.

Server fear should be the least of your worries. As a founder, lots of things can go wrong that will interrupt a holiday or downtime. In my experience, it's rarely, if ever, software or hardware issues.

[1] - https://www.openrent.co.uk


> Server fear should be the least of your worries. As a founder, lots of things can go wrong that will interrupt a holiday or downtime. In my experience, it's rarely, if ever, software or hardware issues.

I agree. My own product did not go down pretty much at all in 3 years. But I've seen problems all over the place while working FT at a tech companoes. Usually, these were created by devs during software releases that would take systems down or corrupt data.

So, don't release silly things before you go on a vacation.


So you are saying that if you are a solo founder and you have designed in redundancy then it's all good to take a holiday and not monitor anything?


I see you got some down votes but I see your question more as a helpful inquiry.

Idea is that you don't have to monitor anything actively. You still have your phone and email in case something goes wrong. But you are not franticly checking dashboards if response time is slower for 0.05 second in actual moment.

What parent is saying if you have normal operations most of the time everything will be OK. Despite all of the "spooky stories" software and servers are mostly reliable and you probably don't need that much redundancy that people who sell magic solutions would like you to believe.

If you look at statistics even for google, most of the downtime is when someone is changing something on the servers. When you are a single founder and you are having a diner and not changing config of your servers or deploying new software 95% of reasons for server going down are off the table.


No, you design a system that doesn't go down while you are on holiday, but with monitoring for the unlikely case that it does


Solo, technical founder here. I started a very niche EdTech company 5 years ago, non-venture funded, grew it (code+users) while having reasonably demanding full-time jobs, and now operate it FT.

In short: it's tough; you're never off. Our errors either surface by way of user emails or monitoring (shoutout to BugSnag), and to this day I still have anxiety going places without my laptop for fear of a critical error coming up and not being able to fix it. I can recall running out of conference talks, being at shopping mall with my wife, and SO many other incidents where I'd hop onto the floor of a hallway, pull out my laptop, and frantically try to figure out what's wrong (and fix it).

On the support side, we have a small number of large clients. In this regard, there's no such thing as completely disconnecting. I have a shortlist where if I get an email from _____, it doesn't matter what I'm doing, I'm responding within an hour. Outsourcing to "watch the shop" is quite difficult; I find that some businesses can do this more easily than others. For something highly niche, it's more challenging.

On the tech side, I use managed services wherever possible. Heroku is wonderful (IMO), BugSnag is fantastic, we recently switched to Postmark which helped with deliverability of emails.

I've loved building this business. Control over my time each day is a reasonable trade for having to occasionally (rarely now) drop everything. At the same time, I miss big tech and the community of being at a larger company.

Hope that helps :)


Full disclosure: I'm the CEO of BetterUptime.com.

One thing you can do is to properly configure your monitoring software.

1. Pick the right alert sensitivity + notification channel: If your app is well-built and never goes down, 30 second checks and getting alerted after the very first failed request works well. However, if another legacy app is unreliable and often goes down for ~5 minutes when making DB backups, configure your monitoring so that you only get alerted when the legacy service goes down for at least 10 minutes.

2. Get phone calls for high urgent alerts (e.g. homepage is down)

3. Push notification/Slack message for low urgency alerts (e.g. background processing queue has too many tasks enqueued). If you're at a dinner with friends and you get a low-urgency alert you can just ignore it.

4. Don't take it too seriously! Odds are it's not a life/death situation when your app goes down. Downtime happens to everyone!

5. Pick a reliable uptime monitoring provider so that you never get a false incident at 4am in the morning (shameless plug! :)


Many people here are advocating against cloud, but I’m a huge convert of serverless.

Google Firestore + Cloud Run + Cloud Storage really work well together. There aren’t any servers to maintain, it auto scales to zero.

Compared to some droplet VMs in digital ocean which got restarted every now and then, cloud run has given me 4 nines of reliability according to updown.io monitor.

It’s fast, it’s cheap, it’s low effort once you get the continuous deploy bits setup.


Serverless is possibly the most interesting of the cloud offerings recently. For some types of work, it does seem to make a lot of sense economically as long as you can stand a short delay spinning up from a cold start. It's a little concerning that there isn't more consolidation on industry standards for this yet, though.

Basically everyone providing cloud hosting has something that is a VM and something that is a managed database. If you build your system with any standard tech like Linux or popular programming languages or major databases, it's going to run on any cloud platform you like with relatively little change.

However, to do most useful things with serverless, you're going to need to tie into a specific cloud provider's ecosystem to a much greater extent. That means a lot of platform-specific code talking to proprietary APIs, which feels like it could become a significant drag if you were using serverless for core aspects of your software rather than just the occasional bonus.


> It's a little concerning that there isn't more consolidation on industry standards for this yet, though.

There is, you can run serverless Docker on GCP, AWS and Scaleway at the very least.


But as soon as you have Docker involved, you have reintroduced much of the complexity you had running vanilla VMs anyway. Serverless is then little more than an economic decision, something that might be a bit cheaper than maintaining a full-time VM to do whatever it is you're doing, as long as you can tolerate the code start penalty (which is typically greater when using Docker).

The main value in serverless, to me at least, is in the ability to write only the actual functionality you need and not care about setting up any run-time environment or setting up a substantial build process to make artefacts to support the run-time environment. With something like AWS Lambda, you can literally just copy and paste some function in any of several programming languages into a box on the dashboard, set a couple of details for security etc, and make it live.


If you can share it, what are your costs like? What kind of capability do you get for those costs?


Design your architecture for the acceptable downtime. We all want 0 downtime, but it happens. You really need to understand what you are building for. Calculate the time of downtime you are fine with. for 99.5% it is >3h but for 99.9 it is 43m. and 99.95 it goes down to 20m per month. So the less time you have allowed, the less time you have to react to a problem. So how long will it take you to turn on your pc on a weekend and try to figure out what is the source of downtime? so if you plan to go above 99.95 things will really get tough, and you will have to do major restructuring to reach this, as you cant allow failures.

So If you need 98% or 99.99% availability, your design will be very different. When you start designing for >99.97% stuff will get complicated.


Step one, as others have said, is make sure things are rock-solid enough that you're not nervous about being offline. When nothing critical has gone down in 6 months, you can start relaxing. If things are going down once a month, you need to work on your infrastructure and processes.

I do make sure I'm always available to fix things within a reasonable time. Practically, I try not to do anything where I would be physically unable to get to a computer with Internet access within 30 minutes, though pre-pandemic I would set my phone to silent when I went out to see a movie or was at the gym. Sometimes this also means bringing the laptop in the car when going places you don't plan to actually work, just in case.

One side effect of needing to watch for incoming notifications: I have East Coast relatives who insist on texting pre-7am Pacific (sometimes 20-message text chains), and wouldn't lay off when I told them it was too early and I couldn't just turn off my phone because I need to check for work notifications. Texts and calls from them are now muted 24/7, at least until I eventually get a work-specific cell phone.


I was the only technical employee at a SaaS company for years. I made the mistake of building on Rackspace’s OpenStack Cloud. Their managed MySQL database would crash, seemingly randomly, about 4x/yr for 1+ hour. Pingdom alerts ruled my life. I actually bought a satellite phone that could receive the alerts for when I was on vacation (I tend to vacation in places with no cell service). It really wore me down after a while. To the point where we decided to migrate to AWS. We’ve been using Aurora ever since and have had exactly 1 instance of DB related downtime, and it was because of DNS (it’s always DNS). My life has considerably improved and I no longer have PTSD for the Pingdom sound at 4 AM. My advice? Choose your infra wisely.


As a solo-founder/developer of [1], this is what I have been doing for the past year or so.

1. I searched for basic monitoring solutions for actively monitoring the backend and finalized New-Relic. They provide a free plan that is good enough for most startups. I have added a bunch of graphs for system, infrastructure and application monitoring. It keeps me sane and well-informed before things go wrong.

2. On my Digital ocean droplets and database, I have set up Slack alerts that page me in case there is a spike. I have created a free slack workspace just for this and added a different alert ringtone so as to not get confused with other workspaces.

3.I use Freshping to monitor Uptime and again, if things go down, I get email and slack alerts within couple of minutes.

4.I have Rollbar agent running for log monitoring. I get an email alert when there is an exception or error.

5. If I am out for more than half a day, I take my laptop with me.

6. I keep my phone on. Always.

In the last one year, rarely things have gone down. I mean maybe a couple of times.

Things I do so I can sleep properly,

1. I do not deploy before heading out, or on Fridays or at bedtime.

2. My infrastructure has a lot of redundancy meaning, a larger instance than required to handle a spike in case I am unavailable.

3. Database usually breakdown, so have recently migrated to Digitalocean managed database.

Things I am planning to do,

1.Try out Monit to automate some of the tasks.

2. Write down a list of steps or a runbook in case things go wrong. It is easy to forget steps when the production system is down.

[1] https://blanq.io


Just to let you know, Bitdefender internet security software on my PC is blocking your pricing page, it is labelled as a phishing page


Write good issue templates for features, bugs, and incidents. Do after incident reports, fix underlying issues by working to automate recovery or, at the very least, document the root cause and the recovery so you know how to do it manually really fast in case you don't know how to automate it yet.

Having clearly written incident reports tends to surface patterns that help you solve for a more general problem family or type, as opposed to playing whack-a-mole solving individual issues. Meaning the culprits will tend to become clear they're the "usual suspects". Some module or part of the code base, or functionality that's causing more crashes or outages in the code that will nudge you to write better tests for it, or find a better implementation, or better exception handling or validation, etc.

Doing this will either prevent future incidents, automatically recover from incidents, or speed up manual recovery while you try to figure out ways to automate this. All these amortize the pain, as you extract every bit of knowledge from these incidents and "institutionalize" that. You're a "solo founder", but there's no need future team members or "future You" have to go through all that: they'll have a knowledge base at their disposal when they join.

Apologize and explain things to your users.

Consistent, systematic effort.


- Monitor (with Bugsnag) and fix all errors/exceptions like its your religion. No new feature dev if there's an open bug. Write tests.

- Use Heroku. Monitor metrics to ensure you don't have major performance issues

- Use Datadog. Datadog can monitor and fix many things (web request queue too big -> trigger lambda function to scale up Heroku dynos, Worker queue latency too high -> same thing, scale up worker dynos, memory swapping -> restart dyno).

- Spend a lot of time fine tuning your logging, and custom metrics in Datadog. Makes investigating much more pleasurable.

- Any issues or exception notifications route to a #devops channel in my slack. Other slack channels include signups, business metrics, daily revenue reports, etc

- If something ever happens where you had to intervene to fix it, do a real post-mortem with yourself and try to come up with a way for that to never be a problem again.

I also do a lot of remote camping & off-roading without internet. I'm working on a simple little app where I can get paged on my satellite messenger (Garmin Inreach) if something is wrong, and key clients can also ping me. Only trusted contacts can SMS the Garmin Inreach, so I would use Twilio as the communication pipe. And I've pre-ordered Starlink. My off road truck has an elaborate electrical system (Lithium battery, solar, etc) and I plan to find a way to run the Starlink dish off 12v.

Currently working on my home backup plan, which includes hot-standy Mac mini, time machine and cloud backups, home battery backup generator (Ecoflow Delta), Starlink, portable generator, etc.


I've never operated a SAAS app at large scale (ie. millions of customers with 50-100+ machines, etc.) but for smaller deploys I must say that things haven't ever gotten that bad.

In some of my own projects I've only gotten bitten by little things a few times over the last 5 years. Like an SSL cert not getting recreated successfully, but this could have been prevented at the time if I had registered the LE account with an email address to get notified it wasn't getting renewed in time.

If you put in your due diligence with writing tests, run them automatically as part of your CI pipeline, stick with stable software / tools and keep things as simple as possible until they no longer work then you'll set yourself up for a strong base to work off of. Then as you encounter issues, you automate fixing them as soon as possible.

Having monitoring in place to prevent disasters helps too. Like getting notified of unusual CPU / memory / disk usage and getting warned before it becomes a real problem. Sure this requires being messaged but it also means you probably have at least a day's notice before you need to take action. That means you don't need to be glued to a pager and respond in 5 minutes because your site is down. Big difference.

This sort of applies to customer support too. I currently do personal customer support for 30,000+ folks who take one of my programming related courses. From the outside you would think I'd be slammed with requests to the point where every day involves answering questions for 2 hours but really it's nothing like that. With a strong base (a working course that stays updated) it's a handful of emails most days and quite often times nothing.


I wasn't solo, but I was the sole technical founder of a startup; I was there for two years before I transitioned out (the startup is still going strong).

My take: Lean on managed services as much as you can. This will help ensure that you have other experts to reach out to if you have issues with a component of your system. We were on Heroku + AWS RDS (the latter because at the time the MySQL offerings in Heroku were problematic, and we were using MySQL). Even if you don't pay for Heroku support, they were pretty good.

Make sure you set your SLA to something reasonable. For the startup, I am not sure we even committed to an SLA, but we were handling people's money and a crucial part of their operations. So I tried to be responsive within a few hours, especially if the app was down.

As far as actually taking vacations, I did that a few times. If I was close to internet service, I took my laptop and made sure I had cell coverage. Remember freaking out a bit because a camping area I was at had spotty coverage.

One time I was going to take a trip to the Canadian wilds. I had a friend who was running a larger company and who had oncall set up for his product. I documented the heck out of the system and asked them to be oncall for the 10ish days I would be out of touch. I don't recall if we paid them (might have been a 'friend deal' where we would pay them if there were any incidents), but I do recall nothing happened.

To answer your question:

> do they hire freelancers to “watch the shop” when they want a break or are they chained to PagerDuty 24/7?

If I had to pick the category I was in, it was "chained to PagerDuty 24/7".


Be nice to your customers, be open about the difficulties of running a tech business on your own, and they won't abandon you if there's a bit of downtime.


I worked at a ~smallish startup. While we had around 20 devs employed we shared oncall between 3 people.

We invested a lot into availability - especially DBs. Most of our issues were internal DNS related which we at one point automated into hosts files that updated every hour.

Oncall was shared between 3 of us with all 3 paged at once and us getting on WhatsApp to 1. diagnose and 2. fix. Most of the time only 1 of us was close to a laptop but all 3 of us would assist as best we could.

One of us wasn't tethered at any point in time but for the most part we were able to get to a laptop within 30 minutes at most. I now work at FAAMG and find oncall especially stressful but it's once every ~6 weeks.


What is the on call like? Why is it stressful?

How many days/hours every 6 weeks? Is it 24 hours when you are at it, or only during the day time in your time zone?


There is no single answer, but the general idea is to make your infra resilient and self healing.

That means healthchecks with auto restarts at every level of abstraction, stateless services...

And yeah on top of all that we have monitoring setup with a few alerts.

With that said, we only had one severe outage since we setup our infra as described above.


Could you please list some resources that could help a complete n00b like me start from somewhere wrt resilient and self healing infra?


The specific tools we use might not apply to you (the backend is a cluster), but happy to share a few ideas:

1- Use a scheduler that autorestarts: systemd, pm2, nomad, ... (we use nomad)

2- Setup healthchecks to detect when your app is not behaving correctly even if it's still running (for example some exception crippled the program). An HTTP healthcheck is an endpoint (for example /health) that returns a 200 status code when everything is fine. If the endpoint is down or returns something else, the service is not considered healthy and the service is restarted (you can limit the number of restarts when errors cannot be solved with a restart)

* Systemd supports socket based healthchecks

* pm2 doesn't have built-in support for healthchecks at all but there are some npm modules for that

* Nomad does HTTP healthchecks (through consul, not alone)

* GCP and AWS (and others) support healthchecks at the level of your server and can restart the entire server when the healthcheck goes wrong

3- Monitoring & alerts: I'll cut to the chase and tell you that honestly the best monitoring solution that worked for us is the built in one from our cloud provider (you still need to setup the agent in your server). 3rd party managed solutions are expensive, and I don't want to self deploy something so critical and add to the complexity of our infra.

The main idea in monitoring is not just to be alerted when your servers are down, but to detect issues before they become critical. Common issues like disk or CPU at 70%...

4- High availability: Here be dragons put a load balancer in front of 3 (or more 2n+1) servers, all running the same copy of your app. Make sure your app is stateless! There are risks of race conditions, stale data ... so try to explore the other options first

I hope these pointers will help you sleep better at night! You can read more about these topics and look for the tools that match your stack :)


Avoid Java and clouds, use raid and monit. Buy much more memory and storage that you think you need so that you'll have a safety buffer.


Solo founder for years, but eventually grew to the point where I hired a small team that can handle 95% of issues. I was at the point where I had to or I'd lose my mind with the control it had over my personal life. Hiring yourself out of that role is a journey in itself.

Anyway, yes, you're the one wearing all the hats, so it's on you. There is no real break, because even if you had someone watching the shop, many times the thing that breaks is the thing only you have deep insight into.

I've been on cross country drives, woken in the middle of the night, at family parties, hanging out with friends when I've gotten paged - and immediately stop what I'm doing to fix the issue, even if it takes a while and ruins said occaision. My platform is ad related, so every second of downtime is pissing off a lot of people because it's directly linked to their revenue. Thankfully that never happened while I was on a plane. I did have to buy a ridiculously expense WiFi package on a cruise ship twice to monitor things.

I've mitigated most potential issues with better infrastructure, tests, and early warnings, but the occasional unexpected item slips in, maybe once or twice yearly. Luckily I have a staffer with deep knowledge of the platform to handle that now. It took a while to get to that point.


I'm in this place of hiring myself out. It's extremely hard to find someone you trust and capable of doing at least good on something you did great yourself. Since they are not involved personally in the business (% stake) how do you find such employees? Should i always give a % otherwise i won't be able to find someone involved enough to manage critical stuff? It's like nobody cares enough... or at least what I found so far.


I'm a solo SaaS guy, and my "shop" doesn't need "watching". By design.

Most of this was accomplished by simply picking a stack that doesn't ever fall down on me, and the rest was by watching for things that might flake out and either fixing the flake or replacing them with less flaky things.

As such, I get maybe one incident a year where I'll walk briskly across to the office to fix something that could do with addressing today rather than next week. But it's never anything as dramatic as the entire site being down. Most often it's the result of Google having shipped a new version of Chrome that breaks some 10 year old feature of their own browser.

The whole goal of the SaaS business stuff was to maximize my vacation time, so anything that got in the way of, say, taking an entire month off to crew a sailboat across the Darien Gap was a non-starter.

So I don't have a pager. Mostly because I've gone out of my way to ensure there will never be anything to page me about.

I've written at length about this all here:

https://www.expatsoftware.com/articles/happiness-is-a-boring...


I architect everything around queues and run at least two redundant, independent processors (written in different languages, even) for each queue.

I try to have as few services involved as possible, which basically means the web server.


Which languages do you use most often? Languages are not created equal, some are clearly superior to others for a given task. If you can develop effectively in a better programming language, then why wouldn't you invest more effort in making the implementation of your task in that language more foolproof? Write better tests, clean-up the interfaces, you can even try to use formal methods to ensure correctness etc.


Perl is the language I use most often for processing, with bash and PHP for redundancy and glue.

As others mentioned, I chose Perl and PHP because those are the languages I am most familiar with, and because they are "boring", meaning they've been stable enough that I could've written my scripts 20 years ago, and they would still work today. PHP to a lesser extent, but still true.

Also, as I mentioned the "lowest common denominator" style of writing allows me to write Perl which I can copy and paste into PHP almost without changes, and vice-versa. To facilitate this, I ported several functions from one to the other, e.g. str_replace for Perl and and index() for PHP.

I can't think of ways to make it more foolproof than writing two redundant systems, nor how much better tests I can have than full coverage of each process by a redundant one, except by introducing triple-redundancy, which is not out of the question. Is that what you had in mind, or something else?


OK, I get it :)


Going back to your comment, I feel like I missed this part, or at least did not address it:

>Write better tests, clean-up the interfaces, you can even try to use formal methods to ensure correctness etc.

I definitely put effort into all these things, but my experience shows that no matter how much work goes into that, things will find a way to fail in a way I could not predict.


So you wrote your app twice?


Yep, pretty much. Once you've written something once, it's easier to write it again. I try to use languages with similar syntax and only use a common subset of their syntax, so it's often almost entirely a copy-paste job.

There are many advantages to doing it this way. One is that I thoroughly review each side of the codebase while writing the other. Another is that I get a complete coverage test suite for free out of the deal. Another is that if something goes sideways, it's easier to figure out where that is. It's also easier to discover any faults, because the outputs don't match up.


Honestly, this sounds so crazy to me that I must be missing something. I might try it out one day! Would it be a fair assumption that these were fairly simple apps in terms of business logic? There are like 2-3 subsystems of mine that the thought of having 2 of each makes me quiver in fear.


In part I was inspired by what I learned about mainframes. I don't know that much about them, but I do know that every single component is hot-swappable and redundant, so that's how I tried to design my application.

As a bonus, I have to design it simply, and the process of rewriting it several times helps that end.


Yes, fairly simple.


All the advice in this thread is good. All I’ll add is, don’t let it get you down. Remember the upside of working solo. I have many not-fond memories of SSHing into servers while in bathrooms at bars. But the freedom of working for yourself is worth it.


I understand Google App engine is a failed product as far as HN crowd is concern. But many of my projects ( ~10000 requests a day ) still run years after years with almost no Maintainance. I.e select platform where you don’t need to do pager duty.


Yeah, I heard that google wants to get rid of app engine internally. I think it's been like that for quite some time. Now they have Cloud Run which is similar to app engine, except it's only for running services (no queues, cache etc. have to do that separately). This is what I picked for my own product so I don't have to think about uptime too much.


I had a solo SaaS I ran from '06 to '15, operating on a 17 server cluster at, of all places, the former Enron data center in Los Angeles. In addition to the "traditional" 3 tiers of Dev, Staging, and Production, we (the startup was 2, me and another) had production setup with redundancy. If some hardware failed, other portions of the cluster would re-route and/or assume the failed hardware's duties. The only single point of failure we had was a Federal Reserve quality hardware firewall - that was the best investment I made, as it sustained massive DDOS attacks and more without breaking a sweat.


In my setup I have everything on a (large) server in a colo, with and exact copy on a second server, with the databases in master/slave. Every two years I buy a new server and swap the oldest one out.

When the master server fails, I can run script to cut over with very minimal manual interaction. I have not had to use the script in 10 years, and only experienced one outage when the datacenter had a blip.

But... it's really hard to not worry about it occasionally, even after 10 years.


I setup Zabbix[0] on a dedicated atom server and did all of the heavy leg work once (created templates, triggers, dashboards, auto-discovery ip ranges etc). Then I sit back and build my systems as usual and they all become monitored based on the tags applied to the vm. Notification is managed by zabbix which sends email alerts, has a tie in to twilio for sms notification and there are a few third party mobile apps for remote monitoring.

This also means I am on call 24/7. I have rundeck [1] (the real star of this automated show) running on another host to tackle most common tasks for me like restarting services or backing up DBs. But sometimes I do have to phone a friend and ask for help or direct them through tasks to get things running (happened once over 12 years)

My buddy and I are finishing up touches on a service monitoring SaaS which is just an html5 front end to the above system. If there is interest I will make a note to have a release party here on HN

[0] - https://www.zabbix.com

[1] - https://www.rundeck.com


Why are you so worried?

IMO, the best way to answer your question is to ask yourself why you're so worried about downtime. Then ask yourself what you can do to fix it.

Also: A mistake I see is for businesses to be so feature focused that they never go back and fix their technical debt. Make sure that your SaaS product is resilient enough for your lifestyle before you add new features or grow.


How often does your server go down? If this is a common occurrence that you are stressed about it, try to solve that first.

My advice will be a little controversial in this thread, but cloud providers are really perfect for building durable products. Any situation you can find where you can trade dollars for durability is well worth the ROI as a solo tech owner. Load balancers, auto scaling, aurora clusters, s3, these are all services that help me sleep like a baby even though my SaaS needs a perfect uptime. Expect instance outages, so keep your servers stateless and run at least 2 instances, as small as possible and go horizontal.

Another good idea is to learn how your product can die. Load test, try to break your app, and then fix those weak spots.

These are my opinions and experiences, and have continued to serve me well.


Yes, I'm a solo founder and I can take a break and sleep etc.

The key to me is to keep thing simple and have it fail in a predictable way by not mixing server roles.

I run an email forwarding service so I clearly say this is incoming mail server, this is outgoing server. For each of them we have a pair of 2 with automatic failover.

Keeping stack simple so if they failed, only a portion of it failed. Example, it's ok if the landing page is down. People can still send and receive email.

If the mail service down, we have a check on our homepage to say that our mail service is down and we're working on it.

In other words, try to design the system in a way where you have a clear boundary between components of system, so when they failed you know exactly what failed and can do thing like restart to fix, scale up server(cpu/mem) to fix it.


I've been solo running/building a startup (csper.io) for over the last year, it just hit profitability a few months ago.

It's easier said than done, but if you can prevent issues in the first place, things will be much more enjoyable.

Some things that worked well for me:

  * GKE on GCP is pretty smooth. When there's a spike in traffic everything autoscales up, so I don't have to do anything. Nice observability, things just work. Just make sure to set container cpu/mem limits.
  * Along that same note, I use MongoDB Atlas which also autoscales very nicely. It autoscales both up and down very well, saving both money, and making my infra resilient
  * GCP has a lot of monitoring/alerting/dashboards that I take advantage of. Health checks around the world, easy integration of logs/metrics. I find structured logging (json), makes setting up alerts pretty easy
  * Good consolidated logging for when there is an issue you know exactly what went wrong
  * GCP also support application tracing which can make timing issues easy to debug (although it requires a bit of work to setup) (for example if you are missing an index on some db)
  * Automatic deployments (thanks to k8s), there's no checklist for doing a deploy, I just run a single make command. I can't screw that up
  * A staging environment that's a match of production. Plenty of times I've crashed staging, it's worth every penny. It also makes life much less stressful
  * Lots of tests. The tests aren't important for when I'm writing the code, but for months later when I make changes and want to know I didn't mess something else up. I find a good test suite can really help you sleep at night, specially if the test suite covers the critical paths
  * An easy way for users to contact you if there is an issue. No one is perfect, but being able to respond quickly is usually forgiven.
Also "stay-cations" are also pretty nice. I try to do one a quarter. I'm still at home if something does break, but I don't do any work for the week. Just load up a new video game and relax for a week. I call it my "monitoring" week.

Hope that helps!


Can you expand on the "Health checks around the world" ?


https://cloud.google.com/monitoring/uptime-checks

If I remember correctly you can specify a bunch of regions for the health checks to originate from. It was super simple to setup (point and click) and it's nice that it's decoupled from the rest of my infrastructure. When there's a failure I get a notification.


I always made sure I had monit to keep services alive and a init.d scripts that boot all necessary services when the box starts up. Avoid single point of failures as much as possible. Minimize unbounded queries and always set a reasonable request timeout. Have away to collect stats (statds is nice)

The reality is yeah you probably are not gonna have many restful nights or peaceful dinners... 10 years later for me and I still avoid activities that don’t allow me to quickly access a computer. I still always have multiple mifis in my backpack to ensure if one cell network is not good maybe the other one is good enough for me to fix a server... you have to kind of enjoy it


I'm solo founder. I don't have much monitoring, except for http://status.simpleokr.com/ which gives me high level insights into api/app being unavailable via email. But I run everything on gcp cloud run which ensure that my app is up. Database is also in HA mode. So everything is handled by the cloud provider. No outages in the past 3 years. I had one early in the days due to traffic and db load, had to scale the db server. But my business/saas is pretty small so I might be an outlier.


I've set up tons of monitoring, and automated what I could – if an app server goes down, it gets removed from load balancer rotation. If a load balancer goes down, it gets removed from DNS. I haven't automated DB failover, because it's just a too hard problem for me, with too many edge cases.

For critical notifications I use Pushover with an emergency setting – a repeating full volume alert on phone, regardless of volume settings or Do Not Disturb mode.

I do have a "go bag" with a dedicated, prepared laptop that I take with me on longer trips (not that there have been many in the past year).


Being onpage 24/7 is a sure way to end up in a mental facility. You get a partner (or employees) and set up shifts, or make the services redundant enough that outage isn't a big deal.


It sounds like you're pre-idea, so one suggestion is to avoid building a "mission critical" type of idea if your desire is to stay solo and not be chained to your laptop...

I run a small SaaS where I had a decision early on to pursue a live chat-based approach to the UX versus an asynchronous approach. A big reason I chose the latter was to avoid the need for 24/7 "real-time" support in favor of a better lifestyle, even though the live version likely would have garnered more customers.


Thanks for the response. I’m currently toying with an idea, and it’s nothing mission critical, so I’m good on that front.

How “chained to your desk” would you say you are? Are there ever any times where you truly clock off?


I don't feel weighed down all that much. I'm not at the "go on an extended backpacking trip with no internet" stage, but I can go for a solid day or so before I get the urge to do a quick check in on the app.

I picked the most "boring" setup I could for a backend and host everything on Heroku, which costs me more but provides a little more peace of mind relative to other setups I've seen.


Sounds like a good setup to me. I think I’ll follow your lead. Congrats on the success so far!


Shameless plug..

This is exactly what we do at MNX Solutions. We are a team of Linux engineers, and provide 24x7 monitoring and response to outages for your cloud based infrastructure.

https://www.mnxsolutions.com/it-services/managed-aws-cloud

Even if we're not a good fit, I'd be happy to chat with anyone about ways to improve their site reliability. It's something we're good at, and love to talk about!


I put a lot of effort into making sure things don’t explode. I write tests. I think about the failure modes.

I use a simple tech stack. Golang monolith, Postgres database.

I pay a little extra for God managed services that auto-recover. I run my database on Cloud SQL and my web servers on Cloud Run on GCP.

As a last line of defence, I have a remote development environment I can access from my phone. I can make fixes and deploy from there. I also have a Garmin InReach satellite communicator that I can be contacted on if I’m out of phone range.


Your question seems a little dismissive of Heroku but I work in that space and it is managed and reliable in a way beyond what you would get piecing your own infrastructure together.


I use Heroku for my own sanity as a solo founder, and I use Papertrail with email notifications to monitor logs. Nothing fancy. I bring my laptop on vacation, but I don't really ever have to use it. I freeze releases a week before vacation, to try and ensure that I don't break anything so that I have room to relax. I agree with other here: use boring tech that you are confident in. I have good integration test coverage, so that also helps. :)


If you use Google Cloud, use Cloud Monitoring (formerly Stackdriver) to set up policies and alerts. There is a Google Cloud Console mobile app that throws you the alerts. If you don't use Google Cloud, you can also use Cloud Monitoring (Stackdriver supports AWS, and probably still does)

In addition to that, use managed services as much as possible. On Google Cloud I use a lot of Cloud Run and Cloud SQL, and infrastructure work is kept to a minimum.


I am a solo founder working on my shopify apps. There was a memory issue and my app was going down for a while. Tbh It's quite hard to reboot the service when I'm out. I only restarted my app when I went home.

I don't hire any freelancer to watch my app. But I am using monitoring service and django notification emails when there's an outage

For solo founders, it is better to pick a product that can tolerate a small amount of down time.


I run a small SaaS making just $100 MRR ish, I set up automated Ping from UptimeRobot and use Sentry to log exception, currently trying to set up Monit to restart service if it goes down or over use memory.

The SaaS provide just a small feature, if it goes down, the users probably wont have noticeable impact. Most issue are solvable by just restarting web server, I have SSH app installed on my phone on the go lol


With 1) database backups and PITR (from SaaS like RDS) so you don't lose customer's data 2) basic monitoring so you are alerted of downtime, even if not responding right away and 3) infra as code so you can deploy things pretty quickly will get you very close to what companies with dedicates teams do.


People nowadays put so much funny crap into their infrastructure, no wonder it's brittle.

The service suddenly going down shouldn't be a serious risk for a vast majority of online businesses (unless you are doing something exceptional or at an exceptional scale or an amateur).


Automation and high quality all around.

Also look for projects that don't need to be 100% available all the time. I use Zenfolio, they send out emails that the site will be down for few hours on a random night, I don't mind it. It is just a portfolio site.


It's interesting reading the comments and seeing a fairly clear divide between developers and operators. Tooling exists to make horizontal scaling much easier these days without going all-out with k8s.


Mostly good test coverage, uptimerobot, sentry.io and nodered to continuously run various scenarios :) also, get infra from well known cloud providers


I have a master reboot script that I can access from Google Console (iOS app). I open it, run the script, and things are ok.


I carry my laptop with me everywhere I go.


Try one of the below

1. NewRelic.com

2. Datadoghq.com

3. Atatus.com


Try Cloud66


they get a co-founder




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: