We are a very small team at https://codeinterview.io. We recently achieved a respectable level of reliability with a tiny team. Some things you should do:
- Atleast have a pool of 2 instances (ideally per service) running under an auto-scaler or a managed K8s (GKE is best) with LB in front. May also want to explore EBS and google cloud run. If you can use them, use them!
- Uptime alerts. pingdom (or newrelic alerts) with pagerduty added.
- Health checks! The trick is to recover the failed container/pod/service before you get that pagerduty call. Ideally, if you have 2 of each service running #2 will handle the requests until the #1 is recreated.
- Sentry + newrelic APM + infra: You should monitor all error stack traces, request throughput, avg response time. For infra, you mainly need to watch memory and CPU usage. Also on each downtime, you should have greater visibility at what caused it. You should set alerts on higher than normal memory usage so you can prevent the crash.
- Logs, your server logs should be stored somewhere (stackdriver on gcloud or cloudwatch on aws).
These might sound overwhelming for a single person but these are one time efforts after which they are mostly automatic.
One thing that has helped me a lot with monitoring is custom application-level metrics.
If you have a good idea of the usage patterns of your service, create metrics backed by the patterns. This can help you find things that CPU/Memory will hide.
- Atleast have a pool of 2 instances (ideally per service) running under an auto-scaler or a managed K8s (GKE is best) with LB in front. May also want to explore EBS and google cloud run. If you can use them, use them!
- Uptime alerts. pingdom (or newrelic alerts) with pagerduty added.
- Health checks! The trick is to recover the failed container/pod/service before you get that pagerduty call. Ideally, if you have 2 of each service running #2 will handle the requests until the #1 is recreated.
- Sentry + newrelic APM + infra: You should monitor all error stack traces, request throughput, avg response time. For infra, you mainly need to watch memory and CPU usage. Also on each downtime, you should have greater visibility at what caused it. You should set alerts on higher than normal memory usage so you can prevent the crash.
- Logs, your server logs should be stored somewhere (stackdriver on gcloud or cloudwatch on aws).
These might sound overwhelming for a single person but these are one time efforts after which they are mostly automatic.