Google SRE Handbook: Monitoring Distributed Systems
sre-handbook
Funnily enough, this chapter is about how to monitor distributed systems. Basically, a systems where there a multiple parts working in concert to provide a service. They reference Google’s internal monitoring tool, borgmon, but I’m most familiar with Prometheus and Grafana so, where required, I’ll use it for examples.
In this chapter there were a few things that really stood out to me that I want to dig into and think about how they can be done well:
- A system health dashboard
- Actionable alerts that require intelligence
- Monitoring the seams
- Paging on symptoms
- Removing unused metrics
System health dashboard
I think it would be pretty common to have a dashboard like this, we certainly have one at work. The idea is that it contains a summary of what is going on across services. It should be your first port of call when you get paged to figure out where the problem may lie. I think the key think here is that it’s for figuring out where the problem may be, not what the problem is. This is an important difference because it means that the dashboard can be much higher level. The dashboard should play to one of our strengths as humans - pattern matching. As an operator, this is a dashboard you should be familiar with and by looking at the dashboard panels, it should be easy to to see what is different from normal service operator. Are there more database connections than usual? Are HTTP requests taking longer than usual? etc. It doesn’t have to be something that any developer can look at and understand what is going on. This is intended to be used by the Oncall team. It is high level so isn’t intended read by squinting at the graphs for a while. This dashboard tells you where the problem is, then the dashboards and logs for that specific part of the service can tell you what the problem is. I think the key properties to strive for is high information density. There shouldn’t be heaps of panels this dashboard because it’s one that you should be familiar with the shape of every panel on it. High information density is a good thing here - we want to be able to see patterns at a glance. Debugging the problem is for other dashboards and observability tools.
This dashboard will always be pretty specific to the service that you’re running but some generally good starting points are:
- Resource usage, e.g.
- Database connections
- CPU usage
- Memory usage
- Thread count
- HTTP connection count
- Recent changes
- Pull requests in the most recent deploy
- Time of last deploy
I think the important question for resource usage is: what the resources that are most under contention? Whatever those are for you are the ones that should be in this dashboard because exhaustion of those resources will often point to you to the seam at which the issue lies - it could either be that the client is trying to use too much of the resource or the provider of that resource has slowed down such that, whilst the client is running normally, it is consuming more resources than it normally does.
Recent changes are important to know about because, more often than not, the issue is caused by the change from the most recent deploy. Having those changes right there with you whilst you’re trying to figure out where the breakage is will really help you narrow down what is going on.
Actionable alerts that require intelligence
This one seems like a bit of a no brainer but I think it can be easy to slip into accidentally over altering if you aren’t fastidious about this. In the SRE handbook they say:
Unless you’re performing security auditing on very narrowly scoped components of a system, you should never trigger an alert simply because “something seems a bit weird.”
When thinking about this I like to try and take this to the extreme and think about how I’d feel if this woke me up in the middle of the night. Be really ruthless about how much you care about an alert. Often times the answer is “not enough to be woken up in the middle of the night”.
You should be able to do something when you get paged, otherwise it’s worse than useless because it has interrupted you unnecessarily. The follow up to this is the importance of the “requires intelligence” part. The response shouldn’t be just run this script because that’s an automatic remediation that could be performed and doesn’t need to have paged you.
I think there is an interesting tension here though because the system should be stable. It shouldn’t constantly be having to take these automated reliability measures. If boxes are constantly crashing but your auto-scaling is getting you though or you’re automatically load shedding traffic that is indicative of a problem that will show up later when, If you are constantly having transient faults then there is actually a persistent fault. However, to me, the difference here is that it shouldn’t page you in the middle of the night. This should be reserved for things that are impacting your customers. I think this is where it is good to differentiate between what they call white and black box monitoring in the SRE Handbook
- Whitebox monitoring: Monitoring with internal knowledge of the system
- Blackbox monitoring: Monitoring without internal knownlege. Most like it interacts with your system in the same way that customers do.
The most urgent level of paging that wakes people up at night should be reserved for blackbox monitoring discovering failures and a small subset of whitebox monitoring that is a leading indicator for catastrophic failure (e.g. the database running out of disk space). The blackbox monitoring tells you there is an issue that will impact customers. You can then dig into the exact cause of that failure knowing that this is really a problem that is either visible or will soon be visible.
Monitor the seams
This comes from the example in this chapter about how you need to not only monitor how long a database query takes on the application side but also on the database side to allow you to determine whether the issue is the network or the database. I think this extends much further though - measure on either side of the seam to help you track down where the failure lies. If you do this at all the natural interaction points then this will help you pin point the failure to a particular part of the application. This is automating something that I’ve thought about for a while in relation to debugging where you find the natural separation points and test for the issue on either side of those and that narrows your search space. These separation points are most clear where the service communications over a network - e.g. the frontend sending a request to the backend or the backend sending a query to the database - however, they can also like within one block of code itself. What about functions? or the point at which your request handling hands off to the application logic? These can all be useful places to monitor to automatically narrow down the search space.
Paging on symptoms
I mostly ended up covering this in actionable alerts but I think the key difference that they’re talking about in the handbook here is that is allows the alerting to remain simple. In actionable alerts, the thing I was focussing on was making sure that the alert fits into a healthy Oncall rotation but this is more about the maintainability of all the alerts. Not trying to determine the cause in your monitoring means that alerts can be much simpler and thus less brittle. To quote the handbook
As simple as possible, but no simpler.
There are exceptions to this though. For example, running out of disk space is a big one but it’s also really obvious that this will cause problems (unless you’re able to automatically remediate this and being able to do so is highly application dependent). Whilst these sorts of things are causes, they’re very simple to detect and will undeniably cause problems so they good to alert on.
Removing unused metrics
This is a very pragmatic one and related to the health of your monitoring system. as opposed to the other which have been about using your observability tools to discover problems. Basically, metrics cost money and use resources to calculate and manage. If you’re not using them, get rid of them! The SRE Handbook gives the rule of thumb of if it hasn’t been used for a quarter it’s up for removal and if it’s not in a dashboard or alert it’s up for removal. The bit about it not being in a dashboard seems relatively easy to automate - just list out the metrics used in dashboards and subtract the total list of metrics from that. Then it’s matter of running though those metrics and seeing what can actually be removed. If this is performed by a member of the Oncall team they probably have a pretty good intuition for what can actually be deleted. If this has never been done before then the list could be quite long but hopefully over time it can be reduced. It’s also a good forcing function for turning useful metrics into dashboards (not necessarily on the system health dashboard though!) because if you’re regularly querying the metric directly in Prometheus, you should probably put it in a dashboard on Grafana.