Don't really agree that this list could have come about through discussions with engineers at Google, Facebook, etc. The more computers you have the less important it becomes to monitor junk like CPU and memory utilization of individual machines. Host-level CPU usage alerting can't possibly be a "must-have" if there are extremely large distributed systems operating without it.
If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.
I work at a FAANG and host level cpu is most definitely an alert we page on. Though a single host hitting 100% CPU isn't really a problem in and of itself (our SOP is just to replace the host), its an important sign to watch for other hosts becoming unhealthy. It might be overkill but hey theres mission critical stuff at hand.
For example: if you have a fleet of hosts handling jobs with retries, a bad job could end up being passed host to host killing each host / locking up each one as it gets passed along. And that could happen in minutes while replacing and deploying and bootstrapping a new host takes longer. So by the time your automated system detects, removes, and spins up a new host everything is on fire.
Could you mention which FAANG so I can avoid applying or a job there? Large-scale software systems _must_ be designed to serve through local resource exhaustion. If you are paging on resource exhaustion of single host you are just paying the interest on your technical debt by drawing down your SREs' quality of life.
I stand by my beef with this article. The statement that "I've talked with engineers at Google [and concluded that a thing Google wouldn't tolerate is a must-have]" doesn't make sense. What I get from this article is you can talk with engineers at Google without learning anything.
Im not at liberty right now to name my employer but our systems are definitely designed to serve through local resource exhaustion. But we aren't talking about cheap hosts here. We generally run high compute optimized or high memory optimized hosts depending on the use case and if these generally powerful hosts hit 100% CPU or full memory utiliziation theres usually more going on than something random or simple so its important to have someone check it out.
A single host stuck at 100% CPU also has a nasty effect on your tail latency, in a system with wide fanout. If a request hits 100 backend systems, and 1 of them is slow, your 99th percentile latency is going to go in the toilet.
This very much depends on the kind of software system. If there is parallel orchestration going on, such as join operators in a scale-out database, the performance of a single machine in the cluster can impact the performance of the entire cluster. In fact, the software will often monitor this itself so that it knows when and where to automatically shed load.
I'd say you should always have CPU monitored, but I get that you might night care to aggressively alert on it. It can be invaluable for hunting down root-causes after the fact: nothing's perfect from the first deployment. I single bad host is best if it crashes, but is a lot more dangerous if it's just wonky.
Things like CPU hopefully shouldn't be your key/gold service-up metric, but paradoxically, the more mature your system the more CPU can tell you; you can catch problems before they happen. It can help notice things like bad CPUs.
Memory stays pretty important in my experience; even more than CPU.
And in addition to all the other responses there are also different levels of pages: Some are page me at 5am, some can wait till morning, and some can wait till Monday. FAANG is more likely to have their own hardware so you actually get deeper/more diverse monitoring needs than a shop on AWS or something.
Outliers are where the interesting stuff happens, and outliers happen to individual instances. Aggregates are useful but can be very misleading. You can have milliseconds 99 percentile latency with ~1% of requests timing out.
I wouldn’t alert on a single machine having CPU issues, but I’m definitely interested in a small collection of individual machines all having CPU issues at the same time.
This is an incorrect statement. CPU utilization and memory matters because it limits how many other containers you can load on the same host, and means that it becomes more and more expensive to run that particular service.
> If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.
If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.