Don't really agree that this list could have come about through discussions with...

kevinsundar · on July 17, 2019

I work at a FAANG and host level cpu is most definitely an alert we page on. Though a single host hitting 100% CPU isn't really a problem in and of itself (our SOP is just to replace the host), its an important sign to watch for other hosts becoming unhealthy. It might be overkill but hey theres mission critical stuff at hand.

For example: if you have a fleet of hosts handling jobs with retries, a bad job could end up being passed host to host killing each host / locking up each one as it gets passed along. And that could happen in minutes while replacing and deploying and bootstrapping a new host takes longer. So by the time your automated system detects, removes, and spins up a new host everything is on fire.

learnfromstory · on July 17, 2019

Could you mention which FAANG so I can avoid applying or a job there? Large-scale software systems _must_ be designed to serve through local resource exhaustion. If you are paging on resource exhaustion of single host you are just paying the interest on your technical debt by drawing down your SREs' quality of life.

I stand by my beef with this article. The statement that "I've talked with engineers at Google [and concluded that a thing Google wouldn't tolerate is a must-have]" doesn't make sense. What I get from this article is you can talk with engineers at Google without learning anything.

kevinsundar · on July 17, 2019

Im not at liberty right now to name my employer but our systems are definitely designed to serve through local resource exhaustion. But we aren't talking about cheap hosts here. We generally run high compute optimized or high memory optimized hosts depending on the use case and if these generally powerful hosts hit 100% CPU or full memory utiliziation theres usually more going on than something random or simple so its important to have someone check it out.

packetslave · on July 17, 2019

A single host stuck at 100% CPU also has a nasty effect on your tail latency, in a system with wide fanout. If a request hits 100 backend systems, and 1 of them is slow, your 99th percentile latency is going to go in the toilet.

learnfromstory · on July 18, 2019

Which is a good reason to hedge and replicate but NOT a reason to alert on high CPU usage of single computers.

packetslave · on July 18, 2019

You definitely want to TRACK cpu usage on individual hosts, but, yeah, I would alert on service latency instead. Symptom, not cause.

jandrewrogers · on July 17, 2019

This very much depends on the kind of software system. If there is parallel orchestration going on, such as join operators in a scale-out database, the performance of a single machine in the cluster can impact the performance of the entire cluster. In fact, the software will often monitor this itself so that it knows when and where to automatically shed load.

kjeetgill · on July 17, 2019

I'd say you should always have CPU monitored, but I get that you might night care to aggressively alert on it. It can be invaluable for hunting down root-causes after the fact: nothing's perfect from the first deployment. I single bad host is best if it crashes, but is a lot more dangerous if it's just wonky.

Things like CPU hopefully shouldn't be your key/gold service-up metric, but paradoxically, the more mature your system the more CPU can tell you; you can catch problems before they happen. It can help notice things like bad CPUs.

Memory stays pretty important in my experience; even more than CPU.

And in addition to all the other responses there are also different levels of pages: Some are page me at 5am, some can wait till morning, and some can wait till Monday. FAANG is more likely to have their own hardware so you actually get deeper/more diverse monitoring needs than a shop on AWS or something.

Source: FAANG-ish tier infra work

stillworks · on July 18, 2019

Granted the article is a bit of a "Decaf-Soy-Latte", but in my experience, whatever that can be monitored should be monitored.

Software deliveries/releases can often realistically be non-perfect. (Don't have direct experience with Canary releases TBH though)

In case anything goes wrong any objective evidence which helps to reconstruct the failure scenario is valuable.

Also... Murphy's Law.

>If you've designed software where the whole service can degrade based on the CPU consumption of a single machine.

Typically, if such software is indeed released, I think it will be several CPUs on several hosts.

yibg · on July 18, 2019

Outliers are where the interesting stuff happens, and outliers happen to individual instances. Aggregates are useful but can be very misleading. You can have milliseconds 99 percentile latency with ~1% of requests timing out.

I wouldn’t alert on a single machine having CPU issues, but I’m definitely interested in a small collection of individual machines all having CPU issues at the same time.

steven2012 · on July 17, 2019

This is an incorrect statement. CPU utilization and memory matters because it limits how many other containers you can load on the same host, and means that it becomes more and more expensive to run that particular service.

madhadron · on July 17, 2019

> If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.

Unless it's your database.

learnfromstory · on July 17, 2019

If you have "the database" then you're fucked anyway and probably your thing isn't on the scale that we are discussing.

cameronbrown · on July 17, 2019

If you're using a sharded SQL database then a single machine going bad could still affect thousands of people.