I;m not saying you should not do that, I'm saying your focus should not be there. When things go south your CEO rarely is interested in CPU utilization, or error rate. The question they usually ask:
How bad is it? What is user impact? Is it all hand on deck or it is just a glitch. When it is cascading, which system to fix first? Root cause?
Component failure rate just doesn't have enough context.
And yes, distributed systems are hard, because it is inherently hard to reason about what will happen when something changes/fails.
I'm not from Uber but I recently worked (was responsible for a huge chunk of infra) at a company that is bigger and has more products running and I saw some hilarious failures. And what is different, it was rarely a bad code push, and when it was, due to the nature of the business, sometimes it was really hard to roll back.
And yes, distributed systems are hard, because it is inherently hard to reason about what will happen when something changes/fails.
I'm not from Uber but I recently worked (was responsible for a huge chunk of infra) at a company that is bigger and has more products running and I saw some hilarious failures. And what is different, it was rarely a bad code push, and when it was, due to the nature of the business, sometimes it was really hard to roll back.