Work In Progress
Work on this post is underway... It will be done "soon..."
We need to be able to tell the business, business metrics. Business doesn’t care about our cpu utilization, it cares about how much money we are making/losing. Collecting data is cheap, not having it when you need it can be expensive, instrument everything.
Fundamentals of monitoring
Four qualities of a good metric
- Must be well-understood. It is clear and agreed upon by everyone what does each metric mean?
- Sufficient granularity
We need to zoom-in enough to be able to see the changes that matter, and it is easy for us to know where/when to look in the case of an event.
- An average would be more accurate over a shorter period of time
- It is easier to look at logs/events when we know that the problem happened within 1 millisecond compared to 1 minute.
- Tagging and Filtration
We need to be able to ask questions and filter events on our metrics. If our predefined metrics do not answer the questions, we can drill deeper.
- Meta data
- Where the container is running
- Which release we are looking at
- Long-lived The current state is not enough, we need to have history. We need to be able to look at events over longer period of times so that we can see correlations, such as we have a peak every weekend for an example. We can also run better analysis and run algorithms over our data points.
Types of Metrics
- Work Metrics
- Resource Metrics
- Events (not metrics but very useful)
- Code changes
- Scaling Events
Putting Metrics to work
The most important metric MTTR (Mean Time to Recovery) To improve that
- Time to detection
- Time to resolution
- Investigation (Send the right alert)
Alerting and Paging
Don’t page on the symptoms, page on the work metrics. Alert people only on what they care about and what they can change. Business for business metrics, developers for service metrics. Infrastructure for infrastructure metrics.
Make alerts actionable, and more useful.
- What happened?
- Why does it matter? (business?)
- Investigation and remediation
- Confirm there’s an issue
- Easy or temporary fixes
- How to investigate
- Help and resource.