Metrics and Monitoring

We need to be able to tell the business, business metrics. Business doesn’t care about our cpu utilization, it cares about how much money we are making/losing. Collecting data is cheap, not having it when you need it can be expensive, instrument everything.

Fundamentals of monitoring

Four qualities of a good metric

Must be well-understood. It is clear and agreed upon by everyone what does each metric mean?
Sufficient granularity We need to zoom-in enough to be able to see the changes that matter, and it is easy for us to know where/when to look in the case of an event.
- An average would be more accurate over a shorter period of time
- It is easier to look at logs/events when we know that the problem happened within 1 millisecond compared to 1 minute.
Tagging and Filtration We need to be able to ask questions and filter events on our metrics. If our predefined metrics do not answer the questions, we can drill deeper.
- Meta data
- Where the container is running
- Which release we are looking at
Long-lived The current state is not enough, we need to have history. We need to be able to look at events over longer period of times so that we can see correlations, such as we have a peak every weekend for an example. We can also run better analysis and run algorithms over our data points.

Types of Metrics

Work Metrics
- Throughput
- Success
- Error
- Performance
Resource Metrics
- Utilization
- Saturation
- Error
- Availability
Events (not metrics but very useful)
- Code changes
- Alerts
- Scaling Events
- ETC

Putting Metrics to work

The most important metric MTTR (Mean Time to Recovery) To improve that

Time to detection
- Monitoring
- Alerting
Time to resolution
- Investigation (Send the right alert)
- Remediation

Alerting and Paging

Don’t page on the symptoms, page on the work metrics. Alert people only on what they care about and what they can change. Business for business metrics, developers for service metrics. Infrastructure for infrastructure metrics.

Make alerts actionable, and more useful.

What happened?
Why does it matter? (business?)
Investigation and remediation
- Confirm there’s an issue
- Easy or temporary fixes
- How to investigate
- Help and resource.

Metrics and Monitoring

Fundamentals of monitoring

Four qualities of a good metric

Types of Metrics

Putting Metrics to work

Alerting and Paging

Resources

Further Reading

Privacy Policy

Functor

Monoid