Work In Progress

Work on this post is underway... It will be done "soon..."


We need to be able to tell the business, business metrics. Business doesn’t care about our cpu utilization, it cares about how much money we are making/losing. Collecting data is cheap, not having it when you need it can be expensive, instrument everything.

Fundamentals of monitoring

Four qualities of a good metric

  1. Must be well-understood. It is clear and agreed upon by everyone what does each metric mean?
  2. Sufficient granularity We need to zoom-in enough to be able to see the changes that matter, and it is easy for us to know where/when to look in the case of an event.
    • An average would be more accurate over a shorter period of time
    • It is easier to look at logs/events when we know that the problem happened within 1 millisecond compared to 1 minute.
  3. Tagging and Filtration We need to be able to ask questions and filter events on our metrics. If our predefined metrics do not answer the questions, we can drill deeper.
    • Meta data
    • Where the container is running
    • Which release we are looking at
  4. Long-lived The current state is not enough, we need to have history. We need to be able to look at events over longer period of times so that we can see correlations, such as we have a peak every weekend for an example. We can also run better analysis and run algorithms over our data points.

Types of Metrics

  1. Work Metrics
    • Throughput
    • Success
    • Error
    • Performance
  2. Resource Metrics
    • Utilization
    • Saturation
    • Error
    • Availability
  3. Events (not metrics but very useful)
    • Code changes
    • Alerts
    • Scaling Events
    • ETC

Putting Metrics to work

The most important metric MTTR (Mean Time to Recovery) To improve that

  1. Time to detection
    • Monitoring
    • Alerting
  2. Time to resolution
    • Investigation (Send the right alert)
    • Remediation

Alerting and Paging

Don’t page on the symptoms, page on the work metrics. Alert people only on what they care about and what they can change. Business for business metrics, developers for service metrics. Infrastructure for infrastructure metrics.

Make alerts actionable, and more useful.

  1. What happened?
  2. Why does it matter? (business?)
  3. Investigation and remediation
    • Confirm there’s an issue
    • Easy or temporary fixes
    • How to investigate
    • Help and resource.

Resources

  1. Monitoring containers: Follow the data - Jason Yee (Datadog)
  2. Monitoring 101: Finding signal in the noise - Ilan Rabinovitch (Datadog)
  3. Monitoring 101: Collecting the right data
  4. StatsD, what it is and how it can help you