Observability Explained

5 min readNov 3, 2022

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

- Wikipedia

Considering you have a question about the system, you take a peek at the data you collected previously, and you should be able to get the answer by playing around with the data. That is observability.

Failure of a computer you didn’t even know existed can render your own computer unusable.
- Leslie Lamport

A typical distributed system is complex and it is impossible to capture all failures. Therefore, we collect useful insight data for such a system. When failure occurs, we ask the question, "What went wrong?" and the collected data helps us uncover the answer.

Observability gives us a greater control on failures in complex systems.

Observability is usually defined as a collection of distinct data types known as the "three pillars":

  • Metrics
  • Logs
  • Traces

These data types are used by

  • Alerts
  • Dashboards

In addition, modern SRE added more tools into the observability suite,

such as

  • Error budgets
  • Status page.

Let’s look at them one by one.