Observability Explained

5 min readNov 3, 2022

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
- Wikipedia

Considering you have a question about the system, you take a peek at the data you collected previously, and you should be able to get the answer by playing around with the data. That is observability.

Failure of a computer you didn’t even know existed can render your own computer unusable.
- Leslie Lamport

A typical distributed system is complex and it is impossible to capture all failures. Therefore, we collect useful insight data for such a system. When failure occurs, we ask the question, "What went wrong?" and the collected data helps us uncover the answer.

Observability gives us a greater control on failures in complex systems.

Observability is usually defined as a collection of distinct data types known as the "three pillars":

Metrics
Logs
Traces

These data types are used by

Alerts
Dashboards

In addition, modern SRE added more tools into the observability suite,

such as

SLI, SLO, SLA
Error budgets
Status page.

Let’s look at them one by one.

Observability Explained

Written by Ju

No responses yet