Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
- Wikipedia
Considering you have a question about the system, you take a peek at the data you collected previously, and you should be able to get the answer by playing around with the data. That is observability.
Failure of a computer you didn’t even know existed can render your own computer unusable.
- Leslie Lamport
A typical distributed system is complex and it is impossible to capture all failures. Therefore, we collect useful insight data for such a system. When failure occurs, we ask the question, "What went wrong?" and the collected data helps us uncover the answer.
Observability gives us a greater control on failures in complex systems.
Observability is usually defined as a collection of distinct data types known as the "three pillars":
- Metrics
- Logs
- Traces
These data types are used by
- Alerts
- Dashboards
In addition, modern SRE added more tools into the observability suite,
such as
- SLI, SLO, SLA
- Error budgets
- Status page.
Let’s look at them one by one.