Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
Considering you have a question about the system, you take a peek at the data you collected previously, and you should be able to get the answer by playing around with the data. That is observability.
Failure of a computer you didn’t even know existed can render your own computer unusable.
- Leslie Lamport
A typical distributed system is complex and it is impossible to capture all failures. Therefore, we collect useful insight data for such a system. When failure occurs, we ask the question, "What went wrong?" and the collected data helps us uncover the answer.
Observability gives us a greater control on failures in complex systems.
Observability is usually defined as a collection of distinct data types known as the "three pillars":
These data types are used by
In addition, modern SRE added more tools into the observability suite,
- SLI, SLO, SLA
- Error budgets
- Status page.
Let’s look at them one by one.
Metric is a numeric representation of data measured over intervals of time. For example, what is system load, how many requests since the application gets booted.
Logs are records of discrete events that occur in a system. And logging is the act of keeping logs.
Implementation-wise, a line of text is appended to a single log file by the application. The log record could be an unstructured plain text line or a JSON serialized data payload. For example, the load balancer received a request, the leader election produces a new leader, a job is complete, etc.
Traces are a specialized use of logging to capture a series of distributed events that are caused by a single request through a system. A unique ID helps to associate all the traces for such a request.
One thing to note is that by collecting metrics, logs, and traces, you do not naturally get an observable system. If you collect garbage data, then you get garbage…