How Asserts Processes Data
As an Asserts customer, all you need to do is connect your Observability data stores such as Prometheus and Cloudwatch to Asserts, see more. Once Asserts has access to the metrics and metadata, a few things happen.
First, we inspect their labels to discover various entities and populate their properties. In addition, we deduce the relationships between them by matching their properties or matching against specified metrics that directly establish relations. As a result, we can determine which pod is hosted on which node, which pods form a Service, and how services call each other.
All these entities, properties, and relationships form a knowledge graph that describes our understanding of the system. They are also indexed to be easily searchable. The discovery process constantly updates the graph while at the same time keeps the history.
Secondly, Asserts has curated a collection of rules to normalize the incoming heterogeneous time series into a set of essential metrics, like RED metrics (Request, Error, Duration) for application components and utilization metrics for infrastructure components.
For example, the RED metrics from Springboot will be recorded as Prometheus counter
- record: asserts:request:total
- record: asserts:latency:total
- record: asserts:error:total
We add labels like
asserts_error_type, etc., to indicate the level of granularity for further processing in instrumentation. Some more dynamic context information like HTTP paths will be recorded in
asserts_request_contextwith Prometheus relabelling rule at ingestion time.
We understand a customer may have different environments for dev, stage, and prod. Each of them might have one or more sites. For data separation, the customer can use either external labels or relabelling rules to add
asserts_sitelabels to scope metrics and thus entities discovered from them. Asserts provides corresponding env and site
We then apply our extensive domain knowledge to instrument these normalized metrics. Out-of-box we automatically instrument application frameworks like Springboot, Flask, Loopback, etc., infrastructure components like Kubernetes resources, 3rd party services like Redis server, Kafka cluster, and many more.
With instrumentation in place, we form a SAAFE model to capture events as what we call “assertions”
- Saturation indicates whether a resource (CPU, Memory, etc) is saturated
- Amend captures changes in the system, like deployment, scaling, config map change, etc.
- Anomaly captures abnormal shifts in request rate, latency, or resource consumption
- Failure records failure state in the system, like primary-standby sync failure, pod crash looping, etc.
- Error records problematic requests, i.e, 500x, 400x, or breaches of latency thresholds, etc.
Assertions thus become condensed time-series data that only capture non-trivial events in the system. They bear Assert’s deep knowledge on the observability of various building blocks in a modern application.
Assertions are different from traditional alerts, as they are not meant to be used to notify on-call personnel. They are more like vital signs surfaced automatically by Asserts, and ready to be used in troubleshooting. Sure enough, customers can choose to subscribe to selected assertions and fulfill their role as traditional alerts.
Asserts’s story doesn’t stop with automatic instrumentation. Once assertions arise, we do a few more things:
- We attach them back to the graph and index them for search. This way, a single graph search phrase can become a powerful way to navigate both entities and their health status.
- We enrich the assertions with context information from the graph. For example, an assertion that happened on a pod can be tagged back to the node and service the pod belongs. This way, assertions that happened on ephemeral entities (Pods) can bubble up to long-lived entities (Nodes, Services), thus forming an aggregated view with a continuous timeline.
Since assertions are condensed and contextualized, they are much faster to query and aggregate, much easier to correlate or rank, thus enabling quick and precise root cause analysis.