Links

Monitoring as Code

Manage your SLOs and Thresholds as code and publish them to Asserts cloud

SLO

Here is an example SLO that we use to monitor Asserts itself. The service level indicator defined here measures how long a recurring task that updates the Asserts graph takes to run, and the objective is to keep that run time under 15 seconds 99% of the time.
apiVersion: asserts/v1
kind: SLO
name: graph-freshness
indicator:
kind: Occurrence
measurement: asserts:latency:p99{job="model-builder", asserts_request_type="method", asserts_request_context="ai.asserts.model.builder.tasks.ModelBuildingTimerTask#run"}
entitySearch: "show service model-builder"
objectives:
- value: 15
ratio: 0.99
name: "Graph refreshed in time"
window:
kind: Rolling
days: 7
Here’s another SLO example. This one checks that the Asserts API server responds successfully to 99.5% of the requests it receives:
apiVersion: asserts/v1
kind: SLO
name: api-server-availability
indicator:
kind: Request
badEventCount: asserts:error:total{job="api-server", asserts_error_type="server_errors"}
totalEventCount: asserts:request:total{job="api-server"}
entitySearch: "show service api-server"
objectives:
- ratio: 0.995
name: "Weekly Availability"
window:
kind: Rolling
days: 7
These examples demonstrate the two kinds of SLOs that Asserts supports:
  • Occurrence SLOs are based on time and evaluated each minute. Based on the application’s performance during a minute, that minute is deemed either good or bad. Bad minutes are counted against the SLO’s error budget. Typical use cases for occurrence SLOs are latency and throughput goals.
  • Request SLOs are based on events that are either good or bad. Bad events count against the SLO’s error budget. Web application availability is a common use case for a request SLO, where each request received counts as an event, and requests that fail due to server errors count as bad events.

Threshold

You can control how assertions are generated by tuning thresholds. This rule sets the latency threshold for login requests for a specific customer:
- record: asserts:latency:p99:request_context_threshold
expr: 1
labels:
namespace: webapps
job: auth
asserts_request_type: inbound
asserts_request_context: /login
asserts_customer: acme
This rule raises a warning level assertion when a redis node has used more than 70% of its CPU:
- record:asserts:resource:warning
expr: 0.7
labels:
asserts_resource_type: cpu:usage
asserts_component: redis
asserts_severity: warning