Monitoring as a Code
Manage your SLOs and Thresholds as code and publish them to Asserts cloud

SLO

Here is an example SLO that we use to monitor Asserts itself. The service level indicator defined here measures how long a recurring task that updates the Asserts graph takes to run, and the objective is to keep that run time under 15 seconds 99% of the time.
1
apiVersion: asserts/v1
2
kind: SLO
3
name: graph-freshness
4
indicator:
5
kind: Occurrence
6
measurement: asserts:latency:p99{job="model-builder", asserts_request_type="method", asserts_request_context="ai.asserts.model.builder.tasks.ModelBuildingTimerTask#run"}
7
entitySearch: "show service model-builder"
8
objectives:
9
- value: 15
10
ratio: 0.99
11
name: "Graph refreshed in time"
12
window:
13
kind: Rolling
14
days: 7
Copied!
Here’s another SLO example. This one checks that the Asserts API server responds successfully to 99.5% of the requests it receives:
1
apiVersion: asserts/v1
2
kind: SLO
3
name: api-server-availability
4
indicator:
5
kind: Request
6
badEventCount: asserts:error:total{job="api-server", asserts_error_type="server_errors"}
7
totalEventCount: asserts:request:total{job="api-server"}
8
entitySearch: "show service api-server"
9
objectives:
10
- ratio: 0.995
11
name: "Weekly Availability"
12
window:
13
kind: Rolling
14
days: 7
Copied!
These examples demonstrate the two kinds of SLOs that Asserts supports:
  • Occurrence SLOs are based on time and evaluated each minute. Based on the application’s performance during a minute, that minute is deemed either good or bad. Bad minutes are counted against the SLO’s error budget. Typical use cases for occurrence SLOs are latency and throughput goals.
  • Request SLOs are based on events that are either good or bad. Bad events count against the SLO’s error budget. Web application availability is a common use case for a request SLO, where each request received counts as an event, and requests that fail due to server errors count as bad events.

Threshold

You can control how assertions are generated by tuning thresholds. This rule sets the latency threshold for login requests for a specific customer:
1
- record: asserts:latency:p99:request_context_threshold
2
expr: 1
3
labels:
4
namespace: webapps
5
job: auth
6
asserts_request_type: inbound
7
asserts_request_context: /login
8
asserts_customer: acme
Copied!
This rule raises a warning level assertion when a redis node has used more than 70% of its CPU:
1
- record:asserts:resource:warning
2
expr: 0.7
3
labels:
4
asserts_resource_type: cpu:usage
5
asserts_component: redis
6
asserts_severity: warning
Copied!
Copy link
Contents