Monitoring as Code

Manage your SLOs and Thresholds as code and publish them to Asserts cloud

SLO

Here is an example SLO that we use to monitor Asserts itself. The service level indicator defined here measures how long a recurring task that updates the Asserts graph takes to run, and the objective is to keep that run time under 15 seconds 99% of the time.

apiVersion: asserts/v1
kind: SLO
name: graph-freshness
indicator:
  kind: Occurrence
  measurement: asserts:latency:p99{job="model-builder", asserts_request_type="method",  asserts_request_context="ai.asserts.model.builder.tasks.ModelBuildingTimerTask#run"}
entitySearch: "show service model-builder"
objectives:
  - value: 15
    ratio: 0.99
    name: "Graph refreshed in time"
    window:
      kind: Rolling
      days: 7

Here’s another SLO example. This one checks that the Asserts API server responds successfully to 99.5% of the requests it receives:

apiVersion: asserts/v1
kind: SLO
name: api-server-availability
indicator:
  kind: Request
  badEventCount: asserts:error:total{job="api-server", asserts_error_type="server_errors"}
  totalEventCount: asserts:request:total{job="api-server"}
entitySearch: "show service api-server"
objectives:
  - ratio: 0.995
    name: "Weekly Availability"
    window:
      kind: Rolling
      days: 7

These examples demonstrate the two kinds of SLOs that Asserts supports:

  • Occurrence SLOs are based on time and evaluated each minute. Based on the application’s performance during a minute, that minute is deemed either good or bad. Bad minutes are counted against the SLO’s error budget. Typical use cases for occurrence SLOs are latency and throughput goals.

  • Request SLOs are based on events that are either good or bad. Bad events count against the SLO’s error budget. Web application availability is a common use case for a request SLO, where each request received counts as an event, and requests that fail due to server errors count as bad events.

Threshold

You can control how assertions are generated by tuning thresholds. This rule sets the latency threshold for login requests for a specific customer:

- record: asserts:latency:p99:request_context_threshold
  expr: 1
  labels:
   namespace: webapps
   job: auth
   asserts_request_type: inbound
   asserts_request_context: /login
   asserts_customer: acme

This rule raises a warning level assertion when a redis node has used more than 70% of its CPU:

- record:asserts:resource:warning
  expr: 0.7
  labels:
   asserts_resource_type: cpu:usage
   asserts_component: redis
   asserts_severity: warning

Last updated