Springboot

Basic Setup

Spring Boot is a popular Java framework for building modern applications. With the help of the actuator module and the micrometer library, we can configure a Spring Boot application to expose performance metrics in the Prometheus format.

First, you will need the actuator module to enable the management endpoints, and the micrometer module to provide the Prometheus endpoint that exposes Prometheus metrics.

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-actuator',
    Implementation ‘io.micrometer:micrometer-registry-prometheus’
}

Second, make sure the Prometheus endpoint is enabled in your application.properties or application.yml:

management.endpoint.prometheus.enabled=true
management.endpoints.web.exposure.include=prometheus

The first line enables the endpoint that provides metrics in Prometheus format, and the second line tells Springboot to expose this endpoint as a Web API. You may already have other endpoints listed here, like info, health, etc. Here is a link to the official documentation for details.

Now, if you hit the /actuator/prometheus endpoint of your web application, you will see a list of metrics like these. They usually cover JVM, inbound HTTP requests, and outbound calls, among other things.

jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 2.012508E7

http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/metrics",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/metrics",} 0.084955379

http_client_requests_seconds_count{clientName="chief.tsdb.dev.asserts.ai",method="POST",outcome="SUCCESS",status="200",uri="/select/0/prometheus/api/v1/query",} 4785.0
http_client_requests_seconds_sum{clientName="chief.tsdb.dev.asserts.ai",method="POST",outcome="SUCCESS",status="200",uri="/select/0/prometheus/api/v1/query",} 238.762194814

Optional Setup - Histogram

By default, you only get summary metrics. If you want the histogram metrics to calculate quantiles, you can enable them with additional properties:

management.metrics.distribution.percentiles-histogram.http.server.requests=true
management.metrics.distribution.percentiles-histogram.http.client.requests=true

The first property enables the histogram for inbound requests, and the second one is for outbound HTTP calls. Be aware histogram metrics can quickly increase the number of time series you get, so enable them with caution.

Optional Setup - Custom Instrumentation

Micrometer also provides annotations like @Timed and @Counted for you to monitor individual methods:

package ai.asserts.tasks;

@Bean
public class TimerTask {
    @Timed(description = "Time spent processing all tenants", histogram = true)
    public void run() {
        processAllTenants();
    }
    ...
}

These annotations produce method_timed_seconds metrics as Prometheus summary metrics and optionally histogram metrics if we set histogram = true.

method_timed_seconds_count{class="ai.asserts.tasks.TimerTask",exception="none",method="run",} 11.0
method_timed_seconds_sum{class="ai.asserts.tasks.TimerTask",exception="none",method="run",} 524.279248451
method_timed_seconds_bucket{class="ai.asserts.tasks.TimerTask",exception="none",method="run",le="0.001",} 0.0

For these annotations to work, the object has to be managed by the framework as a Java bean. If you are creating the above TimerTask object with a new operator, the @Timed annotation will not work.

Cardinality Consideration

JVM metrics are usually small, so you need to focus on the HTTP request metrics when considering how many metrics will get published. There are a few things to consider here.

First, for summary metrics, one <URI, status code> combination has only three series on its latency: count, sum, and max, but histogram metrics are much more. One combination may have 50~100 buckets, so if you have 20 different URLs, you will be counting 1000+ series per service instance. When the application is reporting status codes other than 200, such as 5xx, 4xx, then additional metrics are reported.

For inbound calls, the number of URIs usually matches the number of API endpoints your service provides. But for outbound calls, depending on how you customize your RestTemplate, URIs could contain query strings, thus can cause a cardinality explosion. If you enable histogram for outbound calls, check the URIs and fix the RestTemplate if necessary.

RED Metric KPIs

Asserts will automatically track the following list of Key performance indicators for your Request, Error and Duration aka RED metrics.

Request Rate

  • Inbound rate(http_server_requests_seconds_count[5m])

  • Outbound rate(http_client_requests_seconds_count[5m]) rate(gateway_requests_seconds_count[5m])

  • Method rate(method_timed_seconds_count[5m])

  • Executor rate(executor_execution_seconds_count[5m])

  • Logger rate(logback_events_total[5m])

Error Ratio - Inbound, Outbound and Method et al.

  • Server (5xx) errors rate(http_server_requests_seconds_count{status=~"5.."}[5m])/ rate(http_server_requests_seconds_count[5m])

  • Client (4xx) errors rate(http_server_requests_seconds_count{status=~"4.."}[5m])/ rate(http_server_requests_seconds_count[5m])

  • Method Errors rate(method_timed_seconds_count{exception!="none"}}[5m])/ rate(method_timed_seconds_count[5m])

  • Logger Errors rate(logback_events_total{level="error"}[5m])

Latency - Inbound, Outbound, Method, Executor et al.

  • Average rate(http_server_requests_seconds_sum[5m])/ rate(http_server_requests_seconds_count[5m])

  • P99 histogram_quantile( 0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri, job, ...) )

RED Metrics Alerts

Asserts automatically tracks the short-term and long-term trends for request and latency for Anomaly detection by URI and method names. Similarly, thresholds can be set for Latency averages and P99 to record breaches. Error Ratios are tracked against availability goals (default, 99.9%) and breaches (default, 10%)

KPIAlerts

Request Rate

RequestRateAnomaly - Inbound, Outbound, Method, Executor, Logger

Error Ratio

ErrorRatioBreach

ErrorBuildup - availability goal 99.9 %

Latency Average

Latency P99

LatencyAverageBreach

LatencyAverageAnomaly

LatencyP99ErrorBuildup

Error Log

ErrorLogRateBreach

JVM GC Alerts

Asserts tracks the JVM GC Count and Time from micrometer metrics, the thresholds are tunable

JVM GC

JvmMajorGCTimeHigh rate(jvm_gc_pause_seconds_sum{action="end of major GC"}[2m]) * 120 > 0.10 * 120 unless on (job, service, instance, asserts_env, asserts_site) process_uptime_seconds < 300

JvmMajorGCCountHigh rate(jvm_gc_pause_seconds_count{action="end of major GC"}[2m]) * 120 > 10 unless on (job, service, instance, asserts_env, asserts_site) process_uptime_seconds < 300

Dashboards

Service KPI Dashboard

Asserts aggregates data from micrometer, cAdvisor, kubelet and node-exporter to present a dashboard with the following KPIs:

  • Request Rate

    • Type: inbound, outbound, method, executor, logger, custom

    • Context: URI, method name et al.

  • Latency Average

  • Latency P99

  • Error Ratio and Error Rate

  • CPU %

  • CPU Cores Used

  • CPU Throttle

  • Memory %

  • Memory Bytes

  • Disk Usage

  • Network Usage

JVM Micrometer KPI Dashboard

This dashboard has the following KPIs:

  • JVM memory

  • CPU-Usage, Load, Threads, Thread States, File Descriptors, Log Events

  • JVM Memory Pools (Heap, Non-Heap)

  • Garbage Collection

  • Classloading

  • Direct-/Mapped-Buffer

Last updated