echo metrics reference¶
Glossary for every field exposed on SampleInfo and returned from
Server.sample().
Two build modes¶
- Default build: health-only. Counters and gauges always on. No hdrhistogram dep, no per-sample timestamps.
--features detailed-metrics: adds CAS counters and three timing histograms (memcpy,drain_round,queue_dwell). Use for perf work.
When the feature is off, the detailed fields are still present on
SampleInfo but report 0 / empty HistSnapshot. Python code checking
if info.memcpy.count > 0: works uniformly across both builds.
Metric categories¶
- Counters (
_total/_count): monotonic, always go up. Divide by run duration to get a rate. - Gauges (e.g.
store_size,queue_depth_sum): snapshot values, can go up or down. - Histograms (
memcpy,drain_round,queue_dwell): reportcount,min_ns,max_ns,mean_ns,p50_ns,p90_ns,p99_nsof a duration distribution.
Ingestion path: what the drainer did¶
| Metric | Feature? | What it measures |
|---|---|---|
samples_inserted_total |
always | Samples the drainer has moved from SPSC to store. Drainer path only; Server.submit() samples are not counted. |
cas_success_total |
detailed | Successful CAS on the store's write_cursor; one per reservation (a reservation can cover multiple samples). |
cas_failure_total |
detailed | Failed CAS: two drainers raced for the same slot. The loser retries. |
reservation_retries_total |
detailed | Retries inside try_reserve_slots. Usually equal to cas_failure_total. |
cas_failure / cas_success ratio |
detailed | CAS contention rate. >10 % is bad; 0.2 % is fine. |
Ingress (SPSC) side: what the transport is doing¶
| Metric | What it measures |
|---|---|
active_connections |
Currently-open transport connections. |
push_blocked_count |
Every time SampleSender::push hit a full SPSC queue. One park = one increment. |
queue_depth_sum |
Sum of all this-drainer's SPSC queue depths, sampled at the start of each drain round. |
queue_depth_max |
Max single-queue depth at start of round, taken across all drainers. Equal to producer_queue_size = at least one queue was full. |
Consumer side¶
| Metric | What it measures |
|---|---|
store_size |
Samples committed to the ring but not yet sampled by the consumer. In FIFO, hovers between 0 and batch_size under balanced rates. |
notify_count |
Times the last-committer drainer called notify_one on the consumer — one per full committed batch. Should roughly equal samples_inserted_total / batch_size. |
Timing histograms (detailed build only)¶
All three are populated only when built with --features detailed-metrics.
In the default build they are zero-valued HistSnapshots.
drain_round¶
Wall-clock for one complete execution of drain_round, recorded only
when the round did work. p50 = how long a typical round takes;
p99/tail = stalls (e.g. insert_batch blocked on a full ring).
memcpy¶
Time to memcpy all samples for one reservation (not per-sample).
memcpy.count ≈ cas_success_total because each successful reservation
produces one memcpy record.
p99/p50 ratio on this metric is the scheduler-preemption detector: memcpy is near-constant-time for a given byte count, so a large p99/p50 means the OS parked the thread partway through.
queue_dwell¶
For each sample, the time from push into SPSC to pop out by the
drainer. Per-sample.
- High p50: drainer cycle is slow; samples wait long for a visit.
- High p99: drainer occasionally stalled; samples pushed during the stall sat through it.
- Unlike
memcpy, this is driven by scheduling, not by per-sample work.
Reading them together¶
Always-on signals first:
| Signal | Interpretation |
|---|---|
push_blocked_count rising quickly |
Producers hitting full SPSC queues; drainer can't keep up or SPSC is too small |
queue_depth_max = producer_queue_size persistently |
At least one connection's queue is full; pair with push_blocked_count |
store_size ≈ 0 |
Drainer and consumer in lock-step, or consumer outpacing drainer |
store_size ≈ num_buffers × batch_size |
Drainer outrunning consumer; downstream (usually GPU) is the ceiling |
active_connections dropping unexpectedly |
Actor disconnects; look at the actor side |
Detailed-build signals:
| Signal | Interpretation |
|---|---|
cas_failure / cas_success > 5% |
CAS contention: too many drainers fighting for write_cursor |
memcpy.p99_ns / memcpy.p50_ns > 10× |
Scheduler preemption during memcpy |
queue_dwell.p50_ns high and drain_round.p50_ns low |
Drainer is idle most of the time, wakes rarely |
queue_dwell.p99_ns and drain_round.p99_ns both high |
Drainer occasionally stalls (ring-full backpressure) |