How to capture a contention diagnostic

A contention diagnostic flips the Go runtime's block and mutex profiling on at configurable rates, captures the resulting profiles after a short window, and turns them off again. Block and mutex profiling cost too much to run continuously. Running them for a deliberate window during an actual contention symptom is exactly the right fit. For the rationale see about the watchdog. For every option see watchdog API reference.

Trigger a one-shot diagnostic from the CLI

With the watchdog enabled, run:

piko watchdog contention-diagnostic

The call blocks for the configured window (default 60 s) plus capture overhead, then prints Contention diagnostic completed. The captured profiles land in the watchdog profile directory and show up in piko watchdog list:

piko watchdog list --type block
piko watchdog list --type mutex
piko watchdog download --latest --type block --output ./pprof
piko watchdog download --latest --type mutex --output ./pprof
go tool pprof -http=:8081 ./pprof/<block-file>
go tool pprof -http=:8081 ./pprof/<mutex-file>

Use this when an operator notices contention symptoms (rising scheduler latency, long-tail request latency, threads piling up) and wants to look at the source.

Configure the diagnostic window

Shorter windows reduce the cost. Longer windows give more sampling. The valid range is 1 second to 5 minutes:

piko.WithMonitoringWatchdog(
    piko.WithWatchdogContentionDiagnosticWindow(30 * time.Second),
)

For most workloads, 30 to 60 seconds is enough to capture a representative contention pattern.

Tune the block and mutex rates

The runtime's profile rates are global. The diagnostic sets them only for the window's duration, then restores them. Defaults are aggressive enough to catch contention without overwhelming the runtime:

piko.WithMonitoringWatchdog(
    piko.WithWatchdogContentionDiagnosticBlockProfileRate(1_000_000),    // 1 sample per 1 ms blocking
    piko.WithWatchdogContentionDiagnosticMutexProfileFraction(100),      // 1 in 100 mutex events
)

Lower the block-profile rate to sample more aggressively (for example 100_000 for 1 sample per 100 µs of blocking). Lower the mutex fraction to sample more events (for example 10). On a contended workload the captured profiles are larger but the signal is cleaner.

Auto-fire on repeated scheduler-latency events

The diagnostic can fire automatically when scheduler-latency events repeat within a short window. Useful for production where the operator is not watching the latency dashboard:

piko.WithMonitoringWatchdog(
    piko.WithWatchdogSchedulerLatencyP99Threshold(5 * time.Millisecond),
    piko.WithWatchdogContentionDiagnosticAutoFire(),
)

When the scheduler-latency threshold trips repeatedly, the watchdog runs a contention diagnostic instead of (or alongside) the threshold's normal warning event. The captured block and mutex profiles point at the source of the contention without operator involvement.

Two configuration fields govern the firing thresholds. ContentionDiagnosticConsecutiveTrigger (default 3) is the number of scheduler-latency events that must accrue before a diagnostic auto-fires. ContentionDiagnosticTriggerWindow (default 15 minutes) is the rolling window over which the runner counts those consecutive events. Raising the trigger count or shortening the window makes auto-fire less sensitive. Lowering the count or lengthening the window makes it more sensitive. The defaults fire on a sustained pattern, not a single transient spike.

Without WithWatchdogContentionDiagnosticAutoFire, the diagnostic only runs when explicitly invoked via the CLI.

Why the diagnostic is not part of the regular tick loop

Block and mutex profiling impose a per-goroutine cost that scales with the workload. The watchdog's tick loop (see about the watchdog) keeps its overhead negligible by reading runtime metrics. Turning block and mutex profiling on continuously would itself become the contention. The contention diagnostic is the exception. It opens a short, deliberate window where the cost is acceptable because the goal is to capture exactly that cost.