Watchdog API

The watchdog is a runtime supervisor that monitors heap, RSS, goroutines, file descriptors, and scheduler latency, and captures diagnostic profiles when thresholds breach. The application configures it under WithMonitoring. It exposes its state to piko tui and the piko watchdog CLI through the gRPC monitoring transport. For the design rationale see about the watchdog. For task recipes see how to configure the watchdog. Source: options.go, cmd_watchdog.go.

Bootstrap entry point

func WithMonitoringWatchdog(opts ...WatchdogOption) MonitoringOption

Enables the watchdog inside the monitoring service. Without this option the watchdog never starts and piko watchdog calls fail with a "service not registered" gRPC error.

Note: This option must sit inside WithMonitoring(...), not alongside it. The signature returns MonitoringOption (not Option); placing it at the top level fails to compile.

Prerequisite: Pair this with WithMonitoringProfiling(). The watchdog's capture paths (continuous, threshold-triggered, pre-death, contention) all dispatch through the profiling controller that option constructs. Without it, every capture silently no-ops and the profile directory only ever contains startup_history.json. Piko emits a startup WARN when this dependency is missing.

piko.WithMonitoring(
    piko.WithMonitoringTransport(monitoring_grpc.Transport()),
    piko.WithMonitoringProfiling(),
    piko.WithMonitoringWatchdog(
        piko.WithWatchdogProfileDirectory("/var/lib/piko/profiles"),
        piko.WithWatchdogContinuousProfiling(),
    ),
)

Threshold options

OptionDefaultPurpose
WithWatchdogHeapThresholdPercent(p)0.85Heap fraction of GOMEMLIMIT that triggers a heap profile.
WithWatchdogHeapThresholdBytes(b)512 MiBAbsolute heap threshold when GOMEMLIMIT is unset.
WithWatchdogRSSThresholdPercent(p)0.85RSS fraction of the cgroup memory limit.
WithWatchdogGoroutineThreshold(n)10000Goroutine count that triggers a goroutine profile.
WithWatchdogFDPressureThresholdPercent(p)0.80File-descriptor fraction of soft RLIMIT_NOFILE. Pass 0 to disable.
WithWatchdogSchedulerLatencyP99Threshold(d)10msp99 scheduler latency. Pass zero to disable.

Floating-point thresholds use the 0.0-1.0 range. Counts and durations use their native types.

Loop, capture, and budget options

OptionDefaultPurpose
WithWatchdogCheckInterval(d)500msTick frequency for threshold evaluation.
WithWatchdogCooldown(d)2mMinimum gap between captures for the same metric type.
WithWatchdogMaxProfilesPerType(n)5Files retained per profile type. Oldest rotates out.
WithWatchdogMaxWarningsPerWindow(n)10Warning-only events permitted per capture window.
WithWatchdogProfileDirectory(dir)os.TempDir()/piko-watchdogLocal directory for profile, sidecar, and history files.
WithWatchdogDeltaProfiling()offStores a baseline heap snapshot beside each capture for pprof -diff_base.

When the application calls WithDiagnosticDirectory on the container, profile files land at <dir>/profiles/ only when the application has not called WithWatchdogProfileDirectory. WithWatchdogProfileDirectory takes precedence. The diagnostic-directory override only kicks in when the watchdog profile directory is empty.

Continuous profiling

A separate routine loop captures profiles at a fixed interval, independent of any threshold breach.

OptionDefaultPurpose
WithWatchdogContinuousProfiling()offEnables the routine loop.
WithWatchdogContinuousProfilingInterval(d)10mPeriod between routine captures. Minimum 1m.
WithWatchdogContinuousProfilingTypes(t...)["heap"]Profile types per interval. Allowed: heap, goroutine, allocs.
WithWatchdogContinuousProfilingRetention(n)6Files retained per type.
WithWatchdogContinuousProfilingNotify()offEmits informational notifications for each routine capture.

Contention diagnostic

A short-window diagnostic that flips block and mutex profiling on at configurable rates, captures the resulting profiles, and turns them off again.

OptionDefaultPurpose
WithWatchdogContentionDiagnosticWindow(d)60sTime block + mutex profiling stays active. Range 1s-5m.
WithWatchdogContentionDiagnosticAutoFire()offFires automatically on repeated scheduler-latency events.
WithWatchdogContentionDiagnosticBlockProfileRate(rate)1e6Runtime block profile rate (one sample per rate ns of blocking).
WithWatchdogContentionDiagnosticMutexProfileFraction(frac)100Runtime mutex profile fraction (1 in frac events sampled).

The diagnostic is a one-shot, blocking call when triggered manually:

piko watchdog contention-diagnostic

Notifier and uploader

These two ports plug into the monitoring level (not inside WithMonitoringWatchdog), so every watchdog notification flows through the same notifier.

type WatchdogNotifier        = monitoring_domain.WatchdogNotifier
type WatchdogProfileUploader = monitoring_domain.WatchdogProfileUploader
OptionPurpose
WithWatchdogNotifier(notifier)Delivers WatchdogEvents to an external system (Slack, PagerDuty, email).
WithWatchdogProfileUploader(uploader)Uploads each captured profile to remote storage after the local write.

The notifier receives every event the watchdog emits, including the typed event categories below.

Event types

type WatchdogEvent         = monitoring_domain.WatchdogEvent
type WatchdogEventType     = monitoring_domain.WatchdogEventType
type WatchdogEventPriority = monitoring_domain.WatchdogEventPriority

Priorities:

ConstantMeaning
WatchdogPriorityNormalInformational. Safe to ignore in alerting.
WatchdogPriorityHighWarrants prompt investigation.
WatchdogPriorityCriticalImminent system instability.

The CLI's piko watchdog events --type <type> flag filters by WatchdogEventType. The full set of constants in internal/monitoring/monitoring_domain/watchdog_notifier.go is:

ConstantString value
WatchdogEventHeapThresholdExceededheap_threshold_exceeded
WatchdogEventRSSThresholdExceededrss_threshold_exceeded
WatchdogEventGoroutineThresholdExceededgoroutine_threshold_exceeded
WatchdogEventGoroutineSafetyCeilinggoroutine_safety_ceiling
WatchdogEventGCPressureWarninggc_pressure_warning
WatchdogEventCaptureErrorcapture_error
WatchdogEventGomemlimitNotConfiguredgomemlimit_not_configured
WatchdogEventMemProfileRateDisabledmemprofilerate_disabled
WatchdogEventHeapTrendWarningheap_trend_warning
WatchdogEventGoroutineLeakDetectedgoroutine_leak_detected
WatchdogEventPreDeathSnapshotpre_death_snapshot
WatchdogEventLoopPanickedloop_panicked
WatchdogEventFDPressureExceededfd_pressure_exceeded
WatchdogEventSchedulerLatencyHighscheduler_latency_high
WatchdogEventCrashLoopDetectedcrash_loop_detected
WatchdogEventPreviousCrashClassifiedprevious_crash_classified
WatchdogEventRoutineProfileCapturedroutine_profile_captured
WatchdogEventContentionDiagnosticcontention_diagnostic

CLI: piko watchdog

Connects to the monitoring transport (default 127.0.0.1:9091) and operates on the running process's watchdog state.

SubcommandPurpose
piko watchdog statusPrints lifecycle, thresholds, crash-loop, continuous-profiling, and contention-diagnostic configuration.
piko watchdog list [--type <type>]Lists stored profiles. Type, timestamp, size, filename.
piko watchdog download [<filename> | --latest --type <type>] [--output <dir>] [--skip-sidecar]Downloads a profile file (and its JSON sidecar by default) to the local directory.
piko watchdog prune [--type <type>]Removes stored profiles. Without --type removes everything.
piko watchdog historyPrints the startup-history ring (process ID, started, stopped, reason, host, version).
piko watchdog events [--since <duration>] [--type <type>] [--limit <n>] [--tail]Lists or streams events from the in-memory ring. --tail subscribes to new events as they fire.
piko watchdog contention-diagnosticRuns a one-shot contention diagnostic. Blocks for the configured window.

The global flags from CLI reference (-e/--endpoint, -o/--output, --no-colour, etc.) apply.

Examples

# Status and recent activity
piko watchdog status
piko watchdog events --since 1h
piko watchdog events --type heap_threshold_exceeded
piko watchdog list

# Download the most recent heap profile and inspect it
piko watchdog download --latest --type heap --output ./pprof
go tool pprof ./pprof/heap-<timestamp>.pprof

# Watch for new events live
piko watchdog events --tail

# Trigger a contention diagnostic on demand
piko watchdog contention-diagnostic

# Clean up stored heap profiles
piko watchdog prune --type heap

# Detect crash loops
piko watchdog history

See also