Interpreter methodology

Piko's cross-language benchmark suite measures one specific thing. That thing is interpreter dispatch throughput on workloads that look like real code. It does not measure the language itself, the standard library, or the compiler. The numbers exist to answer "how much CPU does piko spend on bytecode dispatch?", not "is Go better than Python?".

What gets compared

The headline comparison highlights four runtimes:

Piko interp. The Go-bytecode tree-walking-ish interpreter that powers hot-reload dev mode.
Native Go. A go run of the same algorithm. The reference upper bound. Piko is never going to match it in inner-loop mode and should not try to.
CPython 3.13. The reference Python implementation. Pinned via container image tag.
PyPy 3.10. Python with a tracing JIT. Pinned via container image tag.

By default the suite also runs each workload through four additional Go-embeddable interpreters (yaegi, scriggo, tengo, mvm), so a default go test exercises up to eight runtimes per workload. A handful of specs skip the ones that cannot express a given workload.

Every page prints the full host disclosure (OS, arch, CPU count, Go version, image tags) at the top.

Fairness rules (non-negotiable)

Same algorithm in all runners. Idiomatic per language. The suite measures interpreter throughput on real code, not one author's preferred clever trick.
Verify outputs match before reporting times. Each runner normalises its stdout (CRLF to LF, trailing whitespace stripped, single final newline dropped) and SHA-256 hashes it. The suite flags runs whose hash differs from the canonical hash as status: mismatch. It shows them with a warning chip but excludes them from the numeric aggregates: only status: ok runs feed the median, mean, stddev, min and p95. Mismatch runs still appear individually in results/latest.json.
Median of N >= 7 runs, not minimum. Minimums invite cherry-picking. Median plus stddev is honest. The suite discards warmup runs.
Two timing modes, both reported. End-to-end includes process startup. Inner-loop runs the body K times and reports the inner-loop wall time alone. Neither alone tells the whole story.
Pinned interpreter versions. CPython and PyPy run inside Docker containers managed by testcontainers-go. Image tags are explicit, never :latest.
No standard-library hot-path shortcuts. Each benchmark forbids the C-backed stdlib calls that would short-circuit the comparison. Per-benchmark READMEs (linked from each workload page) enumerate the bans. Built-in language features (map / dict, slice / list indexing, len, string concatenation, integer arithmetic) are always allowed.
Workloads big enough to amortise startup. The smallest benchmark in the canonical set runs >= 50 ms on piko on a modern desktop. Anything smaller would let startup noise dominate.

How the suite draws the bars

Bar width is runner_median / max_median_in_view * 100%, with a 1.5% floor so a far-faster runner still shows a visible sliver. The "ratio vs piko" chip is the ratio of medians, not means. Outliers pull means around, and the median ignores them.

What the suite does not claim

The suite does not publish a geomean across benchmarks ("piko is faster overall"). Workload selection biases any such aggregate. Pick the workload that matches the code you write and read that bar.

The suite makes no claim that piko interp beats native Go on hot-path code. It cannot, by design. Production piko compiles to Go, not interpreted form. The interpreter exists to keep dev-mode hot-reload sub-millisecond. Comparing it to go run in inner-loop mode pits "tree-walked bytecode" against "fully optimised native code". Read those numbers as "this is the speed you trade for hot-reload".

Reproduce locally

git clone https://github.com/piko-sh/piko
cd piko/tests/benchmarks/cross_language
RUN_CROSS_LANG_BENCH=1 go test -tags=crosslang -timeout 30m ./...

By default the suite uses Docker via testcontainers to pin CPython and PyPy. Set CROSS_LANG_USE_HOST_PYTHON=1 to use the host's python3 and pypy3 instead (fast local iteration, but the results are no longer machine-portable).

The full results/latest.json is machine-readable and includes every individual run, not just aggregates, so downstream tooling can recompute any percentile or filter you like.

Edge cases on this page

Mismatch. A runner's stdout hash differs from the canonical. Row gets a dashed outline and a warning chip. Numbers still shown.
No data. A runner failed every attempt or the suite has not backfilled the workload. Renders as a "-" cell.
Single runner. Only one runtime has results yet. The page skips the chart and shows just the absolute time with a note.

Known limitations

Single machine. Ratios differ on your hardware. Run the suite yourself if your conclusion depends on the exact numbers.
Linux/amd64 primary. Linux/arm64 and macOS work but are noisier. Windows is best-effort.
Concurrency coverage is limited. Benchmark 16 (parallel word count) is the only multi-core workload; it spawns 16 workers via goroutines + WaitGroup on the Go family and multiprocessing.Pool on Python. Goroutines vs multiprocessing vs asyncio is not exhaustively explored.
Floating-point outputs are folded to integers. Cross-language FP rounding can change a raw stdout hash, so FP workloads (mandelbrot_fp_200x200, nbody_simulation, dense_layer) never print raw floats. Each folds its result to an integer or byte form (escape-iteration sum, energy scaled by 1e9 and rounded, summarised layer output) so the canonical hash is FP-rounding-safe.