Host
Intel Core Ultra 9 285K · 24 cores
Platform
linux/amd64
Go
go1.26.0
CPython
python:3.13-slim
PyPy
pypy:3.10-slim
Runs / combo
10 + 2 warmup

Invert binary tree

65k-node balanced binary tree: allocator pressure, recursive descent, and pointer mutation in one inner loop. The only workload in the suite that stresses all three at once.

Compile time · median (cold)

median of 10 runs

Native Gocompiled
181 ms162×
Piko interpbytecode VM
1.12 msbaseline
CPython 3.13bytecode VM
312 µs0.28×
PyPy 7.3tracing JIT
265 µs0.24×
Ttengobytecode VM
244 µs0.22×
Sscriggobytecode VM
n/a
Mmvmbytecode VM
358 µs0.32×
YyaegiAST walker
485 µs0.43×

Full statistics

RunnerNCompileRuntimeP95StddevRSSvs pikoStatus
Native Gocompiled10181 ms6.11 ms6.21 ms52.5 µs68 MiB162×OK
Piko interpbytecode VM101.12 ms116 ms118 ms927 µs194 MiB1.00×OK
CPython 3.13bytecode VM10312 µs91.2 ms95.2 ms1.92 msn/a0.28×OK
PyPy 7.3tracing JIT10265 µs39.8 ms40.4 ms416 µsn/a0.24×OK
tengobytecode VM10244 µs189 ms195 ms4.08 ms1.06 GiB0.22×OK
scriggobytecode VM0n/an/an/an/an/an/aunsupported
mvmbytecode VM10358 µs182 ms204 ms9.79 ms67 MiB0.32×OK
yaegiAST walker10485 µs744 ms792 ms20.3 ms75 MiB0.43×OK
Workload & symmetry rules

Workload

  1. Recursively construct a balanced binary tree of depth 15 (2¹⁶ − 1 = 65535 nodes). Each node's value is derived deterministically from its parent (value*2 for left, value*2+1 for right, 32-bit masked).
  2. Count nodes via recursive traversal (sanity check).
  3. Compute an in-order hash fold: state = state*31 + value mod 2³².
  4. Invert the tree: at every node, swap left and right pointers recursively.
  5. Re-count and re-hash. Emit the four observables pre_count,pre_fold,post_count,post_fold.

Symmetry rules

  • Real recursive *Node struct (Go) / Node class with __slots__ (Python).
  • No conversion to arrays for the inversion step, just a pointer swap at every node.

What makes this workload unique

Every other benchmark in the suite isolates one performance dimension. Bench 14 stresses pure float arithmetic in a tight loop. Bench 15 stresses interpreter throughput with no allocation. Bench 13 stresses interface dispatch with no mutation. This one combines three pressures into a single inner loop: 65k heap allocations during construction, four full recursive walks across the tree, and in-place pointer-field mutation during the inversion. The runner that wins this benchmark is the runner whose allocator, call-frame management, and pointer-write path are all simultaneously fast. The runner that loses it has a bottleneck in at least one of the three.

Source code