Host
Intel Core Ultra 9 285K · 24 cores
Platform
linux/amd64
Go
go1.26.0
CPython
python:3.13-slim
PyPy
pypy:3.10-slim
Runs / combo
10 + 2 warmup

Dense neural-network layer

Forward pass of a 256x256 dense (fully-connected) layer with ReLU activation. The inner dot-product loop hits piko's SIMD recogniser; every other runner runs it scalarly.

Runtime · median per inner-loop window

median of 10 runs

Native Gocompiled
146 µs0.13×
Piko interpbytecode VM
1.10 msbaseline
CPython 3.13bytecode VM
22.6 ms20.5×
PyPy 7.3tracing JIT
5.72 ms5.19×
Ttengobytecode VM
34.4 ms31.2×
Sscriggobytecode VM
13.2 ms12.0×
Mmvmbytecode VM
24.2 ms22.0×
YyaegiAST walker
16.8 ms15.3×

Full statistics

RunnerNCompileRuntimeP95StddevRSSvs pikoStatus
Native Gocompiled10182 ms146 µs149 µs1.31 µs68 MiB0.13×OK
Piko interpbytecode VM101.05 ms1.10 ms1.13 ms8.50 µs101 MiB1.00×OK
CPython 3.13bytecode VM10356 µs22.6 ms23.6 ms493 µsn/a20.5×OK
PyPy 7.3tracing JIT10301 µs5.72 ms5.99 ms135 µsn/a5.19×OK
tengobytecode VM10291 µs34.4 ms38.3 ms3.73 ms361 MiB31.2×OK
scriggobytecode VM10325 µs13.2 ms14.3 ms400 µs82 MiB12.0×OK
mvmbytecode VM10319 µs24.2 ms36.4 ms5.06 ms66 MiB22.0×OK
yaegiAST walker10424 µs16.8 ms17.0 ms90.0 µs62 MiB15.3×OK
Workload & symmetry rules

Workload

For each of 256 output neurons: compute output[i] = relu(sum_j(W[i][j] * input[j]) + bias[i]) over a deterministically-seeded 256x256 weight matrix and 256-element input vector. Sum the activations, multiply by 1000, emit as a single integer so canonical hashing is FP-safe.

Symmetry rules

  • Hand-rolled scalar dot-product loop in every runner. No matrix libraries, no numpy, no SIMD intrinsics in the source.
  • The Go source is structured so piko's pattern recogniser sees the inner loop as a dot-product and emits one subOpSimdDotProductFloat64 per row. The same source runs scalarly on every other Go-family interpreter.
  • Python emits a Python for loop over scalar floats; no numpy.dot.

Why this benchmark exists

It is the headline number for piko's SIMD recogniser. Bench 14 (Mandelbrot) shows pure FP throughput; bench 22 (n-body) shows struct-in-FP-loop; this one specifically shows what the SIMD path is worth in a workload that maps to it perfectly. If the recogniser fires, piko closes much of the gap to native Go.

Source code