Host
Intel Core Ultra 9 285K · 24 cores
Platform
linux/amd64
Go
go1.26.0
CPython
python:3.13-slim
PyPy
pypy:3.10-slim
Runs / combo
10 + 2 warmup

Dense neural-network layer

Forward pass of a 256x256 dense (fully-connected) layer with ReLU activation. The inner dot-product loop hits piko's SIMD recogniser; every other runner runs it scalarly.

Compile time · median (cold)

median of 10 runs

Native Gocompiled
182 ms173×
Piko interpbytecode VM
1.05 msbaseline
CPython 3.13bytecode VM
356 µs0.34×
PyPy 7.3tracing JIT
301 µs0.29×
Ttengobytecode VM
291 µs0.28×
Sscriggobytecode VM
325 µs0.31×
Mmvmbytecode VM
319 µs0.30×
YyaegiAST walker
424 µs0.40×

Full statistics

RunnerNCompileRuntimeP95StddevRSSvs pikoStatus
Native Gocompiled10182 ms146 µs149 µs1.31 µs68 MiB173×OK
Piko interpbytecode VM101.05 ms1.10 ms1.13 ms8.50 µs101 MiB1.00×OK
CPython 3.13bytecode VM10356 µs22.6 ms23.6 ms493 µsn/a0.34×OK
PyPy 7.3tracing JIT10301 µs5.72 ms5.99 ms135 µsn/a0.29×OK
tengobytecode VM10291 µs34.4 ms38.3 ms3.73 ms361 MiB0.28×OK
scriggobytecode VM10325 µs13.2 ms14.3 ms400 µs82 MiB0.31×OK
mvmbytecode VM10319 µs24.2 ms36.4 ms5.06 ms66 MiB0.30×OK
yaegiAST walker10424 µs16.8 ms17.0 ms90.0 µs62 MiB0.40×OK
Workload & symmetry rules

Workload

For each of 256 output neurons: compute output[i] = relu(sum_j(W[i][j] * input[j]) + bias[i]) over a deterministically-seeded 256x256 weight matrix and 256-element input vector. Sum the activations, multiply by 1000, emit as a single integer so canonical hashing is FP-safe.

Symmetry rules

  • Hand-rolled scalar dot-product loop in every runner. No matrix libraries, no numpy, no SIMD intrinsics in the source.
  • The Go source is structured so piko's pattern recogniser sees the inner loop as a dot-product and emits one subOpSimdDotProductFloat64 per row. The same source runs scalarly on every other Go-family interpreter.
  • Python emits a Python for loop over scalar floats; no numpy.dot.

Why this benchmark exists

It is the headline number for piko's SIMD recogniser. Bench 14 (Mandelbrot) shows pure FP throughput; bench 22 (n-body) shows struct-in-FP-loop; this one specifically shows what the SIMD path is worth in a workload that maps to it perfectly. If the recogniser fires, piko closes much of the gap to native Go.

Source code