Dense neural-network layer

Forward pass of a 256x256 dense (fully-connected) layer with ReLU activation. The inner dot-product loop hits piko's SIMD recogniser; every other runner runs it scalarly.

Runtime Compile time

Compile time · median (cold)

median of 10 runs

Native Gocompiled

182 ms173×

Piko interpbytecode VM

1.05 msbaseline

CPython 3.13bytecode VM

356 µs0.34×

PyPy 7.3tracing JIT

301 µs0.29×

Ttengobytecode VM

291 µs0.28×

Sscriggobytecode VM

325 µs0.31×

Mmvmbytecode VM

319 µs0.30×

YyaegiAST walker

424 µs0.40×

Full statistics

Runner	N	Compile	Runtime	P95	Stddev	RSS	vs piko	Status
Native Gocompiled	10	182 ms	146 µs	149 µs	1.31 µs	68 MiB	173×	OK
Piko interpbytecode VM	10	1.05 ms	1.10 ms	1.13 ms	8.50 µs	101 MiB	1.00×	OK
CPython 3.13bytecode VM	10	356 µs	22.6 ms	23.6 ms	493 µs	n/a	0.34×	OK
PyPy 7.3tracing JIT	10	301 µs	5.72 ms	5.99 ms	135 µs	n/a	0.29×	OK
tengobytecode VM	10	291 µs	34.4 ms	38.3 ms	3.73 ms	361 MiB	0.28×	OK
scriggobytecode VM	10	325 µs	13.2 ms	14.3 ms	400 µs	82 MiB	0.31×	OK
mvmbytecode VM	10	319 µs	24.2 ms	36.4 ms	5.06 ms	66 MiB	0.30×	OK
yaegiAST walker	10	424 µs	16.8 ms	17.0 ms	90.0 µs	62 MiB	0.40×	OK

Workload & symmetry rules

Workload

For each of 256 output neurons: compute output[i] = relu(sum_j(W[i][j] * input[j]) + bias[i]) over a deterministically-seeded 256x256 weight matrix and 256-element input vector. Sum the activations, multiply by 1000, emit as a single integer so canonical hashing is FP-safe.

Symmetry rules

Hand-rolled scalar dot-product loop in every runner. No matrix libraries, no numpy, no SIMD intrinsics in the source.
The Go source is structured so piko's pattern recogniser sees the inner loop as a dot-product and emits one subOpSimdDotProductFloat64 per row. The same source runs scalarly on every other Go-family interpreter.
Python emits a Python for loop over scalar floats; no numpy.dot.

Why this benchmark exists

It is the headline number for piko's SIMD recogniser. Bench 14 (Mandelbrot) shows pure FP throughput; bench 22 (n-body) shows struct-in-FP-loop; this one specifically shows what the SIMD path is worth in a workload that maps to it perfectly. If the recogniser fires, piko closes much of the gap to native Go.

Source code

piko / Go
piko_source.go
native Go
native_main.go
CPython / PyPy
cpython.py
tengo
script.tengo