Benchmarks

TPC-H SF=1, 22 queries, Apple M3 Pro — median ms ± σ vs DuckDB, Polars, PySpark.

Same-machine TPC-H benchmark (Apple M3 Pro, single-node) over all 22 queries against SF=1 Parquet data. ematix-flow / DuckDB / Polars run in-process; PySpark runs in local[*] mode against the same files.

Scope: every number on this page is single-node. ematix-flow also has an auto-detected distributed mode (Arrow Flight peer mesh — see Advantages §5). Cluster-scale TPC-H runs at SF≥100 will land in a later release; the bench harness (tpch_distributed) is already in the repo.

  • ematix-flow / DuckDB / Polars: 10 timed trials after 3 warmups (v0.4.0–v0.5.0; unchanged, refreshed 2026-05-20).
  • PySpark: 3 trials after 1 warmup, Spark 4.1.1 on JDK 23 (refreshed on the same machine, same data, same day).
  • Data: examples/tpch/data/sf1.

Each ematix-flow / DuckDB / Polars cell is median ms ± σ across 5 trials; PySpark cells are median ms across 3 trials. ”—” means the engine couldn’t parse / execute the query (dialect gap).

Headline

  • Geomean speedup of ematix-flow (v0.4.0–v0.5.0; unchanged, 10-trial refresh 2026-05-20):
    • 1.75× vs DuckDB (was 1.69× at v0.3.0)
    • 2.77× vs Polars (was 2.71×)
    • 13.4× vs PySpark local[*] (was 12.9×)
  • Win counts (lowest median per query): ematix-flow 19, DuckDB 1, Polars 2, PySpark 0.

The geomean improvement over v0.3.0 comes from the ematix-parquet v0.13.0 bump (full SIMD specialisation bw=1..=32 — Q06 scan kernel -18.7%, Q17 -9.5% in isolation) plus the Σ.F.1 shape-catalog substrate that auto-loads previously hand-wired optimizer rules. The “What’s not shipped” closures (warehouse backends, Web UI, secrets, distributed peer auto-detection) are orthogonal to the scan/aggregate hot path.

v0.5.0 ships the same query-execution surface as v0.4.0. The kernel work lives in the sibling ematix-parquet codec; v0.5.0 itself is operational — CLIs, Web UI, alerters, observability — so per-query times match v0.4.0.

Full table

Queryematix-flowDuckDBPolarsPySparkBest
Q0128.63 ± 0.6145.24 ± 0.2038.52 ± 0.84196.5ematix-flow
Q029.85 ± 0.2119.07 ± 0.6246.07 ± 0.65290.7ematix-flow
Q0313.96 ± 1.4032.70 ± 0.6546.00 ± 0.86288.2ematix-flow
Q0413.21 ± 0.4323.07 ± 2.2125.28 ± 1.51226.1ematix-flow
Q0521.59 ± 0.9331.48 ± 0.7011150.72 ± 689.69364.2ematix-flow
Q0611.04 ± 1.4111.94 ± 0.2010.16 ± 0.2768.3Polars
Q0728.79 ± 1.1532.65 ± 0.93115.31 ± 3.89286.8ematix-flow
Q0820.41 ± 0.6738.26 ± 0.4193.62 ± 7.78209.8ematix-flow
Q0926.30 ± 1.3660.67 ± 1.6347.96 ± 1.36461.3ematix-flow
Q1028.83 ± 10.4468.29 ± 2.23111.80 ± 8.15421.9ematix-flow
Q118.65 ± 0.3111.62 ± 0.629.35 ± 5.04139.1ematix-flow
Q1214.85 ± 0.3724.37 ± 0.6819.06 ± 0.86288.4ematix-flow
Q1341.68 ± 0.73147.33 ± 2.06117.00 ± 4.13694.2ematix-flow
Q1412.13 ± 1.0024.22 ± 1.0413.01 ± 0.78138.3ematix-flow
Q1516.25 ± 0.9215.69 ± 1.8711.48 ± 0.22166.4Polars
Q168.76 ± 1.4826.00 ± 4.3521.29 ± 0.71211.5ematix-flow
Q1736.85 ± 2.2428.48 ± 1.6242.04 ± 2.96239.4DuckDB
Q1851.21 ± 3.0652.37 ± 1.3159.19 ± 2.32569.1ematix-flow
Q1917.79 ± 1.8936.82 ± 3.48106.55 ± 9.04111.4ematix-flow
Q2016.34 ± 0.8539.11 ± 3.0423.30 ± 2.39148.8ematix-flow
Q2141.08 ± 1.6787.04 ± 2.18730.68 ± 39.43648.5ematix-flow
Q228.62 ± 0.5222.40 ± 0.6512.97 ± 1.67280.2ematix-flow

Release-over-release perf history

v0.4.0 vs v0.3.0

v0.4.0 is the alpha milestone — warehouse backends, Web UI, pluggable secrets, distributed peer auto-detection. All four are orthogonal to the scan / aggregate hot path. The geomean still moved:

Enginev0.3.0v0.4.0Δ geomean
DuckDB1.69×1.75×+3.6%
Polars2.71×2.77×+2.2%
PySpark12.9×13.4×+4.0%

Win count rose 18 → 19 / 22 (Q18 flipped to ematix as σ tightened under 10-trial medians). Per-query times shifted ±10% — noise-band movement, not directional, with two non-headline sources of lift:

  • ematix-parquet v0.13.0 — full SIMD specialisation bw=1..=32 landed; Q06 scan kernel -18.7%, Q17 -9.5% measured in isolation (kernel-only, not end-to-end Q06 wall time).
  • Σ.F.1 shape-catalog substrate — bit-identical perf vs the hand-wired Inject*Rule set it replaced, but stable enough to allow the 10-trial / 3-warmup bench config that surfaced the gain.

v0.3.0 vs v0.2.1

Historical record of the big jump — when ematix-parquet replaced the parquet-rs scan path:

Queryv0.2.1v0.3.0Δ
Q0178.1928.11-64%
Q0320.3815.11-26%
Q0534.0920.93-39%
Q0775.5628.96-62%
Q0835.6620.76-42%
Q0950.1628.13-44%
Q1039.7328.16-29%
Q1344.7341.36-8%
Q1419.4511.28-42%
Q1618.298.60-53%
Q18157.5552.02-67%
Q1999.7618.81-81%
Q2175.4838.08-50%

v0.3.0 win count rose from 15 → 18 / 22.

Caveats

  • ematix-flow’s late-materialization path (read_column_*_masked_into) is enabled for lineitem. Late-mat helps queries with a selective filter on a dict/PLAIN-decodable scalar column; on aggregate-heavy queries with low filter selectivity (Q01) it’s effectively a no-op.
  • Polars’s SQL frontend rejects several TPC-H canonical shapes; hand- translated q??.polars.sql variants ship under examples/tpch/queries/. Q05 specifically still blows up Polars’s planner.
  • DuckDB runs at default settings (in-memory read_parquet views). ematix-flow runs with target_partitions=14 and the InjectFilterMultiAggRule + InjectFilterSumRule + EnableDictGroupCountRule physical-optimizer rules registered.
  • PySpark uses local[*], spark.sql.shuffle.partitions=8, spark.sql.adaptive.enabled=true. JVM warmup costs sit above what the warmup-trial discard can amortize — treat as order-of-magnitude.

Reproducing

# ematix-flow vs DuckDB vs Polars
cargo run --release -p ematix-flow-core \
    --example tpch_triangulation_bench --features triangulation

# PySpark (needs Java 17+; install with `brew install openjdk@23`):
JAVA_HOME=$(/usr/libexec/java_home) python scripts/bench-tpch-pyspark.py \
    --data-dir examples/tpch/data/sf1 --trials 3