Advantages

1. Fast.

TPC-H SF=1, 22 queries, single Apple M3 Pro:

1.69× faster than DuckDB
2.71× faster than Polars
12.9× faster than single-node PySpark

(All geomeans. 18 / 22 wins outright.) Full table + reproducer in Benchmarks.

2. Auto-tunes per query — no knobs to set.

With Spark you tune shuffle.partitions, autoBroadcastJoinThreshold, adaptive.enabled, executor memory, and add /*+ BROADCAST(...) */ hints per query to land on a good plan. With ematix-flow you just write the SQL — the engine detects the shape of the plan and swaps in the right operator itself.

What that means concretely:

Shape detection → operator injection. Physical-optimizer rules (InjectFilterMultiAggRule, InjectFilterSumRule, EnableDictGroupCountRule, …) pattern-match aggregate-and-filter shapes and substitute purpose-built fused executors. A WHERE … GROUP BY lands on a fused filter-then-aggregate kernel without the SQL knowing the rule exists.
Per-chunk selectivity dispatch. Inside the scan, ematix-parquet probes the first few pages of every chunk, measures observed selectivity, and decides per chunk whether to emit a bitmap (wins at low selectivity) or a values vector (wins at high). Same plan, two execution strategies, chosen on the fly.
Page-streaming vs. eager decode. A small-partition rule flips to page-streaming below ~900k rows and back to eager multi-row-group decode above it; an auto-inline rule kicks in at ≥2M rows on multi-row-group scans. No flag to set.
Specialised per-column code paths. Bit-unpack kernels are hand- tuned for each bit width 1–32 on both NEON and AVX2; the dispatcher picks the right kernel from the chunk header. Adding a column type doesn’t change the SQL.

That’s why the TPC-H table on Benchmarks doesn’t need a per-query tuning column. The one-time setup is registering the rule set in the session config; from there, queries auto-tune.

3. Scheduling + DAG, no service to operate.

Pipelines carry their own cron schedule and depends_on= edges (with cycle detection and exponential-backoff retries). Run flow run-due from cron, systemd, a Kubernetes CronJob, GitHub Actions, or the bundled long-running scheduler — same code, same topological order, same retry semantics.

Already on Airflow / Dagster / Prefect? Call .sync() directly.

4. Batteries included.

Out-of-the-box backends:

Databases. Postgres, MySQL, SQLite, DuckDB.
Lakehouse. Delta Lake (local + S3).
Object stores. Parquet, CSV, ORC, JSONL — local and S3.
Streaming. Kafka, RabbitMQ, GCP Pub/Sub, AWS Kinesis.
Schema Registry. Confluent SR and Apicurio for Avro / Protobuf.
CDC. Source mode dispatches per-op transactionally to your existing target backend.

5. Scales out — opt-in distributed mode, no cluster service.

Most fast single-node engines (DuckDB, Polars) stop at one machine. ematix-flow doesn’t.

Set engine = "distributed" + a list of peers in the session config and the same SQL fans out across a peer-to-peer mesh of flow-worker processes via Apache Arrow Flight. mTLS for the mesh, cross-pod lookup broadcast for small dimension tables, no separate cluster service to run.

Library, not a service. ematix-flow-distributed is a Rust crate any ematix-flow process can link. Any process that links it can play coordinator or worker — symmetric mesh, no master node, no JVM, no driver/executor split.
Same @ematix.pipeline decorator. Going single-node → distributed is a configuration change, not a code change. Pipelines, modes, watermarks, run history — all identical.
Spark / DuckDB SQL portable in place. A dialect translator targets DataFusion’s parser under the hood, so existing Spark-flavored or DuckDB-flavored SQL ports across without a rewrite.

Distributed benchmark numbers at SF≥100 are roadmap; the numbers on /specs/02-benchmarks are all single-node. The distributed code path itself is shipped, tested, and has a bench harness (tpch_distributed) — we just haven’t run cluster-scale runs to publish yet.

6. Hand-tuned Parquet scan path.

Most analytical engines lean on parquet-rs. ematix-flow ships with ematix-parquet — a hand-rolled Rust Parquet codec built for analytical workloads:

SIMD bit-unpackers on NEON + AVX2. Every bit width 1–32 has a hand-tuned raw-indices kernel; bw=1–21 (the practical range) also has fused unpack + dict-gather kernels. Output ceiling hits 76–96 GB/s on every specialised width.
Predicate-fused decode. unpack + filter + bitmap-pack in one SIMD pass — 3.7–6.3× faster than materialize-then-filter at low selectivity. Rows that fail the predicate never materialise.
Adaptive dispatch. Per-chunk selectivity probe decides whether to emit a bitmap (wins at low selectivity) or a values vector (wins at high).
Decode-into-caller-buffer. Late-materialization (*_masked_into) and Arrow-style (bytes, offsets) shape skip the alloc + zero-fill that dominates scan profiles.
Full spec coverage. Every physical type, every encoding (PLAIN, dict, DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT), every mainstream codec (Snappy, Zstd, Gzip, Brotli, LZ4_RAW), V1 + V2 pages, page indexes, bloom filters, Parquet Modular Encryption.
Light footprint. Sync read/write stack has zero third-party deps beyond the chosen compression codec. Async, encryption, and parallel decode are opt-in features.

That’s where most of the TPC-H wins come from. The codec also ships standalone on crates.io as ematix-parquet-codec / ematix-parquet-io if you want it without the whole pipeline framework.

ematix-probe is a sibling framework for declarative data-quality assertions and load testing. The ManagedTable you declared for the pipeline becomes a probe contract — declare the schema once, get DDL and data-quality checks.

Data probes. @probe.data declares a target (Postgres, DuckDB, Parquet — local or S3) plus assertions: not_null, unique, between, regex, is_in, row_count, freshness, percentile_between, cardinality_between, schema_match. The adapter chooses pushdown SQL vs. Arrow scan internally.
Auto-derived from ematix-flow tables. probe_from_table(CustomerDim, source=...) reads your ManagedTable and derives not_null on every non-nullable column + unique on every primary key. extend= lets you layer extras (regex, ranges, freshness windows) via the same fluent API.
Loosely coupled. ematix-probe has zero hard dependency on ematix-flow — duck-typed protocol means any class with __tablename__ and an iterable columns attribute participates.
pytest plugin. Auto-loaded via pytest11 entry point — no pytest_plugins wiring required. Each assertion becomes one test node, so failures pinpoint the exact rule that fired.
Load probes. HTTP and Postgres SQL targets under constant-rate (open-model) or virtual-user (closed-model) schedulers. Shared verdict
- run-history surface with data probes. (Python decorators land in v0.2; today the load engine is driven from the Rust API or pseudocode- ish Python.)
Run history. Opt-in SQLite log keeps verdict trends queryable across runs (ematix-probe runs ...). Designed as the substrate for v0.2 drift detection.

Ships on PyPI as ematix-probe. Rust + tokio core.

8. Operationally honest.

Restart-safe. Watermarks, run-history store, and offset commit ordering mean restarts don’t lose or duplicate data.
At-least-once. End-to-end, by default. Exactly-once available for Kafka → Kafka.
Credentials redacted. Typed connections strip secrets in repr().
Observable. Structured run history, Prometheus + OpenTelemetry metrics, Slack alerts.
DLQ. App-level routing + broker-level (RabbitMQ DLX, Pub/Sub dead- letter policy).

Status

ematix-flow is currently PRE-ALPHA. Beta release coming soon.

Today on PyPI as ematix-flow. All four surfaces — declarative pipelines, multi-backend, streaming, stream processing — are functional end-to-end and benchmark-validated, but the public surface (decorator names, config keys, CLI flags) may shift between now and the beta tag. If you’re trying it out, pin the exact version in your requirements:

pip install "ematix-flow==0.3.0"

Bug reports, feedback, and design pushback during the pre-alpha window are exactly what we want — open issues on GitHub.