Why ematix-flow
Eight lines on the back of the box — why ematix-flow exists.
1. Fast.
TPC-H SF=1, 22 queries, single Apple M3 Pro:
- 1.75× faster than DuckDB
- 2.77× faster than Polars
- 13.4× faster than single-node PySpark
(All geomeans. 18 / 22 wins outright.) Full table + reproducer in Benchmarks.
2. Auto-tunes per query — no knobs to set.
With Spark you tune shuffle.partitions, autoBroadcastJoinThreshold,
adaptive.enabled, executor memory, and add /*+ BROADCAST(...) */
hints per query to land on a good plan. With ematix-flow you just write
the SQL.
A physical-optimizer shape catalog pattern-matches the plan and substitutes purpose-built fused operators automatically — filter + aggregate, filter + group-by, dictionary-keyed group-counts, and the other common analytical shapes hit a fused executor instead of the generic Arrow stream. Each catalog entry is a small declarative shape with a rewrite, so adding a new optimization is one entry — no plumbing in the SQL frontend or session API.
That’s why the TPC-H table on Benchmarks doesn’t need a per-query tuning column. The catalog is on by default — you write the SQL, the engine handles the rest.
For the scan-side adaptive decisions — per-chunk bitmap-vs-values selectivity dispatch, page-streaming vs. eager decode, per-bit-width SIMD kernels — see §6.
3. Scheduling + DAG, no service to operate.
Pipelines carry their own cron schedule and depends_on= edges (with
cycle detection and exponential-backoff retries). Run flow run-due from
cron, systemd, a Kubernetes CronJob, GitHub Actions, or the bundled
long-running scheduler — same code, same topological order, same
retry semantics.
Already on Airflow / Dagster / Prefect? Call .sync() directly.
Want a visual operator view? flow web ships a local SPA with the
live task DAG, run history, and one-click Restart from failed
step / Resume from watermark / Pause / Resume on
in-flight runs — see Web UI.
4. Batteries included.
Out-of-the-box backends:
- Databases. Postgres, MySQL, SQLite, DuckDB.
- Cloud warehouses. Snowflake, BigQuery, Redshift.
- Lakehouse. Delta Lake (local + S3).
- Object stores. Parquet, CSV, ORC, JSONL — local and S3.
- Streaming. Kafka, RabbitMQ, GCP Pub/Sub, AWS Kinesis.
- Schema Registry. Confluent SR and Apicurio for Avro / Protobuf.
- CDC. Source mode dispatches per-op transactionally to your existing target backend.
5. Scales out — auto-detected distributed mode, no cluster service.
Most fast single-node engines (DuckDB, Polars) stop at one machine. ematix-flow doesn’t.
engine = "auto" is the default. Drop in a peers = [...] list and the
same SQL fans out across a peer-to-peer mesh of flow-worker processes
via Apache Arrow Flight; with no peers it stays in-process. mTLS for
the mesh, cross-pod lookup broadcast for small dimension tables,
no separate cluster service to run.
- Library, not a service.
ematix-flow-distributedis a Rust crate any ematix-flow process can link. Any process that links it can play coordinator or worker — symmetric mesh, no master node, no JVM, no driver/executor split. - Peer auto-detection.
peers = [...]accepts three schemes:http://host:port(static),dns://host:port(resolves the A-record at startup and expands to every IP behind the name — good for headless services), andk8s://service.namespace:port(sugar for*.svc.cluster.local). Defaultengine = "auto"picks distributed when peers expand to ≥1 URL, otherwise falls back to in-process. No restart for new pods — just re-resolve. - Same
@ematix.pipelinedecorator. Going single-node → distributed is a configuration change, not a code change. Pipelines, modes, watermarks, run history — all identical. - Spark / DuckDB SQL portable in place. A dialect translator targets DataFusion’s parser under the hood, so existing Spark-flavored or DuckDB-flavored SQL ports across without a rewrite.
Distributed benchmark numbers at SF≥100 are roadmap; the numbers on
/reference/benchmarks are all single-node. The
distributed code path itself is shipped, tested, and has a bench
harness (tpch_distributed) — we just haven’t run cluster-scale runs
to publish yet.
6. Hand-tuned Parquet scan path.
Most analytical engines lean on parquet-rs. ematix-flow ships with
ematix-parquet — a
hand-rolled Rust Parquet codec built for analytical workloads:
- SIMD bit-unpackers on NEON + AVX2. Every bit width 1–32 has a hand-tuned raw-indices kernel; bw=1–21 (the practical range) also has fused unpack + dict-gather kernels. Output ceiling hits 76–96 GB/s on every specialised width.
- Predicate-fused decode.
unpack + filter + bitmap-packin one SIMD pass — 3.7–6.3× faster than materialize-then-filter at low selectivity. Rows that fail the predicate never materialise. - Adaptive dispatch. Per-chunk selectivity probe decides whether to emit a bitmap (wins at low selectivity) or a values vector (wins at high).
- Decode-into-caller-buffer. Late-materialization (
*_masked_into) and Arrow-style(bytes, offsets)shape skip the alloc + zero-fill that dominates scan profiles. - Full spec coverage. Every physical type, every encoding (PLAIN, dict, DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT), every mainstream codec (Snappy, Zstd, Gzip, Brotli, LZ4_RAW), V1 + V2 pages, page indexes, bloom filters, Parquet Modular Encryption.
- Light footprint. Sync read/write stack has zero third-party deps beyond the chosen compression codec. Async, encryption, and parallel decode are opt-in features.
That’s where most of the TPC-H wins come from. The codec also ships
standalone on crates.io as ematix-parquet-codec /
ematix-parquet-io if you want it without the whole pipeline framework.
7. Quality + load tests share the surface.
ematix-probe is a
sibling framework for declarative data-quality assertions and load
testing. The ManagedTable you declared for the pipeline becomes
a probe contract — declare the schema once, get DDL and data-quality
checks.
- Data probes.
@probe.datadeclares a target (Postgres, DuckDB, Parquet — local or S3) plus assertions:not_null,unique,between,regex,is_in,row_count,freshness,percentile_between,cardinality_between,schema_match. The adapter chooses pushdown SQL vs. Arrow scan internally. - Auto-derived from ematix-flow tables.
probe_from_table(CustomerDim, source=...)reads yourManagedTableand derivesnot_nullon every non-nullable column +uniqueon every primary key.extend=lets you layer extras (regex, ranges, freshness windows) via the same fluent API. - Loosely coupled. ematix-probe has zero hard dependency on
ematix-flow — duck-typed protocol means any class with
__tablename__and an iterablecolumnsattribute participates. - pytest plugin. Auto-loaded via
pytest11entry point — nopytest_pluginswiring required. Each assertion becomes one test node, so failures pinpoint the exact rule that fired. - Load probes. HTTP and Postgres SQL targets under constant-rate
(open-model) or virtual-user (closed-model) schedulers. Shared verdict
- run-history surface with data probes. (Python decorators land in v0.2; today the load engine is driven from the Rust API or pseudocode- ish Python.)
- Run history. Opt-in SQLite log keeps verdict trends queryable
across runs (
ematix-probe runs ...). Designed as the substrate for v0.2 drift detection.
Ships on PyPI as ematix-probe. Rust + tokio core.
8. Operationally honest.
- Restart-safe. Watermarks, run-history store, and offset commit ordering mean restarts don’t lose or duplicate data.
- At-least-once. End-to-end, by default. Exactly-once available for Kafka → Kafka.
- Credentials redacted. Typed connections strip secrets in
repr(). - Observable. Structured run history, Prometheus metrics + OpenTelemetry trace spans, alerters for Slack, generic webhook, email, and PagerDuty.
- DLQ. App-level routing + broker-level (RabbitMQ DLX, Pub/Sub dead- letter policy).
Status
ematix-flow is currently ALPHA. The v0.5.0 release adds the
operational surface on top of v0.4.0’s backend matrix:
@ematix.warehouse_pipeline, AWS Glue Schema Registry end-to-end,
cron timezones, four new CLI subcommands
(flow doctor / init / logs / secrets test), bearer-token Web
UI auth + cross-pipeline DAG view, email + PagerDuty alerters,
OpenTelemetry trace spans + a starter Grafana dashboard, streaming
live throughput in the Web UI, and Arrow-native warehouse write
adapters — see
/reference/whats-shipped.
Today on PyPI as ematix-flow. All four surfaces — declarative
pipelines, multi-backend, streaming, stream processing — are
functional end-to-end and benchmark-validated. APIs are
stabilizing; minor surfaces may still shift before beta. If you’re
trying it out, pin the exact version in your requirements:
pip install "ematix-flow==0.5.0"
Bug reports, feedback, and design pushback during the alpha window are exactly what we want — open issues on GitHub.