ematix-flow is a Rust core wrapped in a Python surface. Three concentric layers:
┌───────────────────────────────────────────────────────────┐
│ PYTHON SURFACE │
│ @ematix.pipeline / @ematix.streaming_pipeline │
│ @ematix.connection / @ematix.table │
│ @ematix_flow.udf / .udaf │
│ `flow` CLI │
├───────────────────────────────────────────────────────────┤
│ RUST CORE — ematix-flow-core │
│ • DataFusion execution plan │
│ • Custom physical optimizer rules │
│ • Arrow record-batch streaming │
│ • Backend trait (Postgres, MySQL, Kafka, …) │
│ • Run-history store + watermarks │
├───────────────────────────────────────────────────────────┤
│ ARROW DATA PLANE │
│ Every byte crossing a backend boundary is Arrow. │
│ No row-by-row serialization. No intermediate files. │
└───────────────────────────────────────────────────────────┘
Why Rust + Arrow?
- Zero-copy across backends. Postgres → Delta Lake doesn’t pay
serialization cost twice. The result set of a Postgres
COPY OUT BINARYis decoded directly into Arrow record batches that the Delta writer consumes without re-encoding. - Tight scan + group-by. TPC-H wins are real because the scan path uses bit-unpack SIMD (NEON / AVX2), late-materialization for selective filters, and dictionary-aware group-by. See Benchmarks.
- One binary.
flowis a single ~25 MB native binary. No JVM warmup, no Python interpreter overhead in the hot path.
Why Python on top?
- That’s where pipelines live in the wild. SQLAlchemy, dbt, Airflow, Dagster, Prefect — all Python. ematix-flow drops in next to them.
- The decorator surface is the only surface. There’s no second configuration file format to learn, no DSL — pipelines are Python functions.
Sibling projects
ematix-parquet — the Parquet codec
ematix-parquet is the
hand-rolled Rust Parquet codec that powers the fast scan path. Hand-tuned
SIMD on NEON + AVX2, predicate-fused decode, adaptive dispatch on
selectivity, full read + write coverage of the Parquet spec, and a
dependency-light footprint. Ships independently on crates.io as
ematix-parquet-codec / ematix-parquet-io — use it without
ematix-flow if you just want the codec.
See Advantages — hand-tuned Parquet scan path for the perf details that show up in the TPC-H benchmark.
ematix-probe — data quality + load testing
ematix-probe is a separate (but related) framework for data-quality and load testing. You declare a target (a Postgres table, a Parquet file, a SQL query, an HTTP endpoint) and the assertions it must satisfy in Python; the framework runs the checks and returns a structured verdict.
It pairs naturally with ematix-flow — a @probe can fire as a
pre_load_transform or post_load_transform step in a pipeline, gating
the load on data shape. There’s also a pytest plugin so the same probes
run from CI.
ematix-probe is CLI-driven; no web UI ships with either project today.
Run history is queryable via flow runs ... (ematix-flow) and
ematix-probe runs ... (ematix-probe).