Architecture

What's actually inside — Rust core, Arrow plane, Python surface.

ematix-flow is a Rust core wrapped in a Python surface. Three concentric layers:

┌───────────────────────────────────────────────────────────┐
│  PYTHON SURFACE                                           │
│   @ematix.job (alias: .pipeline)  · ematix.workflow(…)    │
│   @ematix.streaming_pipeline                              │
│   @ematix.connection / @ematix.table                      │
│   @ematix_flow.udf / .udaf                                │
│   `flow` CLI                                              │
├───────────────────────────────────────────────────────────┤
│  RUST CORE — ematix-flow-core                             │
│   • DataFusion execution plan                             │
│   • Custom physical optimizer rules                       │
│   • Arrow record-batch streaming                          │
│   • Backend trait (Postgres, MySQL, Kafka, …)             │
│   • Run-history store + watermarks                        │
├───────────────────────────────────────────────────────────┤
│  ARROW DATA PLANE                                         │
│   Every byte crossing a backend boundary is Arrow.        │
│   No row-by-row serialization. No intermediate files.     │
└───────────────────────────────────────────────────────────┘

Why Rust + Arrow?

  • Zero-copy across backends. Postgres → Delta Lake doesn’t pay serialization cost twice. The result set of a Postgres COPY OUT BINARY is decoded directly into Arrow record batches that the Delta writer consumes without re-encoding.
  • Tight scan + group-by. TPC-H wins are real because the scan path uses bit-unpack SIMD (NEON / AVX2), late-materialization for selective filters, and dictionary-aware group-by. See Benchmarks.
  • One binary. flow is a single ~25 MB native binary. No JVM warmup, no Python interpreter overhead in the hot path.

Why Python on top?

  • That’s where pipelines live in the wild. SQLAlchemy, dbt, Airflow, Dagster, Prefect — all Python. ematix-flow drops in next to them.
  • The decorator surface is the only surface. There’s no second configuration file format to learn, no DSL — pipelines are Python functions.

Sibling projects

ematix-parquet — the Parquet codec

ematix-parquet is the hand-rolled Rust Parquet codec that powers the fast scan path. Hand-tuned SIMD on NEON + AVX2, predicate-fused decode, adaptive dispatch on selectivity, full read + write coverage of the Parquet spec, and a dependency-light footprint. Ships independently on crates.io as ematix-parquet-codec / ematix-parquet-io — use it without ematix-flow if you just want the codec.

See Advantages — hand-tuned Parquet scan path for the perf details that show up in the TPC-H benchmark.

ematix-probe — data quality + load testing

ematix-probe is a separate (but related) framework for data-quality and load testing. You declare a target (a Postgres table, a Parquet file, a SQL query, an HTTP endpoint) and the assertions it must satisfy in Python; the framework runs the checks and returns a structured verdict.

It pairs naturally with ematix-flow — a @probe can fire as a pre_load_transform or post_load_transform step in a pipeline, gating the load on data shape. There’s also a pytest plugin so the same probes run from CI.

ematix-probe is CLI-driven; no web UI ships with either project today. Run history is queryable via flow runs ... (ematix-flow) and ematix-probe runs ... (ematix-probe).