EMATIX(R) DATA TERMINAL — ROBCO INDUSTRIES UNIFIED OPERATING SYSTEM
COPYRIGHT 2026 EMATIX SYSTEMS — ALL RIGHTS RESERVED
USER: GUEST   SESSION: 2026-05-20 21:38:22Z   HOST: ematix.dev/specs
// TECHNICAL SPECS

Advantages

Eight lines on the back of the box — why ematix-flow exists.


1. Fast.

TPC-H SF=1, 22 queries, single Apple M3 Pro:

(All geomeans. 18 / 22 wins outright.) Full table + reproducer in Benchmarks.

2. Auto-tunes per query — no knobs to set.

With Spark you tune shuffle.partitions, autoBroadcastJoinThreshold, adaptive.enabled, executor memory, and add /*+ BROADCAST(...) */ hints per query to land on a good plan. With ematix-flow you just write the SQL — the engine detects the shape of the plan and swaps in the right operator itself.

What that means concretely:

That’s why the TPC-H table on Benchmarks doesn’t need a per-query tuning column. The one-time setup is registering the rule set in the session config; from there, queries auto-tune.

3. Scheduling + DAG, no service to operate.

Pipelines carry their own cron schedule and depends_on= edges (with cycle detection and exponential-backoff retries). Run flow run-due from cron, systemd, a Kubernetes CronJob, GitHub Actions, or the bundled long-running scheduler — same code, same topological order, same retry semantics.

Already on Airflow / Dagster / Prefect? Call .sync() directly.

4. Batteries included.

Out-of-the-box backends:

5. Scales out — opt-in distributed mode, no cluster service.

Most fast single-node engines (DuckDB, Polars) stop at one machine. ematix-flow doesn’t.

Set engine = "distributed" + a list of peers in the session config and the same SQL fans out across a peer-to-peer mesh of flow-worker processes via Apache Arrow Flight. mTLS for the mesh, cross-pod lookup broadcast for small dimension tables, no separate cluster service to run.

Distributed benchmark numbers at SF≥100 are roadmap; the numbers on /specs/02-benchmarks are all single-node. The distributed code path itself is shipped, tested, and has a bench harness (tpch_distributed) — we just haven’t run cluster-scale runs to publish yet.

6. Hand-tuned Parquet scan path.

Most analytical engines lean on parquet-rs. ematix-flow ships with ematix-parquet — a hand-rolled Rust Parquet codec built for analytical workloads:

That’s where most of the TPC-H wins come from. The codec also ships standalone on crates.io as ematix-parquet-codec / ematix-parquet-io if you want it without the whole pipeline framework.

7. Quality + load tests share the surface.

ematix-probe is a sibling framework for declarative data-quality assertions and load testing. The ManagedTable you declared for the pipeline becomes a probe contract — declare the schema once, get DDL and data-quality checks.

Ships on PyPI as ematix-probe. Rust + tokio core.

8. Operationally honest.

Status

ematix-flow is currently PRE-ALPHA. Beta release coming soon.

Today on PyPI as ematix-flow. All four surfaces — declarative pipelines, multi-backend, streaming, stream processing — are functional end-to-end and benchmark-validated, but the public surface (decorator names, config keys, CLI flags) may shift between now and the beta tag. If you’re trying it out, pin the exact version in your requirements:

pip install "ematix-flow==0.3.0"

Bug reports, feedback, and design pushback during the pre-alpha window are exactly what we want — open issues on GitHub.


◀ BACK TO TECHNICAL SPECS ▲ HOME