What's shipped — ematix.dev

Backends

Backend	Batch source	Streaming source	Target	DDL planning	Strategy executors (append / merge / scd2 / truncate)	CDC target
Postgres	✅	—	✅	✅	✅ (native + COPY BINARY)	✅
MySQL	✅	—	✅	✅	✅ (`ON DUPLICATE KEY`)	✅
SQLite	✅	—	✅	✅	✅	✅
DuckDB	✅	—	✅	✅	✅	✅
Delta Lake (local + S3)	✅	✅	✅	n/a	✅ (DataFusion-backed `MERGE`)	✅
Object stores (Parquet / CSV / ORC / JSONL, local + S3)	✅	✅	✅	n/a	append + truncate	—
Kafka	—	✅	✅	n/a	append (cross-backend)	source role only
RabbitMQ	—	✅	✅	n/a	append (cross-backend)	—
GCP Pub/Sub	—	✅	✅	n/a	append (cross-backend)	—
AWS Kinesis	—	✅	✅	n/a	append (cross-backend)	—

Batch source = readable by @ematix.pipeline (the function returns a SQL string; the framework executes it against the source connection). Streaming source = tailable by flow consume / @ematix.streaming_pipeline (long-running consumer with manual offset commit / ack). Target = writable by either pipeline shape. Cross-backend moves stream Apache Arrow batches end-to-end — same-DB pairs take the INSERT … SELECT fast path automatically.

Pipeline surface

@ematix.pipeline — batch / scheduled.
@ematix.streaming_pipeline — long-running consumer.
@ematix.connection — typed connection with ${VAR} interpolation.
@ematix.table / ManagedTable — declarative schema + DDL.
@ematix_flow.udf / @ematix_flow.udaf — Python user-defined scalar + aggregate functions.

Modes

append — default.
truncate — full refresh.
merge (a.k.a. scd1) — upsert on merge keys.
scd2 — slowly-changing-dimension Type 2.

Execution

Single-node (default). The whole engine runs in-process; the flow binary is a single ~25 MB native executable.
Distributed (opt-in). Set engine = "distributed" + peers = [...] and SQL fans out across a peer-to-peer mesh of flow-worker processes via Apache Arrow Flight. mTLS-secured mesh, cross-pod lookup broadcast, no separate cluster service. Symmetric — any process linking ematix-flow-distributed can play coordinator or worker. Spark / DuckDB dialect translator means existing SQL ports across without rewrites.

Operational

Cron + DAG + cycle detection + retries.
Watermarks + restart-safe state.
Run-history store (queryable via flow runs ...).
Prometheus + OpenTelemetry metrics.
Slack alerting.
DLQ at app + broker level.
Schema Registry (Confluent + Apicurio, Avro + Protobuf).
Exactly-once Kafka → Kafka (transactions + consumer coordination).

Stream processing

SQL transforms over Arrow batches (in-process, no JVM).
Tumbling / hopping / session windows.
Scalar + aggregate Python UDFs.

What’s not in v0.3.0

Published distributed benchmarks. The distributed code path is shipped (see Execution above) and has a bench harness (tpch_distributed), but cluster-scale TPC-H runs at SF≥100 aren’t published yet — every number on /specs/02-benchmarks is single-machine.
Snowflake / BigQuery / Redshift backends.
Pluggable secrets stores (Vault, AWS Secrets Manager, GCP Secret Manager). Today ${VAR} interpolates from env only.
Web UI. Run history is queryable via the flow runs ... CLI subcommand only — there’s no graphical UI today.

Roadmap items are tracked openly in the GitHub repo issues.