What's shipped

v0.7.0 surface matrix — what's stable, what's still in motion.

Backends

Backend	Batch source	Streaming source	Target	DDL planning	Strategy executors (append / merge / scd2 / truncate)	CDC target
Postgres	✅	—	✅	✅	✅ (native + COPY BINARY)	✅
MySQL	✅	—	✅	✅	✅ (`ON DUPLICATE KEY`)	✅
SQLite	✅	—	✅	✅	✅	✅
DuckDB	✅	—	✅	✅	✅	✅
Snowflake	✅	—	✅	n/a	append (Arrow → Parquet → `PUT` + `COPY INTO`) + truncate + merge (staged `MERGE`)	—
BigQuery	✅	—	✅	n/a	append (Arrow → Parquet → GCS + `load_table_from_uri`) + truncate + merge (`MERGE INTO` from staging)	—
Redshift	✅	—	✅	n/a	append (S3 → `COPY`) + truncate + merge (`MERGE INTO` from staging)	—
Delta Lake (local + S3)	✅	✅	✅	n/a	✅ (DataFusion-backed `MERGE`)	✅
Object stores (Parquet / CSV / ORC / JSONL, local + S3)	✅	✅	✅	n/a	append + truncate	—
Kafka	—	✅	✅	n/a	append (cross-backend)	source role only
RabbitMQ	—	✅	✅	n/a	append (cross-backend)	—
GCP Pub/Sub	—	✅	✅	n/a	append (cross-backend)	—
AWS Kinesis	—	✅	✅	n/a	append (cross-backend)	—

Batch source = readable by @ematix.pipeline (the function returns a SQL string; the framework executes it against the source connection). Streaming source = tailable by flow consume / @ematix.streaming_pipeline (long-running consumer with manual offset commit / ack). Target = writable by either pipeline shape. Cross-backend moves stream Apache Arrow batches end-to-end — same-DB pairs take the INSERT … SELECT fast path automatically.

Workflows + Jobs surface

@ematix.job — batch / scheduled. Atomic unit of work; one function, one target, one schedule. (Historical name @ematix.pipeline remains as an alias.)
ematix.workflow(name, jobs, depends_on) — names a group of jobs + declares the DAG between them. Workflows are the user-facing organizing concept on the Web UI; the DAG lives here, not on individual jobs.
@ematix.warehouse_pipeline — scheduled warehouse-to-warehouse read → DuckDB transform → bulk write, registered with the same cron / DAG / retry / flow run-due machinery.
@ematix.streaming_pipeline — long-running consumer.
@ematix.connection — typed connection with ${VAR} interpolation.
@ematix.table / ManagedTable — declarative schema + DDL.
@ematix_flow.udf / @ematix_flow.udaf — Python user-defined scalar + aggregate functions.

Modes

append — default.
truncate — full refresh.
merge (a.k.a. scd1) — upsert on merge keys.
scd2 — slowly-changing-dimension Type 2.

Execution

Single-node (default when no peers). The whole engine runs in-process; the flow binary is a single ~25 MB native executable.
Distributed (auto-detected). engine = "auto" is the default — set peers = [...] and SQL fans out across a peer-to-peer mesh of flow-worker processes via Apache Arrow Flight; otherwise falls back to in-process. mTLS-secured mesh, cross-pod lookup broadcast, no separate cluster service. Symmetric — any process linking ematix-flow-distributed can play coordinator or worker. Spark / DuckDB dialect translator means existing SQL ports across without rewrites.

Operational

Cron + DAG + cycle detection + retries. DAG edges live on the workflow declaration (or, legacy, on per-job depends_on=).
Cron schedule timezones — timezone="America/New_York" on the job; Web UI renders “Next: …” in the job’s tz.
Watermarks + restart-safe state.
Run-history store (queryable via flow runs ...).
Prometheus metrics + OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, or OTLP HTTP collectors).
Alerters: Slack, generic webhook, stdout, email (email://user:pass@host:port?from=&to=) and PagerDuty (pagerduty://routing_key?service=&severity= — auto-resolves on recovery).
Web UI bearer-token auth (flow web --token …) for off-host access; cross-pipeline DAG view of every depends_on edge; streaming pipeline live throughput + batch-cycle stats.
Grafana starter dashboard JSON (examples/grafana/ematix-flow-dashboard.json).
DLQ at app + broker level.
Schema Registry: Confluent + Apicurio (Avro + Protobuf) and AWS Glue Schema Registry (Avro, end-to-end Kafka dispatch + LocalStack integration tests).
Exactly-once Kafka → Kafka (transactions + consumer coordination).

Stream processing

SQL transforms over Arrow batches (in-process, no JVM).
Tumbling / hopping / session windows.
Scalar + aggregate Python UDFs.

Recently closed (v0.7.0)

v0.7.0 reorganises the workflow trigger model. The previous v0.6.0 shape — workflow with a centralised depends_on={dict} — is replaced by a richer trigger surface on the workflow + per-job DAG declaration on each member job. Hard break, no backwards compat — alpha project, nothing shipped externally depended on the v0.6.0 shape.

Workflow trigger kwargs (all AND-conjoined since last successful self-run):
- triggered_by=[name, ...] — workflow / job names that must have succeeded.
- schedule="cron" + optional timezone="IANA" — cron tick must reach.
- on_message=<source> — per-message firing (exclusive with the above).
- At least one required, unless the workflow contains a streaming pipeline (implicit streaming).
Composite triggers — declaring more than one kwarg ANDs the conditions together. Example: triggered_by=["wf_A", "wf_B"] + schedule="0 21 * * *" fires when both workflows have completed since this workflow last succeeded AND the 21:00 tick has reached. If conditions are satisfied at different times, the workflow fires immediately on the last satisfied condition rather than waiting for the next cron tick.
Within-workflow DAG declared on each job via @ematix.job(name=..., depends_on=[upstream_job, ...]). The workflow itself no longer takes a depends_on= dict — passing one raises ValueError with a pointer to the new model.
Validation at registration time for trigger conflicts (e.g. on_message + schedule together), workflow-level cycles, missing triggers on non-streaming workflows.
Web UI Workflows card renders a trigger summary line under the header: “Trigger: After: a, b · Schedule: 0 21 * * * America/New_York”. Streaming workflows keep the LIVE STREAMING pill in place of a trigger line.
/api/workflows carries triggered_by, schedule, timezone, on_message per workflow; edges derived by walking each member job’s depends_on.

Recently closed (v0.6.1)

Streaming workflows visible on the Workflows tab. /api/workflows now includes streaming pipelines as kind: "streaming" workflow-of-one cards. They previously only appeared on the Jobs and Runs tabs.
Web UI renders streaming workflows with the pulsing amber ▶ LIVE STREAMING pill and a live throughput / batch-cycle footer driven by the same streaming_stats snapshot the Jobs page uses.

Recently closed (v0.6.0)

v0.6.0 reorganised the user-facing model around Workflows and Jobs — same scheduler and runtime as v0.5.0, but the Web UI and decorator names now reflect the natural hierarchy.

ematix.workflow(name=..., jobs=[...], depends_on={...}) — new declaration grouping jobs and declaring the DAG between them. (Superseded in v0.7.0 by the trigger-kwargs surface above.)
@ematix.job decorator as the primary name; @ematix.pipeline remains a non-breaking alias.
/api/workflows endpoint — returns declared workflows + their member jobs and DAG edges, with synthetic single-job workflows for any job not in a declared workflow.
Web UI restructured into four tabs: Workflows (default, with inline SVG flowcharts), Jobs (flat list with filter + sort), Runs (renamed from Jobs, sortable columns), and DAG (cubic-Bézier arrows, #/dag/<job> to focus a subgraph).
flow web --module <name> — pre-imports a pipelines module so the UI can render schedules, next-run times, and the DAG without a separate scheduler tick having populated the rich-history first.
Loopback bind no longer requires a bearer token by default — set --token explicitly when binding to a non-loopback address.

Earlier (v0.5.0 highlights)

v0.5.0 was the operational-surface release — CLIs, alerters, Web UI, and observability on top of the v0.4.0 backend matrix.

@ematix.warehouse_pipeline decorator + Rust executor (invoke_warehouse_pipeline) via the PyO3 callback bridge — no subprocess fork per run.
AWS Glue Schema Registry — end-to-end Kafka dispatch (typed glue_schema_registry connection, Rust dispatch on both consumer and producer paths, zlib codec, LocalStack integration suite).
Four new CLI subcommands: flow doctor, flow init, flow logs, flow secrets test.
Web UI bearer-token auth (flow web --token …) + cross-pipeline DAG view at #/dag.
Email + PagerDuty alerters — email://… (stdlib smtplib) and pagerduty://… (Events API v2, auto-resolves on recovery).
OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, OTLP HTTP) + 6-panel Grafana starter dashboard.
Streaming-pipeline live stats in the Web UI — rolling 1 m / 5 m throughput + batch-cycle windows in place of the old “Median duration: —” footer.
Cron schedule timezones — timezone="America/New_York" on the job; is_due() evaluates in that tz, Web UI renders “Next: …” in the same.
Arrow-native warehouse adapters — Snowflake PUT + COPY INTO, Redshift S3 + COPY, BigQuery GCS + load_table_from_uri — pandas no longer on the warehouse write path.

Full changelog: see CHANGELOG.md in the repo for every prior release.

What’s still on the roadmap

Published distributed benchmarks. The distributed code path is shipped (see Execution above) and has a bench harness (tpch_distributed), but cluster-scale TPC-H runs at SF≥100 aren’t published yet — every number on /reference/benchmarks is single-machine.

Other roadmap items are tracked openly in the GitHub repo issues.