What's shipped
v0.7.0 surface matrix — what's stable, what's still in motion.
Backends
| Backend | Batch source | Streaming source | Target | DDL planning | Strategy executors (append / merge / scd2 / truncate) | CDC target |
|---|---|---|---|---|---|---|
| Postgres | ✅ | — | ✅ | ✅ | ✅ (native + COPY BINARY) | ✅ |
| MySQL | ✅ | — | ✅ | ✅ | ✅ (ON DUPLICATE KEY) | ✅ |
| SQLite | ✅ | — | ✅ | ✅ | ✅ | ✅ |
| DuckDB | ✅ | — | ✅ | ✅ | ✅ | ✅ |
| Snowflake | ✅ | — | ✅ | n/a | append (Arrow → Parquet → PUT + COPY INTO) + truncate + merge (staged MERGE) | — |
| BigQuery | ✅ | — | ✅ | n/a | append (Arrow → Parquet → GCS + load_table_from_uri) + truncate + merge (MERGE INTO from staging) | — |
| Redshift | ✅ | — | ✅ | n/a | append (S3 → COPY) + truncate + merge (MERGE INTO from staging) | — |
| Delta Lake (local + S3) | ✅ | ✅ | ✅ | n/a | ✅ (DataFusion-backed MERGE) | ✅ |
| Object stores (Parquet / CSV / ORC / JSONL, local + S3) | ✅ | ✅ | ✅ | n/a | append + truncate | — |
| Kafka | — | ✅ | ✅ | n/a | append (cross-backend) | source role only |
| RabbitMQ | — | ✅ | ✅ | n/a | append (cross-backend) | — |
| GCP Pub/Sub | — | ✅ | ✅ | n/a | append (cross-backend) | — |
| AWS Kinesis | — | ✅ | ✅ | n/a | append (cross-backend) | — |
Batch source = readable by @ematix.pipeline (the function returns a
SQL string; the framework executes it against the source connection).
Streaming source = tailable by flow consume /
@ematix.streaming_pipeline (long-running consumer with manual offset
commit / ack).
Target = writable by either pipeline shape. Cross-backend moves
stream Apache Arrow batches end-to-end — same-DB pairs take the
INSERT … SELECT fast path automatically.
Workflows + Jobs surface
@ematix.job— batch / scheduled. Atomic unit of work; one function, one target, one schedule. (Historical name@ematix.pipelineremains as an alias.)ematix.workflow(name, jobs, depends_on)— names a group of jobs + declares the DAG between them. Workflows are the user-facing organizing concept on the Web UI; the DAG lives here, not on individual jobs.@ematix.warehouse_pipeline— scheduled warehouse-to-warehouse read → DuckDB transform → bulk write, registered with the same cron / DAG / retry /flow run-duemachinery.@ematix.streaming_pipeline— long-running consumer.@ematix.connection— typed connection with${VAR}interpolation.@ematix.table/ManagedTable— declarative schema + DDL.@ematix_flow.udf/@ematix_flow.udaf— Python user-defined scalar + aggregate functions.
Modes
append— default.truncate— full refresh.merge(a.k.a.scd1) — upsert on merge keys.scd2— slowly-changing-dimension Type 2.
Execution
- Single-node (default when no peers). The whole engine runs
in-process; the
flowbinary is a single ~25 MB native executable. - Distributed (auto-detected).
engine = "auto"is the default — setpeers = [...]and SQL fans out across a peer-to-peer mesh offlow-workerprocesses via Apache Arrow Flight; otherwise falls back to in-process. mTLS-secured mesh, cross-pod lookup broadcast, no separate cluster service. Symmetric — any process linkingematix-flow-distributedcan play coordinator or worker. Spark / DuckDB dialect translator means existing SQL ports across without rewrites.
Operational
- Cron + DAG + cycle detection + retries. DAG edges live on the
workflow declaration (or, legacy, on per-job
depends_on=). - Cron schedule timezones —
timezone="America/New_York"on the job; Web UI renders “Next: …” in the job’s tz. - Watermarks + restart-safe state.
- Run-history store (queryable via
flow runs ...). - Prometheus metrics + OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, or OTLP HTTP collectors).
- Alerters: Slack, generic webhook, stdout, email
(
email://user:pass@host:port?from=&to=) and PagerDuty (pagerduty://routing_key?service=&severity=— auto-resolves on recovery). - Web UI bearer-token auth (
flow web --token …) for off-host access; cross-pipeline DAG view of everydepends_onedge; streaming pipeline live throughput + batch-cycle stats. - Grafana starter dashboard JSON
(
examples/grafana/ematix-flow-dashboard.json). - DLQ at app + broker level.
- Schema Registry: Confluent + Apicurio (Avro + Protobuf) and AWS Glue Schema Registry (Avro, end-to-end Kafka dispatch + LocalStack integration tests).
- Exactly-once Kafka → Kafka (transactions + consumer coordination).
Stream processing
- SQL transforms over Arrow batches (in-process, no JVM).
- Tumbling / hopping / session windows.
- Scalar + aggregate Python UDFs.
Recently closed (v0.7.0)
v0.7.0 reorganises the workflow trigger model. The previous v0.6.0
shape — workflow with a centralised depends_on={dict} — is replaced
by a richer trigger surface on the workflow + per-job DAG declaration
on each member job. Hard break, no backwards compat — alpha
project, nothing shipped externally depended on the v0.6.0 shape.
- Workflow trigger kwargs (all AND-conjoined since last successful
self-run):
triggered_by=[name, ...]— workflow / job names that must have succeeded.schedule="cron"+ optionaltimezone="IANA"— cron tick must reach.on_message=<source>— per-message firing (exclusive with the above).- At least one required, unless the workflow contains a streaming pipeline (implicit streaming).
- Composite triggers — declaring more than one kwarg ANDs the
conditions together. Example:
triggered_by=["wf_A", "wf_B"]+schedule="0 21 * * *"fires when both workflows have completed since this workflow last succeeded AND the 21:00 tick has reached. If conditions are satisfied at different times, the workflow fires immediately on the last satisfied condition rather than waiting for the next cron tick. - Within-workflow DAG declared on each job via
@ematix.job(name=..., depends_on=[upstream_job, ...]). The workflow itself no longer takes adepends_on=dict — passing one raisesValueErrorwith a pointer to the new model. - Validation at registration time for trigger conflicts (e.g.
on_message+scheduletogether), workflow-level cycles, missing triggers on non-streaming workflows. - Web UI Workflows card renders a trigger summary line under the header: “Trigger: After: a, b · Schedule: 0 21 * * * America/New_York”. Streaming workflows keep the LIVE STREAMING pill in place of a trigger line.
/api/workflowscarriestriggered_by,schedule,timezone,on_messageper workflow; edges derived by walking each member job’sdepends_on.
Recently closed (v0.6.1)
- Streaming workflows visible on the Workflows tab.
/api/workflowsnow includes streaming pipelines askind: "streaming"workflow-of-one cards. They previously only appeared on the Jobs and Runs tabs. - Web UI renders streaming workflows with the pulsing amber
▶ LIVE STREAMINGpill and a live throughput / batch-cycle footer driven by the samestreaming_statssnapshot the Jobs page uses.
Recently closed (v0.6.0)
v0.6.0 reorganised the user-facing model around Workflows and Jobs — same scheduler and runtime as v0.5.0, but the Web UI and decorator names now reflect the natural hierarchy.
ematix.workflow(name=..., jobs=[...], depends_on={...})— new declaration grouping jobs and declaring the DAG between them. (Superseded in v0.7.0 by the trigger-kwargs surface above.)@ematix.jobdecorator as the primary name;@ematix.pipelineremains a non-breaking alias./api/workflowsendpoint — returns declared workflows + their member jobs and DAG edges, with synthetic single-job workflows for any job not in a declared workflow.- Web UI restructured into four tabs: Workflows (default, with
inline SVG flowcharts), Jobs (flat list with filter + sort), Runs
(renamed from Jobs, sortable columns), and DAG (cubic-Bézier arrows,
#/dag/<job>to focus a subgraph). flow web --module <name>— pre-imports a pipelines module so the UI can render schedules, next-run times, and the DAG without a separate scheduler tick having populated the rich-history first.- Loopback bind no longer requires a bearer token by default —
set
--tokenexplicitly when binding to a non-loopback address.
Earlier (v0.5.0 highlights)
v0.5.0 was the operational-surface release — CLIs, alerters, Web UI, and observability on top of the v0.4.0 backend matrix.
@ematix.warehouse_pipelinedecorator + Rust executor (invoke_warehouse_pipeline) via the PyO3 callback bridge — no subprocess fork per run.- AWS Glue Schema Registry — end-to-end Kafka dispatch (typed
glue_schema_registryconnection, Rust dispatch on both consumer and producer paths, zlib codec, LocalStack integration suite). - Four new CLI subcommands:
flow doctor,flow init,flow logs,flow secrets test. - Web UI bearer-token auth (
flow web --token …) + cross-pipeline DAG view at#/dag. - Email + PagerDuty alerters —
email://…(stdlib smtplib) andpagerduty://…(Events API v2, auto-resolves on recovery). - OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, OTLP HTTP) + 6-panel Grafana starter dashboard.
- Streaming-pipeline live stats in the Web UI — rolling 1 m / 5 m throughput + batch-cycle windows in place of the old “Median duration: —” footer.
- Cron schedule timezones —
timezone="America/New_York"on the job;is_due()evaluates in that tz, Web UI renders “Next: …” in the same. - Arrow-native warehouse adapters — Snowflake
PUT+COPY INTO, Redshift S3 +COPY, BigQuery GCS +load_table_from_uri— pandas no longer on the warehouse write path.
Full changelog: see
CHANGELOG.md
in the repo for every prior release.
What’s still on the roadmap
- Published distributed benchmarks. The distributed code path
is shipped (see Execution above) and has a bench
harness (
tpch_distributed), but cluster-scale TPC-H runs at SF≥100 aren’t published yet — every number on /reference/benchmarks is single-machine.
Other roadmap items are tracked openly in the GitHub repo issues.