What's shipped

v0.7.0 surface matrix — what's stable, what's still in motion.

Backends

BackendBatch sourceStreaming sourceTargetDDL planningStrategy executors (append / merge / scd2 / truncate)CDC target
Postgres✅ (native + COPY BINARY)
MySQL✅ (ON DUPLICATE KEY)
SQLite
DuckDB
Snowflaken/aappend (Arrow → Parquet → PUT + COPY INTO) + truncate + merge (staged MERGE)
BigQueryn/aappend (Arrow → Parquet → GCS + load_table_from_uri) + truncate + merge (MERGE INTO from staging)
Redshiftn/aappend (S3 → COPY) + truncate + merge (MERGE INTO from staging)
Delta Lake (local + S3)n/a✅ (DataFusion-backed MERGE)
Object stores (Parquet / CSV / ORC / JSONL, local + S3)n/aappend + truncate
Kafkan/aappend (cross-backend)source role only
RabbitMQn/aappend (cross-backend)
GCP Pub/Subn/aappend (cross-backend)
AWS Kinesisn/aappend (cross-backend)

Batch source = readable by @ematix.pipeline (the function returns a SQL string; the framework executes it against the source connection). Streaming source = tailable by flow consume / @ematix.streaming_pipeline (long-running consumer with manual offset commit / ack). Target = writable by either pipeline shape. Cross-backend moves stream Apache Arrow batches end-to-end — same-DB pairs take the INSERT … SELECT fast path automatically.

Workflows + Jobs surface

  • @ematix.job — batch / scheduled. Atomic unit of work; one function, one target, one schedule. (Historical name @ematix.pipeline remains as an alias.)
  • ematix.workflow(name, jobs, depends_on) — names a group of jobs + declares the DAG between them. Workflows are the user-facing organizing concept on the Web UI; the DAG lives here, not on individual jobs.
  • @ematix.warehouse_pipeline — scheduled warehouse-to-warehouse read → DuckDB transform → bulk write, registered with the same cron / DAG / retry / flow run-due machinery.
  • @ematix.streaming_pipeline — long-running consumer.
  • @ematix.connection — typed connection with ${VAR} interpolation.
  • @ematix.table / ManagedTable — declarative schema + DDL.
  • @ematix_flow.udf / @ematix_flow.udaf — Python user-defined scalar + aggregate functions.

Modes

  • append — default.
  • truncate — full refresh.
  • merge (a.k.a. scd1) — upsert on merge keys.
  • scd2 — slowly-changing-dimension Type 2.

Execution

  • Single-node (default when no peers). The whole engine runs in-process; the flow binary is a single ~25 MB native executable.
  • Distributed (auto-detected). engine = "auto" is the default — set peers = [...] and SQL fans out across a peer-to-peer mesh of flow-worker processes via Apache Arrow Flight; otherwise falls back to in-process. mTLS-secured mesh, cross-pod lookup broadcast, no separate cluster service. Symmetric — any process linking ematix-flow-distributed can play coordinator or worker. Spark / DuckDB dialect translator means existing SQL ports across without rewrites.

Operational

  • Cron + DAG + cycle detection + retries. DAG edges live on the workflow declaration (or, legacy, on per-job depends_on=).
  • Cron schedule timezones — timezone="America/New_York" on the job; Web UI renders “Next: …” in the job’s tz.
  • Watermarks + restart-safe state.
  • Run-history store (queryable via flow runs ...).
  • Prometheus metrics + OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, or OTLP HTTP collectors).
  • Alerters: Slack, generic webhook, stdout, email (email://user:pass@host:port?from=&to=) and PagerDuty (pagerduty://routing_key?service=&severity= — auto-resolves on recovery).
  • Web UI bearer-token auth (flow web --token …) for off-host access; cross-pipeline DAG view of every depends_on edge; streaming pipeline live throughput + batch-cycle stats.
  • Grafana starter dashboard JSON (examples/grafana/ematix-flow-dashboard.json).
  • DLQ at app + broker level.
  • Schema Registry: Confluent + Apicurio (Avro + Protobuf) and AWS Glue Schema Registry (Avro, end-to-end Kafka dispatch + LocalStack integration tests).
  • Exactly-once Kafka → Kafka (transactions + consumer coordination).

Stream processing

  • SQL transforms over Arrow batches (in-process, no JVM).
  • Tumbling / hopping / session windows.
  • Scalar + aggregate Python UDFs.

Recently closed (v0.7.0)

v0.7.0 reorganises the workflow trigger model. The previous v0.6.0 shape — workflow with a centralised depends_on={dict} — is replaced by a richer trigger surface on the workflow + per-job DAG declaration on each member job. Hard break, no backwards compat — alpha project, nothing shipped externally depended on the v0.6.0 shape.

  • Workflow trigger kwargs (all AND-conjoined since last successful self-run):
    • triggered_by=[name, ...] — workflow / job names that must have succeeded.
    • schedule="cron" + optional timezone="IANA" — cron tick must reach.
    • on_message=<source> — per-message firing (exclusive with the above).
    • At least one required, unless the workflow contains a streaming pipeline (implicit streaming).
  • Composite triggers — declaring more than one kwarg ANDs the conditions together. Example: triggered_by=["wf_A", "wf_B"] + schedule="0 21 * * *" fires when both workflows have completed since this workflow last succeeded AND the 21:00 tick has reached. If conditions are satisfied at different times, the workflow fires immediately on the last satisfied condition rather than waiting for the next cron tick.
  • Within-workflow DAG declared on each job via @ematix.job(name=..., depends_on=[upstream_job, ...]). The workflow itself no longer takes a depends_on= dict — passing one raises ValueError with a pointer to the new model.
  • Validation at registration time for trigger conflicts (e.g. on_message + schedule together), workflow-level cycles, missing triggers on non-streaming workflows.
  • Web UI Workflows card renders a trigger summary line under the header: “Trigger: After: a, b · Schedule: 0 21 * * * America/New_York”. Streaming workflows keep the LIVE STREAMING pill in place of a trigger line.
  • /api/workflows carries triggered_by, schedule, timezone, on_message per workflow; edges derived by walking each member job’s depends_on.

Recently closed (v0.6.1)

  • Streaming workflows visible on the Workflows tab. /api/workflows now includes streaming pipelines as kind: "streaming" workflow-of-one cards. They previously only appeared on the Jobs and Runs tabs.
  • Web UI renders streaming workflows with the pulsing amber ▶ LIVE STREAMING pill and a live throughput / batch-cycle footer driven by the same streaming_stats snapshot the Jobs page uses.

Recently closed (v0.6.0)

v0.6.0 reorganised the user-facing model around Workflows and Jobs — same scheduler and runtime as v0.5.0, but the Web UI and decorator names now reflect the natural hierarchy.

  • ematix.workflow(name=..., jobs=[...], depends_on={...}) — new declaration grouping jobs and declaring the DAG between them. (Superseded in v0.7.0 by the trigger-kwargs surface above.)
  • @ematix.job decorator as the primary name; @ematix.pipeline remains a non-breaking alias.
  • /api/workflows endpoint — returns declared workflows + their member jobs and DAG edges, with synthetic single-job workflows for any job not in a declared workflow.
  • Web UI restructured into four tabs: Workflows (default, with inline SVG flowcharts), Jobs (flat list with filter + sort), Runs (renamed from Jobs, sortable columns), and DAG (cubic-Bézier arrows, #/dag/<job> to focus a subgraph).
  • flow web --module <name> — pre-imports a pipelines module so the UI can render schedules, next-run times, and the DAG without a separate scheduler tick having populated the rich-history first.
  • Loopback bind no longer requires a bearer token by default — set --token explicitly when binding to a non-loopback address.

Earlier (v0.5.0 highlights)

v0.5.0 was the operational-surface release — CLIs, alerters, Web UI, and observability on top of the v0.4.0 backend matrix.

  • @ematix.warehouse_pipeline decorator + Rust executor (invoke_warehouse_pipeline) via the PyO3 callback bridge — no subprocess fork per run.
  • AWS Glue Schema Registry — end-to-end Kafka dispatch (typed glue_schema_registry connection, Rust dispatch on both consumer and producer paths, zlib codec, LocalStack integration suite).
  • Four new CLI subcommands: flow doctor, flow init, flow logs, flow secrets test.
  • Web UI bearer-token auth (flow web --token …) + cross-pipeline DAG view at #/dag.
  • Email + PagerDuty alertersemail://… (stdlib smtplib) and pagerduty://… (Events API v2, auto-resolves on recovery).
  • OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, OTLP HTTP) + 6-panel Grafana starter dashboard.
  • Streaming-pipeline live stats in the Web UI — rolling 1 m / 5 m throughput + batch-cycle windows in place of the old “Median duration: —” footer.
  • Cron schedule timezonestimezone="America/New_York" on the job; is_due() evaluates in that tz, Web UI renders “Next: …” in the same.
  • Arrow-native warehouse adapters — Snowflake PUT + COPY INTO, Redshift S3 + COPY, BigQuery GCS + load_table_from_uri — pandas no longer on the warehouse write path.

Full changelog: see CHANGELOG.md in the repo for every prior release.

What’s still on the roadmap

  • Published distributed benchmarks. The distributed code path is shipped (see Execution above) and has a bench harness (tpch_distributed), but cluster-scale TPC-H runs at SF≥100 aren’t published yet — every number on /reference/benchmarks is single-machine.

Other roadmap items are tracked openly in the GitHub repo issues.