Backends
| Backend | Batch source | Streaming source | Target | DDL planning | Strategy executors (append / merge / scd2 / truncate) | CDC target |
|---|---|---|---|---|---|---|
| Postgres | ✅ | — | ✅ | ✅ | ✅ (native + COPY BINARY) | ✅ |
| MySQL | ✅ | — | ✅ | ✅ | ✅ (ON DUPLICATE KEY) | ✅ |
| SQLite | ✅ | — | ✅ | ✅ | ✅ | ✅ |
| DuckDB | ✅ | — | ✅ | ✅ | ✅ | ✅ |
| Delta Lake (local + S3) | ✅ | ✅ | ✅ | n/a | ✅ (DataFusion-backed MERGE) | ✅ |
| Object stores (Parquet / CSV / ORC / JSONL, local + S3) | ✅ | ✅ | ✅ | n/a | append + truncate | — |
| Kafka | — | ✅ | ✅ | n/a | append (cross-backend) | source role only |
| RabbitMQ | — | ✅ | ✅ | n/a | append (cross-backend) | — |
| GCP Pub/Sub | — | ✅ | ✅ | n/a | append (cross-backend) | — |
| AWS Kinesis | — | ✅ | ✅ | n/a | append (cross-backend) | — |
Batch source = readable by @ematix.pipeline (the function returns a
SQL string; the framework executes it against the source connection).
Streaming source = tailable by flow consume /
@ematix.streaming_pipeline (long-running consumer with manual offset
commit / ack).
Target = writable by either pipeline shape. Cross-backend moves
stream Apache Arrow batches end-to-end — same-DB pairs take the
INSERT … SELECT fast path automatically.
Pipeline surface
@ematix.pipeline— batch / scheduled.@ematix.streaming_pipeline— long-running consumer.@ematix.connection— typed connection with${VAR}interpolation.@ematix.table/ManagedTable— declarative schema + DDL.@ematix_flow.udf/@ematix_flow.udaf— Python user-defined scalar + aggregate functions.
Modes
append— default.truncate— full refresh.merge(a.k.a.scd1) — upsert on merge keys.scd2— slowly-changing-dimension Type 2.
Execution
- Single-node (default). The whole engine runs in-process; the
flowbinary is a single ~25 MB native executable. - Distributed (opt-in). Set
engine = "distributed"+peers = [...]and SQL fans out across a peer-to-peer mesh offlow-workerprocesses via Apache Arrow Flight. mTLS-secured mesh, cross-pod lookup broadcast, no separate cluster service. Symmetric — any process linkingematix-flow-distributedcan play coordinator or worker. Spark / DuckDB dialect translator means existing SQL ports across without rewrites.
Operational
- Cron + DAG + cycle detection + retries.
- Watermarks + restart-safe state.
- Run-history store (queryable via
flow runs ...). - Prometheus + OpenTelemetry metrics.
- Slack alerting.
- DLQ at app + broker level.
- Schema Registry (Confluent + Apicurio, Avro + Protobuf).
- Exactly-once Kafka → Kafka (transactions + consumer coordination).
Stream processing
- SQL transforms over Arrow batches (in-process, no JVM).
- Tumbling / hopping / session windows.
- Scalar + aggregate Python UDFs.
What’s not in v0.3.0
- Published distributed benchmarks. The distributed code path is
shipped (see Execution above) and has a bench harness
(
tpch_distributed), but cluster-scale TPC-H runs at SF≥100 aren’t published yet — every number on /specs/02-benchmarks is single-machine. - Snowflake / BigQuery / Redshift backends.
- Pluggable secrets stores (Vault, AWS Secrets Manager, GCP Secret Manager).
Today
${VAR}interpolates from env only. - Web UI. Run history is queryable via the
flow runs ...CLI subcommand only — there’s no graphical UI today.
Roadmap items are tracked openly in the GitHub repo issues.