specdriven.sh
← All specs
Data Pipeline·Production

python-etl

Modular ETL pipeline with Airflow orchestration, dbt transforms, and S3 staging.

0 stars0 installsby communityupdated Jan 15, 2026
StackPythonAirflowdbtPostgreSQLS3
Tags#etl#airflow#dbt#s3#pipeline#python
Installterminal
npx specdriven add spec python-etl

What's included

A production ETL framework organised around three distinct stages: extract, load, and transform. Raw data lands in S3 via idempotent extractor tasks, is loaded into a PostgreSQL staging schema, and then transformed into analytics-ready tables by dbt models. Airflow schedules and monitors the full pipeline.

The spec ships with pre-built extractors for common sources - REST APIs, SFTP, and relational databases - and a base class that makes adding new sources straightforward. All extractors are idempotent: re-running a pipeline window produces no duplicates.

Architecture

Airflow DAGs live in dags/ and are intentionally thin - they declare dependencies and pass parameters, but no business logic. Extractor classes in etl/extractors/ handle source-specific concerns. The etl/loaders/ package writes raw payloads to S3 and then bulk-loads them into Postgres staging tables using COPY.

dbt models in transforms/ layer analytical views on top of the staging schema. Source freshness tests and schema tests run after every pipeline execution. A data quality DAG runs separately on a slower cadence and pages on failures.

Getting started

Run npx specdriven add @specs/python-etl to scaffold. Install Python dependencies with pip install -r requirements.txt. Copy .env.example to .env and set AIRFLOW__DATABASE__SQL_ALCHEMY_CONN, AWS_* credentials, and your target Postgres URL.

Initialise Airflow with airflow db migrate and create an admin user. Run airflow scheduler and airflow webserver to start the UI. Trigger the sample DAG from the Airflow UI or via airflow dags trigger etl_sample to verify the pipeline end-to-end.