✓

Follow along with this comprehensive guide

Overview

Data pipelines have traditionally been the domain of software engineers wielding PySpark or Python scripts. However, a new stack — dlt (data load tool), dbt (data build tool), and Trino — allows analysts to build and maintain pipelines using nothing more than YAML configuration files. This guide walks you through replacing complex PySpark pipelines with four YAML files, cutting delivery time from weeks to a single day. By the end, you’ll understand how to set up a pipeline that extracts, loads, transforms, and queries data without writing a single line of Python or Spark code.

Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide — Source: towardsdatascience.com

Prerequisites

Before diving in, ensure you have:

Basic familiarity with SQL – dbt relies on SQL for transformations.
Access to a data warehouse (e.g., Snowflake, BigQuery, Postgres) – Trino will serve as the query engine.
Python 3.8+ installed (only for installing dlt and dbt; no coding required beyond setup).
YAML editor – any text editor works.
A source of data – API, database, or flat files you want to ingest.

This guide assumes you are comfortable running terminal commands and editing configuration files.

Step-by-Step Instructions

1. Setting Up the Tools

Install dlt and dbt using pip (or conda):

pip install dlt dbt-core trino

Verify installations:

dlt --version
dbt --version
trino --version

Create a project directory:

mkdir my_pipeline
cd my_pipeline

2. Configuring the Source – dlt YAML

dlt extracts data from sources and loads it into a destination. Create a file sources.yml:

# sources.yml
sources:
  my_api:
    type: rest_api
    config:
      base_url: "https://api.example.com/v1"
      endpoint: /data
      pagination: true
    # Add authentication if needed
    auth:
      api_key: "${API_KEY}"

This YAML tells dlt to fetch data from an API endpoint with pagination. Replace the URL and API key with your own. dlt supports many source types (databases, cloud storage, etc.).

3. Loading Data – dlt Destination YAML

Create destinations.yml to specify where data goes:

# destinations.yml
destinations:
  my_trino:
    type: trino
    config:
      host: localhost
      port: 8080
      database: my_db
      user: analyst
      password: "${TRINO_PASSWORD}"

Now define a pipeline in pipeline.yml that links the source and destination:

# pipeline.yml
pipeline:
  name: my_first_pipeline
  source: my_api
  destination: my_trino
  tables:
    - name: raw_data
      primary_key: id
      incremental: true

Run the pipeline with a single command:

dlt pipeline run pipeline.yml

Data is now loaded into Trino under the raw_data table.

4. Transforming with dbt

dbt allows analysts to write SQL models. Initialize a dbt project inside your directory:

dbt init my_dbt_project

Edit profiles.yml to point to your Trino instance:

# profiles.yml
my_dbt_project:
  outputs:
    dev:
      type: trino
      method: none
      server: localhost:8080
      database: my_db
      schema: analytics
      user: analyst
      password: "${TRINO_PASSWORD}"
  target: dev

Create a transformation model in models/ – for example, aggregated_data.sql:

-- models/aggregated_data.sql
SELECT
    EXTRACT(YEAR FROM event_date) AS year,
    EXTRACT(MONTH FROM event_date) AS month,
    category,
    SUM(revenue) AS total_revenue
FROM {{ source('raw_data', 'raw_data') }}
GROUP BY 1,2,3

Run dbt to apply transformations:

dbt run

This creates a table or view in Trino’s analytics schema.

5. Querying with Trino

Now you can query the transformed data using any SQL client connected to Trino. For example:

-- Query from Trino CLI or your BI tool
SELECT * FROM my_db.analytics.aggregated_data
WHERE total_revenue > 100000
ORDER BY year, month;

That’s it – a complete pipeline defined in just four YAML files (sources.yml, destinations.yml, pipeline.yml, and dbt’s profiles.yml plus one SQL model).

Common Mistakes

Incorrect indentation in YAML – YAML is space-sensitive. Use 2 spaces per level, not tabs.
Missing environment variables – Never hardcode secrets; use ${VAR} and export them before running.
Pagination not enabled – dlt defaults to single-page fetches. If your API returns many records, enable pagination: true or specify a cursor.
Database schema issues – Ensure the schema (raw_data) exists in Trino before running the dlt pipeline. dlt may create it automatically, but not always.
Trino user permissions – The user must have write access to the destination schema and read access to any sources.
dbt model referencing wrong source – Verify the source name in source() matches the table from dlt. Use dbt docs generate to check lineage.
Ignoring incremental loading – Without incremental: true in pipeline.yml, dlt will overwrite the entire table daily.

Summary

By replacing PySpark with a stack of dlt, dbt, and Trino, organizations empower analysts to build and maintain data pipelines using YAML and SQL alone. The process reduces delivery time from weeks to one day, eliminates the need for dedicated engineering support, and keeps pipelines version-controlled and auditable. This guide demonstrated a complete end-to-end pipeline with four configuration files, covering extraction, loading, transformation, and querying. Start with a single use case, and scale from there.

Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide