Data Pipelines | App Dev Guide

Dashboards visualize data. That data comes from somewhere: your database, external services, logs. The process of extracting, transforming, and loading data is a data pipeline. Pipeline design determines dashboard performance and freshness.

What Is a Data Pipeline?

Extract: pull data from sources (application database, API, logs). Transform: clean, aggregate, calculate metrics. Load: store in a location the dashboard queries.

Example: extract raw transaction logs, transform to calculate daily revenue, store in a metrics table, dashboard queries the metrics table.

ETL vs ELT

ETL (Extract, Transform, Load): traditional. Extract data, transform it, then load into warehouse. Transformation happens before storage.

ELT (Extract, Load, Transform): modern. Load raw data, transform in the warehouse. More flexible.

OLTP vs OLAP Databases

OLTP (Online Transaction Processing): optimized for fast reads and writes of small amounts of data. Your application database. INSERT new orders, UPDATE inventory, SELECT an order.

OLAP (Online Analytical Processing): optimized for complex queries over large amounts of data. Designed for analytics. Column-oriented storage. SELECT SUM(revenue) GROUP BY product.

Running complex analytics queries against an OLTP database (your app's database) is dangerous. The query locks tables, slows down the application, and times out on large datasets.

The Separate Reporting Database

For dashboards, create a separate database optimized for analytics. Not your application database. Extract data from the application database, transform it, load into the reporting database.

This architecture: application queries application database (OLTP, fast), dashboards query reporting database (OLAP, analytics-friendly). No contention. Each is optimized for its purpose.

Pre-Aggregation

Raw data is large. A year of transaction logs is millions of rows. Computing SUM(revenue) on millions of rows is slow.

Pre-aggregate: compute daily revenue once and store it. Dashboard queries the aggregated data (365 rows instead of millions).

Tradeoff: slightly stale data (aggregations updated daily) but dramatically faster queries.

Data Warehouses

Snowflake, BigQuery, Redshift, DuckDB are data warehouse services. Massive scale analytics, powerful query engines, built for complex queries.

Use a data warehouse when: analytics queries are complex, data volume is huge (billions of rows), query speed matters. Most custom applications don't need a warehouse.

Materialized Views in PostgreSQL

PostgreSQL has materialized views: a saved query result. CREATE MATERIALIZED VIEW daily_revenue AS SELECT DATE(created_at) as day, SUM(amount) as revenue FROM orders GROUP BY day.

Refresh the view on a schedule: REFRESH MATERIALIZED VIEW daily_revenue. The view is a table with pre-computed results. Queries on the view are fast.

This provides warehouse-like performance without a separate warehouse. Good for most applications.

Complexity Spectrum

Simple dashboards: query the production database directly with optimized indexes. Works if queries are simple and fast.

Medium dashboards: materialized views or a separate reporting database. Complex queries, but still PostgreSQL-scale.

Complex dashboards: data warehouse (Snowflake, BigQuery). Massive data, complex analytics, need warehouse performance.

Latency vs Staleness

Real-time dashboards query live data but are slow. Pre-aggregated dashboards are fast but stale.

Real-time: good for operational dashboards (system health, current traffic). Staleness: acceptable for strategic dashboards (monthly revenue).

Schedule and Incremental Loading

Full reloads: extract all data from source, replace target. Simple but slow for large volumes.

Incremental: extract only new or changed data since the last run. Much faster. Requires tracking what changed.

Warning

Don't run analytics queries against your production database. Slow queries lock tables and impact user-facing features. Separate your OLTP database from your analytics infrastructure.

Tip

Start simple: query production database with indexes. As dashboards become complex, add materialized views. If that's not enough, consider a data warehouse. Don't jump to data warehouse infrastructure prematurely.