Data Pipelines for Dashboards
Dashboards visualize data. That data comes from somewhere: your database, external services, logs. The process of extracting, transforming, and loading data is a data pipeline. Pipeline design determines dashboard performance and freshness.
What Is a Data Pipeline?
Extract: pull data from sources (application database, API, logs). Transform: clean, aggregate, calculate metrics. Load: store in a location the dashboard queries.
Example: extract raw transaction logs, transform to calculate daily revenue, store in a metrics table, dashboard queries the metrics table.
ETL vs ELT
ETL (Extract, Transform, Load): traditional. Extract data, transform it, then load into warehouse. Transformation happens before storage.
ELT (Extract, Load, Transform): modern. Load raw data, transform in the warehouse. More flexible.
OLTP vs OLAP Databases
OLTP (Online Transaction Processing): optimized for fast reads and writes of small amounts of data. Your application database. INSERT new orders, UPDATE inventory, SELECT an order.
OLAP (Online Analytical Processing): optimized for complex queries over large amounts of data. Designed for analytics. Column-oriented storage. SELECT SUM(revenue) GROUP BY product.
Running complex analytics queries against an OLTP database (your app's database) is dangerous. The query locks tables, slows down the application, and times out on large datasets.
The Separate Reporting Database
For dashboards, create a separate database optimized for analytics. Not your application database. Extract data from the application database, transform it, load into the reporting database.
This architecture: application queries application database (OLTP, fast), dashboards query reporting database (OLAP, analytics-friendly). No contention. Each is optimized for its purpose.
Pre-Aggregation
Raw data is large. A year of transaction logs is millions of rows. Computing SUM(revenue) on millions of rows is slow.
Pre-aggregate: compute daily revenue once and store it. Dashboard queries the aggregated data (365 rows instead of millions).
Tradeoff: slightly stale data (aggregations updated daily) but dramatically faster queries.
Data Warehouses
Snowflake, BigQuery, Redshift, DuckDB are data warehouse services. Massive scale analytics, powerful query engines, built for complex queries.
Use a data warehouse when: analytics queries are complex, data volume is huge (billions of rows), query speed matters. Most custom applications don't need a warehouse.
Materialized Views in PostgreSQL
PostgreSQL has materialized views: a saved query result. CREATE MATERIALIZED VIEW daily_revenue AS SELECT DATE(created_at) as day, SUM(amount) as revenue FROM orders GROUP BY day.
Refresh the view on a schedule: REFRESH MATERIALIZED VIEW daily_revenue. The view is a table with pre-computed results. Queries on the view are fast.
This provides warehouse-like performance without a separate warehouse. Good for most applications.
Complexity Spectrum
Simple dashboards: query the production database directly with optimized indexes. Works if queries are simple and fast.
Medium dashboards: materialized views or a separate reporting database. Complex queries, but still PostgreSQL-scale.
Complex dashboards: data warehouse (Snowflake, BigQuery). Massive data, complex analytics, need warehouse performance.
Latency vs Staleness
Real-time dashboards query live data but are slow. Pre-aggregated dashboards are fast but stale.
Real-time: good for operational dashboards (system health, current traffic). Staleness: acceptable for strategic dashboards (monthly revenue).
Schedule and Incremental Loading
Full reloads: extract all data from source, replace target. Simple but slow for large volumes.
Incremental: extract only new or changed data since the last run. Much faster. Requires tracking what changed.