Building a Modern Data Stack: Architecture, Components, and Decision Framework
A comprehensive guide to designing and implementing a modern data stack, covering component selection, cost analysis, architectural patterns, and real-world decision frameworks for cloud-native data platforms.
The modern data stack represents a paradigm shift from monolithic ETL systems to composable, cloud-native architectures. This guide provides a rigorous examination of each layer—from ingestion to analytics—including cost modeling, vendor comparisons, and a decision framework to help organizations select the right tools for their specific requirements and constraints.
- Basic understanding of SQL and data modeling concepts
- Familiarity with cloud computing fundamentals (AWS, GCP, or Azure)
- Experience with data pipelines or traditional ETL/BI tools

Introduction
The modern data stack has emerged as the dominant architecture for organizations seeking to become data-driven. Unlike the monolithic enterprise data warehouses of the past—characterized by proprietary ETL tools, expensive licensing, and months-long implementations—today's data stack is composable, cloud-native, and designed for agility.
According to the 2024 State of Data Engineering Report, organizations adopting modern data stacks report:
- 67% faster time-to-insight compared to legacy systems
- 40% reduction in data engineering operational overhead
- 3x improvement in self-service analytics adoption
This guide provides a comprehensive framework for designing, implementing, and optimizing a modern data stack tailored to your organization's specific requirements.
What Defines a "Modern" Data Stack?
The modern data stack is characterized by several architectural principles:
1. Cloud-Native by Default
All components are deployed on cloud infrastructure, offering:
- Elastic scaling: Scale compute independently of storage
- Pay-per-use pricing: No upfront capital expenditure
- Managed operations: Reduced infrastructure maintenance burden
2. ELT Over ETL
The paradigm shift from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform) is fundamental:
Traditional ETL:
Source → Transform (external) → Load → Warehouse
↑ Expensive ETL servers
Modern ELT:
Source → Extract → Load → Transform (in-warehouse)
↑ Leverage warehouse compute
By pushing transformations into the data warehouse, organizations leverage the massive compute power of platforms like Snowflake and BigQuery, eliminating the need for separate transformation infrastructure.
3. Composable Architecture
Rather than a single monolithic platform, the modern stack comprises best-of-breed tools that integrate via standard interfaces (APIs, SQL, object storage):
┌────────────────────────────────────────────────────────────────────┐
│ MODERN DATA STACK LAYERS │
├────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ SOURCES │ │ INGESTION │ │ STORAGE │ │ TRANSFORM │ │
│ │ │ │ │ │ │ │ │ │
│ │ Databases │ │ Fivetran │ │ Snowflake │ │ dbt │ │
│ │ APIs │→ │ Airbyte │→ │ BigQuery │→ │ Spark │ │
│ │ SaaS Apps │ │ Stitch │ │ Redshift │ │ SQLMesh │ │
│ │ Events │ │ Custom │ │ Databricks │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │
│ │ │
│ ┌───────────────┘ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ANALYTICS │ │ ORCHESTRATE │ │ OBSERVE │ │
│ │ │ │ │ │ │ │
│ │ Looker │ │ Airflow │ │ Monte Carlo │ │
│ │ Tableau │← │ Dagster │ │ Great Exp. │ │
│ │ Metabase │ │ Prefect │ │ Elementary │ │
│ │ Preset │ │ dbt Cloud │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────┘
Layer 1: Data Ingestion
Data ingestion is the foundation of any data stack. The choice between managed and self-hosted solutions significantly impacts operational overhead and cost.
Vendor Comparison Matrix
| Tool | Type | Connectors | Pricing Model | Best For |
|---|---|---|---|---|
| Fivetran | Managed | 300+ | Per MAR (Monthly Active Rows) | Enterprise, compliance-heavy |
| Airbyte | Open-source/Cloud | 350+ | Self-hosted free, Cloud per credit | Cost-conscious, custom sources |
| Stitch | Managed | 130+ | Per row volume | Mid-market, Talend ecosystem |
| Meltano | Open-source | Singer ecosystem | Self-hosted free | Engineering-heavy teams |
| Hevo | Managed | 150+ | Per event | Indian market, real-time focus |
Cost Analysis: Fivetran vs. Airbyte
For an organization syncing 100 million rows/month across 20 connectors:
Fivetran (Business Tier):
- Base platform: ~$1,000/month
- MAR cost: 100M rows × $0.50/M = $50,000/month
- Total: ~$51,000/month
Airbyte Cloud:
- Credits: 100M rows ≈ 500 credits × $12 = $6,000/month
- Total: ~$6,000/month
Airbyte Self-Hosted (on AWS):
- EC2 (m5.xlarge): ~$150/month
- RDS (db.t3.medium): ~$50/month
- Engineering time: 4 hrs/month × $150/hr = $600/month
- Total: ~$800/month
The cost differential is substantial, but managed solutions provide:
- 99.9% SLA guarantees
- Automatic schema drift handling
- SOC 2 compliance out-of-box
- 24/7 support
Production Ingestion Configuration
# Airbyte connection configuration (YAML)
sourceDefinitionId: "postgres"
name: "Production PostgreSQL"
connectionConfiguration:
host: "prod-db.internal.company.com"
port: 5432
database: "transactions"
username: "${POSTGRES_USER}"
password: "${POSTGRES_PASSWORD}"
ssl_mode:
mode: "verify-ca"
ca_certificate: "${SSL_CA_CERT}"
replication_method:
method: "CDC"
publication: "airbyte_publication"
replication_slot: "airbyte_slot"
initial_waiting_seconds: 300
tunnel_method:
tunnel_method: "SSH_KEY_AUTH"
tunnel_host: "bastion.company.com"
tunnel_port: 22
tunnel_user: "airbyte"
ssh_key: "${SSH_PRIVATE_KEY}"
sync:
schedule:
scheduleType: "cron"
cronExpression: "0 */2 * * *" # Every 2 hours
namespaceDefinition: "source"
prefix: "raw_"
nonBreakingSchemaUpdatesBehavior: "propagate_columns"
CDC vs. Full Refresh Trade-offs
| Approach | Latency | Source Load | Complexity | Data Volume Impact |
|---|---|---|---|---|
| Full Refresh | High (hours) | High | Low | Linear with size |
| Incremental (timestamp) | Medium (minutes) | Medium | Medium | Linear with changes |
| CDC (Debezium/native) | Low (seconds) | Minimal | High | Constant overhead |
Recommendation: Start with incremental loads for tables >1GB, implement CDC only for latency-critical tables (e.g., inventory, orders).
Layer 2: Data Storage (Warehouse/Lakehouse)
The storage layer is the most critical architectural decision, influencing cost, performance, and capabilities for years to come.
Platform Comparison
| Platform | Architecture | Best For | Pricing Model |
|---|---|---|---|
| Snowflake | Shared storage, separate compute | Multi-cloud, governance-heavy | Compute + Storage |
| BigQuery | Serverless, slot-based | GCP-native, ad-hoc queries | On-demand or flat-rate |
| Redshift | Provisioned clusters | AWS-native, predictable workloads | Instance-based |
| Databricks | Lakehouse (Delta Lake) | ML workloads, streaming | DBUs (Databricks Units) |
Snowflake Architecture Deep-Dive
Snowflake's multi-cluster shared data architecture separates compute from storage:
Snowflake Architecture
┌──────────────────────────────────────────────────────────────┐
│ CLOUD SERVICES │
│ (Authentication, Metadata, Query Parsing, Optimization) │
└──────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ WAREHOUSE XS │ │ WAREHOUSE M │ │ WAREHOUSE XL │
│ (Analytics) │ │ (Transform) │ │ (Data Science) │
│ 1 credit/hr │ │ 4 credits/hr │ │ 16 credits/hr │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ CENTRALIZED STORAGE │
│ (S3/Azure Blob/GCS - Immutable) │
│ $23/TB/month │
└──────────────────────────────────────────────────────────────┘
Production Warehouse Setup
-- Snowflake multi-environment setup
-- Create separate warehouses for different workloads
-- ETL/ELT processing (runs during off-hours)
CREATE WAREHOUSE IF NOT EXISTS TRANSFORM_WH
WITH
WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4
SCALING_POLICY = 'ECONOMY' -- Favor cost over speed
WAREHOUSE_TYPE = 'STANDARD'
INITIALLY_SUSPENDED = TRUE
COMMENT = 'ELT transformation workloads - scheduled jobs';
-- Analytics queries (interactive, SLA-bound)
CREATE WAREHOUSE IF NOT EXISTS ANALYTICS_WH
WITH
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 2
SCALING_POLICY = 'STANDARD' -- Balance cost and speed
WAREHOUSE_TYPE = 'STANDARD'
INITIALLY_SUSPENDED = FALSE
STATEMENT_TIMEOUT_IN_SECONDS = 900 -- 15 min query timeout
COMMENT = 'Interactive analytics - BI tools and ad-hoc queries';
-- Data science workloads (memory-intensive)
CREATE WAREHOUSE IF NOT EXISTS DATASCIENCE_WH
WITH
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 120
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 1
WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
INITIALLY_SUSPENDED = TRUE
COMMENT = 'ML training and feature engineering';
-- Create medallion architecture schemas
CREATE DATABASE IF NOT EXISTS ANALYTICS;
CREATE SCHEMA IF NOT EXISTS ANALYTICS.BRONZE; -- Raw data
CREATE SCHEMA IF NOT EXISTS ANALYTICS.SILVER; -- Cleaned, conformed
CREATE SCHEMA IF NOT EXISTS ANALYTICS.GOLD; -- Business-ready aggregates
CREATE SCHEMA IF NOT EXISTS ANALYTICS.SEMANTIC; -- Metrics layer
Cost Optimization Strategies
-- Monitor credit consumption by warehouse
SELECT
warehouse_name,
DATE_TRUNC('week', start_time) AS week,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 3.00 AS estimated_cost_usd -- Standard tier
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_time >= DATEADD('month', -3, CURRENT_DATE())
GROUP BY 1, 2
ORDER BY 1, 2;
-- Identify expensive queries for optimization
SELECT
query_id,
query_text,
warehouse_name,
total_elapsed_time / 1000 AS duration_seconds,
credits_used_cloud_services,
bytes_scanned / 1e9 AS gb_scanned,
partitions_scanned,
partitions_total
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD('day', -7, CURRENT_DATE())
AND credits_used_cloud_services > 0.1
ORDER BY credits_used_cloud_services DESC
LIMIT 50;
-- Implement result caching and clustering
ALTER TABLE ANALYTICS.GOLD.FACT_ORDERS
CLUSTER BY (order_date, customer_segment);
-- Enable result caching (on by default, but verify)
ALTER WAREHOUSE ANALYTICS_WH SET
USE_CACHED_RESULT = TRUE;
Layer 3: Data Transformation
dbt (data build tool) has become the industry standard for data transformation, introducing software engineering practices to analytics workflows.
dbt Project Architecture
dbt_project/
├── dbt_project.yml
├── profiles.yml # Connection configuration
├── packages.yml # Dependencies
├── models/
│ ├── staging/ # 1:1 with sources, light cleaning
│ │ ├── stg_salesforce/
│ │ │ ├── _stg_salesforce__models.yml
│ │ │ ├── stg_salesforce__accounts.sql
│ │ │ └── stg_salesforce__opportunities.sql
│ │ └── stg_stripe/
│ │ ├── _stg_stripe__models.yml
│ │ └── stg_stripe__payments.sql
│ ├── intermediate/ # Business logic, joins
│ │ ├── int_revenue_by_account.sql
│ │ └── int_customer_journey.sql
│ └── marts/ # Business-ready, consumption layer
│ ├── core/
│ │ ├── _core__models.yml
│ │ ├── dim_customers.sql
│ │ ├── dim_products.sql
│ │ └── fct_orders.sql
│ └── marketing/
│ ├── _marketing__models.yml
│ └── rpt_campaign_performance.sql
├── macros/
│ ├── generate_schema_name.sql
│ └── cents_to_dollars.sql
├── tests/
│ └── assert_positive_revenue.sql
├── seeds/
│ └── country_codes.csv
└── analyses/
└── ad_hoc_revenue_analysis.sql
Production-Grade dbt Models
-- models/marts/core/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
cluster_by=['order_date', 'customer_segment'],
tags=['core', 'daily'],
post_hook=[
"GRANT SELECT ON {{ this }} TO ROLE ANALYTICS_READER"
]
)
}}
WITH orders AS (
SELECT * FROM {{ ref('stg_stripe__payments') }}
{% if is_incremental() %}
WHERE payment_date >= (
SELECT COALESCE(MAX(order_date), '1900-01-01')
FROM {{ this }}
)
{% endif %}
),
customers AS (
SELECT * FROM {{ ref('dim_customers') }}
),
products AS (
SELECT * FROM {{ ref('dim_products') }}
),
final AS (
SELECT
-- Keys
o.payment_id AS order_id,
o.customer_id,
o.product_id,
-- Dimensions
c.customer_segment,
c.acquisition_channel,
c.customer_region,
p.product_category,
p.product_line,
-- Date dimensions
o.payment_date AS order_date,
{{ dbt_date.week_start('o.payment_date') }} AS order_week,
DATE_TRUNC('month', o.payment_date) AS order_month,
-- Measures
{{ cents_to_dollars('o.amount_cents') }} AS order_amount_usd,
o.quantity,
o.discount_percent,
{{ cents_to_dollars('o.amount_cents') }} *
(1 - COALESCE(o.discount_percent, 0) / 100) AS net_revenue_usd,
-- Metadata
o.created_at AS _loaded_at,
CURRENT_TIMESTAMP() AS _transformed_at
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
LEFT JOIN products p ON o.product_id = p.product_id
WHERE o.payment_status = 'succeeded'
)
SELECT * FROM final
dbt Testing Framework
# models/marts/core/_core__models.yml
version: 2
models:
- name: fct_orders
description: |
Fact table containing all successful orders with denormalized
customer and product dimensions. Grain: one row per order.
**SLA**: Updated by 6:00 AM UTC daily
**Owner**: [email protected]
meta:
owner: "Data Platform Team"
tier: "gold"
pii: false
columns:
- name: order_id
description: Primary key - unique identifier for each order
tests:
- unique
- not_null
- name: customer_id
description: Foreign key to dim_customers
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_amount_usd
description: Order amount in USD (converted from cents)
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
inclusive: true
- name: order_date
description: Date the order was placed
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "<= CURRENT_DATE()"
config:
severity: error
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- order_id
- order_date
# Custom test: revenue should not drop >50% day-over-day
- dbt_expectations.expect_column_values_to_be_between:
column_name: net_revenue_usd
min_value: 0
Semantic Layer with dbt Metrics
# models/marts/core/_core__metrics.yml
version: 2
metrics:
- name: total_revenue
label: Total Revenue
description: Sum of net revenue from all orders
type: simple
type_params:
measure: net_revenue_usd
meta:
default_granularity: day
team: finance
- name: average_order_value
label: Average Order Value (AOV)
description: Average revenue per order
type: derived
type_params:
expr: total_revenue / order_count
metrics:
- name: total_revenue
- name: order_count
meta:
unit: USD
precision: 2
- name: revenue_per_customer
label: Revenue per Customer
description: Average lifetime revenue per unique customer
type: derived
type_params:
expr: total_revenue / unique_customers
metrics:
- name: total_revenue
- name: unique_customers
filter: |
{{ Dimension('customer_segment') }} != 'test'
Layer 4: Orchestration
Workflow orchestration ensures pipelines run reliably on schedule with proper dependency management.
Orchestrator Comparison
| Tool | Architecture | Strengths | Limitations |
|---|---|---|---|
| Airflow | DAG-based, Python | Mature, extensive ecosystem | Complex deployment, resource-heavy |
| Dagster | Asset-based, Python | Modern UX, data lineage | Smaller ecosystem |
| Prefect | Flow-based, Python | Hybrid deployment, easy start | Cloud-heavy pricing |
| dbt Cloud | dbt-native | Seamless dbt integration | dbt-only, expensive at scale |
Production Airflow DAG
# dags/daily_analytics_pipeline.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.providers.airbyte.operators.airbyte import AirbyteTriggerSyncOperator
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator
from airflow.utils.task_group import TaskGroup
default_args = {
'owner': 'data-platform',
'depends_on_past': False,
'email': ['[email protected]'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
'retry_exponential_backoff': True,
'max_retry_delay': timedelta(minutes=30),
}
with DAG(
dag_id='daily_analytics_pipeline',
default_args=default_args,
description='Daily ELT pipeline: Airbyte sync → dbt transform → data quality',
schedule_interval='0 5 * * *', # 5 AM UTC daily
start_date=datetime(2024, 1, 1),
catchup=False,
max_active_runs=1,
tags=['production', 'analytics', 'daily'],
) as dag:
start = EmptyOperator(task_id='start')
# Ingestion phase - parallel syncs
with TaskGroup('ingestion') as ingestion_group:
sync_salesforce = AirbyteTriggerSyncOperator(
task_id='sync_salesforce',
airbyte_conn_id='airbyte_default',
connection_id='{{ var.value.airbyte_salesforce_conn_id }}',
asynchronous=False,
timeout=3600,
)
sync_stripe = AirbyteTriggerSyncOperator(
task_id='sync_stripe',
airbyte_conn_id='airbyte_default',
connection_id='{{ var.value.airbyte_stripe_conn_id }}',
asynchronous=False,
timeout=3600,
)
sync_hubspot = AirbyteTriggerSyncOperator(
task_id='sync_hubspot',
airbyte_conn_id='airbyte_default',
connection_id='{{ var.value.airbyte_hubspot_conn_id }}',
asynchronous=False,
timeout=3600,
)
# Transformation phase
with TaskGroup('transformation') as transform_group:
dbt_staging = DbtCloudRunJobOperator(
task_id='dbt_run_staging',
job_id='{{ var.value.dbt_staging_job_id }}',
dbt_cloud_conn_id='dbt_cloud_default',
check_interval=30,
timeout=3600,
)
dbt_marts = DbtCloudRunJobOperator(
task_id='dbt_run_marts',
job_id='{{ var.value.dbt_marts_job_id }}',
dbt_cloud_conn_id='dbt_cloud_default',
check_interval=30,
timeout=7200,
)
dbt_staging >> dbt_marts
# Notification
notify_success = SlackWebhookOperator(
task_id='notify_success',
slack_webhook_conn_id='slack_analytics',
message=':white_check_mark: Daily analytics pipeline completed successfully',
trigger_rule='all_success',
)
notify_failure = SlackWebhookOperator(
task_id='notify_failure',
slack_webhook_conn_id='slack_analytics',
message=':x: Daily analytics pipeline FAILED - please investigate',
trigger_rule='one_failed',
)
end = EmptyOperator(
task_id='end',
trigger_rule='none_failed_min_one_success',
)
# Define dependencies
start >> ingestion_group >> transform_group >> [notify_success, notify_failure] >> end
Layer 5: Analytics and BI
The analytics layer makes data accessible to business stakeholders through self-service tools.
BI Platform Selection Framework
| Criteria | Looker | Tableau | Power BI | Metabase |
|---|---|---|---|---|
| Pricing | $$$ | $$ | $ | Free/$ |
| Semantic Layer | LookML (strong) | Data Model | Power Query | Basic |
| Self-Service | High | High | High | Medium |
| Governance | Excellent | Good | Good | Basic |
| Embedded Analytics | Native | Native | Native | Native |
| Best For | Enterprise, governance | Visual analysis | Microsoft shops | SMB, developers |
Looker LookML Model
# models/core/orders.model.lkml
connection: "snowflake_analytics"
include: "/views/core/*.view.lkml"
explore: orders {
label: "Orders Analysis"
description: "Primary explore for order and revenue analysis"
join: customers {
type: left_outer
relationship: many_to_one
sql_on: ${orders.customer_id} = ${customers.customer_id} ;;
}
join: products {
type: left_outer
relationship: many_to_one
sql_on: ${orders.product_id} = ${products.product_id} ;;
}
# Access control
access_filter: {
field: customers.region
user_attribute: allowed_regions
}
# Aggregate awareness for performance
aggregate_table: orders_daily_summary {
query: {
dimensions: [order_date, customers.segment, products.category]
measures: [total_revenue, order_count, average_order_value]
}
materialization: {
datagroup_trigger: daily_etl_complete
}
}
}
Decision Framework: Selecting Your Stack
By Organization Size and Maturity
┌─────────────────────────────────────────────────────────────────────┐
│ STACK SELECTION BY STAGE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ STARTUP (< $10M ARR) SCALE-UP ($10-100M) │
│ ├─ Ingestion: Airbyte Cloud ├─ Ingestion: Fivetran │
│ ├─ Warehouse: BigQuery ├─ Warehouse: Snowflake │
│ ├─ Transform: dbt Cloud ├─ Transform: dbt Cloud │
│ ├─ Orchestrate: dbt Cloud ├─ Orchestrate: Dagster Cloud │
│ └─ BI: Metabase/Preset └─ BI: Looker/Tableau │
│ │
│ Est. Cost: $2-5K/month Est. Cost: $20-50K/month │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ ENTERPRISE (> $100M) DATA-INTENSIVE │
│ ├─ Ingestion: Fivetran + CDC ├─ Ingestion: Kafka + Flink │
│ ├─ Warehouse: Snowflake ├─ Warehouse: Databricks │
│ ├─ Transform: dbt + Spark ├─ Transform: Spark + dbt │
│ ├─ Orchestrate: Airflow ├─ Orchestrate: Airflow/Dagster │
│ └─ BI: Looker + Embedded └─ BI: Custom + Looker │
│ │
│ Est. Cost: $100K+/month Est. Cost: $150K+/month │
│ │
└─────────────────────────────────────────────────────────────────────┘
Decision Tree
-
Data Volume
- < 100GB: BigQuery (serverless simplicity)
- 100GB - 10TB: Snowflake (balance of features/cost)
-
10TB with ML: Databricks (unified lakehouse)
-
Team Composition
- Analytics-heavy: Looker + dbt Cloud (managed, SQL-first)
- Engineering-heavy: Self-hosted Airbyte + Airflow + Metabase
- Mixed: Fivetran + Snowflake + dbt Cloud + Tableau
-
Compliance Requirements
- SOC 2/HIPAA: Snowflake + Fivetran (enterprise compliance)
- GDPR (EU data): EU-region deployment, consider Databricks
Total Cost of Ownership Model
A realistic TCO calculation for a mid-size company (500 employees, 50 data users):
Monthly Cost Breakdown:
INGESTION
├─ Fivetran (50 connectors, 500M MAR) $15,000
└─ Subtotal $15,000
STORAGE & COMPUTE
├─ Snowflake Compute (Enterprise) $25,000
├─ Snowflake Storage (10 TB) $2,300
└─ Subtotal $27,300
TRANSFORMATION
├─ dbt Cloud (Team, 5 seats) $5,000
└─ Subtotal $5,000
ORCHESTRATION
├─ Dagster Cloud $2,500
└─ Subtotal $2,500
ANALYTICS
├─ Looker (50 users) $15,000
└─ Subtotal $15,000
OBSERVABILITY
├─ Monte Carlo (data observability) $5,000
└─ Subtotal $5,000
───────────────────────────────────────────────────
TOTAL MONTHLY $69,800
TOTAL ANNUAL $837,600
+ Engineering time (2 FTE × $180K) $360,000
───────────────────────────────────────────────────
TRUE ANNUAL TCO $1,197,600
Conclusion
Building a modern data stack is not about selecting the "best" tools—it's about selecting the right tools for your organization's specific constraints: budget, team skills, data volume, compliance requirements, and strategic priorities.
Key principles to remember:
- Start managed, optimize later: Begin with fully managed services to reduce time-to-value; self-host only when cost or control demands it
- ELT is the new standard: Push compute to the warehouse; eliminate separate ETL infrastructure
- dbt is non-negotiable: The transformation layer should use dbt or dbt-compatible tooling for maintainability
- Measure TCO holistically: Include engineering time, not just licensing fees
- Plan for governance early: Data governance retrofits are expensive; build access control and lineage from day one
The modern data stack continues to evolve rapidly—what matters is building a foundation that can adapt as your organization's needs change.
References
- The Modern Data Stack: Past, Present, and Future - Benn Stancil
- dbt Documentation
- Snowflake Architecture Whitepaper
- Airbyte Connector Catalog
- State of Data Engineering 2024 - Data Engineering Weekly
- The Data Warehouse Toolkit - Ralph Kimball
Key Takeaways
- ✓The modern data stack follows ELT (Extract-Load-Transform) rather than traditional ETL, pushing transformation compute to the warehouse
- ✓Total cost of ownership (TCO) must include compute, storage, egress, and operational overhead—not just licensing fees
- ✓dbt has become the de facto standard for transformation, creating a shared semantic layer across organizations
- ✓The 'best' stack depends on data volume, team size, existing infrastructure, and compliance requirements
- ✓Start with managed services to reduce operational burden, then optimize with self-hosted alternatives as scale demands



