Skip to main content
|18 min read|Intermediate

Building a Modern Data Stack: Architecture, Components, and Decision Framework

A comprehensive guide to designing and implementing a modern data stack, covering component selection, cost analysis, architectural patterns, and real-world decision frameworks for cloud-native data platforms.

Data StackCloud ArchitectureData EngineeringELTData WarehousedbtSnowflake
TL;DR

The modern data stack represents a paradigm shift from monolithic ETL systems to composable, cloud-native architectures. This guide provides a rigorous examination of each layer—from ingestion to analytics—including cost modeling, vendor comparisons, and a decision framework to help organizations select the right tools for their specific requirements and constraints.

Prerequisites
  • Basic understanding of SQL and data modeling concepts
  • Familiarity with cloud computing fundamentals (AWS, GCP, or Azure)
  • Experience with data pipelines or traditional ETL/BI tools
Building a Modern Data Stack: Architecture, Components, and Decision Framework

Introduction

The modern data stack has emerged as the dominant architecture for organizations seeking to become data-driven. Unlike the monolithic enterprise data warehouses of the past—characterized by proprietary ETL tools, expensive licensing, and months-long implementations—today's data stack is composable, cloud-native, and designed for agility.

According to the 2024 State of Data Engineering Report, organizations adopting modern data stacks report:

  • 67% faster time-to-insight compared to legacy systems
  • 40% reduction in data engineering operational overhead
  • 3x improvement in self-service analytics adoption

This guide provides a comprehensive framework for designing, implementing, and optimizing a modern data stack tailored to your organization's specific requirements.

What Defines a "Modern" Data Stack?

The modern data stack is characterized by several architectural principles:

1. Cloud-Native by Default

All components are deployed on cloud infrastructure, offering:

  • Elastic scaling: Scale compute independently of storage
  • Pay-per-use pricing: No upfront capital expenditure
  • Managed operations: Reduced infrastructure maintenance burden

2. ELT Over ETL

The paradigm shift from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform) is fundamental:

Traditional ETL:
Source → Transform (external) → Load → Warehouse
         ↑ Expensive ETL servers

Modern ELT:
Source → Extract → Load → Transform (in-warehouse)
                          ↑ Leverage warehouse compute

By pushing transformations into the data warehouse, organizations leverage the massive compute power of platforms like Snowflake and BigQuery, eliminating the need for separate transformation infrastructure.

3. Composable Architecture

Rather than a single monolithic platform, the modern stack comprises best-of-breed tools that integrate via standard interfaces (APIs, SQL, object storage):

┌────────────────────────────────────────────────────────────────────┐
│                       MODERN DATA STACK LAYERS                      │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌───────────┐ │
│  │  SOURCES    │  │  INGESTION  │  │  STORAGE    │  │ TRANSFORM │ │
│  │             │  │             │  │             │  │           │ │
│  │ Databases   │  │ Fivetran    │  │ Snowflake   │  │ dbt       │ │
│  │ APIs        │→ │ Airbyte     │→ │ BigQuery    │→ │ Spark     │ │
│  │ SaaS Apps   │  │ Stitch      │  │ Redshift    │  │ SQLMesh   │ │
│  │ Events      │  │ Custom      │  │ Databricks  │  │           │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └───────────┘ │
│                                                           │        │
│                                           ┌───────────────┘        │
│                                           ▼                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                │
│  │  ANALYTICS  │  │ ORCHESTRATE │  │  OBSERVE    │                │
│  │             │  │             │  │             │                │
│  │ Looker      │  │ Airflow     │  │ Monte Carlo │                │
│  │ Tableau     │← │ Dagster     │  │ Great Exp.  │                │
│  │ Metabase    │  │ Prefect     │  │ Elementary  │                │
│  │ Preset      │  │ dbt Cloud   │  │             │                │
│  └─────────────┘  └─────────────┘  └─────────────┘                │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Layer 1: Data Ingestion

Data ingestion is the foundation of any data stack. The choice between managed and self-hosted solutions significantly impacts operational overhead and cost.

Vendor Comparison Matrix

Tool Type Connectors Pricing Model Best For
Fivetran Managed 300+ Per MAR (Monthly Active Rows) Enterprise, compliance-heavy
Airbyte Open-source/Cloud 350+ Self-hosted free, Cloud per credit Cost-conscious, custom sources
Stitch Managed 130+ Per row volume Mid-market, Talend ecosystem
Meltano Open-source Singer ecosystem Self-hosted free Engineering-heavy teams
Hevo Managed 150+ Per event Indian market, real-time focus

Cost Analysis: Fivetran vs. Airbyte

For an organization syncing 100 million rows/month across 20 connectors:

Fivetran (Business Tier):
- Base platform: ~$1,000/month
- MAR cost: 100M rows × $0.50/M = $50,000/month
- Total: ~$51,000/month

Airbyte Cloud:
- Credits: 100M rows ≈ 500 credits × $12 = $6,000/month
- Total: ~$6,000/month

Airbyte Self-Hosted (on AWS):
- EC2 (m5.xlarge): ~$150/month
- RDS (db.t3.medium): ~$50/month
- Engineering time: 4 hrs/month × $150/hr = $600/month
- Total: ~$800/month

The cost differential is substantial, but managed solutions provide:

  • 99.9% SLA guarantees
  • Automatic schema drift handling
  • SOC 2 compliance out-of-box
  • 24/7 support

Production Ingestion Configuration

# Airbyte connection configuration (YAML)
sourceDefinitionId: "postgres"
name: "Production PostgreSQL"
connectionConfiguration:
  host: "prod-db.internal.company.com"
  port: 5432
  database: "transactions"
  username: "${POSTGRES_USER}"
  password: "${POSTGRES_PASSWORD}"
  ssl_mode:
    mode: "verify-ca"
    ca_certificate: "${SSL_CA_CERT}"
  replication_method:
    method: "CDC"
    publication: "airbyte_publication"
    replication_slot: "airbyte_slot"
    initial_waiting_seconds: 300
  tunnel_method:
    tunnel_method: "SSH_KEY_AUTH"
    tunnel_host: "bastion.company.com"
    tunnel_port: 22
    tunnel_user: "airbyte"
    ssh_key: "${SSH_PRIVATE_KEY}"

sync:
  schedule:
    scheduleType: "cron"
    cronExpression: "0 */2 * * *"  # Every 2 hours
  namespaceDefinition: "source"
  prefix: "raw_"
  nonBreakingSchemaUpdatesBehavior: "propagate_columns"

CDC vs. Full Refresh Trade-offs

Approach Latency Source Load Complexity Data Volume Impact
Full Refresh High (hours) High Low Linear with size
Incremental (timestamp) Medium (minutes) Medium Medium Linear with changes
CDC (Debezium/native) Low (seconds) Minimal High Constant overhead

Recommendation: Start with incremental loads for tables >1GB, implement CDC only for latency-critical tables (e.g., inventory, orders).

Layer 2: Data Storage (Warehouse/Lakehouse)

The storage layer is the most critical architectural decision, influencing cost, performance, and capabilities for years to come.

Platform Comparison

Platform Architecture Best For Pricing Model
Snowflake Shared storage, separate compute Multi-cloud, governance-heavy Compute + Storage
BigQuery Serverless, slot-based GCP-native, ad-hoc queries On-demand or flat-rate
Redshift Provisioned clusters AWS-native, predictable workloads Instance-based
Databricks Lakehouse (Delta Lake) ML workloads, streaming DBUs (Databricks Units)

Snowflake Architecture Deep-Dive

Snowflake's multi-cluster shared data architecture separates compute from storage:

                    Snowflake Architecture

┌──────────────────────────────────────────────────────────────┐
│                      CLOUD SERVICES                          │
│  (Authentication, Metadata, Query Parsing, Optimization)     │
└──────────────────────────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│   WAREHOUSE XS   │ │   WAREHOUSE M    │ │   WAREHOUSE XL   │
│   (Analytics)    │ │   (Transform)    │ │   (Data Science) │
│   1 credit/hr    │ │   4 credits/hr   │ │   16 credits/hr  │
└──────────────────┘ └──────────────────┘ └──────────────────┘
          │                   │                   │
          └───────────────────┼───────────────────┘
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                    CENTRALIZED STORAGE                       │
│              (S3/Azure Blob/GCS - Immutable)                │
│                     $23/TB/month                             │
└──────────────────────────────────────────────────────────────┘

Production Warehouse Setup

-- Snowflake multi-environment setup
-- Create separate warehouses for different workloads

-- ETL/ELT processing (runs during off-hours)
CREATE WAREHOUSE IF NOT EXISTS TRANSFORM_WH
WITH
    WAREHOUSE_SIZE = 'LARGE'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 4
    SCALING_POLICY = 'ECONOMY'  -- Favor cost over speed
    WAREHOUSE_TYPE = 'STANDARD'
    INITIALLY_SUSPENDED = TRUE
    COMMENT = 'ELT transformation workloads - scheduled jobs';

-- Analytics queries (interactive, SLA-bound)
CREATE WAREHOUSE IF NOT EXISTS ANALYTICS_WH
WITH
    WAREHOUSE_SIZE = 'MEDIUM'
    AUTO_SUSPEND = 300
    AUTO_RESUME = TRUE
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 2
    SCALING_POLICY = 'STANDARD'  -- Balance cost and speed
    WAREHOUSE_TYPE = 'STANDARD'
    INITIALLY_SUSPENDED = FALSE
    STATEMENT_TIMEOUT_IN_SECONDS = 900  -- 15 min query timeout
    COMMENT = 'Interactive analytics - BI tools and ad-hoc queries';

-- Data science workloads (memory-intensive)
CREATE WAREHOUSE IF NOT EXISTS DATASCIENCE_WH
WITH
    WAREHOUSE_SIZE = 'X-LARGE'
    AUTO_SUSPEND = 120
    AUTO_RESUME = TRUE
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 1
    WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
    INITIALLY_SUSPENDED = TRUE
    COMMENT = 'ML training and feature engineering';

-- Create medallion architecture schemas
CREATE DATABASE IF NOT EXISTS ANALYTICS;
CREATE SCHEMA IF NOT EXISTS ANALYTICS.BRONZE;   -- Raw data
CREATE SCHEMA IF NOT EXISTS ANALYTICS.SILVER;   -- Cleaned, conformed
CREATE SCHEMA IF NOT EXISTS ANALYTICS.GOLD;     -- Business-ready aggregates
CREATE SCHEMA IF NOT EXISTS ANALYTICS.SEMANTIC; -- Metrics layer

Cost Optimization Strategies

-- Monitor credit consumption by warehouse
SELECT
    warehouse_name,
    DATE_TRUNC('week', start_time) AS week,
    SUM(credits_used) AS total_credits,
    SUM(credits_used) * 3.00 AS estimated_cost_usd  -- Standard tier
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_time >= DATEADD('month', -3, CURRENT_DATE())
GROUP BY 1, 2
ORDER BY 1, 2;

-- Identify expensive queries for optimization
SELECT
    query_id,
    query_text,
    warehouse_name,
    total_elapsed_time / 1000 AS duration_seconds,
    credits_used_cloud_services,
    bytes_scanned / 1e9 AS gb_scanned,
    partitions_scanned,
    partitions_total
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD('day', -7, CURRENT_DATE())
  AND credits_used_cloud_services > 0.1
ORDER BY credits_used_cloud_services DESC
LIMIT 50;

-- Implement result caching and clustering
ALTER TABLE ANALYTICS.GOLD.FACT_ORDERS
    CLUSTER BY (order_date, customer_segment);

-- Enable result caching (on by default, but verify)
ALTER WAREHOUSE ANALYTICS_WH SET
    USE_CACHED_RESULT = TRUE;

Layer 3: Data Transformation

dbt (data build tool) has become the industry standard for data transformation, introducing software engineering practices to analytics workflows.

dbt Project Architecture

dbt_project/
├── dbt_project.yml
├── profiles.yml              # Connection configuration
├── packages.yml              # Dependencies
├── models/
│   ├── staging/              # 1:1 with sources, light cleaning
│   │   ├── stg_salesforce/
│   │   │   ├── _stg_salesforce__models.yml
│   │   │   ├── stg_salesforce__accounts.sql
│   │   │   └── stg_salesforce__opportunities.sql
│   │   └── stg_stripe/
│   │       ├── _stg_stripe__models.yml
│   │       └── stg_stripe__payments.sql
│   ├── intermediate/         # Business logic, joins
│   │   ├── int_revenue_by_account.sql
│   │   └── int_customer_journey.sql
│   └── marts/                # Business-ready, consumption layer
│       ├── core/
│       │   ├── _core__models.yml
│       │   ├── dim_customers.sql
│       │   ├── dim_products.sql
│       │   └── fct_orders.sql
│       └── marketing/
│           ├── _marketing__models.yml
│           └── rpt_campaign_performance.sql
├── macros/
│   ├── generate_schema_name.sql
│   └── cents_to_dollars.sql
├── tests/
│   └── assert_positive_revenue.sql
├── seeds/
│   └── country_codes.csv
└── analyses/
    └── ad_hoc_revenue_analysis.sql

Production-Grade dbt Models

-- models/marts/core/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        cluster_by=['order_date', 'customer_segment'],
        tags=['core', 'daily'],
        post_hook=[
            "GRANT SELECT ON {{ this }} TO ROLE ANALYTICS_READER"
        ]
    )
}}

WITH orders AS (
    SELECT * FROM {{ ref('stg_stripe__payments') }}
    {% if is_incremental() %}
        WHERE payment_date >= (
            SELECT COALESCE(MAX(order_date), '1900-01-01')
            FROM {{ this }}
        )
    {% endif %}
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
),

products AS (
    SELECT * FROM {{ ref('dim_products') }}
),

final AS (
    SELECT
        -- Keys
        o.payment_id AS order_id,
        o.customer_id,
        o.product_id,

        -- Dimensions
        c.customer_segment,
        c.acquisition_channel,
        c.customer_region,
        p.product_category,
        p.product_line,

        -- Date dimensions
        o.payment_date AS order_date,
        {{ dbt_date.week_start('o.payment_date') }} AS order_week,
        DATE_TRUNC('month', o.payment_date) AS order_month,

        -- Measures
        {{ cents_to_dollars('o.amount_cents') }} AS order_amount_usd,
        o.quantity,
        o.discount_percent,
        {{ cents_to_dollars('o.amount_cents') }} *
            (1 - COALESCE(o.discount_percent, 0) / 100) AS net_revenue_usd,

        -- Metadata
        o.created_at AS _loaded_at,
        CURRENT_TIMESTAMP() AS _transformed_at

    FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.customer_id
    LEFT JOIN products p ON o.product_id = p.product_id
    WHERE o.payment_status = 'succeeded'
)

SELECT * FROM final

dbt Testing Framework

# models/marts/core/_core__models.yml
version: 2

models:
  - name: fct_orders
    description: |
      Fact table containing all successful orders with denormalized
      customer and product dimensions. Grain: one row per order.

      **SLA**: Updated by 6:00 AM UTC daily
      **Owner**: [email protected]

    meta:
      owner: "Data Platform Team"
      tier: "gold"
      pii: false

    columns:
      - name: order_id
        description: Primary key - unique identifier for each order
        tests:
          - unique
          - not_null

      - name: customer_id
        description: Foreign key to dim_customers
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id

      - name: order_amount_usd
        description: Order amount in USD (converted from cents)
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
              inclusive: true

      - name: order_date
        description: Date the order was placed
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "<= CURRENT_DATE()"
              config:
                severity: error

    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - order_id
            - order_date

      # Custom test: revenue should not drop >50% day-over-day
      - dbt_expectations.expect_column_values_to_be_between:
          column_name: net_revenue_usd
          min_value: 0

Semantic Layer with dbt Metrics

# models/marts/core/_core__metrics.yml
version: 2

metrics:
  - name: total_revenue
    label: Total Revenue
    description: Sum of net revenue from all orders
    type: simple
    type_params:
      measure: net_revenue_usd
    meta:
      default_granularity: day
      team: finance

  - name: average_order_value
    label: Average Order Value (AOV)
    description: Average revenue per order
    type: derived
    type_params:
      expr: total_revenue / order_count
      metrics:
        - name: total_revenue
        - name: order_count
    meta:
      unit: USD
      precision: 2

  - name: revenue_per_customer
    label: Revenue per Customer
    description: Average lifetime revenue per unique customer
    type: derived
    type_params:
      expr: total_revenue / unique_customers
      metrics:
        - name: total_revenue
        - name: unique_customers
    filter: |
      {{ Dimension('customer_segment') }} != 'test'

Layer 4: Orchestration

Workflow orchestration ensures pipelines run reliably on schedule with proper dependency management.

Orchestrator Comparison

Tool Architecture Strengths Limitations
Airflow DAG-based, Python Mature, extensive ecosystem Complex deployment, resource-heavy
Dagster Asset-based, Python Modern UX, data lineage Smaller ecosystem
Prefect Flow-based, Python Hybrid deployment, easy start Cloud-heavy pricing
dbt Cloud dbt-native Seamless dbt integration dbt-only, expensive at scale

Production Airflow DAG

# dags/daily_analytics_pipeline.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.providers.airbyte.operators.airbyte import AirbyteTriggerSyncOperator
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator
from airflow.utils.task_group import TaskGroup

default_args = {
    'owner': 'data-platform',
    'depends_on_past': False,
    'email': ['[email protected]'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'retry_exponential_backoff': True,
    'max_retry_delay': timedelta(minutes=30),
}

with DAG(
    dag_id='daily_analytics_pipeline',
    default_args=default_args,
    description='Daily ELT pipeline: Airbyte sync → dbt transform → data quality',
    schedule_interval='0 5 * * *',  # 5 AM UTC daily
    start_date=datetime(2024, 1, 1),
    catchup=False,
    max_active_runs=1,
    tags=['production', 'analytics', 'daily'],
) as dag:

    start = EmptyOperator(task_id='start')

    # Ingestion phase - parallel syncs
    with TaskGroup('ingestion') as ingestion_group:
        sync_salesforce = AirbyteTriggerSyncOperator(
            task_id='sync_salesforce',
            airbyte_conn_id='airbyte_default',
            connection_id='{{ var.value.airbyte_salesforce_conn_id }}',
            asynchronous=False,
            timeout=3600,
        )

        sync_stripe = AirbyteTriggerSyncOperator(
            task_id='sync_stripe',
            airbyte_conn_id='airbyte_default',
            connection_id='{{ var.value.airbyte_stripe_conn_id }}',
            asynchronous=False,
            timeout=3600,
        )

        sync_hubspot = AirbyteTriggerSyncOperator(
            task_id='sync_hubspot',
            airbyte_conn_id='airbyte_default',
            connection_id='{{ var.value.airbyte_hubspot_conn_id }}',
            asynchronous=False,
            timeout=3600,
        )

    # Transformation phase
    with TaskGroup('transformation') as transform_group:
        dbt_staging = DbtCloudRunJobOperator(
            task_id='dbt_run_staging',
            job_id='{{ var.value.dbt_staging_job_id }}',
            dbt_cloud_conn_id='dbt_cloud_default',
            check_interval=30,
            timeout=3600,
        )

        dbt_marts = DbtCloudRunJobOperator(
            task_id='dbt_run_marts',
            job_id='{{ var.value.dbt_marts_job_id }}',
            dbt_cloud_conn_id='dbt_cloud_default',
            check_interval=30,
            timeout=7200,
        )

        dbt_staging >> dbt_marts

    # Notification
    notify_success = SlackWebhookOperator(
        task_id='notify_success',
        slack_webhook_conn_id='slack_analytics',
        message=':white_check_mark: Daily analytics pipeline completed successfully',
        trigger_rule='all_success',
    )

    notify_failure = SlackWebhookOperator(
        task_id='notify_failure',
        slack_webhook_conn_id='slack_analytics',
        message=':x: Daily analytics pipeline FAILED - please investigate',
        trigger_rule='one_failed',
    )

    end = EmptyOperator(
        task_id='end',
        trigger_rule='none_failed_min_one_success',
    )

    # Define dependencies
    start >> ingestion_group >> transform_group >> [notify_success, notify_failure] >> end

Layer 5: Analytics and BI

The analytics layer makes data accessible to business stakeholders through self-service tools.

BI Platform Selection Framework

Criteria Looker Tableau Power BI Metabase
Pricing $$$ $$ $ Free/$
Semantic Layer LookML (strong) Data Model Power Query Basic
Self-Service High High High Medium
Governance Excellent Good Good Basic
Embedded Analytics Native Native Native Native
Best For Enterprise, governance Visual analysis Microsoft shops SMB, developers

Looker LookML Model

# models/core/orders.model.lkml
connection: "snowflake_analytics"

include: "/views/core/*.view.lkml"

explore: orders {
  label: "Orders Analysis"
  description: "Primary explore for order and revenue analysis"

  join: customers {
    type: left_outer
    relationship: many_to_one
    sql_on: ${orders.customer_id} = ${customers.customer_id} ;;
  }

  join: products {
    type: left_outer
    relationship: many_to_one
    sql_on: ${orders.product_id} = ${products.product_id} ;;
  }

  # Access control
  access_filter: {
    field: customers.region
    user_attribute: allowed_regions
  }

  # Aggregate awareness for performance
  aggregate_table: orders_daily_summary {
    query: {
      dimensions: [order_date, customers.segment, products.category]
      measures: [total_revenue, order_count, average_order_value]
    }
    materialization: {
      datagroup_trigger: daily_etl_complete
    }
  }
}

Decision Framework: Selecting Your Stack

By Organization Size and Maturity

┌─────────────────────────────────────────────────────────────────────┐
│                   STACK SELECTION BY STAGE                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  STARTUP (< $10M ARR)          SCALE-UP ($10-100M)                 │
│  ├─ Ingestion: Airbyte Cloud   ├─ Ingestion: Fivetran              │
│  ├─ Warehouse: BigQuery        ├─ Warehouse: Snowflake             │
│  ├─ Transform: dbt Cloud       ├─ Transform: dbt Cloud             │
│  ├─ Orchestrate: dbt Cloud     ├─ Orchestrate: Dagster Cloud       │
│  └─ BI: Metabase/Preset        └─ BI: Looker/Tableau               │
│                                                                     │
│  Est. Cost: $2-5K/month        Est. Cost: $20-50K/month            │
│                                                                     │
│  ─────────────────────────────────────────────────────────────────  │
│                                                                     │
│  ENTERPRISE (> $100M)          DATA-INTENSIVE                      │
│  ├─ Ingestion: Fivetran + CDC  ├─ Ingestion: Kafka + Flink         │
│  ├─ Warehouse: Snowflake       ├─ Warehouse: Databricks            │
│  ├─ Transform: dbt + Spark     ├─ Transform: Spark + dbt           │
│  ├─ Orchestrate: Airflow       ├─ Orchestrate: Airflow/Dagster     │
│  └─ BI: Looker + Embedded      └─ BI: Custom + Looker              │
│                                                                     │
│  Est. Cost: $100K+/month       Est. Cost: $150K+/month             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Decision Tree

  1. Data Volume

    • < 100GB: BigQuery (serverless simplicity)
    • 100GB - 10TB: Snowflake (balance of features/cost)
    • 10TB with ML: Databricks (unified lakehouse)

  2. Team Composition

    • Analytics-heavy: Looker + dbt Cloud (managed, SQL-first)
    • Engineering-heavy: Self-hosted Airbyte + Airflow + Metabase
    • Mixed: Fivetran + Snowflake + dbt Cloud + Tableau
  3. Compliance Requirements

    • SOC 2/HIPAA: Snowflake + Fivetran (enterprise compliance)
    • GDPR (EU data): EU-region deployment, consider Databricks

Total Cost of Ownership Model

A realistic TCO calculation for a mid-size company (500 employees, 50 data users):

Monthly Cost Breakdown:

INGESTION
├─ Fivetran (50 connectors, 500M MAR)     $15,000
└─ Subtotal                                $15,000

STORAGE & COMPUTE
├─ Snowflake Compute (Enterprise)          $25,000
├─ Snowflake Storage (10 TB)                $2,300
└─ Subtotal                                $27,300

TRANSFORMATION
├─ dbt Cloud (Team, 5 seats)                $5,000
└─ Subtotal                                 $5,000

ORCHESTRATION
├─ Dagster Cloud                            $2,500
└─ Subtotal                                 $2,500

ANALYTICS
├─ Looker (50 users)                       $15,000
└─ Subtotal                                $15,000

OBSERVABILITY
├─ Monte Carlo (data observability)         $5,000
└─ Subtotal                                 $5,000

───────────────────────────────────────────────────
TOTAL MONTHLY                              $69,800
TOTAL ANNUAL                              $837,600

+ Engineering time (2 FTE × $180K)        $360,000
───────────────────────────────────────────────────
TRUE ANNUAL TCO                         $1,197,600

Conclusion

Building a modern data stack is not about selecting the "best" tools—it's about selecting the right tools for your organization's specific constraints: budget, team skills, data volume, compliance requirements, and strategic priorities.

Key principles to remember:

  1. Start managed, optimize later: Begin with fully managed services to reduce time-to-value; self-host only when cost or control demands it
  2. ELT is the new standard: Push compute to the warehouse; eliminate separate ETL infrastructure
  3. dbt is non-negotiable: The transformation layer should use dbt or dbt-compatible tooling for maintainability
  4. Measure TCO holistically: Include engineering time, not just licensing fees
  5. Plan for governance early: Data governance retrofits are expensive; build access control and lineage from day one

The modern data stack continues to evolve rapidly—what matters is building a foundation that can adapt as your organization's needs change.

References

Key Takeaways

  • The modern data stack follows ELT (Extract-Load-Transform) rather than traditional ETL, pushing transformation compute to the warehouse
  • Total cost of ownership (TCO) must include compute, storage, egress, and operational overhead—not just licensing fees
  • dbt has become the de facto standard for transformation, creating a shared semantic layer across organizations
  • The 'best' stack depends on data volume, team size, existing infrastructure, and compliance requirements
  • Start with managed services to reduce operational burden, then optimize with self-hosted alternatives as scale demands
Gemut Analytics Team
Gemut Analytics Team
Data Engineering Experts