Skip to main content
|18 min read|Advanced

Cloud Data Platform Architecture Patterns

A comprehensive guide to designing cloud-native data platforms—covering lakehouse architecture, multi-cloud considerations, and platform comparisons across AWS, GCP, and Azure

Cloud ArchitectureData PlatformAWSGCPAzureLakehouseData Lake
TL;DR

Cloud data platforms have evolved from simple storage-compute architectures to sophisticated ecosystems of integrated services. This guide examines the dominant architectural patterns—data lakes, data warehouses, and the emerging lakehouse paradigm—while providing practical guidance on platform selection and multi-cloud strategies.

Prerequisites
  • Familiarity with cloud computing concepts (compute, storage, networking)
  • Basic understanding of data warehousing and data lakes
  • Experience with at least one major cloud provider
Cloud Data Platform Architecture Patterns

Cloud Data Platform Architecture Patterns

The Evolution of Cloud Data Architecture

Cloud data platforms have evolved through distinct generations:

┌──────────────────────────────────────────────────────────────────────────────┐
│                    EVOLUTION OF CLOUD DATA PLATFORMS                         │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Gen 1: Lift and Shift (2010-2015)                                          │
│  ─────────────────────────────────                                           │
│  • Traditional data warehouses migrated to cloud VMs                         │
│  • Same architecture, just different hosting                                 │
│  • High operational burden, limited scalability                              │
│  Examples: Oracle on EC2, Teradata on Azure VMs                              │
│                                                                              │
│  Gen 2: Cloud-Native Warehouses (2015-2019)                                 │
│  ─────────────────────────────────────────────                               │
│  • Purpose-built for cloud (separation of storage and compute)              │
│  • Elastic scaling, pay-per-use pricing                                      │
│  • Still required ETL into proprietary formats                              │
│  Examples: Snowflake, BigQuery, Redshift                                     │
│                                                                              │
│  Gen 3: Data Lakes (2016-2020)                                              │
│  ─────────────────────────────                                               │
│  • Store everything in object storage (S3, GCS, ADLS)                       │
│  • Schema-on-read flexibility                                                │
│  • Decoupled storage from processing engines                                 │
│  • Data quality and governance challenges                                    │
│  Examples: Hadoop on EMR, Spark on Databricks                               │
│                                                                              │
│  Gen 4: Lakehouse (2020-Present)                                            │
│  ────────────────────────────────                                            │
│  • Open table formats on object storage (Delta, Iceberg, Hudi)              │
│  • ACID transactions, schema enforcement, time travel                        │
│  • Best of lakes (flexibility) and warehouses (reliability)                  │
│  • Unified platform for all data workloads                                   │
│  Examples: Databricks Lakehouse, Snowflake with Iceberg, Dremio             │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

The Lakehouse Architecture

The lakehouse combines data lake flexibility with data warehouse reliability.

Core Principles

┌──────────────────────────────────────────────────────────────────────────────┐
│                       LAKEHOUSE ARCHITECTURE                                  │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     CONSUMPTION LAYER                                │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐       │    │
│  │  │     BI     │ │    SQL     │ │    ML      │ │  Streaming │       │    │
│  │  │ (Tableau)  │ │ (Redash)   │ │ (MLflow)   │ │ (Flink)    │       │    │
│  │  └────────────┘ └────────────┘ └────────────┘ └────────────┘       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      QUERY ENGINES                                   │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐       │    │
│  │  │ Spark SQL  │ │   Trino    │ │  Photon    │ │   Presto   │       │    │
│  │  │            │ │            │ │            │ │            │       │    │
│  │  └────────────┘ └────────────┘ └────────────┘ └────────────┘       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                 OPEN TABLE FORMAT LAYER                              │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Delta Lake / Apache Iceberg / Apache Hudi                    │  │    │
│  │  │                                                               │  │    │
│  │  │  • ACID Transactions    • Time Travel                         │  │    │
│  │  │  • Schema Evolution     • Partition Evolution                 │  │    │
│  │  │  • Unified Batch/Stream • Z-Ordering / Compaction            │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     FILE FORMAT LAYER                                │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐                       │    │
│  │  │  Parquet   │ │    ORC     │ │   Avro     │                       │    │
│  │  │ (columnar) │ │ (columnar) │ │ (row-wise) │                       │    │
│  │  └────────────┘ └────────────┘ └────────────┘                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                   OBJECT STORAGE LAYER                               │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐                       │    │
│  │  │ Amazon S3  │ │    GCS     │ │    ADLS    │                       │    │
│  │  │            │ │            │ │            │                       │    │
│  │  └────────────┘ └────────────┘ └────────────┘                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Open Table Format Comparison

Feature Delta Lake Apache Iceberg Apache Hudi
Ecosystem Databricks, Spark Vendor-neutral, broad support Uber origin, AWS focus
ACID Transactions Yes Yes Yes
Time Travel Yes Yes Yes
Schema Evolution Yes Yes (best support) Yes
Partition Evolution Limited Full support Limited
Streaming Support Excellent Good Excellent
Small File Handling Auto-compaction Compaction needed Auto-compaction
Query Engines Spark, Trino, Flink Spark, Trino, Flink, Dremio Spark, Trino, Flink
Cloud Integration AWS, Azure, GCP AWS, Azure, GCP AWS focus

Platform Comparison

AWS Data Platform

┌──────────────────────────────────────────────────────────────────────────────┐
│                          AWS DATA PLATFORM                                    │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  INGESTION                                                                   │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Kinesis    │ │     MSK      │ │     DMS      │ │   AppFlow    │        │
│  │   (Stream)   │ │   (Kafka)    │ │  (Database)  │ │    (SaaS)    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  STORAGE                                                                     │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                          Amazon S3                                     │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │  S3 Standard │ S3 Intelligent │ S3 Glacier │ S3 Glacier Deep   │  │  │
│  │  │  (Hot data)  │    Tiering     │  (Archive) │    Archive        │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  PROCESSING                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │     EMR      │ │    Glue      │ │   Athena     │ │   Lambda     │        │
│  │  (Spark)     │ │  (Serverless)│ │   (Query)    │ │ (Functions)  │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  WAREHOUSE                                                                   │
│  ┌──────────────┐ ┌──────────────────────────────────────────────────────┐  │
│  │   Redshift   │ │  Redshift Serverless / Redshift Spectrum             │  │
│  │  (Cluster)   │ │  (Query S3 directly)                                 │  │
│  └──────────────┘ └──────────────────────────────────────────────────────┘  │
│                                                                              │
│  ML & ANALYTICS                                                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  SageMaker   │ │  QuickSight  │ │   Bedrock    │ │  OpenSearch  │        │
│  │    (ML)      │ │     (BI)     │ │   (Gen AI)   │ │  (Search)    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  GOVERNANCE                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Lake Form-  │ │    Glue      │ │   IAM /      │ │  CloudTrail  │        │
│  │  ation       │ │   Catalog    │ │    KMS       │ │   (Audit)    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

GCP Data Platform

┌──────────────────────────────────────────────────────────────────────────────┐
│                          GCP DATA PLATFORM                                    │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  INGESTION                                                                   │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Pub/Sub    │ │  Datastream  │ │  Transfer    │ │   Dataflow   │        │
│  │   (Stream)   │ │    (CDC)     │ │   Service    │ │  (Streaming) │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  STORAGE                                                                     │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    Google Cloud Storage                                │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │  Standard  │  Nearline   │  Coldline  │  Archive              │  │  │
│  │  │  (Hot)     │  (Monthly)  │  (Yearly)  │  (Rare access)        │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  PROCESSING                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Dataproc    │ │  Dataflow    │ │  Cloud       │ │  Cloud       │        │
│  │  (Spark)     │ │  (Beam)      │ │  Functions   │ │  Run         │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  WAREHOUSE                                                                   │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                           BigQuery                                      │ │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                    │ │
│  │  │   Native     │ │   BigLake    │ │  Omni        │                    │ │
│  │  │   Tables     │ │  (External)  │ │ (Multi-cloud)│                    │ │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                    │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ML & ANALYTICS                                                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Vertex AI   │ │   Looker     │ │   BQML       │ │  AutoML      │        │
│  │    (ML)      │ │    (BI)      │ │   (ML in BQ) │ │  (No-code)   │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  GOVERNANCE                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Dataplex   │ │    Data      │ │   IAM /      │ │   Cloud      │        │
│  │              │ │   Catalog    │ │    VPC-SC    │ │   Logging    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Azure Data Platform

┌──────────────────────────────────────────────────────────────────────────────┐
│                         AZURE DATA PLATFORM                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  INGESTION                                                                   │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Event Hubs  │ │    Kafka     │ │   Data       │ │    ADF       │        │
│  │   (Stream)   │ │  (HDInsight) │ │   Factory    │ │  (Pipelines) │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  STORAGE                                                                     │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │              Azure Data Lake Storage Gen2 (ADLS)                       │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │   Hot    │   Cool    │   Cold    │   Archive                   │  │  │
│  │  │  (< 1 day │  (Monthly)│ (Yearly) │  (Rare access)              │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  PROCESSING                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  HDInsight   │ │  Databricks  │ │   Azure      │ │   Azure      │        │
│  │  (Spark)     │ │              │ │   Synapse    │ │   Functions  │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  WAREHOUSE                                                                   │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                       Azure Synapse Analytics                           │ │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                    │ │
│  │  │  Dedicated   │ │  Serverless  │ │    Spark     │                    │ │
│  │  │   Pools      │ │    SQL       │ │    Pools     │                    │ │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                    │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ML & ANALYTICS                                                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Azure ML    │ │  Power BI    │ │   Cognitive  │ │  OpenAI      │        │
│  │              │ │              │ │   Services   │ │   Service    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  GOVERNANCE                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Purview    │ │    Unity     │ │   Azure      │ │   Monitor    │        │
│  │              │ │   Catalog    │ │   AD / RBAC  │ │              │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Platform Selection Matrix

Factor AWS GCP Azure
Best For Broadest ecosystem, enterprise Analytics/ML workloads, simplicity Microsoft shops, hybrid
Warehouse Redshift (good) BigQuery (excellent) Synapse (good)
Lakehouse EMR + S3 + Lake Formation BigLake + GCS Databricks + ADLS
Streaming Kinesis (good), MSK (Kafka) Pub/Sub + Dataflow (excellent) Event Hubs (good)
Serverless Excellent (Lambda, Glue) Excellent (Dataflow, Functions) Good (Functions)
ML Platform SageMaker (comprehensive) Vertex AI (excellent) Azure ML (good)
Governance Lake Formation (maturing) Dataplex (good) Purview (good)
Cost Model Complex but flexible Simpler, slot-based Complex
Enterprise Features Excellent Good Excellent

Reference Architecture: Enterprise Lakehouse

# terraform/main.tf (pseudo-code for architecture reference)
"""
Enterprise Lakehouse Reference Architecture on AWS.
Demonstrates production patterns for cloud data platforms.
"""

# Network Foundation
module "network" {
  source = "./modules/network"

  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]

  # Isolated subnets for data workloads
  private_subnets = {
    data_processing = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
    analytics       = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
  }

  # VPC endpoints for AWS services
  endpoints = ["s3", "glue", "athena", "kinesis", "secretsmanager"]
}

# Data Lake Storage
module "data_lake" {
  source = "./modules/data-lake"

  bucket_name = "company-data-lake-prod"

  # Medallion architecture zones
  zones = {
    raw = {
      prefix       = "raw/"
      lifecycle    = "INTELLIGENT_TIERING"
      retention    = "7_YEARS"
      versioning   = true
    }
    bronze = {
      prefix       = "bronze/"
      lifecycle    = "INTELLIGENT_TIERING"
      retention    = "3_YEARS"
      versioning   = true
    }
    silver = {
      prefix       = "silver/"
      lifecycle    = "STANDARD"
      retention    = "2_YEARS"
      versioning   = true
    }
    gold = {
      prefix       = "gold/"
      lifecycle    = "STANDARD"
      retention    = "1_YEAR"
      versioning   = true
    }
  }

  # Encryption
  kms_key_id = module.security.data_encryption_key_id

  # Replication for disaster recovery
  replication = {
    enabled     = true
    destination = "us-west-2"
  }
}

# Data Catalog and Governance
module "governance" {
  source = "./modules/governance"

  # AWS Lake Formation for fine-grained access
  lake_formation = {
    data_lake_locations = [module.data_lake.bucket_arn]

    # Tag-based access control
    lf_tags = {
      pii        = ["true", "false"]
      department = ["engineering", "finance", "marketing", "sales"]
      sensitivity = ["public", "internal", "confidential", "restricted"]
    }
  }

  # Glue Data Catalog
  glue_catalog = {
    databases = ["raw", "bronze", "silver", "gold"]

    crawlers = {
      bronze_crawler = {
        path     = "s3://company-data-lake-prod/bronze/"
        schedule = "cron(0 */4 * * ? *)"  # Every 4 hours
      }
    }
  }
}

# Processing Clusters
module "processing" {
  source = "./modules/processing"

  # EMR for batch processing
  emr_clusters = {
    batch_processing = {
      release         = "emr-6.10.0"
      applications    = ["Spark", "Hive", "Delta"]
      instance_type   = "r5.4xlarge"
      core_nodes      = 10
      task_nodes_min  = 0
      task_nodes_max  = 50
      spot_percentage = 80
    }
  }

  # Glue for serverless ETL
  glue_jobs = {
    bronze_ingestion = {
      worker_type   = "G.1X"
      workers       = 10
      glue_version  = "4.0"
    }
  }
}

# Query and Analytics
module "analytics" {
  source = "./modules/analytics"

  # Athena for ad-hoc queries
  athena = {
    workgroups = {
      analyst = {
        output_location    = "s3://company-athena-results/analyst/"
        bytes_scanned_cutoff = 10737418240  # 10 GB
        enforce_workgroup_config = true
      }
      data_scientist = {
        output_location    = "s3://company-athena-results/data-science/"
        bytes_scanned_cutoff = 107374182400  # 100 GB
      }
    }
  }

  # Redshift for warehouse workloads
  redshift = {
    cluster_type     = "ra3.xlplus"
    number_of_nodes  = 4

    # Serverless for variable workloads
    serverless = {
      enabled        = true
      base_capacity  = 32  # RPUs
      max_capacity   = 256
    }

    # Spectrum for querying S3
    spectrum_role = module.governance.redshift_spectrum_role_arn
  }
}

# Streaming Layer
module "streaming" {
  source = "./modules/streaming"

  # MSK for Kafka
  msk = {
    kafka_version    = "3.4.0"
    broker_nodes     = 6
    instance_type    = "kafka.m5.2xlarge"
    ebs_volume_size  = 1000

    # Configuration
    auto_create_topics = false
    retention_hours    = 168  # 7 days
  }

  # Kinesis for serverless streaming
  kinesis = {
    streams = {
      clickstream = {
        shard_count    = 10
        retention_hours = 24
      }
    }
  }

  # Flink for stream processing
  flink = {
    applications = {
      realtime_aggregation = {
        parallelism = 10
      }
    }
  }
}

# Observability
module "observability" {
  source = "./modules/observability"

  # CloudWatch dashboards
  dashboards = ["data-platform-overview", "processing-jobs", "query-performance"]

  # Alarms
  alarms = {
    processing_failure = {
      metric    = "glue_job_failures"
      threshold = 1
      period    = 300
    }
    query_performance = {
      metric    = "athena_query_duration"
      threshold = 300000  # 5 minutes
      statistic = "p95"
    }
  }

  # Data quality monitoring
  data_quality = {
    enabled = true
    engine  = "great_expectations"
  }
}

Multi-Cloud Strategies

Design for Portability

# multicloud/abstractions.py
"""
Abstraction layer for multi-cloud data operations.
Enables portability across AWS, GCP, and Azure.
"""

from abc import ABC, abstractmethod
from typing import List, Dict, Any
from enum import Enum

class CloudProvider(Enum):
    AWS = "aws"
    GCP = "gcp"
    AZURE = "azure"


class ObjectStorage(ABC):
    """Abstract interface for object storage operations."""

    @abstractmethod
    def read_parquet(self, path: str) -> 'DataFrame':
        pass

    @abstractmethod
    def write_parquet(self, df: 'DataFrame', path: str, partition_by: List[str] = None):
        pass

    @abstractmethod
    def list_objects(self, prefix: str) -> List[str]:
        pass


class S3Storage(ObjectStorage):
    """AWS S3 implementation."""

    def __init__(self, bucket: str, region: str = "us-east-1"):
        self.bucket = bucket
        self.region = region
        self._client = boto3.client('s3', region_name=region)

    def read_parquet(self, path: str) -> 'DataFrame':
        import pandas as pd
        full_path = f"s3://{self.bucket}/{path}"
        return pd.read_parquet(full_path)

    def write_parquet(self, df: 'DataFrame', path: str, partition_by: List[str] = None):
        import pyarrow.parquet as pq
        full_path = f"s3://{self.bucket}/{path}"
        df.to_parquet(full_path, partition_cols=partition_by)


class GCSStorage(ObjectStorage):
    """Google Cloud Storage implementation."""

    def __init__(self, bucket: str, project: str):
        self.bucket = bucket
        self.project = project
        self._client = storage.Client(project=project)

    def read_parquet(self, path: str) -> 'DataFrame':
        import pandas as pd
        full_path = f"gs://{self.bucket}/{path}"
        return pd.read_parquet(full_path)


class ADLSStorage(ObjectStorage):
    """Azure Data Lake Storage Gen2 implementation."""

    def __init__(self, account: str, container: str):
        self.account = account
        self.container = container

    def read_parquet(self, path: str) -> 'DataFrame':
        import pandas as pd
        full_path = f"abfss://{self.container}@{self.account}.dfs.core.windows.net/{path}"
        return pd.read_parquet(full_path)


def get_storage(provider: CloudProvider, config: Dict[str, Any]) -> ObjectStorage:
    """Factory for cloud storage implementations."""
    if provider == CloudProvider.AWS:
        return S3Storage(config['bucket'], config.get('region', 'us-east-1'))
    elif provider == CloudProvider.GCP:
        return GCSStorage(config['bucket'], config['project'])
    elif provider == CloudProvider.AZURE:
        return ADLSStorage(config['account'], config['container'])
    else:
        raise ValueError(f"Unknown provider: {provider}")

Multi-Cloud Data Mesh

┌──────────────────────────────────────────────────────────────────────────────┐
│                       MULTI-CLOUD DATA MESH                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                    ┌─────────────────────────────────┐                       │
│                    │    FEDERATED CATALOG            │                       │
│                    │    (Unity Catalog / Dataplex)   │                       │
│                    └──────────────┬──────────────────┘                       │
│                                   │                                          │
│          ┌────────────────────────┼────────────────────────┐                 │
│          │                        │                        │                 │
│          ▼                        ▼                        ▼                 │
│  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐         │
│  │     AWS DOMAIN   │   │    GCP DOMAIN    │   │   AZURE DOMAIN   │         │
│  │                  │   │                  │   │                  │         │
│  │  ┌────────────┐  │   │  ┌────────────┐  │   │  ┌────────────┐  │         │
│  │  │  Customer  │  │   │  │  Analytics │  │   │  │  Finance   │  │         │
│  │  │   Data     │  │   │  │   Data     │  │   │  │   Data     │  │         │
│  │  │  Products  │  │   │  │  Products  │  │   │  │  Products  │  │         │
│  │  └────────────┘  │   │  └────────────┘  │   │  └────────────┘  │         │
│  │                  │   │                  │   │                  │         │
│  │  ┌────────────┐  │   │  ┌────────────┐  │   │  ┌────────────┐  │         │
│  │  │   S3 +     │  │   │  │   GCS +    │  │   │  │  ADLS +    │  │         │
│  │  │   Delta    │  │   │  │  Iceberg   │  │   │  │   Delta    │  │         │
│  │  └────────────┘  │   │  └────────────┘  │   │  └────────────┘  │         │
│  └──────────────────┘   └──────────────────┘   └──────────────────┘         │
│          │                        │                        │                 │
│          └────────────────────────┼────────────────────────┘                 │
│                                   ▼                                          │
│                    ┌─────────────────────────────────┐                       │
│                    │    CROSS-CLOUD QUERY ENGINE     │                       │
│                    │    (Starburst / Dremio)         │                       │
│                    └─────────────────────────────────┘                       │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Cost Optimization

Storage Tiering Strategy

# cost_optimization/storage_tiering.py
"""
Automated storage tiering for cost optimization.
Moves data between tiers based on access patterns.
"""

from datetime import datetime, timedelta
from typing import Dict, List
from dataclasses import dataclass

@dataclass
class TieringPolicy:
    """Storage tiering policy configuration."""
    hot_to_warm_days: int = 30
    warm_to_cold_days: int = 90
    cold_to_archive_days: int = 365

    hot_tier: str = "STANDARD"
    warm_tier: str = "STANDARD_IA"
    cold_tier: str = "GLACIER_IR"
    archive_tier: str = "DEEP_ARCHIVE"


class StorageTieringManager:
    """Manages automatic storage tiering based on access patterns."""

    def __init__(self, s3_client, policy: TieringPolicy = None):
        self.s3 = s3_client
        self.policy = policy or TieringPolicy()

    def analyze_access_patterns(self, bucket: str, prefix: str) -> Dict:
        """Analyze object access patterns for tiering decisions."""

        # Get CloudWatch metrics for object access
        # In production, this would query S3 Storage Lens or access logs

        return {
            'total_objects': 1000000,
            'total_size_bytes': 10 * 1024**4,  # 10 TB
            'by_last_access': {
                'hot': {'count': 50000, 'size': 500 * 1024**3},      # < 30 days
                'warm': {'count': 200000, 'size': 2 * 1024**4},       # 30-90 days
                'cold': {'count': 400000, 'size': 4 * 1024**4},       # 90-365 days
                'archive': {'count': 350000, 'size': 3.5 * 1024**4},  # > 365 days
            }
        }

    def calculate_cost_savings(self, analysis: Dict) -> Dict:
        """Calculate potential cost savings from optimal tiering."""

        # AWS S3 pricing (us-east-1, approximate)
        pricing = {
            'STANDARD': 0.023,       # per GB/month
            'STANDARD_IA': 0.0125,
            'GLACIER_IR': 0.004,
            'DEEP_ARCHIVE': 0.00099,
        }

        by_access = analysis['by_last_access']

        # Current cost (all in STANDARD)
        total_size_gb = analysis['total_size_bytes'] / 1024**3
        current_cost = total_size_gb * pricing['STANDARD']

        # Optimized cost
        optimized_cost = (
            by_access['hot']['size'] / 1024**3 * pricing['STANDARD'] +
            by_access['warm']['size'] / 1024**3 * pricing['STANDARD_IA'] +
            by_access['cold']['size'] / 1024**3 * pricing['GLACIER_IR'] +
            by_access['archive']['size'] / 1024**3 * pricing['DEEP_ARCHIVE']
        )

        return {
            'current_monthly_cost': current_cost,
            'optimized_monthly_cost': optimized_cost,
            'monthly_savings': current_cost - optimized_cost,
            'annual_savings': (current_cost - optimized_cost) * 12,
            'savings_percentage': (1 - optimized_cost / current_cost) * 100
        }

    def apply_lifecycle_rules(self, bucket: str) -> None:
        """Apply lifecycle rules based on tiering policy."""

        lifecycle_config = {
            'Rules': [
                {
                    'ID': 'TierToWarm',
                    'Filter': {'Prefix': ''},
                    'Status': 'Enabled',
                    'Transitions': [
                        {
                            'Days': self.policy.hot_to_warm_days,
                            'StorageClass': self.policy.warm_tier
                        }
                    ]
                },
                {
                    'ID': 'TierToCold',
                    'Filter': {'Prefix': ''},
                    'Status': 'Enabled',
                    'Transitions': [
                        {
                            'Days': self.policy.warm_to_cold_days,
                            'StorageClass': self.policy.cold_tier
                        }
                    ]
                },
                {
                    'ID': 'TierToArchive',
                    'Filter': {'Prefix': ''},
                    'Status': 'Enabled',
                    'Transitions': [
                        {
                            'Days': self.policy.cold_to_archive_days,
                            'StorageClass': self.policy.archive_tier
                        }
                    ]
                }
            ]
        }

        self.s3.put_bucket_lifecycle_configuration(
            Bucket=bucket,
            LifecycleConfiguration=lifecycle_config
        )

Key Takeaways

  1. Lakehouse is the future: The combination of open table formats (Delta, Iceberg) with object storage provides the best of data lakes and warehouses.

  2. Choose platforms based on fit, not hype: AWS offers the broadest ecosystem, GCP excels at analytics/ML, Azure integrates best with Microsoft shops.

  3. Design for portability: Use open formats and abstraction layers. Multi-cloud is increasingly the enterprise reality.

  4. Governance is foundational: Don't bolt on security and governance after the fact. Build them into the architecture from day one.

  5. Optimize costs continuously: Storage tiering, reserved capacity, and serverless options can reduce costs by 50%+ when properly managed.

  6. Serverless isn't always cheaper: Understand the cost model. Serverless excels at variable workloads but can be expensive at scale.

References

  1. Databricks. (2023). "Lakehouse: A New Generation of Open Platforms."
  2. Apache Iceberg Documentation: Table Format Specification
  3. AWS Well-Architected Framework: Data Analytics Lens
  4. Google Cloud Architecture Center: Data Analytics Reference Architectures
  5. Azure Architecture Center: Analytics End-to-End with Azure Synapse
  6. Armbrust, M. et al. (2021). "Lakehouse: A New Generation of Open Platforms." CIDR.

Key Takeaways

  • The lakehouse architecture combines the flexibility of data lakes with the performance of data warehouses
  • Platform choice should be driven by existing ecosystem, team skills, and specific workload requirements—not marketing
  • Multi-cloud is inevitable for most enterprises; design for portability from day one
  • Serverless and managed services reduce operational burden but require careful cost management
  • Data governance and security architecture must be foundational, not bolted on
Gemut Analytics Team
Gemut Analytics Team
Data Engineering Experts