September 25, 2024|18 min read|Advanced

Cloud Data Platform Architecture Patterns

A comprehensive guide to designing cloud-native data platforms—covering lakehouse architecture, multi-cloud considerations, and platform comparisons across AWS, GCP, and Azure

Cloud ArchitectureData PlatformAWSGCPAzureLakehouseData Lake

TL;DR

Cloud data platforms have evolved from simple storage-compute architectures to sophisticated ecosystems of integrated services. This guide examines the dominant architectural patterns—data lakes, data warehouses, and the emerging lakehouse paradigm—while providing practical guidance on platform selection and multi-cloud strategies.

Prerequisites

Familiarity with cloud computing concepts (compute, storage, networking)
Basic understanding of data warehousing and data lakes
Experience with at least one major cloud provider

Cloud Data Platform Architecture Patterns

The Evolution of Cloud Data Architecture

Cloud data platforms have evolved through distinct generations:

┌──────────────────────────────────────────────────────────────────────────────┐
│                    EVOLUTION OF CLOUD DATA PLATFORMS                         │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Gen 1: Lift and Shift (2010-2015)                                          │
│  ─────────────────────────────────                                           │
│  • Traditional data warehouses migrated to cloud VMs                         │
│  • Same architecture, just different hosting                                 │
│  • High operational burden, limited scalability                              │
│  Examples: Oracle on EC2, Teradata on Azure VMs                              │
│                                                                              │
│  Gen 2: Cloud-Native Warehouses (2015-2019)                                 │
│  ─────────────────────────────────────────────                               │
│  • Purpose-built for cloud (separation of storage and compute)              │
│  • Elastic scaling, pay-per-use pricing                                      │
│  • Still required ETL into proprietary formats                              │
│  Examples: Snowflake, BigQuery, Redshift                                     │
│                                                                              │
│  Gen 3: Data Lakes (2016-2020)                                              │
│  ─────────────────────────────                                               │
│  • Store everything in object storage (S3, GCS, ADLS)                       │
│  • Schema-on-read flexibility                                                │
│  • Decoupled storage from processing engines                                 │
│  • Data quality and governance challenges                                    │
│  Examples: Hadoop on EMR, Spark on Databricks                               │
│                                                                              │
│  Gen 4: Lakehouse (2020-Present)                                            │
│  ────────────────────────────────                                            │
│  • Open table formats on object storage (Delta, Iceberg, Hudi)              │
│  • ACID transactions, schema enforcement, time travel                        │
│  • Best of lakes (flexibility) and warehouses (reliability)                  │
│  • Unified platform for all data workloads                                   │
│  Examples: Databricks Lakehouse, Snowflake with Iceberg, Dremio             │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

The Lakehouse Architecture

The lakehouse combines data lake flexibility with data warehouse reliability.

Core Principles

┌──────────────────────────────────────────────────────────────────────────────┐
│                       LAKEHOUSE ARCHITECTURE                                  │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     CONSUMPTION LAYER                                │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐       │    │
│  │  │     BI     │ │    SQL     │ │    ML      │ │  Streaming │       │    │
│  │  │ (Tableau)  │ │ (Redash)   │ │ (MLflow)   │ │ (Flink)    │       │    │
│  │  └────────────┘ └────────────┘ └────────────┘ └────────────┘       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      QUERY ENGINES                                   │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐       │    │
│  │  │ Spark SQL  │ │   Trino    │ │  Photon    │ │   Presto   │       │    │
│  │  │            │ │            │ │            │ │            │       │    │
│  │  └────────────┘ └────────────┘ └────────────┘ └────────────┘       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                 OPEN TABLE FORMAT LAYER                              │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Delta Lake / Apache Iceberg / Apache Hudi                    │  │    │
│  │  │                                                               │  │    │
│  │  │  • ACID Transactions    • Time Travel                         │  │    │
│  │  │  • Schema Evolution     • Partition Evolution                 │  │    │
│  │  │  • Unified Batch/Stream • Z-Ordering / Compaction            │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     FILE FORMAT LAYER                                │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐                       │    │
│  │  │  Parquet   │ │    ORC     │ │   Avro     │                       │    │
│  │  │ (columnar) │ │ (columnar) │ │ (row-wise) │                       │    │
│  │  └────────────┘ └────────────┘ └────────────┘                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                   OBJECT STORAGE LAYER                               │    │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐                       │    │
│  │  │ Amazon S3  │ │    GCS     │ │    ADLS    │                       │    │
│  │  │            │ │            │ │            │                       │    │
│  │  └────────────┘ └────────────┘ └────────────┘                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Open Table Format Comparison

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Ecosystem	Databricks, Spark	Vendor-neutral, broad support	Uber origin, AWS focus
ACID Transactions	Yes	Yes	Yes
Time Travel	Yes	Yes	Yes
Schema Evolution	Yes	Yes (best support)	Yes
Partition Evolution	Limited	Full support	Limited
Streaming Support	Excellent	Good	Excellent
Small File Handling	Auto-compaction	Compaction needed	Auto-compaction
Query Engines	Spark, Trino, Flink	Spark, Trino, Flink, Dremio	Spark, Trino, Flink
Cloud Integration	AWS, Azure, GCP	AWS, Azure, GCP	AWS focus

Platform Comparison

AWS Data Platform

┌──────────────────────────────────────────────────────────────────────────────┐
│                          AWS DATA PLATFORM                                    │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  INGESTION                                                                   │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Kinesis    │ │     MSK      │ │     DMS      │ │   AppFlow    │        │
│  │   (Stream)   │ │   (Kafka)    │ │  (Database)  │ │    (SaaS)    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  STORAGE                                                                     │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                          Amazon S3                                     │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │  S3 Standard │ S3 Intelligent │ S3 Glacier │ S3 Glacier Deep   │  │  │
│  │  │  (Hot data)  │    Tiering     │  (Archive) │    Archive        │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  PROCESSING                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │     EMR      │ │    Glue      │ │   Athena     │ │   Lambda     │        │
│  │  (Spark)     │ │  (Serverless)│ │   (Query)    │ │ (Functions)  │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  WAREHOUSE                                                                   │
│  ┌──────────────┐ ┌──────────────────────────────────────────────────────┐  │
│  │   Redshift   │ │  Redshift Serverless / Redshift Spectrum             │  │
│  │  (Cluster)   │ │  (Query S3 directly)                                 │  │
│  └──────────────┘ └──────────────────────────────────────────────────────┘  │
│                                                                              │
│  ML & ANALYTICS                                                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  SageMaker   │ │  QuickSight  │ │   Bedrock    │ │  OpenSearch  │        │
│  │    (ML)      │ │     (BI)     │ │   (Gen AI)   │ │  (Search)    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  GOVERNANCE                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Lake Form-  │ │    Glue      │ │   IAM /      │ │  CloudTrail  │        │
│  │  ation       │ │   Catalog    │ │    KMS       │ │   (Audit)    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

GCP Data Platform

┌──────────────────────────────────────────────────────────────────────────────┐
│                          GCP DATA PLATFORM                                    │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  INGESTION                                                                   │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Pub/Sub    │ │  Datastream  │ │  Transfer    │ │   Dataflow   │        │
│  │   (Stream)   │ │    (CDC)     │ │   Service    │ │  (Streaming) │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  STORAGE                                                                     │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    Google Cloud Storage                                │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │  Standard  │  Nearline   │  Coldline  │  Archive              │  │  │
│  │  │  (Hot)     │  (Monthly)  │  (Yearly)  │  (Rare access)        │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  PROCESSING                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Dataproc    │ │  Dataflow    │ │  Cloud       │ │  Cloud       │        │
│  │  (Spark)     │ │  (Beam)      │ │  Functions   │ │  Run         │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  WAREHOUSE                                                                   │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                           BigQuery                                      │ │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                    │ │
│  │  │   Native     │ │   BigLake    │ │  Omni        │                    │ │
│  │  │   Tables     │ │  (External)  │ │ (Multi-cloud)│                    │ │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                    │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ML & ANALYTICS                                                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Vertex AI   │ │   Looker     │ │   BQML       │ │  AutoML      │        │
│  │    (ML)      │ │    (BI)      │ │   (ML in BQ) │ │  (No-code)   │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  GOVERNANCE                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Dataplex   │ │    Data      │ │   IAM /      │ │   Cloud      │        │
│  │              │ │   Catalog    │ │    VPC-SC    │ │   Logging    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Azure Data Platform

┌──────────────────────────────────────────────────────────────────────────────┐
│                         AZURE DATA PLATFORM                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  INGESTION                                                                   │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Event Hubs  │ │    Kafka     │ │   Data       │ │    ADF       │        │
│  │   (Stream)   │ │  (HDInsight) │ │   Factory    │ │  (Pipelines) │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  STORAGE                                                                     │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │              Azure Data Lake Storage Gen2 (ADLS)                       │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │   Hot    │   Cool    │   Cold    │   Archive                   │  │  │
│  │  │  (< 1 day │  (Monthly)│ (Yearly) │  (Rare access)              │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  PROCESSING                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  HDInsight   │ │  Databricks  │ │   Azure      │ │   Azure      │        │
│  │  (Spark)     │ │              │ │   Synapse    │ │   Functions  │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  WAREHOUSE                                                                   │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                       Azure Synapse Analytics                           │ │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                    │ │
│  │  │  Dedicated   │ │  Serverless  │ │    Spark     │                    │ │
│  │  │   Pools      │ │    SQL       │ │    Pools     │                    │ │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                    │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ML & ANALYTICS                                                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │  Azure ML    │ │  Power BI    │ │   Cognitive  │ │  OpenAI      │        │
│  │              │ │              │ │   Services   │ │   Service    │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
│  GOVERNANCE                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐        │
│  │   Purview    │ │    Unity     │ │   Azure      │ │   Monitor    │        │
│  │              │ │   Catalog    │ │   AD / RBAC  │ │              │        │
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Platform Selection Matrix

Factor	AWS	GCP	Azure
Best For	Broadest ecosystem, enterprise	Analytics/ML workloads, simplicity	Microsoft shops, hybrid
Warehouse	Redshift (good)	BigQuery (excellent)	Synapse (good)
Lakehouse	EMR + S3 + Lake Formation	BigLake + GCS	Databricks + ADLS
Streaming	Kinesis (good), MSK (Kafka)	Pub/Sub + Dataflow (excellent)	Event Hubs (good)
Serverless	Excellent (Lambda, Glue)	Excellent (Dataflow, Functions)	Good (Functions)
ML Platform	SageMaker (comprehensive)	Vertex AI (excellent)	Azure ML (good)
Governance	Lake Formation (maturing)	Dataplex (good)	Purview (good)
Cost Model	Complex but flexible	Simpler, slot-based	Complex
Enterprise Features	Excellent	Good	Excellent

Reference Architecture: Enterprise Lakehouse

# terraform/main.tf (pseudo-code for architecture reference)
"""
Enterprise Lakehouse Reference Architecture on AWS.
Demonstrates production patterns for cloud data platforms.
"""
 
# Network Foundation
module "network" {
  source = "./modules/network"
 
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
 
  # Isolated subnets for data workloads
  private_subnets = {
    data_processing = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
    analytics       = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
  }
 
  # VPC endpoints for AWS services
  endpoints = ["s3", "glue", "athena", "kinesis", "secretsmanager"]
}
 
# Data Lake Storage
module "data_lake" {
  source = "./modules/data-lake"
 
  bucket_name = "company-data-lake-prod"
 
  # Medallion architecture zones
  zones = {
    raw = {
      prefix       = "raw/"
      lifecycle    = "INTELLIGENT_TIERING"
      retention    = "7_YEARS"
      versioning   = true
    }
    bronze = {
      prefix       = "bronze/"
      lifecycle    = "INTELLIGENT_TIERING"
      retention    = "3_YEARS"
      versioning   = true
    }
    silver = {
      prefix       = "silver/"
      lifecycle    = "STANDARD"
      retention    = "2_YEARS"
      versioning   = true
    }
    gold = {
      prefix       = "gold/"
      lifecycle    = "STANDARD"
      retention    = "1_YEAR"
      versioning   = true
    }
  }
 
  # Encryption
  kms_key_id = module.security.data_encryption_key_id
 
  # Replication for disaster recovery
  replication = {
    enabled     = true
    destination = "us-west-2"
  }
}
 
# Data Catalog and Governance
module "governance" {
  source = "./modules/governance"
 
  # AWS Lake Formation for fine-grained access
  lake_formation = {
    data_lake_locations = [module.data_lake.bucket_arn]
 
    # Tag-based access control
    lf_tags = {
      pii        = ["true", "false"]
      department = ["engineering", "finance", "marketing", "sales"]
      sensitivity = ["public", "internal", "confidential", "restricted"]
    }
  }
 
  # Glue Data Catalog
  glue_catalog = {
    databases = ["raw", "bronze", "silver", "gold"]
 
    crawlers = {
      bronze_crawler = {
        path     = "s3://company-data-lake-prod/bronze/"
        schedule = "cron(0 */4 * * ? *)"  # Every 4 hours
      }
    }
  }
}
 
# Processing Clusters
module "processing" {
  source = "./modules/processing"
 
  # EMR for batch processing
  emr_clusters = {
    batch_processing = {
      release         = "emr-6.10.0"
      applications    = ["Spark", "Hive", "Delta"]
      instance_type   = "r5.4xlarge"
      core_nodes      = 10
      task_nodes_min  = 0
      task_nodes_max  = 50
      spot_percentage = 80
    }
  }
 
  # Glue for serverless ETL
  glue_jobs = {
    bronze_ingestion = {
      worker_type   = "G.1X"
      workers       = 10
      glue_version  = "4.0"
    }
  }
}
 
# Query and Analytics
module "analytics" {
  source = "./modules/analytics"
 
  # Athena for ad-hoc queries
  athena = {
    workgroups = {
      analyst = {
        output_location    = "s3://company-athena-results/analyst/"
        bytes_scanned_cutoff = 10737418240  # 10 GB
        enforce_workgroup_config = true
      }
      data_scientist = {
        output_location    = "s3://company-athena-results/data-science/"
        bytes_scanned_cutoff = 107374182400  # 100 GB
      }
    }
  }
 
  # Redshift for warehouse workloads
  redshift = {
    cluster_type     = "ra3.xlplus"
    number_of_nodes  = 4
 
    # Serverless for variable workloads
    serverless = {
      enabled        = true
      base_capacity  = 32  # RPUs
      max_capacity   = 256
    }
 
    # Spectrum for querying S3
    spectrum_role = module.governance.redshift_spectrum_role_arn
  }
}
 
# Streaming Layer
module "streaming" {
  source = "./modules/streaming"
 
  # MSK for Kafka
  msk = {
    kafka_version    = "3.4.0"
    broker_nodes     = 6
    instance_type    = "kafka.m5.2xlarge"
    ebs_volume_size  = 1000
 
    # Configuration
    auto_create_topics = false
    retention_hours    = 168  # 7 days
  }
 
  # Kinesis for serverless streaming
  kinesis = {
    streams = {
      clickstream = {
        shard_count    = 10
        retention_hours = 24
      }
    }
  }
 
  # Flink for stream processing
  flink = {
    applications = {
      realtime_aggregation = {
        parallelism = 10
      }
    }
  }
}
 
# Observability
module "observability" {
  source = "./modules/observability"
 
  # CloudWatch dashboards
  dashboards = ["data-platform-overview", "processing-jobs", "query-performance"]
 
  # Alarms
  alarms = {
    processing_failure = {
      metric    = "glue_job_failures"
      threshold = 1
      period    = 300
    }
    query_performance = {
      metric    = "athena_query_duration"
      threshold = 300000  # 5 minutes
      statistic = "p95"
    }
  }
 
  # Data quality monitoring
  data_quality = {
    enabled = true
    engine  = "great_expectations"
  }
}

Multi-Cloud Strategies

Design for Portability

# multicloud/abstractions.py
"""
Abstraction layer for multi-cloud data operations.
Enables portability across AWS, GCP, and Azure.
"""
 
from abc import ABC, abstractmethod
from typing import List, Dict, Any
from enum import Enum
 
class CloudProvider(Enum):
    AWS = "aws"
    GCP = "gcp"
    AZURE = "azure"
 
 
class ObjectStorage(ABC):
    """Abstract interface for object storage operations."""
 
    @abstractmethod
    def read_parquet(self, path: str) -> 'DataFrame':
        pass
 
    @abstractmethod
    def write_parquet(self, df: 'DataFrame', path: str, partition_by: List[str] = None):
        pass
 
    @abstractmethod
    def list_objects(self, prefix: str) -> List[str]:
        pass
 
 
class S3Storage(ObjectStorage):
    """AWS S3 implementation."""
 
    def __init__(self, bucket: str, region: str = "us-east-1"):
        self.bucket = bucket
        self.region = region
        self._client = boto3.client('s3', region_name=region)
 
    def read_parquet(self, path: str) -> 'DataFrame':
        import pandas as pd
        full_path = f"s3://{self.bucket}/{path}"
        return pd.read_parquet(full_path)
 
    def write_parquet(self, df: 'DataFrame', path: str, partition_by: List[str] = None):
        import pyarrow.parquet as pq
        full_path = f"s3://{self.bucket}/{path}"
        df.to_parquet(full_path, partition_cols=partition_by)
 
 
class GCSStorage(ObjectStorage):
    """Google Cloud Storage implementation."""
 
    def __init__(self, bucket: str, project: str):
        self.bucket = bucket
        self.project = project
        self._client = storage.Client(project=project)
 
    def read_parquet(self, path: str) -> 'DataFrame':
        import pandas as pd
        full_path = f"gs://{self.bucket}/{path}"
        return pd.read_parquet(full_path)
 
 
class ADLSStorage(ObjectStorage):
    """Azure Data Lake Storage Gen2 implementation."""
 
    def __init__(self, account: str, container: str):
        self.account = account
        self.container = container
 
    def read_parquet(self, path: str) -> 'DataFrame':
        import pandas as pd
        full_path = f"abfss://{self.container}@{self.account}.dfs.core.windows.net/{path}"
        return pd.read_parquet(full_path)
 
 
def get_storage(provider: CloudProvider, config: Dict[str, Any]) -> ObjectStorage:
    """Factory for cloud storage implementations."""
    if provider == CloudProvider.AWS:
        return S3Storage(config['bucket'], config.get('region', 'us-east-1'))
    elif provider == CloudProvider.GCP:
        return GCSStorage(config['bucket'], config['project'])
    elif provider == CloudProvider.AZURE:
        return ADLSStorage(config['account'], config['container'])
    else:
        raise ValueError(f"Unknown provider: {provider}")

Multi-Cloud Data Mesh

┌──────────────────────────────────────────────────────────────────────────────┐
│                       MULTI-CLOUD DATA MESH                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                    ┌─────────────────────────────────┐                       │
│                    │    FEDERATED CATALOG            │                       │
│                    │    (Unity Catalog / Dataplex)   │                       │
│                    └──────────────┬──────────────────┘                       │
│                                   │                                          │
│          ┌────────────────────────┼────────────────────────┐                 │
│          │                        │                        │                 │
│          ▼                        ▼                        ▼                 │
│  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐         │
│  │     AWS DOMAIN   │   │    GCP DOMAIN    │   │   AZURE DOMAIN   │         │
│  │                  │   │                  │   │                  │         │
│  │  ┌────────────┐  │   │  ┌────────────┐  │   │  ┌────────────┐  │         │
│  │  │  Customer  │  │   │  │  Analytics │  │   │  │  Finance   │  │         │
│  │  │   Data     │  │   │  │   Data     │  │   │  │   Data     │  │         │
│  │  │  Products  │  │   │  │  Products  │  │   │  │  Products  │  │         │
│  │  └────────────┘  │   │  └────────────┘  │   │  └────────────┘  │         │
│  │                  │   │                  │   │                  │         │
│  │  ┌────────────┐  │   │  ┌────────────┐  │   │  ┌────────────┐  │         │
│  │  │   S3 +     │  │   │  │   GCS +    │  │   │  │  ADLS +    │  │         │
│  │  │   Delta    │  │   │  │  Iceberg   │  │   │  │   Delta    │  │         │
│  │  └────────────┘  │   │  └────────────┘  │   │  └────────────┘  │         │
│  └──────────────────┘   └──────────────────┘   └──────────────────┘         │
│          │                        │                        │                 │
│          └────────────────────────┼────────────────────────┘                 │
│                                   ▼                                          │
│                    ┌─────────────────────────────────┐                       │
│                    │    CROSS-CLOUD QUERY ENGINE     │                       │
│                    │    (Starburst / Dremio)         │                       │
│                    └─────────────────────────────────┘                       │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Cost Optimization

Storage Tiering Strategy

# cost_optimization/storage_tiering.py
"""
Automated storage tiering for cost optimization.
Moves data between tiers based on access patterns.
"""
 
from datetime import datetime, timedelta
from typing import Dict, List
from dataclasses import dataclass
 
@dataclass
class TieringPolicy:
    """Storage tiering policy configuration."""
    hot_to_warm_days: int = 30
    warm_to_cold_days: int = 90
    cold_to_archive_days: int = 365
 
    hot_tier: str = "STANDARD"
    warm_tier: str = "STANDARD_IA"
    cold_tier: str = "GLACIER_IR"
    archive_tier: str = "DEEP_ARCHIVE"
 
 
class StorageTieringManager:
    """Manages automatic storage tiering based on access patterns."""
 
    def __init__(self, s3_client, policy: TieringPolicy = None):
        self.s3 = s3_client
        self.policy = policy or TieringPolicy()
 
    def analyze_access_patterns(self, bucket: str, prefix: str) -> Dict:
        """Analyze object access patterns for tiering decisions."""
 
        # Get CloudWatch metrics for object access
        # In production, this would query S3 Storage Lens or access logs
 
        return {
            'total_objects': 1000000,
            'total_size_bytes': 10 * 1024**4,  # 10 TB
            'by_last_access': {
                'hot': {'count': 50000, 'size': 500 * 1024**3},      # < 30 days
                'warm': {'count': 200000, 'size': 2 * 1024**4},       # 30-90 days
                'cold': {'count': 400000, 'size': 4 * 1024**4},       # 90-365 days
                'archive': {'count': 350000, 'size': 3.5 * 1024**4},  # > 365 days
            }
        }
 
    def calculate_cost_savings(self, analysis: Dict) -> Dict:
        """Calculate potential cost savings from optimal tiering."""
 
        # AWS S3 pricing (us-east-1, approximate)
        pricing = {
            'STANDARD': 0.023,       # per GB/month
            'STANDARD_IA': 0.0125,
            'GLACIER_IR': 0.004,
            'DEEP_ARCHIVE': 0.00099,
        }
 
        by_access = analysis['by_last_access']
 
        # Current cost (all in STANDARD)
        total_size_gb = analysis['total_size_bytes'] / 1024**3
        current_cost = total_size_gb * pricing['STANDARD']
 
        # Optimized cost
        optimized_cost = (
            by_access['hot']['size'] / 1024**3 * pricing['STANDARD'] +
            by_access['warm']['size'] / 1024**3 * pricing['STANDARD_IA'] +
            by_access['cold']['size'] / 1024**3 * pricing['GLACIER_IR'] +
            by_access['archive']['size'] / 1024**3 * pricing['DEEP_ARCHIVE']
        )
 
        return {
            'current_monthly_cost': current_cost,
            'optimized_monthly_cost': optimized_cost,
            'monthly_savings': current_cost - optimized_cost,
            'annual_savings': (current_cost - optimized_cost) * 12,
            'savings_percentage': (1 - optimized_cost / current_cost) * 100
        }
 
    def apply_lifecycle_rules(self, bucket: str) -> None:
        """Apply lifecycle rules based on tiering policy."""
 
        lifecycle_config = {
            'Rules': [
                {
                    'ID': 'TierToWarm',
                    'Filter': {'Prefix': ''},
                    'Status': 'Enabled',
                    'Transitions': [
                        {
                            'Days': self.policy.hot_to_warm_days,
                            'StorageClass': self.policy.warm_tier
                        }
                    ]
                },
                {
                    'ID': 'TierToCold',
                    'Filter': {'Prefix': ''},
                    'Status': 'Enabled',
                    'Transitions': [
                        {
                            'Days': self.policy.warm_to_cold_days,
                            'StorageClass': self.policy.cold_tier
                        }
                    ]
                },
                {
                    'ID': 'TierToArchive',
                    'Filter': {'Prefix': ''},
                    'Status': 'Enabled',
                    'Transitions': [
                        {
                            'Days': self.policy.cold_to_archive_days,
                            'StorageClass': self.policy.archive_tier
                        }
                    ]
                }
            ]
        }
 
        self.s3.put_bucket_lifecycle_configuration(
            Bucket=bucket,
            LifecycleConfiguration=lifecycle_config
        )

Key Takeaways

Lakehouse is the future: The combination of open table formats (Delta, Iceberg) with object storage provides the best of data lakes and warehouses.
Choose platforms based on fit, not hype: AWS offers the broadest ecosystem, GCP excels at analytics/ML, Azure integrates best with Microsoft shops.
Design for portability: Use open formats and abstraction layers. Multi-cloud is increasingly the enterprise reality.
Governance is foundational: Don't bolt on security and governance after the fact. Build them into the architecture from day one.
Optimize costs continuously: Storage tiering, reserved capacity, and serverless options can reduce costs by 50%+ when properly managed.
Serverless isn't always cheaper: Understand the cost model. Serverless excels at variable workloads but can be expensive at scale.

References

Databricks. (2023). "Lakehouse: A New Generation of Open Platforms."
Apache Iceberg Documentation: Table Format Specification
AWS Well-Architected Framework: Data Analytics Lens
Google Cloud Architecture Center: Data Analytics Reference Architectures
Azure Architecture Center: Analytics End-to-End with Azure Synapse
Armbrust, M. et al. (2021). "Lakehouse: A New Generation of Open Platforms." CIDR.

Key Takeaways

✓The lakehouse architecture combines the flexibility of data lakes with the performance of data warehouses
✓Platform choice should be driven by existing ecosystem, team skills, and specific workload requirements—not marketing
✓Multi-cloud is inevitable for most enterprises; design for portability from day one
✓Serverless and managed services reduce operational burden but require careful cost management
✓Data governance and security architecture must be foundational, not bolted on

Gemut Analytics Team

Data Engineering Experts

Cloud Data Platform Architecture Patterns

Cloud Data Platform Architecture Patterns

The Evolution of Cloud Data Architecture

The Lakehouse Architecture

Core Principles

Open Table Format Comparison

Platform Comparison

AWS Data Platform

GCP Data Platform

Azure Data Platform

Platform Selection Matrix

Reference Architecture: Enterprise Lakehouse

Multi-Cloud Strategies

Design for Portability

Multi-Cloud Data Mesh

Cost Optimization

Storage Tiering Strategy

Key Takeaways

References

Key Takeaways

Related Articles

Building a Modern Data Stack: Architecture, Components, and Decision Framework