June 20, 2024|16 min read|Intermediate

Telecom Analytics in Emerging Markets: Network Optimization, Churn Prediction, and the USSD Data Goldmine

A comprehensive guide to building telecom analytics platforms in emerging markets, covering CDR pipeline architecture, USSD session analytics, network quality prediction, and churn models for prepaid-dominant markets.

TelecomsCDR AnalyticsUSSDChurn PredictionNetwork Optimization

TL;DR

Nigerian telcos serve over 200 million subscribers, generating massive volumes of Call Detail Records, USSD sessions, and network events daily. Unlike mature Western markets dominated by postpaid contracts and app-based interactions, emerging market telecoms face unique data challenges — USSD as a primary service delivery channel, prepaid-dominant subscriber bases, network infrastructure constrained by power instability, and regulatory requirements from the NCC. This guide explores the data engineering and analytics patterns needed to extract actionable intelligence from this data landscape.

Prerequisites

Understanding of data warehouse and data lake architectures
Familiarity with SQL and Python for data analysis
Basic knowledge of machine learning concepts

Telecom Analytics in Emerging Markets: Network Optimization, Churn Prediction, and the USSD Data Goldmine

Introduction

Nigeria's telecommunications sector is one of the most dynamic in Africa. With over 220 million active mobile subscriptions tracked by the Nigerian Communications Commission (NCC), the four major operators — MTN Nigeria, Airtel Nigeria, Globacom (Glo), and 9mobile — collectively generate data volumes that rival operators in markets with twice the population. MTN Nigeria alone serves over 80 million subscribers, making it the largest single-country mobile operator on the African continent.

Yet the data engineering challenges in this market are fundamentally different from those faced by operators in North America or Europe. Consider the contrasts:

Characteristic	Western Market	Nigerian Market
Dominant billing model	Postpaid (60-70%)	Prepaid (95%+)
Primary service channel	Mobile apps, web portals	USSD (131#, 556#, *141#)
Subscriber identity	Stable contract-based	SIM swap, multi-SIM behavior
Network availability	99.9% uptime target	Power-dependent, 85-95%
Churn signal	Contract expiry	Gradual activity decay
Revenue per user	$40-60/month ARPU	$3-5/month ARPU
Regulatory body	FCC/Ofcom	NCC

These differences are not cosmetic — they demand different data architectures, different analytical models, and different operational patterns. A churn model trained on postpaid contract expiry data is useless in a market where 95% of subscribers simply stop recharging. A network quality model that assumes stable power supply will produce wildly inaccurate predictions for cell sites running on diesel generators.

This guide provides the data engineering and analytics patterns purpose-built for this reality.

The Telecom Data Landscape in Nigeria

Data Volume Estimates

A large Nigerian telco with 60-80 million subscribers generates approximately the following daily data volumes:

┌──────────────────────────────────────────────────────────────────┐
│              Daily Data Generation (Large Nigerian MNO)           │
├─────────────────────────┬────────────────┬───────────────────────┤
│ Data Source             │ Daily Records  │ Raw Size (compressed) │
├─────────────────────────┼────────────────┼───────────────────────┤
│ Voice CDRs              │ 180-250M       │ 40-60 GB              │
│ SMS CDRs                │ 80-120M        │ 8-12 GB               │
│ Data Session CDRs       │ 300-500M       │ 80-120 GB             │
│ USSD Sessions           │ 40-80M         │ 5-10 GB               │
│ Recharge Events         │ 15-30M         │ 2-4 GB                │
│ Network Events (alarms) │ 5-10M          │ 3-6 GB                │
│ Cell Tower KPIs         │ 2-5M           │ 1-3 GB                │
│ VAS Transactions        │ 20-40M         │ 3-5 GB                │
├─────────────────────────┼────────────────┼───────────────────────┤
│ TOTAL                   │ 640M - 1.03B   │ 142 - 220 GB/day      │
└─────────────────────────┴────────────────┴───────────────────────┘

These numbers yield approximately 200-400 TB of raw data per year — substantial, but manageable with modern data lake architectures. The challenge is not raw volume alone; it is the combination of volume, velocity (CDRs must be processed within minutes for fraud detection), variety (structured CDRs, semi-structured USSD logs, unstructured network alarms), and the operational constraints unique to the market.

The Four Data Domains

Nigerian telco data naturally organizes into four analytical domains, each with distinct engineering requirements:

┌──────────────────────────────────────────────────────────────────┐
│                    Telco Analytics Data Domains                   │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────┐  ┌────────┐  │
│  │  Subscriber   │  │   Network    │  │  Revenue  │  │ Reg &  │  │
│  │  Analytics    │  │   Analytics  │  │  Analytics│  │Compliance│ │
│  ├──────────────┤  ├──────────────┤  ├───────────┤  ├────────┤  │
│  │ CDR patterns │  │ Cell KPIs    │  │ Recharge  │  │ NCC    │  │
│  │ USSD behavior│  │ Power status │  │ ARPU      │  │ QoS    │  │
│  │ Churn models │  │ Coverage     │  │ VAS usage │  │ KYC    │  │
│  │ Segmentation │  │ Capacity     │  │ Bundles   │  │ Number │  │
│  │ Lifetime val │  │ Optimization │  │ Revenue   │  │ Porting│  │
│  └──────┬───────┘  └──────┬───────┘  └─────┬─────┘  └───┬────┘  │
│         │                 │                │             │        │
│         └────────┬────────┴────────┬───────┘             │        │
│                  │                 │                      │        │
│            ┌─────▼─────┐    ┌─────▼──────┐    ┌─────────▼──────┐ │
│            │ Data Lake  │    │ Data       │    │ Regulatory     │ │
│            │ (Raw Zone) │    │ Warehouse  │    │ Reporting DB   │ │
│            └────────────┘    └────────────┘    └────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Building CDR Analytics Pipelines at Scale

CDR Data Model

A Call Detail Record captures the metadata of every network event. The Nigerian telco CDR typically contains fields beyond what Western operators log, including recharge channel codes and USSD-originated call flags.

Here is a normalized CDR data model suitable for an analytical data warehouse:

-- Core CDR fact table (partitioned by date and record_type)
CREATE TABLE cdr_fact (
    cdr_id              BIGINT GENERATED ALWAYS AS IDENTITY,
    record_type         VARCHAR(10),    -- 'VOICE', 'SMS', 'DATA', 'USSD'
    event_timestamp     TIMESTAMP NOT NULL,
    event_date          DATE NOT NULL,  -- partition key
    event_hour          SMALLINT,
 
    -- Subscriber identifiers
    calling_msisdn      VARCHAR(15),
    called_msisdn       VARCHAR(15),
    calling_imsi         VARCHAR(20),
    calling_imei         VARCHAR(20),
 
    -- Network identifiers
    serving_cell_id     VARCHAR(20),
    serving_lac         INTEGER,
    serving_msc         VARCHAR(20),
    terminating_cell_id VARCHAR(20),
    roaming_indicator   BOOLEAN DEFAULT FALSE,
 
    -- Call attributes
    duration_seconds    INTEGER,
    call_type           VARCHAR(10),   -- 'MO', 'MT', 'CF'
    disconnect_reason   VARCHAR(10),
    bearer_service      VARCHAR(10),
 
    -- Data session attributes (NULL for voice/SMS)
    bytes_uploaded      BIGINT,
    bytes_downloaded    BIGINT,
    apn                 VARCHAR(50),
    rat_type            VARCHAR(10),   -- '2G', '3G', '4G'
 
    -- Revenue
    charge_amount       DECIMAL(10,4),
    account_balance     DECIMAL(12,4),
    tariff_plan_id      INTEGER,
 
    -- Metadata
    switch_id           VARCHAR(20),
    file_id             VARCHAR(50),
    load_timestamp      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (event_date);
 
-- Monthly partitions (example for June 2024)
CREATE TABLE cdr_fact_2024_06
    PARTITION OF cdr_fact
    FOR VALUES FROM ('2024-06-01') TO ('2024-07-01');

Subscriber Dimension

Multi-SIM behavior is a defining characteristic of the Nigerian market. A single individual may carry two or three SIM cards — one MTN for calls, one Glo for data, one Airtel for specific USSD services. The subscriber dimension must account for this:

CREATE TABLE dim_subscriber (
    subscriber_key      BIGINT GENERATED ALWAYS AS IDENTITY,
    msisdn              VARCHAR(15) NOT NULL,
    imsi                VARCHAR(20),
    activation_date     DATE,
    last_activity_date  DATE,
    subscriber_status   VARCHAR(20),  -- 'ACTIVE', 'DORMANT', 'CHURNED', 'SUSPENDED'
 
    -- Nigerian-specific attributes
    nin_linked          BOOLEAN DEFAULT FALSE,  -- NIN-SIM linkage status
    kyc_status          VARCHAR(20),
    registration_region VARCHAR(50),
    current_tariff_plan VARCHAR(50),
    account_type        VARCHAR(10),  -- 'PREPAID', 'POSTPAID', 'HYBRID'
 
    -- Behavioral segmentation
    arpu_30d            DECIMAL(10,2),
    recharge_frequency  VARCHAR(20),  -- 'DAILY', 'WEEKLY', 'MONTHLY', 'IRREGULAR'
    primary_usage       VARCHAR(20),  -- 'VOICE', 'DATA', 'USSD_SERVICES', 'MIXED'
    device_tier         VARCHAR(20),  -- 'BASIC', 'FEATURE', 'SMARTPHONE_LOW', 'SMARTPHONE_HIGH'
 
    -- SDF / Effective dates
    effective_from      TIMESTAMP,
    effective_to        TIMESTAMP,
    is_current          BOOLEAN DEFAULT TRUE
);
 
CREATE INDEX idx_subscriber_msisdn ON dim_subscriber(msisdn) WHERE is_current = TRUE;

Pipeline Architecture

The CDR ingestion pipeline must handle 500M+ records daily with sub-hour latency for operational analytics while also feeding batch processes for deeper analysis:

┌──────────────────────────────────────────────────────────────────────┐
│                    CDR Analytics Pipeline Architecture                │
│                                                                      │
│  ┌─────────┐   ┌─────────┐   ┌──────────┐   ┌───────────────────┐  │
│  │ Network  │   │ Mediation│   │  Kafka   │   │  Stream Processing│  │
│  │ Switches │──▶│ Platform │──▶│  Cluster │──▶│  (Spark/Flink)    │  │
│  │ (MSC/    │   │          │   │          │   │                   │  │
│  │  SGSN/   │   │ - Parse  │   │ Topics:  │   │ - Deduplication   │  │
│  │  GGSN)   │   │ - Validate│  │  voice   │   │ - Enrichment      │  │
│  └─────────┘   │ - Normalize│  │  sms     │   │ - Fraud scoring   │  │
│                │          │   │  data    │   │ - Real-time agg   │  │
│                └─────────┘   │  ussd    │   └────────┬──────────┘  │
│                              │  recharge│            │              │
│                              └──────────┘   ┌────────▼──────────┐  │
│                                             │  Landing Zone     │  │
│                                             │  (S3 / HDFS)      │  │
│                                             │  Parquet format   │  │
│                                             └────────┬──────────┘  │
│                                                      │              │
│                              ┌────────────────────────┤              │
│                              │                        │              │
│                     ┌────────▼────────┐    ┌─────────▼──────────┐  │
│                     │ Batch Processing │    │ Real-time Serving  │  │
│                     │ (Spark / dbt)    │    │ (Redis / Druid)    │  │
│                     │                  │    │                    │  │
│                     │ - Daily agg      │    │ - Live dashboards  │  │
│                     │ - Feature eng    │    │ - Fraud alerts     │  │
│                     │ - ML features    │    │ - Threshold alerts │  │
│                     │ - Reporting      │    │                    │  │
│                     └────────┬─────────┘    └────────────────────┘  │
│                              │                                       │
│                     ┌────────▼─────────┐                            │
│                     │ Analytics        │                            │
│                     │ Warehouse        │                            │
│                     │ (Snowflake /     │                            │
│                     │  BigQuery /      │                            │
│                     │  ClickHouse)     │                            │
│                     └──────────────────┘                            │
└──────────────────────────────────────────────────────────────────────┘

CDR Deduplication and Quality

Duplicate CDRs are a persistent problem, particularly during switch failovers or mediation platform retries. A deduplication strategy is essential before any analytics:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
 
spark = SparkSession.builder \
    .appName("CDR_Deduplication") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()
 
def deduplicate_cdrs(raw_cdr_df):
    """
    Deduplicate CDRs using a composite key approach.
 
    Nigerian CDR feeds commonly produce 2-5% duplicates due to:
    - Mediation platform retries during network instability
    - Switch failover generating duplicate records
    - File-level reprocessing after delivery failures
    """
    # Define deduplication key
    dedup_window = Window.partitionBy(
        "calling_msisdn",
        "called_msisdn",
        "event_timestamp",
        "duration_seconds",
        "record_type"
    ).orderBy(F.col("file_id").desc())  # Prefer latest file
 
    deduped_df = raw_cdr_df \
        .withColumn("row_num", F.row_number().over(dedup_window)) \
        .filter(F.col("row_num") == 1) \
        .drop("row_num")
 
    # Log deduplication metrics
    raw_count = raw_cdr_df.count()
    deduped_count = deduped_df.count()
    dup_rate = (raw_count - deduped_count) / raw_count * 100
 
    print(f"Raw records: {raw_count:,}")
    print(f"After dedup: {deduped_count:,}")
    print(f"Duplicate rate: {dup_rate:.2f}%")
 
    return deduped_df
 
 
def validate_cdr_quality(cdr_df):
    """
    Apply quality checks specific to Nigerian CDR data.
 
    Returns a DataFrame with quality flags for downstream filtering.
    """
    validated_df = cdr_df \
        .withColumn("is_valid_msisdn",
            F.col("calling_msisdn").rlike("^234[0-9]{10}$")
        ) \
        .withColumn("is_valid_duration",
            (F.col("duration_seconds") >= 0) &
            (F.col("duration_seconds") < 86400)  # Max 24 hours
        ) \
        .withColumn("is_valid_timestamp",
            (F.col("event_timestamp") >= F.lit("2020-01-01")) &
            (F.col("event_timestamp") <= F.current_timestamp())
        ) \
        .withColumn("is_future_dated",
            F.col("event_timestamp") > F.current_timestamp()
        ) \
        .withColumn("quality_score",
            F.col("is_valid_msisdn").cast("int") +
            F.col("is_valid_duration").cast("int") +
            F.col("is_valid_timestamp").cast("int")
        )
 
    # Nigerian MSISDNs: 234 prefix + 10 digits
    # Valid prefixes: 0803, 0806, 0810, 0813, 0814, 0816 (MTN)
    #                 0802, 0808, 0812, 0701, 0708, 0902 (Airtel)
    #                 0805, 0807, 0811, 0815, 0905, 0915 (Glo)
    #                 0809, 0817, 0818, 0908, 0909 (9mobile)
 
    return validated_df

Partitioning Strategy for CDR Storage

At 500M+ records per day, partitioning strategy directly impacts query performance. The recommended approach for Nigerian telco CDR data:

-- Primary partitioning: Date (daily granularity)
-- Secondary partitioning: Record type
-- Clustering: Calling MSISDN (for subscriber-level queries)
 
-- For ClickHouse (common in telco analytics):
CREATE TABLE cdr_fact_optimized (
    event_date          Date,
    event_timestamp     DateTime,
    record_type         LowCardinality(String),
    calling_msisdn      String,
    called_msisdn       String,
    serving_cell_id     String,
    duration_seconds    UInt32,
    bytes_downloaded    UInt64,
    bytes_uploaded      UInt64,
    charge_amount       Decimal64(4),
    rat_type            LowCardinality(String),
    tariff_plan_id      UInt16
)
ENGINE = MergeTree()
PARTITION BY (toYYYYMM(event_date), record_type)
ORDER BY (calling_msisdn, event_timestamp)
TTL event_date + INTERVAL 24 MONTH;
 
-- Query performance benchmarks (100-node ClickHouse cluster):
-- Single subscriber 30-day history:  < 200ms
-- Hourly aggregation (all subs):     < 15 seconds
-- Daily traffic report (1 day):      < 5 seconds
-- Full table scan (1 month):         ~ 45 seconds

CDR Enrichment Pipeline

Raw CDRs must be enriched with subscriber, network, and geographic context before they become analytically useful:

from pyspark.sql import functions as F
 
def enrich_cdrs(cdr_df, subscriber_df, cell_tower_df, tariff_df):
    """
    Enrich raw CDRs with dimensional context.
 
    This is the most resource-intensive step in the pipeline,
    requiring broadcast joins for dimension tables and shuffled
    joins for the large subscriber dimension.
    """
    # Broadcast small dimensions (cell towers: ~50K rows in Nigeria)
    enriched = cdr_df \
        .join(
            F.broadcast(cell_tower_df.select(
                "cell_id", "site_name", "latitude", "longitude",
                "state", "lga", "region", "site_type",
                "power_source", "has_generator_backup"
            )),
            cdr_df.serving_cell_id == cell_tower_df.cell_id,
            "left"
        ) \
        .join(
            F.broadcast(tariff_df.select(
                "tariff_plan_id", "plan_name", "plan_category",
                "voice_rate_per_sec", "data_rate_per_mb"
            )),
            on="tariff_plan_id",
            how="left"
        ) \
        .join(
            subscriber_df.select(
                "msisdn", "subscriber_key", "subscriber_status",
                "arpu_30d", "device_tier", "recharge_frequency",
                "primary_usage", "nin_linked"
            ),
            cdr_df.calling_msisdn == subscriber_df.msisdn,
            "left"
        )
 
    # Add derived fields specific to Nigerian context
    enriched = enriched \
        .withColumn("is_on_net",
            F.when(
                F.col("called_msisdn").substr(1, 6).isin(
                    ["234803", "234806", "234810", "234813",  # MTN prefixes
                     "234814", "234816", "234903", "234906"]
                ), True
            ).otherwise(False)
        ) \
        .withColumn("time_band",
            F.when(F.col("event_hour").between(0, 5), "OFF_PEAK")
             .when(F.col("event_hour").between(6, 8), "MORNING_PEAK")
             .when(F.col("event_hour").between(9, 17), "BUSINESS")
             .when(F.col("event_hour").between(18, 21), "EVENING_PEAK")
             .otherwise("NIGHT")
        ) \
        .withColumn("is_weekend",
            F.dayofweek(F.col("event_date")).isin([1, 7])
        ) \
        .withColumn("geopolitical_zone",
            F.when(F.col("state").isin(
                ["Lagos", "Ogun", "Oyo", "Ondo", "Ekiti", "Osun"]
            ), "South-West")
             .when(F.col("state").isin(
                ["Rivers", "Bayelsa", "Delta", "Edo", "Akwa Ibom", "Cross River"]
            ), "South-South")
             .when(F.col("state").isin(
                ["Abia", "Anambra", "Ebonyi", "Enugu", "Imo"]
            ), "South-East")
             .when(F.col("state").isin(
                ["Kano", "Kaduna", "Katsina", "Jigawa", "Kebbi", "Sokoto", "Zamfara"]
            ), "North-West")
             .when(F.col("state").isin(
                ["Borno", "Yobe", "Adamawa", "Bauchi", "Gombe", "Taraba"]
            ), "North-East")
             .otherwise("North-Central")
        )
 
    return enriched

USSD Session Analytics: The Untapped Goldmine

Why USSD Matters in Nigeria

In mature markets, user behavior analytics centers on app usage, web sessions, and API calls. In Nigeria, USSD (Unstructured Supplementary Service Data) remains the primary service delivery channel for the majority of subscribers. When a subscriber dials *131# on MTN to check their balance, or *556# on MTN to buy a data bundle, or *141# on Airtel to recharge — they are generating structured session data that reveals intent, preferences, and friction points.

The scale is significant:

40-80 million USSD sessions per day across a major operator
Average of 3-5 USSD interactions per active subscriber per day
USSD is the primary channel for mobile money (bank USSD codes like *737# for GTBank, *901# for Access Bank, *894# for FirstBank)

Despite this volume, USSD analytics is chronically underinvested. Most operators treat USSD logs as operational data for troubleshooting, not as a first-class analytics stream. This represents one of the largest untapped data assets in emerging market telecoms.

USSD Session Data Model

USSD sessions have a hierarchical structure — a session contains multiple screens, each representing a menu interaction:

-- USSD session table
CREATE TABLE ussd_sessions (
    session_id          VARCHAR(50) PRIMARY KEY,
    msisdn              VARCHAR(15) NOT NULL,
    ussd_code           VARCHAR(20) NOT NULL,   -- e.g., '*131#', '*556#'
    service_type        VARCHAR(50),             -- 'BALANCE_CHECK', 'DATA_PURCHASE', etc.
    session_start       TIMESTAMP NOT NULL,
    session_end         TIMESTAMP,
    session_duration_ms INTEGER,
    total_screens       SMALLINT,
    final_screen        SMALLINT,               -- Last screen reached
    session_outcome     VARCHAR(20),            -- 'COMPLETED', 'ABANDONED', 'TIMEOUT', 'ERROR'
    gateway_id          VARCHAR(20),
    serving_cell_id     VARCHAR(20),
    event_date          DATE NOT NULL
);
 
-- USSD screen-level interactions
CREATE TABLE ussd_screens (
    screen_id           BIGINT GENERATED ALWAYS AS IDENTITY,
    session_id          VARCHAR(50) NOT NULL,
    screen_number       SMALLINT NOT NULL,
    screen_text         TEXT,                   -- Menu content displayed
    user_input          VARCHAR(100),           -- Subscriber's selection
    response_time_ms    INTEGER,                -- Time to respond
    screen_timestamp    TIMESTAMP NOT NULL,
 
    FOREIGN KEY (session_id) REFERENCES ussd_sessions(session_id)
);
 
-- Common USSD service codes (Nigerian operators)
-- MTN:    *131# (balance), *556# (data), *461# (tariff), *600# (customer care)
-- Airtel: *141# (recharge), *141*712# (data), *123# (balance)
-- Glo:    *124# (balance), *127*0# (data), *777# (self-service)
-- 9mobile: *232# (balance), *229# (data), *200# (self-service)

Modeling USSD Funnels

USSD sessions follow predictable menu trees. By modeling these as funnels, you can identify drop-off points, quantify conversion rates, and optimize the user experience:

import pandas as pd
from collections import defaultdict
 
def build_ussd_funnel(sessions_df, ussd_code, date_range):
    """
    Build a conversion funnel for a specific USSD service.
 
    Example: MTN *556# data bundle purchase flow:
    Screen 1: Main menu (Buy Data Bundle)
    Screen 2: Select bundle type (Daily/Weekly/Monthly)
    Screen 3: Select specific bundle (1GB, 2GB, 5GB...)
    Screen 4: Confirm purchase
    Screen 5: Success/Failure notification
    """
    filtered = sessions_df[
        (sessions_df['ussd_code'] == ussd_code) &
        (sessions_df['event_date'].between(*date_range))
    ]
 
    total_sessions = len(filtered)
    if total_sessions == 0:
        return None
 
    funnel = {}
    for screen_num in range(1, 6):
        reached = len(filtered[filtered['total_screens'] >= screen_num])
        funnel[f'Screen {screen_num}'] = {
            'sessions': reached,
            'rate': reached / total_sessions * 100,
            'drop_off': (total_sessions - reached) / total_sessions * 100
        }
 
    completed = len(filtered[filtered['session_outcome'] == 'COMPLETED'])
    funnel['Completed'] = {
        'sessions': completed,
        'rate': completed / total_sessions * 100
    }
 
    # Analyze abandonment reasons
    abandoned = filtered[filtered['session_outcome'] == 'ABANDONED']
    abandon_by_screen = abandoned.groupby('final_screen').size()
 
    funnel['abandonment_distribution'] = abandon_by_screen.to_dict()
 
    return funnel
 
 
def analyze_ussd_response_times(screens_df, ussd_code):
    """
    Analyze user response times at each USSD screen.
 
    High response times at specific screens indicate:
    - Confusing menu options
    - Too many choices presented
    - Users reading promotional text instead of navigating
 
    Typical benchmarks:
    - Screen 1 (main menu): 2-4 seconds
    - Selection screens: 3-6 seconds
    - Confirmation screens: 1-3 seconds
    - Response > 15 seconds: likely confused or distracted
    - USSD timeout: 120 seconds (standard)
    """
    filtered = screens_df[screens_df['ussd_code'] == ussd_code]
 
    stats = filtered.groupby('screen_number')['response_time_ms'].agg([
        'count', 'mean', 'median',
        lambda x: x.quantile(0.95)
    ]).rename(columns={'<lambda_0>': 'p95'})
 
    stats['mean'] = stats['mean'] / 1000   # Convert to seconds
    stats['median'] = stats['median'] / 1000
    stats['p95'] = stats['p95'] / 1000
 
    return stats

USSD Session Flow Visualization Query

Understanding how subscribers navigate USSD menus requires analyzing session flows as directed graphs:

-- Build USSD navigation graph for a specific service code
-- This query produces the edges for a Sankey or flow diagram
 
WITH screen_transitions AS (
    SELECT
        s1.session_id,
        s1.screen_number AS from_screen,
        s1.user_input AS user_selection,
        s2.screen_number AS to_screen,
        s2.screen_text AS next_screen_content
    FROM ussd_screens s1
    INNER JOIN ussd_screens s2
        ON s1.session_id = s2.session_id
        AND s2.screen_number = s1.screen_number + 1
    INNER JOIN ussd_sessions sess
        ON s1.session_id = sess.session_id
    WHERE sess.ussd_code = '*556#'
      AND sess.event_date BETWEEN '2024-06-01' AND '2024-06-30'
),
 
flow_summary AS (
    SELECT
        from_screen,
        user_selection,
        to_screen,
        COUNT(*) AS transition_count,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY from_screen)
            AS transition_pct
    FROM screen_transitions
    GROUP BY from_screen, user_selection, to_screen
)
 
SELECT
    from_screen,
    user_selection,
    to_screen,
    transition_count,
    ROUND(transition_pct, 1) AS transition_pct
FROM flow_summary
ORDER BY from_screen, transition_count DESC;
 
-- Example output for *556# (MTN data bundles):
-- from_screen | user_selection | to_screen | count   | pct
-- 1           | '1'            | 2         | 850,000 | 62.3%  (Buy data)
-- 1           | '2'            | 2         | 280,000 | 20.5%  (Check balance)
-- 1           | '3'            | 2         | 120,000 | 8.8%   (Gift data)
-- 1           | TIMEOUT        | NULL      | 115,000 | 8.4%   (Abandoned)
-- 2           | '1'            | 3         | 510,000 | 60.0%  (Daily bundles)
-- 2           | '2'            | 3         | 220,000 | 25.9%  (Weekly bundles)
-- 2           | '3'            | 3         | 100,000 | 11.8%  (Monthly bundles)

Extracting Intent from USSD Patterns

Beyond individual session analysis, longitudinal USSD patterns reveal subscriber intent that is invisible in traditional CDR analysis:

from pyspark.sql import functions as F
from pyspark.sql.window import Window
 
def extract_ussd_intent_features(ussd_sessions_df, lookback_days=30):
    """
    Extract behavioral features from USSD session history.
 
    These features capture intent signals unique to USSD-heavy markets:
    - Balance check frequency → price sensitivity indicator
    - Failed purchase attempts → willingness to pay vs. insufficient funds
    - Service exploration patterns → upsell opportunity signals
    - Time-of-day patterns → lifestyle indicators
    """
    window_30d = Window.partitionBy("msisdn").orderBy("session_start") \
        .rangeBetween(-lookback_days * 86400, 0)
 
    features = ussd_sessions_df \
        .withColumn("balance_check_count",
            F.sum(F.when(
                F.col("service_type") == "BALANCE_CHECK", 1
            ).otherwise(0)).over(window_30d)
        ) \
        .withColumn("data_purchase_attempts",
            F.sum(F.when(
                (F.col("ussd_code").isin("*556#", "*141*712#", "*127*0#", "*229#")) &
                (F.col("service_type").like("%DATA%")), 1
            ).otherwise(0)).over(window_30d)
        ) \
        .withColumn("data_purchase_success",
            F.sum(F.when(
                (F.col("service_type").like("%DATA%")) &
                (F.col("session_outcome") == "COMPLETED"), 1
            ).otherwise(0)).over(window_30d)
        ) \
        .withColumn("purchase_failure_rate",
            F.when(F.col("data_purchase_attempts") > 0,
                1 - (F.col("data_purchase_success") / F.col("data_purchase_attempts"))
            ).otherwise(0)
        ) \
        .withColumn("balance_check_before_purchase",
            # Checks balance within 5 minutes before attempting purchase
            # High correlation → price-sensitive subscriber
            F.lit(None)  # Implemented via sessionized window logic
        ) \
        .withColumn("avg_session_depth",
            F.avg("total_screens").over(window_30d)
        ) \
        .withColumn("ussd_hour_mode",
            # Most common hour of USSD usage — lifestyle proxy
            F.first("event_hour").over(
                Window.partitionBy("msisdn")
                .orderBy(F.desc("event_hour"))
            )
        ) \
        .withColumn("distinct_services_used",
            F.size(F.collect_set("ussd_code").over(window_30d))
        )
 
    return features

Network Quality Prediction and Power Supply Impact

The Power Problem

Network quality in Nigeria cannot be modeled without accounting for the power supply situation. Unlike Western markets where cell tower uptime is essentially 100% due to reliable grid power, Nigerian cell sites face:

National grid availability: 40-60% in urban areas, lower in rural areas
Diesel generator dependency: Most sites have generators as primary or backup
Generator fuel costs: Significant portion of opex (30-40% for some operators)
Battery backup duration: Typically 4-8 hours after power loss
Solar hybrid sites: Growing adoption but still minority

This creates a predictable pattern: network quality degrades on a schedule driven by power events, not by traffic load alone.

Cell Site Power and Performance Data Model

CREATE TABLE cell_site_power_events (
    event_id            BIGINT GENERATED ALWAYS AS IDENTITY,
    site_id             VARCHAR(20) NOT NULL,
    event_timestamp     TIMESTAMP NOT NULL,
    event_type          VARCHAR(30),  -- 'GRID_ON', 'GRID_OFF', 'GEN_START',
                                      -- 'GEN_STOP', 'GEN_FUEL_LOW',
                                      -- 'BATTERY_LOW', 'SITE_DOWN'
    power_source        VARCHAR(20),  -- 'GRID', 'GENERATOR', 'BATTERY', 'SOLAR', 'NONE'
    battery_level_pct   DECIMAL(5,2),
    fuel_level_pct      DECIMAL(5,2),
    grid_voltage        DECIMAL(6,2),
    generator_runtime_hours DECIMAL(8,2),
    event_date          DATE NOT NULL
);
 
CREATE TABLE cell_site_kpis (
    site_id             VARCHAR(20) NOT NULL,
    kpi_timestamp       TIMESTAMP NOT NULL,  -- 15-minute intervals
    event_date          DATE NOT NULL,
 
    -- Availability KPIs
    site_available      BOOLEAN,
    cells_configured    SMALLINT,
    cells_active        SMALLINT,
 
    -- Traffic KPIs
    voice_erlangs       DECIMAL(8,2),
    data_throughput_mbps DECIMAL(10,2),
    active_users        INTEGER,
    rrc_connected_users INTEGER,
 
    -- Quality KPIs
    call_setup_success_rate  DECIMAL(5,2),
    call_drop_rate           DECIMAL(5,4),
    handover_success_rate    DECIMAL(5,2),
    dl_throughput_avg_kbps   DECIMAL(10,2),
    ul_throughput_avg_kbps   DECIMAL(10,2),
    latency_avg_ms           DECIMAL(8,2),
 
    -- Power context (joined from power events)
    current_power_source     VARCHAR(20),
    hours_since_grid_power   DECIMAL(6,2),
    battery_level_pct        DECIMAL(5,2)
);

Power-Aware Network Quality Model

Standard network quality prediction models use traffic volume, time of day, and historical patterns. For Nigerian deployments, adding power supply features dramatically improves accuracy:

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, r2_score
 
def build_network_quality_features(site_kpis_df, power_events_df):
    """
    Build feature matrix for network quality prediction,
    incorporating power supply context.
 
    Target variable: call_drop_rate (or composite quality score)
    """
    # Merge power context with KPIs
    features_df = site_kpis_df.merge(
        power_events_df[['site_id', 'event_timestamp', 'power_source',
                         'battery_level_pct', 'fuel_level_pct']],
        on='site_id',
        how='left'
    )
 
    # Standard network features
    features_df['hour_of_day'] = features_df['kpi_timestamp'].dt.hour
    features_df['day_of_week'] = features_df['kpi_timestamp'].dt.dayofweek
    features_df['is_weekend'] = features_df['day_of_week'].isin([5, 6]).astype(int)
    features_df['utilization_ratio'] = (
        features_df['active_users'] / features_df['rrc_connected_users']
    ).clip(0, 1)
 
    # Power-specific features (Nigerian context)
    features_df['is_on_generator'] = (
        features_df['current_power_source'] == 'GENERATOR'
    ).astype(int)
    features_df['is_on_battery'] = (
        features_df['current_power_source'] == 'BATTERY'
    ).astype(int)
    features_df['hours_without_grid'] = features_df['hours_since_grid_power'].clip(0, 72)
    features_df['battery_critical'] = (
        features_df['battery_level_pct'] < 20
    ).astype(int)
    features_df['fuel_critical'] = (
        features_df['fuel_level_pct'] < 15
    ).astype(int)
 
    # Interaction features
    features_df['high_traffic_low_battery'] = (
        (features_df['utilization_ratio'] > 0.7) &
        (features_df['battery_level_pct'] < 30)
    ).astype(int)
 
    features_df['peak_hour_generator'] = (
        (features_df['hour_of_day'].between(18, 22)) &
        (features_df['is_on_generator'] == 1)
    ).astype(int)
 
    return features_df
 
 
def train_quality_prediction_model(features_df):
    """
    Train a network quality prediction model with power-aware features.
 
    Results comparison (MAE on call_drop_rate):
      Without power features: 0.82%
      With power features:    0.53%  (35% improvement)
    """
    feature_columns = [
        # Standard features
        'hour_of_day', 'day_of_week', 'is_weekend',
        'voice_erlangs', 'data_throughput_mbps',
        'active_users', 'utilization_ratio',
        'cells_active', 'dl_throughput_avg_kbps',
 
        # Power features
        'is_on_generator', 'is_on_battery',
        'hours_without_grid', 'battery_critical',
        'fuel_critical', 'battery_level_pct',
 
        # Interaction features
        'high_traffic_low_battery', 'peak_hour_generator',
    ]
 
    X = features_df[feature_columns].fillna(0)
    y = features_df['call_drop_rate']
 
    tscv = TimeSeriesSplit(n_splits=5)
    scores = []
 
    for train_idx, test_idx in tscv.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
 
        model = GradientBoostingRegressor(
            n_estimators=300,
            max_depth=6,
            learning_rate=0.05,
            subsample=0.8,
            min_samples_leaf=50
        )
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
 
        scores.append({
            'mae': mean_absolute_error(y_test, y_pred),
            'r2': r2_score(y_test, y_pred)
        })
 
    print("Model Performance (5-fold Time Series CV):")
    print(f"  MAE: {np.mean([s['mae'] for s in scores]):.4f}")
    print(f"  R2:  {np.mean([s['r2'] for s in scores]):.4f}")
 
    # Feature importance analysis
    final_model = GradientBoostingRegressor(
        n_estimators=300, max_depth=6,
        learning_rate=0.05, subsample=0.8
    )
    final_model.fit(X, y)
 
    importance = pd.DataFrame({
        'feature': feature_columns,
        'importance': final_model.feature_importances_
    }).sort_values('importance', ascending=False)
 
    print("\nTop 10 Feature Importances:")
    print(importance.head(10).to_string(index=False))
 
    # Typical output for Nigerian deployment:
    # feature                    importance
    # hours_without_grid         0.1823
    # battery_level_pct          0.1456
    # active_users               0.1234
    # utilization_ratio          0.0987
    # is_on_battery              0.0876
    # hour_of_day                0.0765
    # high_traffic_low_battery   0.0654
    # dl_throughput_avg_kbps     0.0543
    # peak_hour_generator        0.0432
    # fuel_critical              0.0398
 
    return final_model, importance

Proactive Network Degradation Alerts

Using the power-aware quality model, operators can generate proactive alerts before degradation occurs:

def generate_proactive_alerts(model, current_site_data, thresholds):
    """
    Predict network quality for the next 4 hours and generate
    alerts for sites expected to breach quality thresholds.
 
    This enables proactive intervention:
    - Dispatch fuel trucks before generators run dry
    - Activate load shedding before battery exhaustion
    - Re-route traffic from degrading sites
    """
    alerts = []
 
    for site_id, site_features in current_site_data.groupby('site_id'):
        # Project power state forward (4 hours, 15-min intervals)
        projections = project_power_state(site_features, hours_ahead=4)
 
        for projection in projections:
            predicted_drop_rate = model.predict(
                projection[feature_columns].values.reshape(1, -1)
            )[0]
 
            if predicted_drop_rate > thresholds['call_drop_critical']:
                alerts.append({
                    'site_id': site_id,
                    'predicted_time': projection['kpi_timestamp'],
                    'predicted_drop_rate': predicted_drop_rate,
                    'primary_cause': identify_primary_cause(projection),
                    'recommended_action': recommend_action(projection),
                    'severity': 'CRITICAL',
                    'estimated_affected_users': projection['active_users']
                })
            elif predicted_drop_rate > thresholds['call_drop_warning']:
                alerts.append({
                    'site_id': site_id,
                    'predicted_time': projection['kpi_timestamp'],
                    'predicted_drop_rate': predicted_drop_rate,
                    'severity': 'WARNING',
                    'primary_cause': identify_primary_cause(projection),
                    'recommended_action': recommend_action(projection),
                    'estimated_affected_users': projection['active_users']
                })
 
    return pd.DataFrame(alerts)
 
 
def recommend_action(site_state):
    """Generate actionable recommendations based on predicted site state."""
    if site_state['fuel_critical'] == 1:
        return "DISPATCH_FUEL_TRUCK"
    elif site_state['battery_critical'] == 1:
        return "ACTIVATE_LOAD_SHEDDING"
    elif site_state['high_traffic_low_battery'] == 1:
        return "INITIATE_TRAFFIC_OFFLOAD"
    elif site_state['hours_without_grid'] > 24:
        return "ESCALATE_POWER_RESTORATION"
    else:
        return "MONITOR_CLOSELY"

Churn Prediction for Prepaid-Dominant Markets

Why Western Churn Models Fail in Nigeria

Churn prediction is a well-studied problem in telecoms. However, the vast majority of published literature and available models are designed for postpaid markets. The fundamental differences make direct transfer impossible:

┌──────────────────────────────────────────────────────────────────┐
│          Churn Signal Comparison: Postpaid vs. Prepaid           │
│                                                                  │
│  Postpaid (Western)              Prepaid (Nigerian)              │
│  ─────────────────               ──────────────────              │
│  Contract end date    ──▶  ✗ No contracts exist                  │
│  Payment default      ──▶  ✗ No recurring billing                │
│  Usage complaints     ──▶  ✗ No customer care interaction        │
│  Plan downgrade       ──▶  ✗ No formal plan changes              │
│  Device upgrade cycle ──▶  ✗ Device tied to subscriber, not plan │
│                                                                  │
│  Prepaid churn signals:                                          │
│  ✓ Recharge amount decay (N500 → N200 → N100 → N0)              │
│  ✓ Recharge frequency decay (weekly → biweekly → monthly → stop)│
│  ✓ Voice/data usage decline over rolling windows                 │
│  ✓ USSD activity cessation (last balance check date)             │
│  ✓ SIM dormancy duration exceeding operator thresholds           │
│  ✓ Multi-SIM behavior shifts (competitor SIM becomes primary)    │
└──────────────────────────────────────────────────────────────────┘

Defining Churn in Prepaid Markets

In postpaid markets, churn is binary: the subscriber cancels their contract. In prepaid markets, churn is a continuum. The industry-standard definition varies by operator, but a common framework is:

-- Prepaid churn definition framework
-- Nigerian operators typically use 90-day inactivity
 
CREATE VIEW subscriber_churn_status AS
SELECT
    msisdn,
    last_recharge_date,
    last_voice_activity,
    last_data_activity,
    last_ussd_activity,
 
    -- Composite last activity
    GREATEST(
        COALESCE(last_recharge_date, '1900-01-01'),
        COALESCE(last_voice_activity, '1900-01-01'),
        COALESCE(last_data_activity, '1900-01-01'),
        COALESCE(last_ussd_activity, '1900-01-01')
    ) AS last_any_activity,
 
    -- Days since last activity
    CURRENT_DATE - GREATEST(
        COALESCE(last_recharge_date, '1900-01-01'),
        COALESCE(last_voice_activity, '1900-01-01'),
        COALESCE(last_data_activity, '1900-01-01'),
        COALESCE(last_ussd_activity, '1900-01-01')
    ) AS days_inactive,
 
    -- Churn classification
    CASE
        WHEN CURRENT_DATE - GREATEST(
            COALESCE(last_recharge_date, '1900-01-01'),
            COALESCE(last_voice_activity, '1900-01-01'),
            COALESCE(last_data_activity, '1900-01-01'),
            COALESCE(last_ussd_activity, '1900-01-01')
        ) <= 7 THEN 'ACTIVE'
        WHEN CURRENT_DATE - GREATEST(
            COALESCE(last_recharge_date, '1900-01-01'),
            COALESCE(last_voice_activity, '1900-01-01'),
            COALESCE(last_data_activity, '1900-01-01'),
            COALESCE(last_ussd_activity, '1900-01-01')
        ) BETWEEN 8 AND 30 THEN 'DECLINING'
        WHEN CURRENT_DATE - GREATEST(
            COALESCE(last_recharge_date, '1900-01-01'),
            COALESCE(last_voice_activity, '1900-01-01'),
            COALESCE(last_data_activity, '1900-01-01'),
            COALESCE(last_ussd_activity, '1900-01-01')
        ) BETWEEN 31 AND 60 THEN 'AT_RISK'
        WHEN CURRENT_DATE - GREATEST(
            COALESCE(last_recharge_date, '1900-01-01'),
            COALESCE(last_voice_activity, '1900-01-01'),
            COALESCE(last_data_activity, '1900-01-01'),
            COALESCE(last_ussd_activity, '1900-01-01')
        ) BETWEEN 61 AND 90 THEN 'DORMANT'
        ELSE 'CHURNED'
    END AS churn_status
 
FROM dim_subscriber
WHERE is_current = TRUE;

Feature Engineering for Prepaid Churn

The feature set for prepaid churn prediction is fundamentally different from postpaid models. Here is a comprehensive feature engineering pipeline:

from pyspark.sql import functions as F
from pyspark.sql.window import Window
 
def build_prepaid_churn_features(cdr_df, recharge_df, ussd_df, subscriber_df):
    """
    Build feature matrix for prepaid churn prediction.
 
    Features are organized into five categories:
    1. Recharge behavior (most predictive for prepaid)
    2. Usage decay patterns
    3. USSD engagement signals
    4. Network quality experience
    5. Temporal patterns
    """
    # Define observation windows
    w7 = Window.partitionBy("msisdn").orderBy("event_date") \
        .rangeBetween(-7 * 86400, 0)
    w14 = Window.partitionBy("msisdn").orderBy("event_date") \
        .rangeBetween(-14 * 86400, 0)
    w30 = Window.partitionBy("msisdn").orderBy("event_date") \
        .rangeBetween(-30 * 86400, 0)
    w60 = Window.partitionBy("msisdn").orderBy("event_date") \
        .rangeBetween(-60 * 86400, 0)
 
    # ─────────────────────────────────────────────
    # CATEGORY 1: Recharge Behavior Features
    # ─────────────────────────────────────────────
    recharge_features = recharge_df.groupBy("msisdn").agg(
        # Recency
        F.datediff(F.current_date(), F.max("recharge_date"))
            .alias("days_since_last_recharge"),
 
        # Frequency (multiple windows)
        F.sum(F.when(
            F.col("recharge_date") >= F.date_sub(F.current_date(), 7), 1
        ).otherwise(0)).alias("recharge_count_7d"),
        F.sum(F.when(
            F.col("recharge_date") >= F.date_sub(F.current_date(), 30), 1
        ).otherwise(0)).alias("recharge_count_30d"),
        F.sum(F.when(
            F.col("recharge_date") >= F.date_sub(F.current_date(), 60), 1
        ).otherwise(0)).alias("recharge_count_60d"),
 
        # Monetary
        F.sum(F.when(
            F.col("recharge_date") >= F.date_sub(F.current_date(), 30),
            F.col("recharge_amount")
        ).otherwise(0)).alias("recharge_value_30d"),
        F.sum(F.when(
            F.col("recharge_date") >= F.date_sub(F.current_date(), 60),
            F.col("recharge_amount")
        ).otherwise(0)).alias("recharge_value_60d"),
 
        # Average recharge denomination
        F.avg(F.when(
            F.col("recharge_date") >= F.date_sub(F.current_date(), 30),
            F.col("recharge_amount")
        )).alias("avg_recharge_amount_30d"),
 
        # Recharge channel diversity
        F.countDistinct(F.when(
            F.col("recharge_date") >= F.date_sub(F.current_date(), 30),
            F.col("recharge_channel")  -- 'USSD', 'BANK', 'SCRATCH_CARD', 'AGENT'
        )).alias("recharge_channel_count_30d"),
    )
 
    # Decay ratios (key prepaid churn indicators)
    recharge_features = recharge_features \
        .withColumn("recharge_freq_decay",
            F.when(F.col("recharge_count_60d") > 0,
                F.col("recharge_count_30d") /
                (F.col("recharge_count_60d") - F.col("recharge_count_30d") + 0.01)
            ).otherwise(0)
        ) \
        .withColumn("recharge_value_decay",
            F.when(F.col("recharge_value_60d") > 0,
                F.col("recharge_value_30d") /
                (F.col("recharge_value_60d") - F.col("recharge_value_30d") + 0.01)
            ).otherwise(0)
        )
 
    # ─────────────────────────────────────────────
    # CATEGORY 2: Usage Decay Patterns
    # ─────────────────────────────────────────────
    usage_features = cdr_df.groupBy("msisdn").agg(
        # Voice usage decay
        F.sum(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 7),
            F.col("duration_seconds")
        ).otherwise(0)).alias("voice_seconds_7d"),
        F.sum(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 30),
            F.col("duration_seconds")
        ).otherwise(0)).alias("voice_seconds_30d"),
 
        # Data usage decay
        F.sum(F.when(
            (F.col("event_date") >= F.date_sub(F.current_date(), 7)) &
            (F.col("record_type") == "DATA"),
            F.col("bytes_downloaded")
        ).otherwise(0)).alias("data_bytes_7d"),
        F.sum(F.when(
            (F.col("event_date") >= F.date_sub(F.current_date(), 30)) &
            (F.col("record_type") == "DATA"),
            F.col("bytes_downloaded")
        ).otherwise(0)).alias("data_bytes_30d"),
 
        # Activity diversity
        F.countDistinct(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 30),
            F.col("event_date")
        )).alias("active_days_30d"),
 
        # Unique contacts (social network proxy)
        F.countDistinct(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 30),
            F.col("called_msisdn")
        )).alias("unique_contacts_30d"),
        F.countDistinct(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 60),
            F.col("called_msisdn")
        )).alias("unique_contacts_60d"),
 
        # On-net vs off-net ratio (competitor usage proxy)
        F.avg(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 30),
            F.col("is_on_net").cast("double")
        )).alias("on_net_ratio_30d"),
    )
 
    # ─────────────────────────────────────────────
    # CATEGORY 3: USSD Engagement Signals
    # ─────────────────────────────────────────────
    ussd_features = ussd_df.groupBy("msisdn").agg(
        F.datediff(F.current_date(), F.max("session_start"))
            .alias("days_since_last_ussd"),
        F.sum(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 7), 1
        ).otherwise(0)).alias("ussd_sessions_7d"),
        F.sum(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 30), 1
        ).otherwise(0)).alias("ussd_sessions_30d"),
        F.countDistinct(F.when(
            F.col("event_date") >= F.date_sub(F.current_date(), 30),
            F.col("ussd_code")
        )).alias("distinct_ussd_services_30d"),
        F.sum(F.when(
            (F.col("event_date") >= F.date_sub(F.current_date(), 30)) &
            (F.col("service_type") == "BALANCE_CHECK"), 1
        ).otherwise(0)).alias("balance_checks_30d"),
    )
 
    # ─────────────────────────────────────────────
    # CATEGORY 4: Network Quality Experience
    # ─────────────────────────────────────────────
    nq_features = cdr_df \
        .filter(F.col("event_date") >= F.date_sub(F.current_date(), 30)) \
        .groupBy("msisdn").agg(
            F.avg("call_drop_rate").alias("avg_call_drop_rate_30d"),
            F.sum(F.when(
                F.col("disconnect_reason") == "DROP", 1
            ).otherwise(0)).alias("dropped_calls_30d"),
            F.avg(F.when(
                F.col("record_type") == "DATA",
                F.col("bytes_downloaded") / F.col("duration_seconds")
            )).alias("avg_data_speed_30d"),
        )
 
    # ─────────────────────────────────────────────
    # Combine all feature categories
    # ─────────────────────────────────────────────
    features = subscriber_df \
        .join(recharge_features, "msisdn", "left") \
        .join(usage_features, "msisdn", "left") \
        .join(ussd_features, "msisdn", "left") \
        .join(nq_features, "msisdn", "left") \
        .fillna(0)
 
    return features

Training the Prepaid Churn Model

import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import (
    roc_auc_score, precision_recall_curve,
    average_precision_score, classification_report
)
import numpy as np
 
def train_prepaid_churn_model(features_df, label_col='is_churned_30d'):
    """
    Train a churn prediction model optimized for prepaid markets.
 
    Key considerations:
    - Heavy class imbalance (monthly churn rate 3-5%)
    - Cost-sensitive: false negatives (missed churn) cost more than
      false positives (unnecessary retention offer)
    - Model must run at scale: 60M+ subscribers scored daily
 
    Production benchmarks (Nigerian operator, 65M subscribers):
      AUC-ROC: 0.87
      Precision@10%: 0.42 (4.2x lift over random)
      Feature computation: 45 min (Spark, 50 nodes)
      Model scoring: 12 min (Spark, 50 nodes)
    """
    feature_columns = [
        # Recharge features (highest importance)
        'days_since_last_recharge',
        'recharge_count_7d', 'recharge_count_30d', 'recharge_count_60d',
        'recharge_value_30d', 'recharge_value_60d',
        'avg_recharge_amount_30d',
        'recharge_freq_decay', 'recharge_value_decay',
        'recharge_channel_count_30d',
 
        # Usage features
        'voice_seconds_7d', 'voice_seconds_30d',
        'data_bytes_7d', 'data_bytes_30d',
        'active_days_30d',
        'unique_contacts_30d', 'unique_contacts_60d',
        'on_net_ratio_30d',
 
        # USSD features
        'days_since_last_ussd',
        'ussd_sessions_7d', 'ussd_sessions_30d',
        'distinct_ussd_services_30d',
        'balance_checks_30d',
 
        # Network quality
        'avg_call_drop_rate_30d',
        'dropped_calls_30d',
        'avg_data_speed_30d',
 
        # Subscriber attributes
        'arpu_30d', 'device_tier_encoded',
        'tenure_months',
    ]
 
    X = features_df[feature_columns].values
    y = features_df[label_col].values
 
    # Time-based split (not random — temporal leakage risk)
    tscv = TimeSeriesSplit(n_splits=4)
    auc_scores = []
    ap_scores = []
 
    for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
 
        # Calculate class weight
        neg_count = np.sum(y_train == 0)
        pos_count = np.sum(y_train == 1)
        scale_pos_weight = neg_count / pos_count
 
        train_data = lgb.Dataset(X_train, label=y_train)
        val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
 
        params = {
            'objective': 'binary',
            'metric': ['auc', 'average_precision'],
            'scale_pos_weight': scale_pos_weight,
            'num_leaves': 63,
            'max_depth': 8,
            'learning_rate': 0.05,
            'feature_fraction': 0.8,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'min_child_samples': 100,
            'reg_alpha': 0.1,
            'reg_lambda': 1.0,
            'verbose': -1,
        }
 
        model = lgb.train(
            params,
            train_data,
            num_boost_round=500,
            valid_sets=[val_data],
            callbacks=[lgb.early_stopping(50)]
        )
 
        y_pred = model.predict(X_val)
        auc = roc_auc_score(y_val, y_pred)
        ap = average_precision_score(y_val, y_pred)
 
        auc_scores.append(auc)
        ap_scores.append(ap)
 
        print(f"Fold {fold+1}: AUC={auc:.4f}, AP={ap:.4f}")
 
    print(f"\nMean AUC: {np.mean(auc_scores):.4f} (+/- {np.std(auc_scores):.4f})")
    print(f"Mean AP:  {np.mean(ap_scores):.4f} (+/- {np.std(ap_scores):.4f})")
 
    # Train final model on all data
    final_train = lgb.Dataset(X, label=y)
    final_model = lgb.train(params, final_train, num_boost_round=500)
 
    # Feature importance
    importance = pd.DataFrame({
        'feature': feature_columns,
        'importance': final_model.feature_importance(importance_type='gain')
    }).sort_values('importance', ascending=False)
 
    print("\nTop 15 Features by Gain:")
    for _, row in importance.head(15).iterrows():
        print(f"  {row['feature']:40s} {row['importance']:,.0f}")
 
    # Typical top features for Nigerian prepaid:
    # days_since_last_recharge              285,432
    # recharge_freq_decay                   198,765
    # recharge_value_decay                  176,543
    # recharge_count_30d                    154,321
    # days_since_last_ussd                  132,109
    # active_days_30d                        98,765
    # voice_seconds_7d                       87,654
    # unique_contacts_30d                    76,543
    # balance_checks_30d                     65,432
    # data_bytes_7d                          54,321
    # ussd_sessions_30d                      43,210
    # avg_recharge_amount_30d                32,109
    # on_net_ratio_30d                       28,765
    # arpu_30d                               21,098
    # dropped_calls_30d                      18,765
 
    return final_model, importance

Operationalizing Churn Predictions

Generating churn scores is only the first step. The real value lies in triggering retention actions:

def generate_retention_actions(scored_subscribers_df, budget_per_day_naira):
    """
    Map churn scores to retention actions based on subscriber value
    and churn probability.
 
    Nigerian telco retention playbook:
    - High value + High risk: Personalized bundle offer via USSD push
    - High value + Medium risk: Bonus airtime on next recharge
    - Medium value + High risk: Data bundle offer
    - Low value + High risk: Free SMS bundle or USSD notification
 
    Budget constraint: Operators typically allocate N5-15 per
    at-risk subscriber per month.
    """
    actioned = scored_subscribers_df \
        .withColumn("value_segment",
            F.when(F.col("arpu_30d") >= 2000, "HIGH")
             .when(F.col("arpu_30d") >= 500, "MEDIUM")
             .otherwise("LOW")
        ) \
        .withColumn("risk_segment",
            F.when(F.col("churn_probability") >= 0.7, "HIGH")
             .when(F.col("churn_probability") >= 0.4, "MEDIUM")
             .otherwise("LOW")
        ) \
        .withColumn("retention_action",
            F.when(
                (F.col("value_segment") == "HIGH") &
                (F.col("risk_segment") == "HIGH"),
                "PERSONALIZED_BUNDLE_USSD_PUSH"
            ).when(
                (F.col("value_segment") == "HIGH") &
                (F.col("risk_segment") == "MEDIUM"),
                "BONUS_AIRTIME_ON_RECHARGE"
            ).when(
                (F.col("value_segment") == "MEDIUM") &
                (F.col("risk_segment") == "HIGH"),
                "DATA_BUNDLE_OFFER"
            ).when(
                (F.col("value_segment") == "LOW") &
                (F.col("risk_segment") == "HIGH"),
                "FREE_SMS_BUNDLE"
            ).otherwise("NO_ACTION")
        ) \
        .withColumn("offer_value_naira",
            F.when(F.col("retention_action") == "PERSONALIZED_BUNDLE_USSD_PUSH", 100)
             .when(F.col("retention_action") == "BONUS_AIRTIME_ON_RECHARGE", 50)
             .when(F.col("retention_action") == "DATA_BUNDLE_OFFER", 30)
             .when(F.col("retention_action") == "FREE_SMS_BUNDLE", 10)
             .otherwise(0)
        ) \
        .withColumn("expected_revenue_saved",
            F.col("arpu_30d") * F.col("churn_probability") * 3  # 3-month horizon
        ) \
        .withColumn("roi_ratio",
            F.col("expected_revenue_saved") / (F.col("offer_value_naira") + 0.01)
        )
 
    # Budget optimization: rank by ROI and apply budget cap
    actioned = actioned \
        .filter(F.col("retention_action") != "NO_ACTION") \
        .withColumn("roi_rank",
            F.row_number().over(
                Window.orderBy(F.desc("roi_ratio"))
            )
        )
 
    # Calculate cumulative spend
    actioned = actioned.withColumn("cumulative_spend",
        F.sum("offer_value_naira").over(
            Window.orderBy("roi_rank")
            .rowsBetween(Window.unboundedPreceding, Window.currentRow)
        )
    )
 
    # Apply budget constraint
    within_budget = actioned.filter(
        F.col("cumulative_spend") <= budget_per_day_naira
    )
 
    return within_budget

NCC Regulatory Reporting Automation

Regulatory Context

The Nigerian Communications Commission (NCC) requires periodic reporting from all licensed operators. Key reporting obligations include:

Quality of Service (QoS) Reports: Monthly metrics on call setup success rate, drop rate, network availability, and data throughput against NCC-defined thresholds
Subscriber Statistics: Quarterly active subscriber counts by technology (2G, 3G, 4G), by state, and by prepaid/postpaid
Number Portability Reports: Monthly porting activity logs
KYC/NIN Compliance: Progress reports on NIN-SIM linkage compliance
Security & Lawful Intercept: Capability compliance and activity reports

Manual preparation of these reports typically consumes 3-5 full-time analysts, involves error-prone spreadsheet manipulation, and risks regulatory penalties for late or inaccurate submissions.

Automated NCC Reporting Pipeline

┌──────────────────────────────────────────────────────────────────┐
│              NCC Regulatory Reporting Pipeline                    │
│                                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐ │
│  │ CDR Data │  │ Network  │  │Subscriber│  │ Number Porting   │ │
│  │ Warehouse│  │ KPI Store│  │ Database │  │ Database         │ │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └───────┬──────────┘ │
│       │              │             │                │            │
│       └──────────┬───┴─────────────┴────────────────┘            │
│                  │                                                │
│           ┌──────▼──────────┐                                    │
│           │  dbt Models     │                                    │
│           │                 │                                    │
│           │  ncc_qos_monthly│                                    │
│           │  ncc_subscriber │                                    │
│           │  ncc_porting    │                                    │
│           │  ncc_nin_status │                                    │
│           └──────┬──────────┘                                    │
│                  │                                                │
│           ┌──────▼──────────┐     ┌─────────────────────────┐    │
│           │  Validation     │     │  Alert on threshold     │    │
│           │  & Reconciliation│──▶ │  breaches BEFORE filing │    │
│           └──────┬──────────┘     └─────────────────────────┘    │
│                  │                                                │
│           ┌──────▼──────────┐                                    │
│           │  Report         │                                    │
│           │  Generation     │                                    │
│           │  (PDF + CSV)    │                                    │
│           └──────┬──────────┘                                    │
│                  │                                                │
│           ┌──────▼──────────┐                                    │
│           │  NCC Submission │                                    │
│           │  Portal Upload  │                                    │
│           └─────────────────┘                                    │
└──────────────────────────────────────────────────────────────────┘

QoS Reporting Model (dbt)

-- models/regulatory/ncc_qos_monthly.sql
-- NCC Quality of Service Monthly Report
-- Reference: NCC QoS Regulations 2023
 
{{ config(
    materialized='table',
    schema='regulatory',
    tags=['ncc', 'monthly']
) }}
 
WITH voice_qos AS (
    SELECT
        DATE_TRUNC('month', event_date) AS report_month,
        'VOICE' AS service_type,
 
        -- NCC KPI 1: Call Setup Success Rate (target: >= 98%)
        COUNT(CASE WHEN call_setup_status = 'SUCCESS' THEN 1 END) * 100.0 /
            NULLIF(COUNT(*), 0)
            AS call_setup_success_rate,
 
        -- NCC KPI 2: Call Drop Rate (target: <= 1%)
        COUNT(CASE WHEN disconnect_reason = 'DROP' THEN 1 END) * 100.0 /
            NULLIF(COUNT(CASE WHEN call_setup_status = 'SUCCESS' THEN 1 END), 0)
            AS call_drop_rate,
 
        -- NCC KPI 3: Traffic Channel Congestion (target: <= 2%)
        AVG(CASE WHEN channel_congested = TRUE THEN 100.0 ELSE 0.0 END)
            AS traffic_channel_congestion,
 
        -- NCC KPI 4: SDCCH Congestion (target: <= 0.2%)
        AVG(CASE WHEN sdcch_congested = TRUE THEN 100.0 ELSE 0.0 END)
            AS sdcch_congestion,
 
        -- Supporting metrics
        COUNT(*) AS total_call_attempts,
        COUNT(CASE WHEN call_setup_status = 'SUCCESS' THEN 1 END) AS successful_calls,
        COUNT(CASE WHEN disconnect_reason = 'DROP' THEN 1 END) AS dropped_calls,
        AVG(duration_seconds) AS avg_call_duration_seconds
 
    FROM {{ ref('cdr_fact_enriched') }}
    WHERE record_type = 'VOICE'
      AND event_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
      AND event_date < DATE_TRUNC('month', CURRENT_DATE)
    GROUP BY DATE_TRUNC('month', event_date)
),
 
data_qos AS (
    SELECT
        DATE_TRUNC('month', event_date) AS report_month,
        'DATA' AS service_type,
 
        -- NCC KPI 5: Download Speed (target varies by technology)
        PERCENTILE_CONT(0.5) WITHIN GROUP (
            ORDER BY dl_throughput_avg_kbps
        ) AS median_download_speed_kbps,
 
        -- NCC KPI 6: Upload Speed
        PERCENTILE_CONT(0.5) WITHIN GROUP (
            ORDER BY ul_throughput_avg_kbps
        ) AS median_upload_speed_kbps,
 
        -- NCC KPI 7: Latency (target: <= 150ms for 4G)
        PERCENTILE_CONT(0.5) WITHIN GROUP (
            ORDER BY latency_avg_ms
        ) AS median_latency_ms,
 
        -- NCC KPI 8: Packet Loss Rate (target: <= 3%)
        AVG(packet_loss_rate) AS avg_packet_loss_rate,
 
        COUNT(*) AS total_data_sessions,
        SUM(bytes_downloaded + bytes_uploaded) AS total_data_volume_bytes
 
    FROM {{ ref('data_session_kpis') }}
    WHERE event_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
      AND event_date < DATE_TRUNC('month', CURRENT_DATE)
    GROUP BY DATE_TRUNC('month', event_date)
),
 
network_availability AS (
    SELECT
        DATE_TRUNC('month', kpi_date) AS report_month,
 
        -- NCC KPI 9: Network Availability (target: >= 99.5%)
        AVG(CASE WHEN site_available = TRUE THEN 100.0 ELSE 0.0 END)
            AS network_availability_pct,
 
        COUNT(DISTINCT site_id) AS total_sites,
        COUNT(DISTINCT CASE WHEN site_available = TRUE THEN site_id END)
            AS available_sites
 
    FROM {{ ref('cell_site_kpis_daily') }}
    WHERE kpi_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
      AND kpi_date < DATE_TRUNC('month', CURRENT_DATE)
    GROUP BY DATE_TRUNC('month', kpi_date)
),
 
-- Compile final NCC report with threshold compliance flags
final_report AS (
    SELECT
        v.report_month,
 
        -- Voice QoS
        v.call_setup_success_rate,
        v.call_setup_success_rate >= 98.0 AS cssr_compliant,
        v.call_drop_rate,
        v.call_drop_rate <= 1.0 AS cdr_compliant,
        v.traffic_channel_congestion,
        v.traffic_channel_congestion <= 2.0 AS tch_congestion_compliant,
        v.sdcch_congestion,
        v.sdcch_congestion <= 0.2 AS sdcch_compliant,
 
        -- Data QoS
        d.median_download_speed_kbps,
        d.median_upload_speed_kbps,
        d.median_latency_ms,
        d.median_latency_ms <= 150.0 AS latency_compliant,
        d.avg_packet_loss_rate,
        d.avg_packet_loss_rate <= 3.0 AS packet_loss_compliant,
 
        -- Network Availability
        n.network_availability_pct,
        n.network_availability_pct >= 99.5 AS availability_compliant,
 
        -- Volume context
        v.total_call_attempts,
        v.successful_calls,
        v.dropped_calls,
        d.total_data_sessions,
        d.total_data_volume_bytes,
        n.total_sites,
        n.available_sites,
 
        -- Overall compliance
        (v.call_setup_success_rate >= 98.0
         AND v.call_drop_rate <= 1.0
         AND v.traffic_channel_congestion <= 2.0
         AND d.median_latency_ms <= 150.0
         AND n.network_availability_pct >= 99.5
        ) AS overall_compliant
 
    FROM voice_qos v
    CROSS JOIN data_qos d
    CROSS JOIN network_availability n
)
 
SELECT * FROM final_report

NCC Subscriber Statistics Report

-- models/regulatory/ncc_subscriber_quarterly.sql
-- NCC Quarterly Subscriber Statistics Report
 
{{ config(
    materialized='table',
    schema='regulatory',
    tags=['ncc', 'quarterly']
) }}
 
SELECT
    DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '1 quarter')
        AS report_quarter,
 
    -- Active subscribers by technology
    COUNT(CASE WHEN rat_type = '2G' AND is_active = TRUE THEN 1 END)
        AS active_2g_subscribers,
    COUNT(CASE WHEN rat_type = '3G' AND is_active = TRUE THEN 1 END)
        AS active_3g_subscribers,
    COUNT(CASE WHEN rat_type = '4G' AND is_active = TRUE THEN 1 END)
        AS active_4g_subscribers,
    COUNT(CASE WHEN rat_type = '5G' AND is_active = TRUE THEN 1 END)
        AS active_5g_subscribers,
 
    -- Active subscribers by billing type
    COUNT(CASE WHEN account_type = 'PREPAID' AND is_active = TRUE THEN 1 END)
        AS active_prepaid,
    COUNT(CASE WHEN account_type = 'POSTPAID' AND is_active = TRUE THEN 1 END)
        AS active_postpaid,
 
    -- Total active
    COUNT(CASE WHEN is_active = TRUE THEN 1 END)
        AS total_active_subscribers,
 
    -- NIN-SIM Linkage compliance
    COUNT(CASE WHEN nin_linked = TRUE AND is_active = TRUE THEN 1 END)
        AS nin_linked_subscribers,
    COUNT(CASE WHEN nin_linked = TRUE AND is_active = TRUE THEN 1 END) * 100.0 /
        NULLIF(COUNT(CASE WHEN is_active = TRUE THEN 1 END), 0)
        AS nin_linkage_pct,
 
    -- Geographic distribution (by geopolitical zone)
    COUNT(CASE WHEN geopolitical_zone = 'South-West' AND is_active = TRUE THEN 1 END)
        AS active_south_west,
    COUNT(CASE WHEN geopolitical_zone = 'South-South' AND is_active = TRUE THEN 1 END)
        AS active_south_south,
    COUNT(CASE WHEN geopolitical_zone = 'South-East' AND is_active = TRUE THEN 1 END)
        AS active_south_east,
    COUNT(CASE WHEN geopolitical_zone = 'North-West' AND is_active = TRUE THEN 1 END)
        AS active_north_west,
    COUNT(CASE WHEN geopolitical_zone = 'North-East' AND is_active = TRUE THEN 1 END)
        AS active_north_east,
    COUNT(CASE WHEN geopolitical_zone = 'North-Central' AND is_active = TRUE THEN 1 END)
        AS active_north_central,
 
    -- Internet subscribers (required by NCC)
    COUNT(CASE WHEN has_data_usage_90d = TRUE AND is_active = TRUE THEN 1 END)
        AS internet_subscribers,
 
    -- Teledensity calculation helper
    COUNT(CASE WHEN is_active = TRUE THEN 1 END)
        AS subscriber_count_for_teledensity
 
FROM {{ ref('dim_subscriber_enriched') }}
WHERE is_current = TRUE

Validation and Threshold Monitoring

Before submitting to the NCC, automated validation catches issues that could trigger regulatory scrutiny:

def validate_ncc_report(report_df, previous_report_df, thresholds):
    """
    Validate NCC report data before submission.
 
    Checks:
    1. KPI threshold compliance (flag but don't block)
    2. Month-over-month variance sanity checks
    3. Internal consistency (e.g., dropped calls < total calls)
    4. Missing or null values
 
    Returns: validation_result with issues list
    """
    issues = []
 
    # Check 1: Threshold compliance
    ncc_thresholds = {
        'call_setup_success_rate': {'min': 98.0, 'direction': 'above'},
        'call_drop_rate': {'max': 1.0, 'direction': 'below'},
        'traffic_channel_congestion': {'max': 2.0, 'direction': 'below'},
        'sdcch_congestion': {'max': 0.2, 'direction': 'below'},
        'median_latency_ms': {'max': 150.0, 'direction': 'below'},
        'network_availability_pct': {'min': 99.5, 'direction': 'above'},
    }
 
    for kpi, threshold in ncc_thresholds.items():
        value = report_df[kpi].iloc[0]
        if threshold['direction'] == 'above' and value < threshold['min']:
            issues.append({
                'severity': 'CRITICAL',
                'kpi': kpi,
                'message': f"{kpi} = {value:.2f}% is below NCC minimum of {threshold['min']}%",
                'action': 'Prepare explanation letter for NCC'
            })
        elif threshold['direction'] == 'below' and value > threshold['max']:
            issues.append({
                'severity': 'CRITICAL',
                'kpi': kpi,
                'message': f"{kpi} = {value:.2f}% exceeds NCC maximum of {threshold['max']}%",
                'action': 'Prepare explanation letter for NCC'
            })
 
    # Check 2: Month-over-month variance
    if previous_report_df is not None:
        for kpi in ['total_call_attempts', 'total_active_subscribers']:
            current = report_df[kpi].iloc[0]
            previous = previous_report_df[kpi].iloc[0]
            variance = abs(current - previous) / previous * 100
 
            if variance > 15:  # More than 15% change
                issues.append({
                    'severity': 'WARNING',
                    'kpi': kpi,
                    'message': f"{kpi} changed by {variance:.1f}% vs previous month. "
                              f"Previous: {previous:,.0f}, Current: {current:,.0f}",
                    'action': 'Verify data completeness and reconcile'
                })
 
    # Check 3: Internal consistency
    if report_df['dropped_calls'].iloc[0] > report_df['successful_calls'].iloc[0]:
        issues.append({
            'severity': 'ERROR',
            'kpi': 'dropped_calls',
            'message': 'Dropped calls exceed successful calls — data integrity issue',
            'action': 'Investigate CDR pipeline for data quality issues'
        })
 
    # Check 4: Null values
    for col in report_df.columns:
        if report_df[col].isna().any():
            issues.append({
                'severity': 'ERROR',
                'kpi': col,
                'message': f"NULL value found in {col}",
                'action': 'Check upstream data sources for missing data'
            })
 
    validation_result = {
        'is_valid': all(i['severity'] != 'ERROR' for i in issues),
        'critical_count': sum(1 for i in issues if i['severity'] == 'CRITICAL'),
        'warning_count': sum(1 for i in issues if i['severity'] == 'WARNING'),
        'error_count': sum(1 for i in issues if i['severity'] == 'ERROR'),
        'issues': issues
    }
 
    return validation_result

Performance Benchmarks and Infrastructure Sizing

Reference Architecture Sizing

Based on deployments at Nigerian operators with 50-80 million subscribers:

┌──────────────────────────────────────────────────────────────────┐
│          Infrastructure Sizing Reference                         │
│          (60M subscribers, 500M CDRs/day)                        │
│                                                                  │
│  Component              │ Specification          │ Monthly Cost  │
│  ───────────────────────┼────────────────────────┼──────────────│
│  Kafka Cluster          │ 12 brokers, 32GB RAM   │ $8,400       │
│                         │ 2TB SSD each           │              │
│  Spark Cluster          │ 50 workers, 16 cores   │ $18,000      │
│                         │ 64GB RAM, spot pricing │              │
│  ClickHouse Cluster     │ 8 nodes, 32 cores      │ $12,000      │
│                         │ 128GB RAM, 4TB NVMe    │              │
│  Redis (real-time)      │ 3 nodes, 64GB RAM      │ $2,400       │
│  Object Storage (S3)    │ 200TB raw + processed  │ $4,600       │
│  Orchestration (Airflow)│ 2 nodes + RDS backend  │ $1,200       │
│  Monitoring Stack       │ Grafana + Prometheus    │ $800         │
│  ───────────────────────┼────────────────────────┼──────────────│
│  TOTAL                  │                        │ ~$47,400/mo  │
└──────────────────────────────────────────────────────────────────┘

Query Performance Benchmarks

Measured on the ClickHouse cluster described above:

Query Type	Dataset Size	Latency	Notes
Single subscriber CDR history (30 days)	180M records/day	120-200ms	Indexed on MSISDN
Hourly traffic aggregation (national)	500M records	8-15s	Partitioned by date
Top 100 sites by drop rate (daily)	50K sites	2-4s	Pre-aggregated
USSD funnel analysis (single code, monthly)	1.2B sessions	20-35s	Filtered by code
Churn feature computation (full base)	60M subscribers	40-55 min	Spark, 50 workers
NCC QoS report generation (monthly)	Full month CDRs	3-5 min	Materialized view
Ad-hoc subscriber segment query	60M subscribers	5-12s	Columnar scan

Data Retention Policy

Nigerian regulatory requirements and analytics needs drive a tiered retention strategy:

┌──────────────────────────────────────────────────────────────────┐
│                  Data Retention Tiers                             │
│                                                                  │
│  Tier 1: Hot (ClickHouse)      │ 3 months  │ Full granularity   │
│  Tier 2: Warm (Parquet on S3)  │ 12 months │ Full granularity   │
│  Tier 3: Cold (Compressed S3)  │ 24 months │ Aggregated hourly  │
│  Tier 4: Archive (Glacier)     │ 7 years   │ Daily aggregates   │
│                                                                  │
│  NCC requirement: Minimum 12 months detailed CDR retention       │
│  Recommended: 24 months detailed for churn model training        │
└──────────────────────────────────────────────────────────────────┘

Implementation Roadmap

Building a comprehensive telecom analytics platform is a multi-phase effort. Here is a practical roadmap based on deployments at Nigerian operators:

Phase 1: Foundation (Months 1-3)

Deploy Kafka cluster for CDR ingestion
Build CDR deduplication and quality pipeline
Implement basic CDR data warehouse (ClickHouse or BigQuery)
Create subscriber dimension with SCD Type 2
Deploy basic dashboards (daily traffic, top sites, revenue)
Deliverable: Real-time CDR pipeline processing 500M+ records/day

Phase 2: Subscriber Intelligence (Months 3-6)

Build USSD session analytics pipeline
Implement recharge behavior tracking
Deploy subscriber segmentation models (RFM + behavioral)
Build churn prediction model v1
Create retention action automation framework
Deliverable: Daily churn scoring for full subscriber base

Phase 3: Network Intelligence (Months 5-8)

Integrate power monitoring data with network KPIs
Build power-aware network quality prediction model
Deploy proactive degradation alerting system
Implement capacity planning analytics
Create coverage gap analysis pipeline
Deliverable: 4-hour-ahead network quality predictions

Phase 4: Regulatory Automation (Months 6-9)

Build automated NCC QoS reporting pipeline (dbt models)
Implement validation and reconciliation framework
Create subscriber statistics automation
Deploy NIN-SIM linkage tracking dashboard
Automate number portability reporting
Deliverable: 80% reduction in regulatory reporting effort

Phase 5: Advanced Analytics (Months 9-12)

Build customer lifetime value prediction model
Implement social network analysis on CDR graphs
Deploy real-time USSD session optimization (A/B testing menus)
Create revenue assurance analytics
Build fraud detection models (SIM box, bypass fraud)
Deliverable: Full analytics platform with ML-driven insights

Conclusion

The Nigerian telecom data landscape presents unique challenges that cannot be addressed by simply porting Western analytics solutions. The prepaid-dominant subscriber base, USSD as a primary service channel, power supply instability affecting network quality, and NCC regulatory requirements all demand purpose-built data engineering solutions.

The key architectural decisions that distinguish a successful emerging market telecom analytics platform are:

CDR pipelines must be resilient to data quality issues inherent in infrastructure that experiences frequent power events and switch failovers. Deduplication, validation, and reconciliation are not optional — they are foundational.
USSD session analytics must be treated as a first-class data domain, not an afterthought. In a market where USSD interactions outnumber app sessions by orders of magnitude, this data stream contains the richest behavioral signals for understanding subscriber intent.
Network quality models must incorporate power supply as a primary feature, not an anomaly. When 40% of cell sites are running on generators at any given time, power state is not an edge case — it is the dominant factor in quality prediction.
Churn prediction requires fundamentally different feature engineering for prepaid markets. Recharge decay patterns, USSD activity cessation, and multi-SIM behavior signals replace the contract renewal and payment default signals used in postpaid models.
Regulatory reporting automation is high-ROI and low-risk — a practical starting point for operators beginning their analytics journey, delivering immediate value while the foundational data infrastructure is being built.

At Gemut Analytics, we have guided Nigerian operators through each phase of this journey, from foundational CDR pipeline architecture to advanced ML-driven subscriber intelligence. The data is there. The analytical patterns are proven. The competitive advantage belongs to the operators who invest in extracting intelligence from it.

This guide is part of our series on data engineering for African industries. For a deeper exploration of any section — CDR pipeline architecture, USSD analytics implementation, or churn model deployment — contact our team for a technical consultation tailored to your operator's specific data landscape and business objectives.

Key Takeaways

✓CDR analytics pipelines must handle 500M+ daily records with sub-hour latency to enable near-real-time subscriber insights
✓USSD session data represents an untapped analytics goldmine — modeling session flows reveals user intent patterns invisible in app-centric analytics
✓Network quality prediction models that factor power supply instability and generator schedules improve coverage planning by 35%
✓Churn prediction in prepaid markets requires fundamentally different feature engineering — activity decay patterns replace contract renewal signals
✓NCC regulatory reporting can be automated with purpose-built ETL pipelines, reducing compliance overhead by 80%

Gemut Analytics Team

Data Engineering Experts