Telecom Analytics in Emerging Markets: Network Optimization, Churn Prediction, and the USSD Data Goldmine
A comprehensive guide to building telecom analytics platforms in emerging markets, covering CDR pipeline architecture, USSD session analytics, network quality prediction, and churn models for prepaid-dominant markets.
Nigerian telcos serve over 200 million subscribers, generating massive volumes of Call Detail Records, USSD sessions, and network events daily. Unlike mature Western markets dominated by postpaid contracts and app-based interactions, emerging market telecoms face unique data challenges — USSD as a primary service delivery channel, prepaid-dominant subscriber bases, network infrastructure constrained by power instability, and regulatory requirements from the NCC. This guide explores the data engineering and analytics patterns needed to extract actionable intelligence from this data landscape.
- Understanding of data warehouse and data lake architectures
- Familiarity with SQL and Python for data analysis
- Basic knowledge of machine learning concepts

Introduction
Nigeria's telecommunications sector is one of the most dynamic in Africa. With over 220 million active mobile subscriptions tracked by the Nigerian Communications Commission (NCC), the four major operators — MTN Nigeria, Airtel Nigeria, Globacom (Glo), and 9mobile — collectively generate data volumes that rival operators in markets with twice the population. MTN Nigeria alone serves over 80 million subscribers, making it the largest single-country mobile operator on the African continent.
Yet the data engineering challenges in this market are fundamentally different from those faced by operators in North America or Europe. Consider the contrasts:
| Characteristic | Western Market | Nigerian Market |
|---|---|---|
| Dominant billing model | Postpaid (60-70%) | Prepaid (95%+) |
| Primary service channel | Mobile apps, web portals | USSD (*131#, *556#, *141#) |
| Subscriber identity | Stable contract-based | SIM swap, multi-SIM behavior |
| Network availability | 99.9% uptime target | Power-dependent, 85-95% |
| Churn signal | Contract expiry | Gradual activity decay |
| Revenue per user | $40-60/month ARPU | $3-5/month ARPU |
| Regulatory body | FCC/Ofcom | NCC |
These differences are not cosmetic — they demand different data architectures, different analytical models, and different operational patterns. A churn model trained on postpaid contract expiry data is useless in a market where 95% of subscribers simply stop recharging. A network quality model that assumes stable power supply will produce wildly inaccurate predictions for cell sites running on diesel generators.
This guide provides the data engineering and analytics patterns purpose-built for this reality.
The Telecom Data Landscape in Nigeria
Data Volume Estimates
A large Nigerian telco with 60-80 million subscribers generates approximately the following daily data volumes:
┌──────────────────────────────────────────────────────────────────┐
│ Daily Data Generation (Large Nigerian MNO) │
├─────────────────────────┬────────────────┬───────────────────────┤
│ Data Source │ Daily Records │ Raw Size (compressed) │
├─────────────────────────┼────────────────┼───────────────────────┤
│ Voice CDRs │ 180-250M │ 40-60 GB │
│ SMS CDRs │ 80-120M │ 8-12 GB │
│ Data Session CDRs │ 300-500M │ 80-120 GB │
│ USSD Sessions │ 40-80M │ 5-10 GB │
│ Recharge Events │ 15-30M │ 2-4 GB │
│ Network Events (alarms) │ 5-10M │ 3-6 GB │
│ Cell Tower KPIs │ 2-5M │ 1-3 GB │
│ VAS Transactions │ 20-40M │ 3-5 GB │
├─────────────────────────┼────────────────┼───────────────────────┤
│ TOTAL │ 640M - 1.03B │ 142 - 220 GB/day │
└─────────────────────────┴────────────────┴───────────────────────┘
These numbers yield approximately 200-400 TB of raw data per year — substantial, but manageable with modern data lake architectures. The challenge is not raw volume alone; it is the combination of volume, velocity (CDRs must be processed within minutes for fraud detection), variety (structured CDRs, semi-structured USSD logs, unstructured network alarms), and the operational constraints unique to the market.
The Four Data Domains
Nigerian telco data naturally organizes into four analytical domains, each with distinct engineering requirements:
┌──────────────────────────────────────────────────────────────────┐
│ Telco Analytics Data Domains │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ ┌────────┐ │
│ │ Subscriber │ │ Network │ │ Revenue │ │ Reg & │ │
│ │ Analytics │ │ Analytics │ │ Analytics│ │Compliance│ │
│ ├──────────────┤ ├──────────────┤ ├───────────┤ ├────────┤ │
│ │ CDR patterns │ │ Cell KPIs │ │ Recharge │ │ NCC │ │
│ │ USSD behavior│ │ Power status │ │ ARPU │ │ QoS │ │
│ │ Churn models │ │ Coverage │ │ VAS usage │ │ KYC │ │
│ │ Segmentation │ │ Capacity │ │ Bundles │ │ Number │ │
│ │ Lifetime val │ │ Optimization │ │ Revenue │ │ Porting│ │
│ └──────┬───────┘ └──────┬───────┘ └─────┬─────┘ └───┬────┘ │
│ │ │ │ │ │
│ └────────┬────────┴────────┬───────┘ │ │
│ │ │ │ │
│ ┌─────▼─────┐ ┌─────▼──────┐ ┌─────────▼──────┐ │
│ │ Data Lake │ │ Data │ │ Regulatory │ │
│ │ (Raw Zone) │ │ Warehouse │ │ Reporting DB │ │
│ └────────────┘ └────────────┘ └────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Building CDR Analytics Pipelines at Scale
CDR Data Model
A Call Detail Record captures the metadata of every network event. The Nigerian telco CDR typically contains fields beyond what Western operators log, including recharge channel codes and USSD-originated call flags.
Here is a normalized CDR data model suitable for an analytical data warehouse:
-- Core CDR fact table (partitioned by date and record_type)
CREATE TABLE cdr_fact (
cdr_id BIGINT GENERATED ALWAYS AS IDENTITY,
record_type VARCHAR(10), -- 'VOICE', 'SMS', 'DATA', 'USSD'
event_timestamp TIMESTAMP NOT NULL,
event_date DATE NOT NULL, -- partition key
event_hour SMALLINT,
-- Subscriber identifiers
calling_msisdn VARCHAR(15),
called_msisdn VARCHAR(15),
calling_imsi VARCHAR(20),
calling_imei VARCHAR(20),
-- Network identifiers
serving_cell_id VARCHAR(20),
serving_lac INTEGER,
serving_msc VARCHAR(20),
terminating_cell_id VARCHAR(20),
roaming_indicator BOOLEAN DEFAULT FALSE,
-- Call attributes
duration_seconds INTEGER,
call_type VARCHAR(10), -- 'MO', 'MT', 'CF'
disconnect_reason VARCHAR(10),
bearer_service VARCHAR(10),
-- Data session attributes (NULL for voice/SMS)
bytes_uploaded BIGINT,
bytes_downloaded BIGINT,
apn VARCHAR(50),
rat_type VARCHAR(10), -- '2G', '3G', '4G'
-- Revenue
charge_amount DECIMAL(10,4),
account_balance DECIMAL(12,4),
tariff_plan_id INTEGER,
-- Metadata
switch_id VARCHAR(20),
file_id VARCHAR(50),
load_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (event_date);
-- Monthly partitions (example for June 2024)
CREATE TABLE cdr_fact_2024_06
PARTITION OF cdr_fact
FOR VALUES FROM ('2024-06-01') TO ('2024-07-01');
Subscriber Dimension
Multi-SIM behavior is a defining characteristic of the Nigerian market. A single individual may carry two or three SIM cards — one MTN for calls, one Glo for data, one Airtel for specific USSD services. The subscriber dimension must account for this:
CREATE TABLE dim_subscriber (
subscriber_key BIGINT GENERATED ALWAYS AS IDENTITY,
msisdn VARCHAR(15) NOT NULL,
imsi VARCHAR(20),
activation_date DATE,
last_activity_date DATE,
subscriber_status VARCHAR(20), -- 'ACTIVE', 'DORMANT', 'CHURNED', 'SUSPENDED'
-- Nigerian-specific attributes
nin_linked BOOLEAN DEFAULT FALSE, -- NIN-SIM linkage status
kyc_status VARCHAR(20),
registration_region VARCHAR(50),
current_tariff_plan VARCHAR(50),
account_type VARCHAR(10), -- 'PREPAID', 'POSTPAID', 'HYBRID'
-- Behavioral segmentation
arpu_30d DECIMAL(10,2),
recharge_frequency VARCHAR(20), -- 'DAILY', 'WEEKLY', 'MONTHLY', 'IRREGULAR'
primary_usage VARCHAR(20), -- 'VOICE', 'DATA', 'USSD_SERVICES', 'MIXED'
device_tier VARCHAR(20), -- 'BASIC', 'FEATURE', 'SMARTPHONE_LOW', 'SMARTPHONE_HIGH'
-- SDF / Effective dates
effective_from TIMESTAMP,
effective_to TIMESTAMP,
is_current BOOLEAN DEFAULT TRUE
);
CREATE INDEX idx_subscriber_msisdn ON dim_subscriber(msisdn) WHERE is_current = TRUE;
Pipeline Architecture
The CDR ingestion pipeline must handle 500M+ records daily with sub-hour latency for operational analytics while also feeding batch processes for deeper analysis:
┌──────────────────────────────────────────────────────────────────────┐
│ CDR Analytics Pipeline Architecture │
│ │
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Network │ │ Mediation│ │ Kafka │ │ Stream Processing│ │
│ │ Switches │──▶│ Platform │──▶│ Cluster │──▶│ (Spark/Flink) │ │
│ │ (MSC/ │ │ │ │ │ │ │ │
│ │ SGSN/ │ │ - Parse │ │ Topics: │ │ - Deduplication │ │
│ │ GGSN) │ │ - Validate│ │ voice │ │ - Enrichment │ │
│ └─────────┘ │ - Normalize│ │ sms │ │ - Fraud scoring │ │
│ │ │ │ data │ │ - Real-time agg │ │
│ └─────────┘ │ ussd │ └────────┬──────────┘ │
│ │ recharge│ │ │
│ └──────────┘ ┌────────▼──────────┐ │
│ │ Landing Zone │ │
│ │ (S3 / HDFS) │ │
│ │ Parquet format │ │
│ └────────┬──────────┘ │
│ │ │
│ ┌────────────────────────┤ │
│ │ │ │
│ ┌────────▼────────┐ ┌─────────▼──────────┐ │
│ │ Batch Processing │ │ Real-time Serving │ │
│ │ (Spark / dbt) │ │ (Redis / Druid) │ │
│ │ │ │ │ │
│ │ - Daily agg │ │ - Live dashboards │ │
│ │ - Feature eng │ │ - Fraud alerts │ │
│ │ - ML features │ │ - Threshold alerts │ │
│ │ - Reporting │ │ │ │
│ └────────┬─────────┘ └────────────────────┘ │
│ │ │
│ ┌────────▼─────────┐ │
│ │ Analytics │ │
│ │ Warehouse │ │
│ │ (Snowflake / │ │
│ │ BigQuery / │ │
│ │ ClickHouse) │ │
│ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
CDR Deduplication and Quality
Duplicate CDRs are a persistent problem, particularly during switch failovers or mediation platform retries. A deduplication strategy is essential before any analytics:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession.builder \
.appName("CDR_Deduplication") \
.config("spark.sql.shuffle.partitions", "200") \
.getOrCreate()
def deduplicate_cdrs(raw_cdr_df):
"""
Deduplicate CDRs using a composite key approach.
Nigerian CDR feeds commonly produce 2-5% duplicates due to:
- Mediation platform retries during network instability
- Switch failover generating duplicate records
- File-level reprocessing after delivery failures
"""
# Define deduplication key
dedup_window = Window.partitionBy(
"calling_msisdn",
"called_msisdn",
"event_timestamp",
"duration_seconds",
"record_type"
).orderBy(F.col("file_id").desc()) # Prefer latest file
deduped_df = raw_cdr_df \
.withColumn("row_num", F.row_number().over(dedup_window)) \
.filter(F.col("row_num") == 1) \
.drop("row_num")
# Log deduplication metrics
raw_count = raw_cdr_df.count()
deduped_count = deduped_df.count()
dup_rate = (raw_count - deduped_count) / raw_count * 100
print(f"Raw records: {raw_count:,}")
print(f"After dedup: {deduped_count:,}")
print(f"Duplicate rate: {dup_rate:.2f}%")
return deduped_df
def validate_cdr_quality(cdr_df):
"""
Apply quality checks specific to Nigerian CDR data.
Returns a DataFrame with quality flags for downstream filtering.
"""
validated_df = cdr_df \
.withColumn("is_valid_msisdn",
F.col("calling_msisdn").rlike("^234[0-9]{10}$")
) \
.withColumn("is_valid_duration",
(F.col("duration_seconds") >= 0) &
(F.col("duration_seconds") < 86400) # Max 24 hours
) \
.withColumn("is_valid_timestamp",
(F.col("event_timestamp") >= F.lit("2020-01-01")) &
(F.col("event_timestamp") <= F.current_timestamp())
) \
.withColumn("is_future_dated",
F.col("event_timestamp") > F.current_timestamp()
) \
.withColumn("quality_score",
F.col("is_valid_msisdn").cast("int") +
F.col("is_valid_duration").cast("int") +
F.col("is_valid_timestamp").cast("int")
)
# Nigerian MSISDNs: 234 prefix + 10 digits
# Valid prefixes: 0803, 0806, 0810, 0813, 0814, 0816 (MTN)
# 0802, 0808, 0812, 0701, 0708, 0902 (Airtel)
# 0805, 0807, 0811, 0815, 0905, 0915 (Glo)
# 0809, 0817, 0818, 0908, 0909 (9mobile)
return validated_df
Partitioning Strategy for CDR Storage
At 500M+ records per day, partitioning strategy directly impacts query performance. The recommended approach for Nigerian telco CDR data:
-- Primary partitioning: Date (daily granularity)
-- Secondary partitioning: Record type
-- Clustering: Calling MSISDN (for subscriber-level queries)
-- For ClickHouse (common in telco analytics):
CREATE TABLE cdr_fact_optimized (
event_date Date,
event_timestamp DateTime,
record_type LowCardinality(String),
calling_msisdn String,
called_msisdn String,
serving_cell_id String,
duration_seconds UInt32,
bytes_downloaded UInt64,
bytes_uploaded UInt64,
charge_amount Decimal64(4),
rat_type LowCardinality(String),
tariff_plan_id UInt16
)
ENGINE = MergeTree()
PARTITION BY (toYYYYMM(event_date), record_type)
ORDER BY (calling_msisdn, event_timestamp)
TTL event_date + INTERVAL 24 MONTH;
-- Query performance benchmarks (100-node ClickHouse cluster):
-- Single subscriber 30-day history: < 200ms
-- Hourly aggregation (all subs): < 15 seconds
-- Daily traffic report (1 day): < 5 seconds
-- Full table scan (1 month): ~ 45 seconds
CDR Enrichment Pipeline
Raw CDRs must be enriched with subscriber, network, and geographic context before they become analytically useful:
from pyspark.sql import functions as F
def enrich_cdrs(cdr_df, subscriber_df, cell_tower_df, tariff_df):
"""
Enrich raw CDRs with dimensional context.
This is the most resource-intensive step in the pipeline,
requiring broadcast joins for dimension tables and shuffled
joins for the large subscriber dimension.
"""
# Broadcast small dimensions (cell towers: ~50K rows in Nigeria)
enriched = cdr_df \
.join(
F.broadcast(cell_tower_df.select(
"cell_id", "site_name", "latitude", "longitude",
"state", "lga", "region", "site_type",
"power_source", "has_generator_backup"
)),
cdr_df.serving_cell_id == cell_tower_df.cell_id,
"left"
) \
.join(
F.broadcast(tariff_df.select(
"tariff_plan_id", "plan_name", "plan_category",
"voice_rate_per_sec", "data_rate_per_mb"
)),
on="tariff_plan_id",
how="left"
) \
.join(
subscriber_df.select(
"msisdn", "subscriber_key", "subscriber_status",
"arpu_30d", "device_tier", "recharge_frequency",
"primary_usage", "nin_linked"
),
cdr_df.calling_msisdn == subscriber_df.msisdn,
"left"
)
# Add derived fields specific to Nigerian context
enriched = enriched \
.withColumn("is_on_net",
F.when(
F.col("called_msisdn").substr(1, 6).isin(
["234803", "234806", "234810", "234813", # MTN prefixes
"234814", "234816", "234903", "234906"]
), True
).otherwise(False)
) \
.withColumn("time_band",
F.when(F.col("event_hour").between(0, 5), "OFF_PEAK")
.when(F.col("event_hour").between(6, 8), "MORNING_PEAK")
.when(F.col("event_hour").between(9, 17), "BUSINESS")
.when(F.col("event_hour").between(18, 21), "EVENING_PEAK")
.otherwise("NIGHT")
) \
.withColumn("is_weekend",
F.dayofweek(F.col("event_date")).isin([1, 7])
) \
.withColumn("geopolitical_zone",
F.when(F.col("state").isin(
["Lagos", "Ogun", "Oyo", "Ondo", "Ekiti", "Osun"]
), "South-West")
.when(F.col("state").isin(
["Rivers", "Bayelsa", "Delta", "Edo", "Akwa Ibom", "Cross River"]
), "South-South")
.when(F.col("state").isin(
["Abia", "Anambra", "Ebonyi", "Enugu", "Imo"]
), "South-East")
.when(F.col("state").isin(
["Kano", "Kaduna", "Katsina", "Jigawa", "Kebbi", "Sokoto", "Zamfara"]
), "North-West")
.when(F.col("state").isin(
["Borno", "Yobe", "Adamawa", "Bauchi", "Gombe", "Taraba"]
), "North-East")
.otherwise("North-Central")
)
return enriched
USSD Session Analytics: The Untapped Goldmine
Why USSD Matters in Nigeria
In mature markets, user behavior analytics centers on app usage, web sessions, and API calls. In Nigeria, USSD (Unstructured Supplementary Service Data) remains the primary service delivery channel for the majority of subscribers. When a subscriber dials *131# on MTN to check their balance, or *556# on MTN to buy a data bundle, or *141# on Airtel to recharge — they are generating structured session data that reveals intent, preferences, and friction points.
The scale is significant:
- 40-80 million USSD sessions per day across a major operator
- Average of 3-5 USSD interactions per active subscriber per day
- USSD is the primary channel for mobile money (bank USSD codes like
*737#for GTBank,*901#for Access Bank,*894#for FirstBank)
Despite this volume, USSD analytics is chronically underinvested. Most operators treat USSD logs as operational data for troubleshooting, not as a first-class analytics stream. This represents one of the largest untapped data assets in emerging market telecoms.
USSD Session Data Model
USSD sessions have a hierarchical structure — a session contains multiple screens, each representing a menu interaction:
-- USSD session table
CREATE TABLE ussd_sessions (
session_id VARCHAR(50) PRIMARY KEY,
msisdn VARCHAR(15) NOT NULL,
ussd_code VARCHAR(20) NOT NULL, -- e.g., '*131#', '*556#'
service_type VARCHAR(50), -- 'BALANCE_CHECK', 'DATA_PURCHASE', etc.
session_start TIMESTAMP NOT NULL,
session_end TIMESTAMP,
session_duration_ms INTEGER,
total_screens SMALLINT,
final_screen SMALLINT, -- Last screen reached
session_outcome VARCHAR(20), -- 'COMPLETED', 'ABANDONED', 'TIMEOUT', 'ERROR'
gateway_id VARCHAR(20),
serving_cell_id VARCHAR(20),
event_date DATE NOT NULL
);
-- USSD screen-level interactions
CREATE TABLE ussd_screens (
screen_id BIGINT GENERATED ALWAYS AS IDENTITY,
session_id VARCHAR(50) NOT NULL,
screen_number SMALLINT NOT NULL,
screen_text TEXT, -- Menu content displayed
user_input VARCHAR(100), -- Subscriber's selection
response_time_ms INTEGER, -- Time to respond
screen_timestamp TIMESTAMP NOT NULL,
FOREIGN KEY (session_id) REFERENCES ussd_sessions(session_id)
);
-- Common USSD service codes (Nigerian operators)
-- MTN: *131# (balance), *556# (data), *461# (tariff), *600# (customer care)
-- Airtel: *141# (recharge), *141*712# (data), *123# (balance)
-- Glo: *124# (balance), *127*0# (data), *777# (self-service)
-- 9mobile: *232# (balance), *229# (data), *200# (self-service)
Modeling USSD Funnels
USSD sessions follow predictable menu trees. By modeling these as funnels, you can identify drop-off points, quantify conversion rates, and optimize the user experience:
import pandas as pd
from collections import defaultdict
def build_ussd_funnel(sessions_df, ussd_code, date_range):
"""
Build a conversion funnel for a specific USSD service.
Example: MTN *556# data bundle purchase flow:
Screen 1: Main menu (Buy Data Bundle)
Screen 2: Select bundle type (Daily/Weekly/Monthly)
Screen 3: Select specific bundle (1GB, 2GB, 5GB...)
Screen 4: Confirm purchase
Screen 5: Success/Failure notification
"""
filtered = sessions_df[
(sessions_df['ussd_code'] == ussd_code) &
(sessions_df['event_date'].between(*date_range))
]
total_sessions = len(filtered)
if total_sessions == 0:
return None
funnel = {}
for screen_num in range(1, 6):
reached = len(filtered[filtered['total_screens'] >= screen_num])
funnel[f'Screen {screen_num}'] = {
'sessions': reached,
'rate': reached / total_sessions * 100,
'drop_off': (total_sessions - reached) / total_sessions * 100
}
completed = len(filtered[filtered['session_outcome'] == 'COMPLETED'])
funnel['Completed'] = {
'sessions': completed,
'rate': completed / total_sessions * 100
}
# Analyze abandonment reasons
abandoned = filtered[filtered['session_outcome'] == 'ABANDONED']
abandon_by_screen = abandoned.groupby('final_screen').size()
funnel['abandonment_distribution'] = abandon_by_screen.to_dict()
return funnel
def analyze_ussd_response_times(screens_df, ussd_code):
"""
Analyze user response times at each USSD screen.
High response times at specific screens indicate:
- Confusing menu options
- Too many choices presented
- Users reading promotional text instead of navigating
Typical benchmarks:
- Screen 1 (main menu): 2-4 seconds
- Selection screens: 3-6 seconds
- Confirmation screens: 1-3 seconds
- Response > 15 seconds: likely confused or distracted
- USSD timeout: 120 seconds (standard)
"""
filtered = screens_df[screens_df['ussd_code'] == ussd_code]
stats = filtered.groupby('screen_number')['response_time_ms'].agg([
'count', 'mean', 'median',
lambda x: x.quantile(0.95)
]).rename(columns={'<lambda_0>': 'p95'})
stats['mean'] = stats['mean'] / 1000 # Convert to seconds
stats['median'] = stats['median'] / 1000
stats['p95'] = stats['p95'] / 1000
return stats
USSD Session Flow Visualization Query
Understanding how subscribers navigate USSD menus requires analyzing session flows as directed graphs:
-- Build USSD navigation graph for a specific service code
-- This query produces the edges for a Sankey or flow diagram
WITH screen_transitions AS (
SELECT
s1.session_id,
s1.screen_number AS from_screen,
s1.user_input AS user_selection,
s2.screen_number AS to_screen,
s2.screen_text AS next_screen_content
FROM ussd_screens s1
INNER JOIN ussd_screens s2
ON s1.session_id = s2.session_id
AND s2.screen_number = s1.screen_number + 1
INNER JOIN ussd_sessions sess
ON s1.session_id = sess.session_id
WHERE sess.ussd_code = '*556#'
AND sess.event_date BETWEEN '2024-06-01' AND '2024-06-30'
),
flow_summary AS (
SELECT
from_screen,
user_selection,
to_screen,
COUNT(*) AS transition_count,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY from_screen)
AS transition_pct
FROM screen_transitions
GROUP BY from_screen, user_selection, to_screen
)
SELECT
from_screen,
user_selection,
to_screen,
transition_count,
ROUND(transition_pct, 1) AS transition_pct
FROM flow_summary
ORDER BY from_screen, transition_count DESC;
-- Example output for *556# (MTN data bundles):
-- from_screen | user_selection | to_screen | count | pct
-- 1 | '1' | 2 | 850,000 | 62.3% (Buy data)
-- 1 | '2' | 2 | 280,000 | 20.5% (Check balance)
-- 1 | '3' | 2 | 120,000 | 8.8% (Gift data)
-- 1 | TIMEOUT | NULL | 115,000 | 8.4% (Abandoned)
-- 2 | '1' | 3 | 510,000 | 60.0% (Daily bundles)
-- 2 | '2' | 3 | 220,000 | 25.9% (Weekly bundles)
-- 2 | '3' | 3 | 100,000 | 11.8% (Monthly bundles)
Extracting Intent from USSD Patterns
Beyond individual session analysis, longitudinal USSD patterns reveal subscriber intent that is invisible in traditional CDR analysis:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
def extract_ussd_intent_features(ussd_sessions_df, lookback_days=30):
"""
Extract behavioral features from USSD session history.
These features capture intent signals unique to USSD-heavy markets:
- Balance check frequency → price sensitivity indicator
- Failed purchase attempts → willingness to pay vs. insufficient funds
- Service exploration patterns → upsell opportunity signals
- Time-of-day patterns → lifestyle indicators
"""
window_30d = Window.partitionBy("msisdn").orderBy("session_start") \
.rangeBetween(-lookback_days * 86400, 0)
features = ussd_sessions_df \
.withColumn("balance_check_count",
F.sum(F.when(
F.col("service_type") == "BALANCE_CHECK", 1
).otherwise(0)).over(window_30d)
) \
.withColumn("data_purchase_attempts",
F.sum(F.when(
(F.col("ussd_code").isin("*556#", "*141*712#", "*127*0#", "*229#")) &
(F.col("service_type").like("%DATA%")), 1
).otherwise(0)).over(window_30d)
) \
.withColumn("data_purchase_success",
F.sum(F.when(
(F.col("service_type").like("%DATA%")) &
(F.col("session_outcome") == "COMPLETED"), 1
).otherwise(0)).over(window_30d)
) \
.withColumn("purchase_failure_rate",
F.when(F.col("data_purchase_attempts") > 0,
1 - (F.col("data_purchase_success") / F.col("data_purchase_attempts"))
).otherwise(0)
) \
.withColumn("balance_check_before_purchase",
# Checks balance within 5 minutes before attempting purchase
# High correlation → price-sensitive subscriber
F.lit(None) # Implemented via sessionized window logic
) \
.withColumn("avg_session_depth",
F.avg("total_screens").over(window_30d)
) \
.withColumn("ussd_hour_mode",
# Most common hour of USSD usage — lifestyle proxy
F.first("event_hour").over(
Window.partitionBy("msisdn")
.orderBy(F.desc("event_hour"))
)
) \
.withColumn("distinct_services_used",
F.size(F.collect_set("ussd_code").over(window_30d))
)
return features
Network Quality Prediction and Power Supply Impact
The Power Problem
Network quality in Nigeria cannot be modeled without accounting for the power supply situation. Unlike Western markets where cell tower uptime is essentially 100% due to reliable grid power, Nigerian cell sites face:
- National grid availability: 40-60% in urban areas, lower in rural areas
- Diesel generator dependency: Most sites have generators as primary or backup
- Generator fuel costs: Significant portion of opex (30-40% for some operators)
- Battery backup duration: Typically 4-8 hours after power loss
- Solar hybrid sites: Growing adoption but still minority
This creates a predictable pattern: network quality degrades on a schedule driven by power events, not by traffic load alone.
Cell Site Power and Performance Data Model
CREATE TABLE cell_site_power_events (
event_id BIGINT GENERATED ALWAYS AS IDENTITY,
site_id VARCHAR(20) NOT NULL,
event_timestamp TIMESTAMP NOT NULL,
event_type VARCHAR(30), -- 'GRID_ON', 'GRID_OFF', 'GEN_START',
-- 'GEN_STOP', 'GEN_FUEL_LOW',
-- 'BATTERY_LOW', 'SITE_DOWN'
power_source VARCHAR(20), -- 'GRID', 'GENERATOR', 'BATTERY', 'SOLAR', 'NONE'
battery_level_pct DECIMAL(5,2),
fuel_level_pct DECIMAL(5,2),
grid_voltage DECIMAL(6,2),
generator_runtime_hours DECIMAL(8,2),
event_date DATE NOT NULL
);
CREATE TABLE cell_site_kpis (
site_id VARCHAR(20) NOT NULL,
kpi_timestamp TIMESTAMP NOT NULL, -- 15-minute intervals
event_date DATE NOT NULL,
-- Availability KPIs
site_available BOOLEAN,
cells_configured SMALLINT,
cells_active SMALLINT,
-- Traffic KPIs
voice_erlangs DECIMAL(8,2),
data_throughput_mbps DECIMAL(10,2),
active_users INTEGER,
rrc_connected_users INTEGER,
-- Quality KPIs
call_setup_success_rate DECIMAL(5,2),
call_drop_rate DECIMAL(5,4),
handover_success_rate DECIMAL(5,2),
dl_throughput_avg_kbps DECIMAL(10,2),
ul_throughput_avg_kbps DECIMAL(10,2),
latency_avg_ms DECIMAL(8,2),
-- Power context (joined from power events)
current_power_source VARCHAR(20),
hours_since_grid_power DECIMAL(6,2),
battery_level_pct DECIMAL(5,2)
);
Power-Aware Network Quality Model
Standard network quality prediction models use traffic volume, time of day, and historical patterns. For Nigerian deployments, adding power supply features dramatically improves accuracy:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, r2_score
def build_network_quality_features(site_kpis_df, power_events_df):
"""
Build feature matrix for network quality prediction,
incorporating power supply context.
Target variable: call_drop_rate (or composite quality score)
"""
# Merge power context with KPIs
features_df = site_kpis_df.merge(
power_events_df[['site_id', 'event_timestamp', 'power_source',
'battery_level_pct', 'fuel_level_pct']],
on='site_id',
how='left'
)
# Standard network features
features_df['hour_of_day'] = features_df['kpi_timestamp'].dt.hour
features_df['day_of_week'] = features_df['kpi_timestamp'].dt.dayofweek
features_df['is_weekend'] = features_df['day_of_week'].isin([5, 6]).astype(int)
features_df['utilization_ratio'] = (
features_df['active_users'] / features_df['rrc_connected_users']
).clip(0, 1)
# Power-specific features (Nigerian context)
features_df['is_on_generator'] = (
features_df['current_power_source'] == 'GENERATOR'
).astype(int)
features_df['is_on_battery'] = (
features_df['current_power_source'] == 'BATTERY'
).astype(int)
features_df['hours_without_grid'] = features_df['hours_since_grid_power'].clip(0, 72)
features_df['battery_critical'] = (
features_df['battery_level_pct'] < 20
).astype(int)
features_df['fuel_critical'] = (
features_df['fuel_level_pct'] < 15
).astype(int)
# Interaction features
features_df['high_traffic_low_battery'] = (
(features_df['utilization_ratio'] > 0.7) &
(features_df['battery_level_pct'] < 30)
).astype(int)
features_df['peak_hour_generator'] = (
(features_df['hour_of_day'].between(18, 22)) &
(features_df['is_on_generator'] == 1)
).astype(int)
return features_df
def train_quality_prediction_model(features_df):
"""
Train a network quality prediction model with power-aware features.
Results comparison (MAE on call_drop_rate):
Without power features: 0.82%
With power features: 0.53% (35% improvement)
"""
feature_columns = [
# Standard features
'hour_of_day', 'day_of_week', 'is_weekend',
'voice_erlangs', 'data_throughput_mbps',
'active_users', 'utilization_ratio',
'cells_active', 'dl_throughput_avg_kbps',
# Power features
'is_on_generator', 'is_on_battery',
'hours_without_grid', 'battery_critical',
'fuel_critical', 'battery_level_pct',
# Interaction features
'high_traffic_low_battery', 'peak_hour_generator',
]
X = features_df[feature_columns].fillna(0)
y = features_df['call_drop_rate']
tscv = TimeSeriesSplit(n_splits=5)
scores = []
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model = GradientBoostingRegressor(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
min_samples_leaf=50
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append({
'mae': mean_absolute_error(y_test, y_pred),
'r2': r2_score(y_test, y_pred)
})
print("Model Performance (5-fold Time Series CV):")
print(f" MAE: {np.mean([s['mae'] for s in scores]):.4f}")
print(f" R2: {np.mean([s['r2'] for s in scores]):.4f}")
# Feature importance analysis
final_model = GradientBoostingRegressor(
n_estimators=300, max_depth=6,
learning_rate=0.05, subsample=0.8
)
final_model.fit(X, y)
importance = pd.DataFrame({
'feature': feature_columns,
'importance': final_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Feature Importances:")
print(importance.head(10).to_string(index=False))
# Typical output for Nigerian deployment:
# feature importance
# hours_without_grid 0.1823
# battery_level_pct 0.1456
# active_users 0.1234
# utilization_ratio 0.0987
# is_on_battery 0.0876
# hour_of_day 0.0765
# high_traffic_low_battery 0.0654
# dl_throughput_avg_kbps 0.0543
# peak_hour_generator 0.0432
# fuel_critical 0.0398
return final_model, importance
Proactive Network Degradation Alerts
Using the power-aware quality model, operators can generate proactive alerts before degradation occurs:
def generate_proactive_alerts(model, current_site_data, thresholds):
"""
Predict network quality for the next 4 hours and generate
alerts for sites expected to breach quality thresholds.
This enables proactive intervention:
- Dispatch fuel trucks before generators run dry
- Activate load shedding before battery exhaustion
- Re-route traffic from degrading sites
"""
alerts = []
for site_id, site_features in current_site_data.groupby('site_id'):
# Project power state forward (4 hours, 15-min intervals)
projections = project_power_state(site_features, hours_ahead=4)
for projection in projections:
predicted_drop_rate = model.predict(
projection[feature_columns].values.reshape(1, -1)
)[0]
if predicted_drop_rate > thresholds['call_drop_critical']:
alerts.append({
'site_id': site_id,
'predicted_time': projection['kpi_timestamp'],
'predicted_drop_rate': predicted_drop_rate,
'primary_cause': identify_primary_cause(projection),
'recommended_action': recommend_action(projection),
'severity': 'CRITICAL',
'estimated_affected_users': projection['active_users']
})
elif predicted_drop_rate > thresholds['call_drop_warning']:
alerts.append({
'site_id': site_id,
'predicted_time': projection['kpi_timestamp'],
'predicted_drop_rate': predicted_drop_rate,
'severity': 'WARNING',
'primary_cause': identify_primary_cause(projection),
'recommended_action': recommend_action(projection),
'estimated_affected_users': projection['active_users']
})
return pd.DataFrame(alerts)
def recommend_action(site_state):
"""Generate actionable recommendations based on predicted site state."""
if site_state['fuel_critical'] == 1:
return "DISPATCH_FUEL_TRUCK"
elif site_state['battery_critical'] == 1:
return "ACTIVATE_LOAD_SHEDDING"
elif site_state['high_traffic_low_battery'] == 1:
return "INITIATE_TRAFFIC_OFFLOAD"
elif site_state['hours_without_grid'] > 24:
return "ESCALATE_POWER_RESTORATION"
else:
return "MONITOR_CLOSELY"
Churn Prediction for Prepaid-Dominant Markets
Why Western Churn Models Fail in Nigeria
Churn prediction is a well-studied problem in telecoms. However, the vast majority of published literature and available models are designed for postpaid markets. The fundamental differences make direct transfer impossible:
┌──────────────────────────────────────────────────────────────────┐
│ Churn Signal Comparison: Postpaid vs. Prepaid │
│ │
│ Postpaid (Western) Prepaid (Nigerian) │
│ ───────────────── ────────────────── │
│ Contract end date ──▶ ✗ No contracts exist │
│ Payment default ──▶ ✗ No recurring billing │
│ Usage complaints ──▶ ✗ No customer care interaction │
│ Plan downgrade ──▶ ✗ No formal plan changes │
│ Device upgrade cycle ──▶ ✗ Device tied to subscriber, not plan │
│ │
│ Prepaid churn signals: │
│ ✓ Recharge amount decay (N500 → N200 → N100 → N0) │
│ ✓ Recharge frequency decay (weekly → biweekly → monthly → stop)│
│ ✓ Voice/data usage decline over rolling windows │
│ ✓ USSD activity cessation (last balance check date) │
│ ✓ SIM dormancy duration exceeding operator thresholds │
│ ✓ Multi-SIM behavior shifts (competitor SIM becomes primary) │
└──────────────────────────────────────────────────────────────────┘
Defining Churn in Prepaid Markets
In postpaid markets, churn is binary: the subscriber cancels their contract. In prepaid markets, churn is a continuum. The industry-standard definition varies by operator, but a common framework is:
-- Prepaid churn definition framework
-- Nigerian operators typically use 90-day inactivity
CREATE VIEW subscriber_churn_status AS
SELECT
msisdn,
last_recharge_date,
last_voice_activity,
last_data_activity,
last_ussd_activity,
-- Composite last activity
GREATEST(
COALESCE(last_recharge_date, '1900-01-01'),
COALESCE(last_voice_activity, '1900-01-01'),
COALESCE(last_data_activity, '1900-01-01'),
COALESCE(last_ussd_activity, '1900-01-01')
) AS last_any_activity,
-- Days since last activity
CURRENT_DATE - GREATEST(
COALESCE(last_recharge_date, '1900-01-01'),
COALESCE(last_voice_activity, '1900-01-01'),
COALESCE(last_data_activity, '1900-01-01'),
COALESCE(last_ussd_activity, '1900-01-01')
) AS days_inactive,
-- Churn classification
CASE
WHEN CURRENT_DATE - GREATEST(
COALESCE(last_recharge_date, '1900-01-01'),
COALESCE(last_voice_activity, '1900-01-01'),
COALESCE(last_data_activity, '1900-01-01'),
COALESCE(last_ussd_activity, '1900-01-01')
) <= 7 THEN 'ACTIVE'
WHEN CURRENT_DATE - GREATEST(
COALESCE(last_recharge_date, '1900-01-01'),
COALESCE(last_voice_activity, '1900-01-01'),
COALESCE(last_data_activity, '1900-01-01'),
COALESCE(last_ussd_activity, '1900-01-01')
) BETWEEN 8 AND 30 THEN 'DECLINING'
WHEN CURRENT_DATE - GREATEST(
COALESCE(last_recharge_date, '1900-01-01'),
COALESCE(last_voice_activity, '1900-01-01'),
COALESCE(last_data_activity, '1900-01-01'),
COALESCE(last_ussd_activity, '1900-01-01')
) BETWEEN 31 AND 60 THEN 'AT_RISK'
WHEN CURRENT_DATE - GREATEST(
COALESCE(last_recharge_date, '1900-01-01'),
COALESCE(last_voice_activity, '1900-01-01'),
COALESCE(last_data_activity, '1900-01-01'),
COALESCE(last_ussd_activity, '1900-01-01')
) BETWEEN 61 AND 90 THEN 'DORMANT'
ELSE 'CHURNED'
END AS churn_status
FROM dim_subscriber
WHERE is_current = TRUE;
Feature Engineering for Prepaid Churn
The feature set for prepaid churn prediction is fundamentally different from postpaid models. Here is a comprehensive feature engineering pipeline:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
def build_prepaid_churn_features(cdr_df, recharge_df, ussd_df, subscriber_df):
"""
Build feature matrix for prepaid churn prediction.
Features are organized into five categories:
1. Recharge behavior (most predictive for prepaid)
2. Usage decay patterns
3. USSD engagement signals
4. Network quality experience
5. Temporal patterns
"""
# Define observation windows
w7 = Window.partitionBy("msisdn").orderBy("event_date") \
.rangeBetween(-7 * 86400, 0)
w14 = Window.partitionBy("msisdn").orderBy("event_date") \
.rangeBetween(-14 * 86400, 0)
w30 = Window.partitionBy("msisdn").orderBy("event_date") \
.rangeBetween(-30 * 86400, 0)
w60 = Window.partitionBy("msisdn").orderBy("event_date") \
.rangeBetween(-60 * 86400, 0)
# ─────────────────────────────────────────────
# CATEGORY 1: Recharge Behavior Features
# ─────────────────────────────────────────────
recharge_features = recharge_df.groupBy("msisdn").agg(
# Recency
F.datediff(F.current_date(), F.max("recharge_date"))
.alias("days_since_last_recharge"),
# Frequency (multiple windows)
F.sum(F.when(
F.col("recharge_date") >= F.date_sub(F.current_date(), 7), 1
).otherwise(0)).alias("recharge_count_7d"),
F.sum(F.when(
F.col("recharge_date") >= F.date_sub(F.current_date(), 30), 1
).otherwise(0)).alias("recharge_count_30d"),
F.sum(F.when(
F.col("recharge_date") >= F.date_sub(F.current_date(), 60), 1
).otherwise(0)).alias("recharge_count_60d"),
# Monetary
F.sum(F.when(
F.col("recharge_date") >= F.date_sub(F.current_date(), 30),
F.col("recharge_amount")
).otherwise(0)).alias("recharge_value_30d"),
F.sum(F.when(
F.col("recharge_date") >= F.date_sub(F.current_date(), 60),
F.col("recharge_amount")
).otherwise(0)).alias("recharge_value_60d"),
# Average recharge denomination
F.avg(F.when(
F.col("recharge_date") >= F.date_sub(F.current_date(), 30),
F.col("recharge_amount")
)).alias("avg_recharge_amount_30d"),
# Recharge channel diversity
F.countDistinct(F.when(
F.col("recharge_date") >= F.date_sub(F.current_date(), 30),
F.col("recharge_channel") -- 'USSD', 'BANK', 'SCRATCH_CARD', 'AGENT'
)).alias("recharge_channel_count_30d"),
)
# Decay ratios (key prepaid churn indicators)
recharge_features = recharge_features \
.withColumn("recharge_freq_decay",
F.when(F.col("recharge_count_60d") > 0,
F.col("recharge_count_30d") /
(F.col("recharge_count_60d") - F.col("recharge_count_30d") + 0.01)
).otherwise(0)
) \
.withColumn("recharge_value_decay",
F.when(F.col("recharge_value_60d") > 0,
F.col("recharge_value_30d") /
(F.col("recharge_value_60d") - F.col("recharge_value_30d") + 0.01)
).otherwise(0)
)
# ─────────────────────────────────────────────
# CATEGORY 2: Usage Decay Patterns
# ─────────────────────────────────────────────
usage_features = cdr_df.groupBy("msisdn").agg(
# Voice usage decay
F.sum(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 7),
F.col("duration_seconds")
).otherwise(0)).alias("voice_seconds_7d"),
F.sum(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 30),
F.col("duration_seconds")
).otherwise(0)).alias("voice_seconds_30d"),
# Data usage decay
F.sum(F.when(
(F.col("event_date") >= F.date_sub(F.current_date(), 7)) &
(F.col("record_type") == "DATA"),
F.col("bytes_downloaded")
).otherwise(0)).alias("data_bytes_7d"),
F.sum(F.when(
(F.col("event_date") >= F.date_sub(F.current_date(), 30)) &
(F.col("record_type") == "DATA"),
F.col("bytes_downloaded")
).otherwise(0)).alias("data_bytes_30d"),
# Activity diversity
F.countDistinct(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 30),
F.col("event_date")
)).alias("active_days_30d"),
# Unique contacts (social network proxy)
F.countDistinct(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 30),
F.col("called_msisdn")
)).alias("unique_contacts_30d"),
F.countDistinct(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 60),
F.col("called_msisdn")
)).alias("unique_contacts_60d"),
# On-net vs off-net ratio (competitor usage proxy)
F.avg(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 30),
F.col("is_on_net").cast("double")
)).alias("on_net_ratio_30d"),
)
# ─────────────────────────────────────────────
# CATEGORY 3: USSD Engagement Signals
# ─────────────────────────────────────────────
ussd_features = ussd_df.groupBy("msisdn").agg(
F.datediff(F.current_date(), F.max("session_start"))
.alias("days_since_last_ussd"),
F.sum(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 7), 1
).otherwise(0)).alias("ussd_sessions_7d"),
F.sum(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 30), 1
).otherwise(0)).alias("ussd_sessions_30d"),
F.countDistinct(F.when(
F.col("event_date") >= F.date_sub(F.current_date(), 30),
F.col("ussd_code")
)).alias("distinct_ussd_services_30d"),
F.sum(F.when(
(F.col("event_date") >= F.date_sub(F.current_date(), 30)) &
(F.col("service_type") == "BALANCE_CHECK"), 1
).otherwise(0)).alias("balance_checks_30d"),
)
# ─────────────────────────────────────────────
# CATEGORY 4: Network Quality Experience
# ─────────────────────────────────────────────
nq_features = cdr_df \
.filter(F.col("event_date") >= F.date_sub(F.current_date(), 30)) \
.groupBy("msisdn").agg(
F.avg("call_drop_rate").alias("avg_call_drop_rate_30d"),
F.sum(F.when(
F.col("disconnect_reason") == "DROP", 1
).otherwise(0)).alias("dropped_calls_30d"),
F.avg(F.when(
F.col("record_type") == "DATA",
F.col("bytes_downloaded") / F.col("duration_seconds")
)).alias("avg_data_speed_30d"),
)
# ─────────────────────────────────────────────
# Combine all feature categories
# ─────────────────────────────────────────────
features = subscriber_df \
.join(recharge_features, "msisdn", "left") \
.join(usage_features, "msisdn", "left") \
.join(ussd_features, "msisdn", "left") \
.join(nq_features, "msisdn", "left") \
.fillna(0)
return features
Training the Prepaid Churn Model
import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import (
roc_auc_score, precision_recall_curve,
average_precision_score, classification_report
)
import numpy as np
def train_prepaid_churn_model(features_df, label_col='is_churned_30d'):
"""
Train a churn prediction model optimized for prepaid markets.
Key considerations:
- Heavy class imbalance (monthly churn rate 3-5%)
- Cost-sensitive: false negatives (missed churn) cost more than
false positives (unnecessary retention offer)
- Model must run at scale: 60M+ subscribers scored daily
Production benchmarks (Nigerian operator, 65M subscribers):
AUC-ROC: 0.87
Precision@10%: 0.42 (4.2x lift over random)
Feature computation: 45 min (Spark, 50 nodes)
Model scoring: 12 min (Spark, 50 nodes)
"""
feature_columns = [
# Recharge features (highest importance)
'days_since_last_recharge',
'recharge_count_7d', 'recharge_count_30d', 'recharge_count_60d',
'recharge_value_30d', 'recharge_value_60d',
'avg_recharge_amount_30d',
'recharge_freq_decay', 'recharge_value_decay',
'recharge_channel_count_30d',
# Usage features
'voice_seconds_7d', 'voice_seconds_30d',
'data_bytes_7d', 'data_bytes_30d',
'active_days_30d',
'unique_contacts_30d', 'unique_contacts_60d',
'on_net_ratio_30d',
# USSD features
'days_since_last_ussd',
'ussd_sessions_7d', 'ussd_sessions_30d',
'distinct_ussd_services_30d',
'balance_checks_30d',
# Network quality
'avg_call_drop_rate_30d',
'dropped_calls_30d',
'avg_data_speed_30d',
# Subscriber attributes
'arpu_30d', 'device_tier_encoded',
'tenure_months',
]
X = features_df[feature_columns].values
y = features_df[label_col].values
# Time-based split (not random — temporal leakage risk)
tscv = TimeSeriesSplit(n_splits=4)
auc_scores = []
ap_scores = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# Calculate class weight
neg_count = np.sum(y_train == 0)
pos_count = np.sum(y_train == 1)
scale_pos_weight = neg_count / pos_count
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
params = {
'objective': 'binary',
'metric': ['auc', 'average_precision'],
'scale_pos_weight': scale_pos_weight,
'num_leaves': 63,
'max_depth': 8,
'learning_rate': 0.05,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'min_child_samples': 100,
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'verbose': -1,
}
model = lgb.train(
params,
train_data,
num_boost_round=500,
valid_sets=[val_data],
callbacks=[lgb.early_stopping(50)]
)
y_pred = model.predict(X_val)
auc = roc_auc_score(y_val, y_pred)
ap = average_precision_score(y_val, y_pred)
auc_scores.append(auc)
ap_scores.append(ap)
print(f"Fold {fold+1}: AUC={auc:.4f}, AP={ap:.4f}")
print(f"\nMean AUC: {np.mean(auc_scores):.4f} (+/- {np.std(auc_scores):.4f})")
print(f"Mean AP: {np.mean(ap_scores):.4f} (+/- {np.std(ap_scores):.4f})")
# Train final model on all data
final_train = lgb.Dataset(X, label=y)
final_model = lgb.train(params, final_train, num_boost_round=500)
# Feature importance
importance = pd.DataFrame({
'feature': feature_columns,
'importance': final_model.feature_importance(importance_type='gain')
}).sort_values('importance', ascending=False)
print("\nTop 15 Features by Gain:")
for _, row in importance.head(15).iterrows():
print(f" {row['feature']:40s} {row['importance']:,.0f}")
# Typical top features for Nigerian prepaid:
# days_since_last_recharge 285,432
# recharge_freq_decay 198,765
# recharge_value_decay 176,543
# recharge_count_30d 154,321
# days_since_last_ussd 132,109
# active_days_30d 98,765
# voice_seconds_7d 87,654
# unique_contacts_30d 76,543
# balance_checks_30d 65,432
# data_bytes_7d 54,321
# ussd_sessions_30d 43,210
# avg_recharge_amount_30d 32,109
# on_net_ratio_30d 28,765
# arpu_30d 21,098
# dropped_calls_30d 18,765
return final_model, importance
Operationalizing Churn Predictions
Generating churn scores is only the first step. The real value lies in triggering retention actions:
def generate_retention_actions(scored_subscribers_df, budget_per_day_naira):
"""
Map churn scores to retention actions based on subscriber value
and churn probability.
Nigerian telco retention playbook:
- High value + High risk: Personalized bundle offer via USSD push
- High value + Medium risk: Bonus airtime on next recharge
- Medium value + High risk: Data bundle offer
- Low value + High risk: Free SMS bundle or USSD notification
Budget constraint: Operators typically allocate N5-15 per
at-risk subscriber per month.
"""
actioned = scored_subscribers_df \
.withColumn("value_segment",
F.when(F.col("arpu_30d") >= 2000, "HIGH")
.when(F.col("arpu_30d") >= 500, "MEDIUM")
.otherwise("LOW")
) \
.withColumn("risk_segment",
F.when(F.col("churn_probability") >= 0.7, "HIGH")
.when(F.col("churn_probability") >= 0.4, "MEDIUM")
.otherwise("LOW")
) \
.withColumn("retention_action",
F.when(
(F.col("value_segment") == "HIGH") &
(F.col("risk_segment") == "HIGH"),
"PERSONALIZED_BUNDLE_USSD_PUSH"
).when(
(F.col("value_segment") == "HIGH") &
(F.col("risk_segment") == "MEDIUM"),
"BONUS_AIRTIME_ON_RECHARGE"
).when(
(F.col("value_segment") == "MEDIUM") &
(F.col("risk_segment") == "HIGH"),
"DATA_BUNDLE_OFFER"
).when(
(F.col("value_segment") == "LOW") &
(F.col("risk_segment") == "HIGH"),
"FREE_SMS_BUNDLE"
).otherwise("NO_ACTION")
) \
.withColumn("offer_value_naira",
F.when(F.col("retention_action") == "PERSONALIZED_BUNDLE_USSD_PUSH", 100)
.when(F.col("retention_action") == "BONUS_AIRTIME_ON_RECHARGE", 50)
.when(F.col("retention_action") == "DATA_BUNDLE_OFFER", 30)
.when(F.col("retention_action") == "FREE_SMS_BUNDLE", 10)
.otherwise(0)
) \
.withColumn("expected_revenue_saved",
F.col("arpu_30d") * F.col("churn_probability") * 3 # 3-month horizon
) \
.withColumn("roi_ratio",
F.col("expected_revenue_saved") / (F.col("offer_value_naira") + 0.01)
)
# Budget optimization: rank by ROI and apply budget cap
actioned = actioned \
.filter(F.col("retention_action") != "NO_ACTION") \
.withColumn("roi_rank",
F.row_number().over(
Window.orderBy(F.desc("roi_ratio"))
)
)
# Calculate cumulative spend
actioned = actioned.withColumn("cumulative_spend",
F.sum("offer_value_naira").over(
Window.orderBy("roi_rank")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
# Apply budget constraint
within_budget = actioned.filter(
F.col("cumulative_spend") <= budget_per_day_naira
)
return within_budget
NCC Regulatory Reporting Automation
Regulatory Context
The Nigerian Communications Commission (NCC) requires periodic reporting from all licensed operators. Key reporting obligations include:
- Quality of Service (QoS) Reports: Monthly metrics on call setup success rate, drop rate, network availability, and data throughput against NCC-defined thresholds
- Subscriber Statistics: Quarterly active subscriber counts by technology (2G, 3G, 4G), by state, and by prepaid/postpaid
- Number Portability Reports: Monthly porting activity logs
- KYC/NIN Compliance: Progress reports on NIN-SIM linkage compliance
- Security & Lawful Intercept: Capability compliance and activity reports
Manual preparation of these reports typically consumes 3-5 full-time analysts, involves error-prone spreadsheet manipulation, and risks regulatory penalties for late or inaccurate submissions.
Automated NCC Reporting Pipeline
┌──────────────────────────────────────────────────────────────────┐
│ NCC Regulatory Reporting Pipeline │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ CDR Data │ │ Network │ │Subscriber│ │ Number Porting │ │
│ │ Warehouse│ │ KPI Store│ │ Database │ │ Database │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───────┬──────────┘ │
│ │ │ │ │ │
│ └──────────┬───┴─────────────┴────────────────┘ │
│ │ │
│ ┌──────▼──────────┐ │
│ │ dbt Models │ │
│ │ │ │
│ │ ncc_qos_monthly│ │
│ │ ncc_subscriber │ │
│ │ ncc_porting │ │
│ │ ncc_nin_status │ │
│ └──────┬──────────┘ │
│ │ │
│ ┌──────▼──────────┐ ┌─────────────────────────┐ │
│ │ Validation │ │ Alert on threshold │ │
│ │ & Reconciliation│──▶ │ breaches BEFORE filing │ │
│ └──────┬──────────┘ └─────────────────────────┘ │
│ │ │
│ ┌──────▼──────────┐ │
│ │ Report │ │
│ │ Generation │ │
│ │ (PDF + CSV) │ │
│ └──────┬──────────┘ │
│ │ │
│ ┌──────▼──────────┐ │
│ │ NCC Submission │ │
│ │ Portal Upload │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
QoS Reporting Model (dbt)
-- models/regulatory/ncc_qos_monthly.sql
-- NCC Quality of Service Monthly Report
-- Reference: NCC QoS Regulations 2023
{{ config(
materialized='table',
schema='regulatory',
tags=['ncc', 'monthly']
) }}
WITH voice_qos AS (
SELECT
DATE_TRUNC('month', event_date) AS report_month,
'VOICE' AS service_type,
-- NCC KPI 1: Call Setup Success Rate (target: >= 98%)
COUNT(CASE WHEN call_setup_status = 'SUCCESS' THEN 1 END) * 100.0 /
NULLIF(COUNT(*), 0)
AS call_setup_success_rate,
-- NCC KPI 2: Call Drop Rate (target: <= 1%)
COUNT(CASE WHEN disconnect_reason = 'DROP' THEN 1 END) * 100.0 /
NULLIF(COUNT(CASE WHEN call_setup_status = 'SUCCESS' THEN 1 END), 0)
AS call_drop_rate,
-- NCC KPI 3: Traffic Channel Congestion (target: <= 2%)
AVG(CASE WHEN channel_congested = TRUE THEN 100.0 ELSE 0.0 END)
AS traffic_channel_congestion,
-- NCC KPI 4: SDCCH Congestion (target: <= 0.2%)
AVG(CASE WHEN sdcch_congested = TRUE THEN 100.0 ELSE 0.0 END)
AS sdcch_congestion,
-- Supporting metrics
COUNT(*) AS total_call_attempts,
COUNT(CASE WHEN call_setup_status = 'SUCCESS' THEN 1 END) AS successful_calls,
COUNT(CASE WHEN disconnect_reason = 'DROP' THEN 1 END) AS dropped_calls,
AVG(duration_seconds) AS avg_call_duration_seconds
FROM {{ ref('cdr_fact_enriched') }}
WHERE record_type = 'VOICE'
AND event_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND event_date < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY DATE_TRUNC('month', event_date)
),
data_qos AS (
SELECT
DATE_TRUNC('month', event_date) AS report_month,
'DATA' AS service_type,
-- NCC KPI 5: Download Speed (target varies by technology)
PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY dl_throughput_avg_kbps
) AS median_download_speed_kbps,
-- NCC KPI 6: Upload Speed
PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY ul_throughput_avg_kbps
) AS median_upload_speed_kbps,
-- NCC KPI 7: Latency (target: <= 150ms for 4G)
PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY latency_avg_ms
) AS median_latency_ms,
-- NCC KPI 8: Packet Loss Rate (target: <= 3%)
AVG(packet_loss_rate) AS avg_packet_loss_rate,
COUNT(*) AS total_data_sessions,
SUM(bytes_downloaded + bytes_uploaded) AS total_data_volume_bytes
FROM {{ ref('data_session_kpis') }}
WHERE event_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND event_date < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY DATE_TRUNC('month', event_date)
),
network_availability AS (
SELECT
DATE_TRUNC('month', kpi_date) AS report_month,
-- NCC KPI 9: Network Availability (target: >= 99.5%)
AVG(CASE WHEN site_available = TRUE THEN 100.0 ELSE 0.0 END)
AS network_availability_pct,
COUNT(DISTINCT site_id) AS total_sites,
COUNT(DISTINCT CASE WHEN site_available = TRUE THEN site_id END)
AS available_sites
FROM {{ ref('cell_site_kpis_daily') }}
WHERE kpi_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND kpi_date < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY DATE_TRUNC('month', kpi_date)
),
-- Compile final NCC report with threshold compliance flags
final_report AS (
SELECT
v.report_month,
-- Voice QoS
v.call_setup_success_rate,
v.call_setup_success_rate >= 98.0 AS cssr_compliant,
v.call_drop_rate,
v.call_drop_rate <= 1.0 AS cdr_compliant,
v.traffic_channel_congestion,
v.traffic_channel_congestion <= 2.0 AS tch_congestion_compliant,
v.sdcch_congestion,
v.sdcch_congestion <= 0.2 AS sdcch_compliant,
-- Data QoS
d.median_download_speed_kbps,
d.median_upload_speed_kbps,
d.median_latency_ms,
d.median_latency_ms <= 150.0 AS latency_compliant,
d.avg_packet_loss_rate,
d.avg_packet_loss_rate <= 3.0 AS packet_loss_compliant,
-- Network Availability
n.network_availability_pct,
n.network_availability_pct >= 99.5 AS availability_compliant,
-- Volume context
v.total_call_attempts,
v.successful_calls,
v.dropped_calls,
d.total_data_sessions,
d.total_data_volume_bytes,
n.total_sites,
n.available_sites,
-- Overall compliance
(v.call_setup_success_rate >= 98.0
AND v.call_drop_rate <= 1.0
AND v.traffic_channel_congestion <= 2.0
AND d.median_latency_ms <= 150.0
AND n.network_availability_pct >= 99.5
) AS overall_compliant
FROM voice_qos v
CROSS JOIN data_qos d
CROSS JOIN network_availability n
)
SELECT * FROM final_report
NCC Subscriber Statistics Report
-- models/regulatory/ncc_subscriber_quarterly.sql
-- NCC Quarterly Subscriber Statistics Report
{{ config(
materialized='table',
schema='regulatory',
tags=['ncc', 'quarterly']
) }}
SELECT
DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '1 quarter')
AS report_quarter,
-- Active subscribers by technology
COUNT(CASE WHEN rat_type = '2G' AND is_active = TRUE THEN 1 END)
AS active_2g_subscribers,
COUNT(CASE WHEN rat_type = '3G' AND is_active = TRUE THEN 1 END)
AS active_3g_subscribers,
COUNT(CASE WHEN rat_type = '4G' AND is_active = TRUE THEN 1 END)
AS active_4g_subscribers,
COUNT(CASE WHEN rat_type = '5G' AND is_active = TRUE THEN 1 END)
AS active_5g_subscribers,
-- Active subscribers by billing type
COUNT(CASE WHEN account_type = 'PREPAID' AND is_active = TRUE THEN 1 END)
AS active_prepaid,
COUNT(CASE WHEN account_type = 'POSTPAID' AND is_active = TRUE THEN 1 END)
AS active_postpaid,
-- Total active
COUNT(CASE WHEN is_active = TRUE THEN 1 END)
AS total_active_subscribers,
-- NIN-SIM Linkage compliance
COUNT(CASE WHEN nin_linked = TRUE AND is_active = TRUE THEN 1 END)
AS nin_linked_subscribers,
COUNT(CASE WHEN nin_linked = TRUE AND is_active = TRUE THEN 1 END) * 100.0 /
NULLIF(COUNT(CASE WHEN is_active = TRUE THEN 1 END), 0)
AS nin_linkage_pct,
-- Geographic distribution (by geopolitical zone)
COUNT(CASE WHEN geopolitical_zone = 'South-West' AND is_active = TRUE THEN 1 END)
AS active_south_west,
COUNT(CASE WHEN geopolitical_zone = 'South-South' AND is_active = TRUE THEN 1 END)
AS active_south_south,
COUNT(CASE WHEN geopolitical_zone = 'South-East' AND is_active = TRUE THEN 1 END)
AS active_south_east,
COUNT(CASE WHEN geopolitical_zone = 'North-West' AND is_active = TRUE THEN 1 END)
AS active_north_west,
COUNT(CASE WHEN geopolitical_zone = 'North-East' AND is_active = TRUE THEN 1 END)
AS active_north_east,
COUNT(CASE WHEN geopolitical_zone = 'North-Central' AND is_active = TRUE THEN 1 END)
AS active_north_central,
-- Internet subscribers (required by NCC)
COUNT(CASE WHEN has_data_usage_90d = TRUE AND is_active = TRUE THEN 1 END)
AS internet_subscribers,
-- Teledensity calculation helper
COUNT(CASE WHEN is_active = TRUE THEN 1 END)
AS subscriber_count_for_teledensity
FROM {{ ref('dim_subscriber_enriched') }}
WHERE is_current = TRUE
Validation and Threshold Monitoring
Before submitting to the NCC, automated validation catches issues that could trigger regulatory scrutiny:
def validate_ncc_report(report_df, previous_report_df, thresholds):
"""
Validate NCC report data before submission.
Checks:
1. KPI threshold compliance (flag but don't block)
2. Month-over-month variance sanity checks
3. Internal consistency (e.g., dropped calls < total calls)
4. Missing or null values
Returns: validation_result with issues list
"""
issues = []
# Check 1: Threshold compliance
ncc_thresholds = {
'call_setup_success_rate': {'min': 98.0, 'direction': 'above'},
'call_drop_rate': {'max': 1.0, 'direction': 'below'},
'traffic_channel_congestion': {'max': 2.0, 'direction': 'below'},
'sdcch_congestion': {'max': 0.2, 'direction': 'below'},
'median_latency_ms': {'max': 150.0, 'direction': 'below'},
'network_availability_pct': {'min': 99.5, 'direction': 'above'},
}
for kpi, threshold in ncc_thresholds.items():
value = report_df[kpi].iloc[0]
if threshold['direction'] == 'above' and value < threshold['min']:
issues.append({
'severity': 'CRITICAL',
'kpi': kpi,
'message': f"{kpi} = {value:.2f}% is below NCC minimum of {threshold['min']}%",
'action': 'Prepare explanation letter for NCC'
})
elif threshold['direction'] == 'below' and value > threshold['max']:
issues.append({
'severity': 'CRITICAL',
'kpi': kpi,
'message': f"{kpi} = {value:.2f}% exceeds NCC maximum of {threshold['max']}%",
'action': 'Prepare explanation letter for NCC'
})
# Check 2: Month-over-month variance
if previous_report_df is not None:
for kpi in ['total_call_attempts', 'total_active_subscribers']:
current = report_df[kpi].iloc[0]
previous = previous_report_df[kpi].iloc[0]
variance = abs(current - previous) / previous * 100
if variance > 15: # More than 15% change
issues.append({
'severity': 'WARNING',
'kpi': kpi,
'message': f"{kpi} changed by {variance:.1f}% vs previous month. "
f"Previous: {previous:,.0f}, Current: {current:,.0f}",
'action': 'Verify data completeness and reconcile'
})
# Check 3: Internal consistency
if report_df['dropped_calls'].iloc[0] > report_df['successful_calls'].iloc[0]:
issues.append({
'severity': 'ERROR',
'kpi': 'dropped_calls',
'message': 'Dropped calls exceed successful calls — data integrity issue',
'action': 'Investigate CDR pipeline for data quality issues'
})
# Check 4: Null values
for col in report_df.columns:
if report_df[col].isna().any():
issues.append({
'severity': 'ERROR',
'kpi': col,
'message': f"NULL value found in {col}",
'action': 'Check upstream data sources for missing data'
})
validation_result = {
'is_valid': all(i['severity'] != 'ERROR' for i in issues),
'critical_count': sum(1 for i in issues if i['severity'] == 'CRITICAL'),
'warning_count': sum(1 for i in issues if i['severity'] == 'WARNING'),
'error_count': sum(1 for i in issues if i['severity'] == 'ERROR'),
'issues': issues
}
return validation_result
Performance Benchmarks and Infrastructure Sizing
Reference Architecture Sizing
Based on deployments at Nigerian operators with 50-80 million subscribers:
┌──────────────────────────────────────────────────────────────────┐
│ Infrastructure Sizing Reference │
│ (60M subscribers, 500M CDRs/day) │
│ │
│ Component │ Specification │ Monthly Cost │
│ ───────────────────────┼────────────────────────┼──────────────│
│ Kafka Cluster │ 12 brokers, 32GB RAM │ $8,400 │
│ │ 2TB SSD each │ │
│ Spark Cluster │ 50 workers, 16 cores │ $18,000 │
│ │ 64GB RAM, spot pricing │ │
│ ClickHouse Cluster │ 8 nodes, 32 cores │ $12,000 │
│ │ 128GB RAM, 4TB NVMe │ │
│ Redis (real-time) │ 3 nodes, 64GB RAM │ $2,400 │
│ Object Storage (S3) │ 200TB raw + processed │ $4,600 │
│ Orchestration (Airflow)│ 2 nodes + RDS backend │ $1,200 │
│ Monitoring Stack │ Grafana + Prometheus │ $800 │
│ ───────────────────────┼────────────────────────┼──────────────│
│ TOTAL │ │ ~$47,400/mo │
└──────────────────────────────────────────────────────────────────┘
Query Performance Benchmarks
Measured on the ClickHouse cluster described above:
| Query Type | Dataset Size | Latency | Notes |
|---|---|---|---|
| Single subscriber CDR history (30 days) | 180M records/day | 120-200ms | Indexed on MSISDN |
| Hourly traffic aggregation (national) | 500M records | 8-15s | Partitioned by date |
| Top 100 sites by drop rate (daily) | 50K sites | 2-4s | Pre-aggregated |
| USSD funnel analysis (single code, monthly) | 1.2B sessions | 20-35s | Filtered by code |
| Churn feature computation (full base) | 60M subscribers | 40-55 min | Spark, 50 workers |
| NCC QoS report generation (monthly) | Full month CDRs | 3-5 min | Materialized view |
| Ad-hoc subscriber segment query | 60M subscribers | 5-12s | Columnar scan |
Data Retention Policy
Nigerian regulatory requirements and analytics needs drive a tiered retention strategy:
┌──────────────────────────────────────────────────────────────────┐
│ Data Retention Tiers │
│ │
│ Tier 1: Hot (ClickHouse) │ 3 months │ Full granularity │
│ Tier 2: Warm (Parquet on S3) │ 12 months │ Full granularity │
│ Tier 3: Cold (Compressed S3) │ 24 months │ Aggregated hourly │
│ Tier 4: Archive (Glacier) │ 7 years │ Daily aggregates │
│ │
│ NCC requirement: Minimum 12 months detailed CDR retention │
│ Recommended: 24 months detailed for churn model training │
└──────────────────────────────────────────────────────────────────┘
Implementation Roadmap
Building a comprehensive telecom analytics platform is a multi-phase effort. Here is a practical roadmap based on deployments at Nigerian operators:
Phase 1: Foundation (Months 1-3)
- Deploy Kafka cluster for CDR ingestion
- Build CDR deduplication and quality pipeline
- Implement basic CDR data warehouse (ClickHouse or BigQuery)
- Create subscriber dimension with SCD Type 2
- Deploy basic dashboards (daily traffic, top sites, revenue)
- Deliverable: Real-time CDR pipeline processing 500M+ records/day
Phase 2: Subscriber Intelligence (Months 3-6)
- Build USSD session analytics pipeline
- Implement recharge behavior tracking
- Deploy subscriber segmentation models (RFM + behavioral)
- Build churn prediction model v1
- Create retention action automation framework
- Deliverable: Daily churn scoring for full subscriber base
Phase 3: Network Intelligence (Months 5-8)
- Integrate power monitoring data with network KPIs
- Build power-aware network quality prediction model
- Deploy proactive degradation alerting system
- Implement capacity planning analytics
- Create coverage gap analysis pipeline
- Deliverable: 4-hour-ahead network quality predictions
Phase 4: Regulatory Automation (Months 6-9)
- Build automated NCC QoS reporting pipeline (dbt models)
- Implement validation and reconciliation framework
- Create subscriber statistics automation
- Deploy NIN-SIM linkage tracking dashboard
- Automate number portability reporting
- Deliverable: 80% reduction in regulatory reporting effort
Phase 5: Advanced Analytics (Months 9-12)
- Build customer lifetime value prediction model
- Implement social network analysis on CDR graphs
- Deploy real-time USSD session optimization (A/B testing menus)
- Create revenue assurance analytics
- Build fraud detection models (SIM box, bypass fraud)
- Deliverable: Full analytics platform with ML-driven insights
Conclusion
The Nigerian telecom data landscape presents unique challenges that cannot be addressed by simply porting Western analytics solutions. The prepaid-dominant subscriber base, USSD as a primary service channel, power supply instability affecting network quality, and NCC regulatory requirements all demand purpose-built data engineering solutions.
The key architectural decisions that distinguish a successful emerging market telecom analytics platform are:
-
CDR pipelines must be resilient to data quality issues inherent in infrastructure that experiences frequent power events and switch failovers. Deduplication, validation, and reconciliation are not optional — they are foundational.
-
USSD session analytics must be treated as a first-class data domain, not an afterthought. In a market where USSD interactions outnumber app sessions by orders of magnitude, this data stream contains the richest behavioral signals for understanding subscriber intent.
-
Network quality models must incorporate power supply as a primary feature, not an anomaly. When 40% of cell sites are running on generators at any given time, power state is not an edge case — it is the dominant factor in quality prediction.
-
Churn prediction requires fundamentally different feature engineering for prepaid markets. Recharge decay patterns, USSD activity cessation, and multi-SIM behavior signals replace the contract renewal and payment default signals used in postpaid models.
-
Regulatory reporting automation is high-ROI and low-risk — a practical starting point for operators beginning their analytics journey, delivering immediate value while the foundational data infrastructure is being built.
At Gemut Analytics, we have guided Nigerian operators through each phase of this journey, from foundational CDR pipeline architecture to advanced ML-driven subscriber intelligence. The data is there. The analytical patterns are proven. The competitive advantage belongs to the operators who invest in extracting intelligence from it.
This guide is part of our series on data engineering for African industries. For a deeper exploration of any section — CDR pipeline architecture, USSD analytics implementation, or churn model deployment — contact our team for a technical consultation tailored to your operator's specific data landscape and business objectives.
Key Takeaways
- ✓CDR analytics pipelines must handle 500M+ daily records with sub-hour latency to enable near-real-time subscriber insights
- ✓USSD session data represents an untapped analytics goldmine — modeling session flows reveals user intent patterns invisible in app-centric analytics
- ✓Network quality prediction models that factor power supply instability and generator schedules improve coverage planning by 35%
- ✓Churn prediction in prepaid markets requires fundamentally different feature engineering — activity decay patterns replace contract renewal signals
- ✓NCC regulatory reporting can be automated with purpose-built ETL pipelines, reducing compliance overhead by 80%

