End-to-End ELT Pipeline for a Video Streaming Platform

System Architecture Link to heading

Project Architecture Diagram

You can find the code for this project in GitHub. Link to heading

Project Overview Link to heading

This project implements an end-to-end ELT pipeline designed to analyze the performance of a video streaming platform using synthetic data. The pipeline simulates real-world platform behavior through datasets representing users, content, and viewing sessions.

The primary objective is to demonstrate the design and implementation of a modern data engineering workflow, including:

Data ingestion from heterogeneous sources
Integration of relational and NoSQL data models
Centralized storage in a data warehouse
Transformations and dimensional modeling for analytics

The final output enables analytical reporting and dashboarding to support data-driven insights into user behavior and content performance.

Data Sources Link to heading

Three synthetic datasets were created to emulate a video streaming platform:

users.csv
Contains user demographics and subscription details.
viewing_sessions.csv
Captures detailed viewing session information, including device type, session duration, and streaming quality.
content.json
Stores structured metadata for movies and TV series.

Data Modeling and Storage Strategy Link to heading

To reflect a realistic multi-source environment:

MongoDB (NoSQL source)
- content.json
PostgreSQL (Relational source)
- viewing_sessions.csv
- users.csv

The ELT pipeline extracts data from both MongoDB and PostgreSQL and loads it into PostgreSQL as a Data Warehouse. The warehouse is structured using multiple schemas:

raw schema
Stores ingested data in a normalized, source-aligned format.
trusted schema
Contains transformed, analytics-ready data modeled using dimensional design principles and created with dbt.

ELT Pipeline and Orchestration Link to heading

The pipeline follows an ELT approach:

Extract
Data is extracted from MongoDB and PostgreSQL source systems.
Load
Raw data is loaded into the PostgreSQL Data Warehouse under the raw schema.
Transform
dbt is used to apply transformations, enforce data quality rules, and create a dimensional model in the trusted schema.

The entire workflow is orchestrated using Apache Airflow, ensuring:

Task dependency management
Logging and observability
Error handling and retries
Idempotent executions

All components run in a Dockerized environment to ensure consistency and reproducibility.

DAG Overview Link to heading

Dag Diagram

The Airflow DAG video_streaming_elt_pipeline consists of the following key tasks:

Create MongoDB Collections: Initializes the MongoDB source database.
Create PostgreSQL Tables: Initializes the PostgreSQL source database.
Load JSON to MongoDB: Loads content.json into MongoDB.
Load CSVs to PostgreSQL: Loads users.csv and viewing_sessions.csv into PostgreSQL.
Extract to Data Warehouse: Extracts data from source systems and loads it into the raw schema of the PostgreSQL Data Warehouse.
Transform with dbt: Executes dbt models to transform raw data into a dimensional model in the trusted schema.

Analytics Dashboard Link to heading

First section:

Power BI Dashboard Screenshot

Second section: