Urban Mobility Data Warehouse for NYC Taxi Trips

Table of Contents Link to heading

Project Overview Link to heading

This project presents the design and implementation of a Data Warehouse (DW) in Google Cloud Platform (GCP) using BigQuery aimed at analyzing urban mobility patterns using New York City taxi trip data. The primary objective is to support efficient, data-driven decision-making by integrating, transforming, and organizing raw transactional data into a well-structured dimensional model optimized for analytical queries and dashboarding.

Dataset Description Link to heading

The dataset used in this project is the New York City Taxi Trips dataset available in BigQuery. The dataset contains information about taxi trips in New York City, including pickup and drop-off locations, trip distance, fare amount, total amount, passenger count, etc. For more information, please visit the NYC website.

Dimensional Modeling Link to heading

The Data Warehouse was designed following the star schema approach to ensure simplicity, high query performance, and analytical flexibility.

Model Components Link to heading

  • Fact Table

    • fact_trips: Stores key performance metrics, including trip distance, fare amount, total amount, and passenger count.
  • Dimension Tables

    • dim_time: Captures temporal hierarchies such as day, month, year, and weekday.
    • dim_location: Describes pickup and drop-off zones.
    • dim_payment: Contains payment methods and their descriptions.
    • dim_vendor: Identifies taxi service providers.

This model enables multi-dimensional analysis by time, location, payment type, and vendor, providing valuable insights into urban transportation behavior.

Star schema representation showing fact and dimension relationships.

Figure 1. Star schema representation showing fact and dimension relationships.

ELT Process Link to heading

For this project the Extract, Load, and Transform (ELT) pattern was used. It consists of the following stages:

  1. Extraction
    Data is retrieved directly from BigQuery’s public NYC taxi trips dataset.

  2. Loading
    Data is loaded into BigQuery tables that are first used as staging tables under a raw schema to perform data cleaning and transformation.

  3. Transformation
    Data is then moved to a processed schema where it is cleaned, data types are normalized, and derived attributes are created. Subsequently, data is loaded into a trusted schema in the fact and dimension tables. Finally, materialized views are created to provide fast access to the data as it contains millions of records.

Analysis and Results Link to heading

Using Looker Studio connected to BigQuery, an interactive dashboard was developed to analyze operational and mobility patterns of NYC taxi services. The visualizations that it contains support insights into trip volumes, revenue distribution, temporal trends, and geographic demand, demonstrating the analytical value of the Data Warehouse design. You can access the dashboard here.

Below is a screenshot of the dashboard, so you can get an idea of the data and visualizations that it contains.

Dashboard screenshot Figure 2. Dashboard screenshot.

Technologies Used Link to heading

  • SQL
  • Google Cloud Platform (GCP)
  • BigQuery
  • Data Warehousing
  • Looker