Course: CSCI E-1090A (Introduction to Data Science), Harvard Extension School, Fall 2024

Group: Niraj Thapaliya, Jussan Da Silva Bahia Nascimento, Tim Niko

Code: github.com/nthapaliya/boston-rideshare-analysis

Notebook: View on Github


The question

What are the most important factors determining Uber and Lyft ride prices in Boston: distance, weather, time of day, neighborhood, or service tier?

The pipeline

693,071 rides from late November to mid-December 2018, joined with hourly weather data. After dropping Uber Taxi rides (all had missing prices — a structurally different product), duplicates, and scrambled geo columns, the final dataset is 637,036 rows and 13 predictors.

Key cleaning decisions:

  • Reduced 37 weather variables to 3 (temperature, wind speed, precipitation probability) after finding 26 of 37 are highly collinear and all have near-zero correlation with price
  • Log-transformed the response variable for regression
  • One-hot encoded 12 Boston neighborhoods (source and destination) and 13 product types

Six models compared by 5-fold cross-validation R², then the final model evaluated on a held-out test set with permutation importance and bootstrap coefficient analysis to rank features.

Results

ModelMean CV R²
Baseline (distance + surge only)0.1439
Multilinear Regression0.9385
Polynomial (degree 2)0.9410
Interaction term0.9386
Decision Tree0.9354
Random Forest0.9373
Gradient Boosting0.9155

Final model (Multilinear Regression, chosen for interpretability): test R² = 0.9381.

Feature importance ranking (permutation): product type, distance, cab type (Uber vs. Lyft), surge multiplier. Weather contributes essentially nothing — the dataset spans three uniformly cold winter weeks with almost no weather variation.

Uber rides are cheaper than Lyft on average. Theatre District and North End rides are priced higher than Back Bay; Financial District and Boston University origins are cheaper.

Dataset

Uber & Lyft Dataset Boston, MA on Kaggle — 693,071 rides across 12 neighborhoods near downtown Boston.

Stack

Python, pandas, scikit-learn (LinearRegression, Lasso, DecisionTreeRegressor, RandomForestRegressor, GradientBoostingRegressor, GridSearchCV), statsmodels (VIF), folium, seaborn, uv