Reproducible ML Workflow

ML Pipeline for Short-Term Rental Prices in NYC

A fully reproducible machine learning pipeline for predicting short-term rental prices in New York City, designed to retrain weekly as new data arrives.

GitHub

ML Pipeline for Short-Term Rental Prices in NYC

100%

Reproducible

Weekly

Retraining

Full

Versioning

Auto

Validation

The Challenge

Property management companies need reliable price estimates for short-term rentals based on comparable properties. The rental market changes rapidly, so models must be retrained frequently with new data. A one-off model is not enough — what's needed is a reproducible pipeline that can be run end-to-end or step-by-step, with full experiment tracking and artifact versioning.

Our Approach

The pipeline is orchestrated with MLflow, with each step (data download, preprocessing, EDA, data validation, train/test split, model training, evaluation) as an independent component. Weights & Biases handles artifact storage and experiment tracking. Hydra manages configuration via YAML files, enabling multi-run hyperparameter sweeps without code changes.

Data Quality & Validation

Automated data cleaning removes price outliers (keeping the $10–$350 per night range), handles missing values, and converts temporal data to proper formats. Testing thresholds validate dataset integrity by checking row counts and price distributions against reference datasets, catching data quality issues before they reach the model.

Results & Impact

The pipeline trains a Random Forest model with hyperparameter optimization, producing versioned artifacts at every stage. It supports running the complete pipeline or individual steps, making it suitable for both weekly automated retraining and ad-hoc experimentation. Full reproducibility is guaranteed through MLflow orchestration and W&B artifact tracking.

Technologies Used

PythonMLflowWeights & BiasesHydrascikit-learnpandasConda

All Projects