Rental Classifier

Training ML models to predict rental interest levels.

View the Project on GitHub

Project Documentation

This is the source for the ML training project conducted in CMPT 459 at Simon Fraser University. In this class, we spent the semester training ML models of varying complexities to predict the interest level of rental listings online.

We trained several machine learning models to predict the interest levels of rental listings online with a 73% accuracy. Utilized modern machine learning models and techniques, implemented in Python. This project was a collaboration with @DistrictFine for a school project. Please torrent the image data before execution; there are too many images to store on GitHub.


0: Exploratory Data Analysis

The following commands perform EDA on the dataset. Please execute commands 0.1 and 0.2 before anything else and you’ll need to prepare your dir and data beforehand.

python 0.1-download_images.py              # Download, prepare all image data
python 0.2-setup_data_directory.py         # Prepares the data directory before EDA
python 1.1-price_location_histogram.py     # Generate histogram for price and location
python 1.2-hourwise_trend.py               # Generate graphs for the hourwise trend of postings
python 1.3-visualization_target_values.py  # Visualize target values using graphs
python 2.1-missing_values.py               # Outputs number of missing values in dataset
python 3.1-image_features.py               # Outputs all image features from dataset
python 3.2-document_features.py            # Outputs all document features from dataset

The visualizations pull from the data imported in 0.1 so be sure to run that first. All results are saved into the results folder of the 0-exploratory_data_analysis directory.

1: Trivial Model Training

The trivial models used are (1) decision tree and (2) SVM. These models are single-instance, trivial, and not-very-good. Our training results were poor, but we used these two models as the foundation for more complex, non-trivial models later on.

python 0-create_classification_dataset.py  # Prepares all data before training
python 1-decision_tree.py                  # Trains and tests a Decision Tree classifier
python 2-svm.py                            # Trains and tests an SVM classifier

Test results from the cross-validation are printed once training has concluded. As mentioned before, these models are not very good (about 60% accurate, only barely better than a coin flip).

2: Non-trivial Model Training

These final two models are (1) gradient boosting and (2) random forests. They are far more accurate, as they reduce both bias and variance.

python 0-create_classification_dataset.py  # Prepares all data before training
python 1-gradient_boosting_classifier.py   # Trains and tests a Decision Tree classifier
python 2-random_forest.py                  # Trains and tests an SVM classifier

The accuracy of these non-trivial models are higher than about 75%, significantly better than the trivial methods used earlier. Both methods take averages of smaller, trivial models to produce a more stable, less biased, and less variable result.


This project was a collaboration with @DistrictFine for a school project. You should definitely check out his GitHub profile as he works on some really cool stuff!