This is the source for the ML training project conducted in CMPT 459 at Simon Fraser University. In this class, we spent the semester training ML models of varying complexities to predict the interest level of rental listings online.
We trained several machine learning models to predict the interest levels of rental listings online with a 73% accuracy. Utilized modern machine learning models and techniques, implemented in Python. This project was a collaboration with @DistrictFine for a school project. Please torrent the image data before execution; there are too many images to store on GitHub.
The following commands perform EDA on the dataset.
Please execute commands
0.2 before anything else and you’ll need to prepare your dir and data beforehand.
python 0.1-download_images.py # Download, prepare all image data python 0.2-setup_data_directory.py # Prepares the data directory before EDA python 1.1-price_location_histogram.py # Generate histogram for price and location python 1.2-hourwise_trend.py # Generate graphs for the hourwise trend of postings python 1.3-visualization_target_values.py # Visualize target values using graphs python 2.1-missing_values.py # Outputs number of missing values in dataset python 3.1-image_features.py # Outputs all image features from dataset python 3.2-document_features.py # Outputs all document features from dataset
The visualizations pull from the data imported in
0.1 so be sure to run that first.
All results are saved into the
results folder of the
The trivial models used are (1) decision tree and (2) SVM. These models are single-instance, trivial, and not-very-good. Our training results were poor, but we used these two models as the foundation for more complex, non-trivial models later on.
python 0-create_classification_dataset.py # Prepares all data before training python 1-decision_tree.py # Trains and tests a Decision Tree classifier python 2-svm.py # Trains and tests an SVM classifier
Test results from the cross-validation are printed once training has concluded. As mentioned before, these models are not very good (about 60% accurate, only barely better than a coin flip).
These final two models are (1) gradient boosting and (2) random forests. They are far more accurate, as they reduce both bias and variance.
python 0-create_classification_dataset.py # Prepares all data before training python 1-gradient_boosting_classifier.py # Trains and tests a Decision Tree classifier python 2-random_forest.py # Trains and tests an SVM classifier
The accuracy of these non-trivial models are higher than about 75%, significantly better than the trivial methods used earlier. Both methods take averages of smaller, trivial models to produce a more stable, less biased, and less variable result.
This project was a collaboration with @DistrictFine for a school project. You should definitely check out his GitHub profile as he works on some really cool stuff!