This project is an attempt to use deep learning instead of classic ALS for recommender system. It uses various embedding to achieve an average RMSE of 0.97445 against ALS baseline of 0,98201.
The dataset (unzip data.zip
) represents a 100'000 by 10'000 matrix (10M entries) filled with roughly 1M values (sparse). It contains user ratings for items with no additional data.
Libraries used:
- Scipy
- Numpy
- Keras v1.1.1
- Tensorflow v0.11.0
- h5py
- Sklearn
- mca
- tqdm
Execute the following command to install the libraries:
pip3 install scipy keras==1.1.1 sklearn numpy h5py tqdm git+https://github.com/esafak/mca
In order to install Tensorflow, follow the guide on this link: https://www.tensorflow.org/get_started/os_setup
In the "export" command, replace the version number from "12" to "11".
This script implements Alternating Least Squares.
This Jupyter notebook creates the neural network and trains it.
This Jupyter notebook splits and forges the data to be used by the neural network.
To check that no user and no item behaves in an unexpected way (for the net), one can sort the user by the number of ratings they did and the item by the number of times they are rated. This allows to split into even cuts and compute the score on those small subsets. Following figures show specific subsets and fragmentations giving an insight on their impact over the score. Surprisingly the less rated items did not penalize the score while mid rated items have a worse impact. On the user side, there are some variance but it remains negligeable.
The number of ratings given by one user does not have a lot of impact on the global score (red line).
The number of items rated varies more. The mid rated items penalize the most the whole score.