CS-433 Project 2

Damian Dudzicz, Guillaume Michel, Adrien Vandenbroucque

This repository contains the code for the Project 2 of EPFL's Machine Learning course. The goal of this project is to perform sentiment analysis over a dataset of Tweets.

Packages required

In order to run the code properly, you will need the following packages:

NLTK (The Natural Language Processing Toolkit) In this library, you wil specifically need to download:
- PorterStemmer
- TweetTokenizer
- Corpus of stop-words
NumPy
Scikit-Learn
Matplotlib
Gensim

The structure of the files is the following:

helpers.py contains the function useful for the CSV submission,
preprocessing.py contains functions useful to load and preprocess tweets,
w2v.py contains functions useful for creating and training a Word2Vec model,
pipeline.py contains the Scikit-Learn pipeline we used for the final version,
cross_validation.py contains functions for performing the cross validation method we used,
plot.py contains functions to plot the results of the cross-validation,
Project2-ML.ipynb is a notebook used when choosing which method was the best, and
run.py contains the code to load the data and run one of the models in order to compute predictions.

The train and test data can be found in the data/ folder.

How to run the program

The command python3 run.py will let you compute the predictions for the best model, and write the predictions for the test data into a file named submission.csv.

More about the files

`helpers.py`

This file contains the function used to create the CSV submission for CrowdAI.

`preprocessing.py`

This file contains the functions to load and prepare the train and test data. It also contains the function used to tokenize the tweets.

`w2v.py`

This file contains the function to create and train a Word2Vec model, here we used a modified one called FastText. It also contains a function used to convert tweets to vectors.

`pipeline.py`

This file contains the function that creates a Scikit-Learn pipeline for using a Bag of Words representation of words with TF-IDF weighting. To classify these vectors, we then use the LinearSVC classifier.

`cross_validation.py`

This file contains the function that performs a repeated k-fold cross-validation (more precisely, stratified k-fold).

`plot.py`

This file contains the function used to plot the results of the cross-validation.

`Project2-ML.ipynb`

This notebook contains code that evaluate the different models, when trying to find the best.

`run.py`

This file contains the code that loads the training and test data.

We then compute the representation of tweets in a vector space and classify those. We use in this case the model that gave us the best results (TfIdfVectorizer + LinearSVC). We finfally compute and submit the predictions, in the format accepted by the CrowdAI competition.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Project2-ML.ipynb		Project2-ML.ipynb
README.md		README.md
build_vocab.sh		build_vocab.sh
cooc.py		cooc.py
cross_validation.py		cross_validation.py
cut_vocab.sh		cut_vocab.sh
embeddings.npy		embeddings.npy
glove_solution.py		glove_solution.py
helpers.py		helpers.py
pickle_vocab.py		pickle_vocab.py
pipeline.py		pipeline.py
plot.py		plot.py
preprocessing.py		preprocessing.py
run.py		run.py
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS-433 Project 2

Damian Dudzicz, Guillaume Michel, Adrien Vandenbroucque

Packages required

How to run the program

More about the files

`helpers.py`

`preprocessing.py`

`w2v.py`

`pipeline.py`

`cross_validation.py`

`plot.py`

`Project2-ML.ipynb`

`run.py`

About

Releases

Packages

Contributors 2

Languages

Adirlou/Project2-ML

Folders and files

Latest commit

History

Repository files navigation

CS-433 Project 2

Damian Dudzicz, Guillaume Michel, Adrien Vandenbroucque

Packages required

How to run the program

More about the files

helpers.py

preprocessing.py

w2v.py

pipeline.py

cross_validation.py

plot.py

Project2-ML.ipynb

run.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`helpers.py`

`preprocessing.py`

`w2v.py`

`pipeline.py`

`cross_validation.py`

`plot.py`

`Project2-ML.ipynb`

`run.py`

Packages