Machine Learning for Biologists, a hands-on introduction
Instructors: Pietro Franceschi, Filippo Biscarini
Synopsis The use of modern quantitative technologies to characterize complex phenomena represents the standard approach in almost every research domain. Biology makes no exception and the use of multi-omics techniques (metabolomics, transcriptomics, genomics and proteomics) is pervasive in every facet of life sciences. The resulting multivariate datasets are highly complex and advanced data analysis approaches need to be applied to optimize the information retrieved. For relatively large-scale studies, machine learning represents a valid tool to complement classical multivariate statistical methods. The objective of this course is to highlight advantages and limitations of these data analysis approaches in the context of biological research, providing a broad hands-on introduction to the use of multivariate methods and machine learning for the analysis of ‘omics datasets.
Below the structure of the course, and a detailed timetable (link at the end of the document). Code and data will be available at the beginning of each day. Slides will be available at the end of each day.
Day 1
- General Introduction
- Data mining, -omics and machine learning
- slides 0: Hands-off introduction to ML (Filippo)
- Omics meet ML (Pietro)
- Introduction to advanced R data libraries Rmd
- Introduction to
tidymodels
Rmd
Day 2
- Recap of Day_1: DIY!
- Multivariate data: things to always remember
- Model and variable selection: the machine learning paradigm
- slides 1.variable_selection (Filippo)
- Supervised learning: regression and classification
- [script 1.introduction_to_ml] (.Rmd) (ipynb)
- slides 2.Supervised learning (Filippo)
- Machine learning for regression problems
- [data_reg](data/DNA methylation data.xlsm)
- [script 2.linear_regression] (.Rmd) (ipynb)
- slides 3.Regression (Filippo)
Day 3
- Overfitting and resampling techniques
- [script 3.training_testing] (.Rmd) (ipynb)
- slides 4.overfitting (Filippo)
- slides 5.resampling (Filippo)
- Classification problems
- data_class
- [script 4.classification] (.Rmd) (ipynb)
- slides 6.classification (Filippo)
- Regression and classification with tidymodels
- Lasso-penalised linear and logistic regression
- data_lasso
- [script 6.lasso] (.Rmd) (ipynb)
- slides 7.lasso_regularization (Filippo)
- Lasso and model tuning
- data lasso 2
- [script 7.lasso_with_tidymodels] (.Rmd) (ipynb)
- KNN imputation
Day 4
- Random Forest for regression and classification
- [script 7.random_forest] (.Rmd) (ipynb)
- [script 8.multiclass_random_forest] (.Rmd) (ipynb)
- slides 7.random_forest (Filippo)
- Slow learning: the boosting approach
- [script 9.boosting] (.Rmd) (ipynb)
- slides 8.boosting (Filippo)
- Unsupervised learning: PCA, Umap, Self-organizing maps (Pietro)
Day 5
- Advanced data visualization: master ggplot! (Pietro)
- Final interactive exercise
- Kahoot quiz: let’s test our machine learning skills!
- Q&A
R Libraries
- Complete list here