Azencott BioML

Machine Learning for
Bioinformaics
S 1133 Fall 2016
Chlo-Agathe Azencot
Centre for Computaional Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr
Learning objecives
Determine whether a problem can be solved by

machine learning.
Formulate a problem in machine learning terms.
Give and moivate examples of applicaions of
machine learning in bioinformaics.
Pick an appropriate way to evaluate a machine
learning algorithm for classiicaion/regression.
Know how to solve speciic machine learning
problems:
Dimensionality reducion with PCA

(Classiicaion and) regression with linear regression
2
Feature selecion with:
Classiicaion (and regression) with decision trees and

random forests
Classiicaion (and regression) with SVMs
[22 sept]
Clustering with:
[22 sept]
Filter approaches
Regularized linear regression
Hierarchical clustering
k-means
Recovering a series of states from a series of

observaions with HMMs.
[17 oct, Thomas Walter]
http://cazencott.info/index.php/pages/Teaching
3
Some bioinformaics quesions
What is the funcion of a gene/protein?

What is the binding strength between a protein and
a ligand?
Which genes are involved in speciic diseases?
Which proteins interact with which other proteins?
What is the yield of a crop?
What is (Machine) Learning?
Why Learn?
Learning: Modifying a behavior based on

experience. [F. Benureau]
Machine learning: Programming computers to
opimize a performance criterion using example
data.
There is no need to learn to apply exising
algorithms.
Machine learning is used (among other things)
when human experise does not exist.
What we talk about when we talk

about learning
Learning general models from paricular examples

(data)
Data is (mostly) cheap and abundant;

Knowledge is expensive and scarce.
Example in retail:
From customer transacions to consumer behavior
People who bought Game of Thrones also bought
Lord of the Rings [amazon.com]
Goal: Build a model that is a good and useful

approximaion to the data.
7
What is machine learning?
Opimizing a performance criterion using example

data or past experience.
Role of Staisics: Inference from a sample
Role of Computer Science: Eicient algorithms to
Solve the opimizaion problem;

Represent and evaluate the model for inference.
Classes of machine learning problems
Supervised learning: Predict outcome from features
Unsupervised learning: Find paterns in the data
Classiicaion
Regression
Ranking
Dimensionality reducion
Clustering
Semi-supervised learning: Predict outcome for

unlabeled but known instances.
Reinforcement learning: Maximize cumulaive reward.
Outliers detecion: Detect anomalies.
9
Classiicaion
10
Cancer diagnosis
Given the expression levels of 20k genes, is the

leukemia acute lymphocyic (ALL) or myeloid (AML)?
11
Cancer prognosis
Given the expression levels of 20k genes in a tumor

ater surgery, is the cancer likely to relapse?
12
Pharmacogenomics
Given the genome of a paient, which treatment

should we give?
13
Protein annotaion
Given the sequence of a protein, is it secreted or not?

14
Drug discovery
Is a new candidate molecule likely to be acive?

15
Regression
Image from Yunyoungmok on wikimedia.commons
16
Binding ainity
Predict the binding ainity between a ligand and a

protein.
Image: Surlex-QMOD
17
Age of onset
Predict the age of onset of a disease from clinical

and geneic variables.
Image: schizophrenia.com
18
Solubility
Predict the solubility in octanol / water of a

chemical compound.
Image: kreais.eu
19
Crop yield
Predict the yield of a crop from its genome.

Image: Dehaan / wikimedia.commoms
20
Ranking
Find a funcion that orders a given list of items of X

As a classiicaion problem: pairwise approach
Is {x1, x2} correctly ordered?
As a regression problem: pointwise approach

Listwise approach: directly opimize for the best
list.
21
Ranking drug candidates
Rank chemical compounds according to how likely

they are to cure a disease
Ranking conformers
Select the best 3D structure(s) for a chemical among

those generated by various methods.
Knowledge extracion
Supervised learning models can be used to
make predicions about new instances;

extract knowledge: interpretaion.
Feature selecion
Which of the predicive variables are incorporated into

the model? How important are they?
Filter approaches
Wrapper approaches
Embedded approaches
24
Dimensionality reducion
Reduce the number of input variables
Feature selecion
Feature extracion:
Transform the data into a space of lower dimension.
Goals of dimensionality reducion:
Reduce storage space & computaional ime

Remove colineariies
Visualizaion (in 2 or 3 dimensions) and intepretaion.
25
Clustering
Goal: Group objects into clusters, i.e. classes that

were unknown beforehand.
Objects in the same cluster are more similar to
each other than to objects in a diferent cluster.
Moivaion:
Understanding general characterisics of the data;

Visualizaion;
Infering some properies of an object based on how it
relates to other objects.
22/09/2016: Clustering gene expression data.

26
Applicaions of ML to bioinformaics
Comparaive genomics
TF binding sites
promoters binding sites
operons
Sequence assembly
Funcion predicion
Biomedical images analysis
Proteomics
Moif ideniicaion
idenify coding regions

splice site predicion
Structure predicion
of DNA, RNA, proteins...
Gene inding
build phylogeneic trees

compare & infer gene funcions
Text mining
protein locaion predicion

protein-protein interacions
gene/protein annotaion
disambiguaion
Systems biology
networks & pathways
Microarray data analysis

NGS data analysis
27
Dimensionality reducion:
Principal Component Analysis
28
Feature extracion
Create new features, but fewer than before

Typically: project p features on a m-dimensional
space
Principal Component Analysis does so
Linearly
In a unsupervised fashion.
Principal Components Analysis (PCA)
Goal: Find a low-dimensional space such that

informaion loss is minimized when the data is
projected on that space.
30

Unsupervised: We're only looking at the data, not
at any labels.
31

at any labels.
In PCA, we want the variance to be
maximized.
Warning!
This requires to normalize the data.
32

at any labels.
In PCA, we want the variance to be
maximized.
Projecion on x2
Projecion on x1
33

variance is maximized when the data is projected
on that space.
Assumpion: the data is centered i.e. it has mean 0.
If not: substract the mean:
What formula gives us the projecion z of x on the
direcion of w (assuming w to be a unit vector)?
34
First principal component (PC1)
Goal: ind
such that
Cov(x)==xxT (data centered)
Opimizaion problem:
Using Lagrange mulipliers:

35
Take the derivaive, set it to 0.

Hence
What does it tell us about , w1 w.r.t the matrix ?
36
Take the derivaive, set it to 0.
Hence
is an eigenvalue of
w1 is an eigenvector of
Plugging back in:
w1 is the eigenvector of with the largest eigenvalue.

37
Second principal component (PC2)
The second principal component is given by the

eigenvector of with the second largest eigenvalue.
It is orthogonal to the irst PC.
And so on and so forth for all other PCs.
38
Singular value decomposiion
nxp
pxp
orthogonal
SVD of X:
nxn
orthogonal
nxp
diagonal, non-neg
Cov(X):
Eigendecomposiion of :
pxp
pxp
diagonal, eigenvalues
pxp
j-th col=j-th eigenvector
The singular values of X are the square roots of the

eigenvalues of .
The right singular vectors of X are the eigenvectors of .
39
What PCA does
W: p x m matrix of the m leading eigenvectors of .

m irst PCs:
40
How to choose m
Percentage of variance explained:
Total variance in the data = Tr() =

The irst m PCs account for
variance.
of the total
Scree graph:
variance explained
number of PCs
41
How to choose m
Pick enough components to explain a ixed

percentage of the variance.
Find the elbow (where adding more components
doesn't really help much).
Scree graph:
variance explained
number of PCs
42
Scree graph (example)
Optdigits dataset of the UCI repository [Ethem Alpaydin]
43
PCA example: Populaion geneics

Geneic data of
1387 Europeans
[Novembre et al, 2008]
44
Evaluaing a supervised
machine learning model
45
Classiicaion
Training set:
Goal: Find f such that
Empirical error of f on the training set:
Choose f so as to it the training set well: small

empirical error.
Overiing (classiicaion)
Which of the black

and purple classiiers
has a largest
empirical error?
Which model seems
more likely to be
correct?
47
Regression
Training set:
Goal: Find f such that
Empirical error of f on the training set:
Choose f so as to it the training set well.
Overiing (regression)
49
Predicion error
Overiing and model complexity
On new data
On training data
Model complexity
50
Validaion sets
Choose the model that performs best on a

validaion set separate from the training set.
Training
Validaion
51
Cross-validaion
Cut the training set in k separate folds.

For each fold, train on the (k-1) remaining folds.
Validaion
Training
Validaion
Validaion
Training
Training
Training
Validaion
Validaion
52
Classiicaion model evaluaion
Confusion matrix
True class
Predicted
class
-1
-1
True Negaives
+1
False Negaives
+1
False Posiives
True Posiives
False posiives (false alarms) are also called type I errors
False negaives (misses) are also called type II errors.
53
Sensiivity = Recall = True posiive rate (TPR)
Speciicity = True negaive rate (TNR)
Precision = Posiive predicive value (PPV)
False discovery rate (FDR)

54
Accuracy
F1-score = harmonic mean of precision and

sensiivity.
55
Example: Pap smear
4,000 apparently healthy women of age 40+

Tested for cervical cancer through pap smear and
histology (gold standard)
Cancer
No cancer
Total
Positive test
190
210
400
Negative test
10
3590
3600
Total
200
3800
4000
What are the sensiivity, speciicity, and PPV of the

test?
Speciicity = True negaive rate (TNR)
Cancer
No cancer
Total
Positive test
190
210
400
Negative test
10
3590
3600
Total
200
3800
4000
57
In this populaion:
Sensiivity = 95.0 %
Speciicity = 94.5 %
Cancer
No cancer
Total
Positive test
190
210
400
Negative test
10
3590
3600
Total
200
3800
4000
PPV = 47.5 %
Prevalence of the disease = 200/4000 = 0.05

P(cancer|posiive test) = PPV = 47.5 %
P(no cancer| negaive test) = 3590/3600 = 99.7 %
Poor diagnosis tool
Good screening tool
ROC curves
Perfect classiier
ii
er
ss
cla
ra
nd
om
Receiver-Operator Characterisic.
Summarized by the area under the curve (AUROC).
True posiive rate
False posiive rate
Perfect classiier:
AUROC = 1.0
Random classiier:
AUROC = 0.5
Our classiier:
0.5 < AUROC < 1.0
59
Predicing breast cancer risk based on

mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Geneic Variants Improve Breast Cancer Risk
Predicion on Mammograms. AMIA Annual Symposium Proceedings. 876-885.
= 1 - FPR
Which method
outperforms the others?
Is a low FPR or high TPR
preferable in a clinical
seing?
60

= 1 - FPR
High recall = fewer

chances to miss a case
High speciicity / low
FPR = fewer false alarms
61
Precision-Recall curves
Precision
Good corner
Bad corner
Recall 1
62

63

High recall = fewer chances

to miss a case
High precision = substanially
more true diagnoses than
false alarms
64
Logisic & Linear Regression
65
Linear regression
66
Least-square it
Minimize the sum of squared residuals
Assuming X has full rank (and hence XTX posiive

deinite)
67
Correlated variables
If the variables are decorrelated:
Each coeicient can be esimated separately;

Interpretaion is easy:
A change of 1 in xj is associated with a change of j in Y,
while everything else stays the same.
Correlaions between variables cause problems:
The variance of all coeicients tend to increase;

Interpretaion is much harder
When xj changes, so does everything else.
68
What about classiicaion?
Idea: Model Pr(Y=1|X) as a linear funcion and use

a linear regression.
Why might this be a problem?
69
Idea: Model Pr(Y=1|X) as a linear funcion and use

a linear regression.
Why might this be a problem?
Pr(Y=1|X) must be between 0 and 1.

Non-linearity:
If y close to +1 or -1, x must change a lot for y to change;

If y close to 0, that's not the case.
70
Idea: Model Pr(Y=1|X) as a linear funcion
Problem: Pr(Y=1|X) must be between 0 and 1.

Problem: Non-linearity:
If y close to +1 or -1, x must change a lot for y to change;

If y close to 0, that's not the case.
Hence: use a logit transformaion
Logisic regression.
71
Example: Endometrium vs Uterus

tumor classiicaion
61 endometrium tumor samples
54 675 genes
124 uterus tumor samples
10-fold cross-validaion
72
Feature selecion
73
Large p, small n
Geneics and genomics
thousands of genes, millions of SNPs

usually, at best thousands of paients
74
Filter approach
Assign a score to each feature
Correlaion to the label

Staisical test of associaion
Mutual informaion criterion
Only keep features whose score is above a chosen

threshold.
75
Linear regression
Least-squares it (equivalent to MLE under the

assumpion of Gaussian noise):
The soluion is uniquely deined when n > p and XTX

inverible.
76
When X X not inversible
Pseudo-inverse
Linear system of p equaions:
Numerical methods
Gaussian eliminaion
LU decomposiion
Gradient descent
J()
77
Linear regression when p >> n

Simulated data: p=1000, n=100, 10 causal features
True coeicients
Predicted coeicients
78

tumor classiicaion
54 675 genes
79
Regularizaion
Minimize
SSE + penalty on model complexity
can be set by cross-validaion.

Simpler model fewer parameters
shrinkage: drive the coeicients of the
parameters towards 0.
Sparsity:
many coeicients get a weight of 0

they can be eliminated from the model.
80
Lasso
L1 penalizaion
Equivalent to
The L1 penalty shrinks the coeicients to zero.
2
iso-contours of the
least-squares error
unconstrained min.
= least-squares sol.
1
feasible region:
|1| + |2| t
81
feature coeicients
Lasso soluion path
decreasing value of
82

tumor classiicaion
54 675 genes
83

tumor classiicaion
54 675 genes
Parameter set by inner crossvalidaion to maximize accuracy

63 genes with non-zero weights
84
Toolboxes
Python: scikit-learn
http://scikit-learn.org
R: Machine Learning Task View

http://cran.r-project.org/web/views/MachineLearning.html
Matlab: Machine Learning with MATLAB

http://fr.mathworks.com/machine-learning/index.html
Staisics and Machine Learning Toolbox

Neural Network Toolbox
85
Summary
Machine learning plays an important role in

bioinformaics.
Predicive models (classiicaion / regression)

Knowledge extracion, in paricular feature selecion
One must be wary of overiing and exert cauion

when evaluaing a machine learning model.
Example of unsupervised learning:
PCA
Examples of classiicaion & regression algorithms:
linear regression and logisic regression

lasso.
86
References
Textbook: Hasie and Tibshirani (2009).Elements of staisical learning.

http://statweb.stanford.edu/~tibs/ElemStatLearn
Textbook: Schlkopf, Tsuda and Vert (2005). Kernel Methods in
Computaional Biology.
Review aricle: M. W. Libbrecht and W. S. Noble (2015). Machine
learning applicaions in geneics and genomics. Nature Reviews
Geneics.
Tutorial on PCA (J. Schlens):
https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
Pour aller plus loin...
Apprenissage ariiciel, S1324 (semestre 4)

Opimisaion, S1734 (semestre 4)
87

Azencott BioML

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Azencott BioML

Uploaded by

Copyright:

Available Formats

Machine Learning for

Centre for Computaional Biology, Mines ParisTech

Determine whether a problem can be solved by

Dimensionality reducion with PCA

Feature selecion with:

Classiicaion (and regression) with decision trees and

Recovering a series of states from a series of

Some bioinformaics quesions

What is the funcion of a gene/protein?

What is (Machine) Learning?

Learning: Modifying a behavior based on

What we talk about when we talk

Learning general models from paricular examples

Data is (mostly) cheap and abundant;

Goal: Build a model that is a good and useful

What is machine learning?

Opimizing a performance criterion using example

Solve the opimizaion problem;

Classes of machine learning problems

Supervised learning: Predict outcome from features

Unsupervised learning: Find paterns in the data

Semi-supervised learning: Predict outcome for

Given the expression levels of 20k genes, is the

Given the expression levels of 20k genes in a tumor

Given the genome of a paient, which treatment

Given the sequence of a protein, is it secreted or not?

Is a new candidate molecule likely to be acive?

Image from Yunyoungmok on wikimedia.commons

Predict the binding ainity between a ligand and a

Predict the age of onset of a disease from clinical

Predict the solubility in octanol / water of a

Predict the yield of a crop from its genome.

Find a funcion that orders a given list of items of X

As a regression problem: pointwise approach

Ranking drug candidates

Rank chemical compounds according to how likely

Select the best 3D structure(s) for a chemical among

Supervised learning models can be used to

make predicions about new instances;

Which of the predicive variables are incorporated into

Goals of dimensionality reducion:

Reduce storage space & computaional ime

Goal: Group objects into clusters, i.e. classes that

Understanding general characterisics of the data;

22/09/2016: Clustering gene expression data.

idenify coding regions

build phylogeneic trees

protein locaion predicion

Microarray data analysis

Create new features, but fewer than before

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that

First principal component (PC1)

Cov(x)==xxT (data centered)

Using Lagrange mulipliers:

First principal component (PC1)