You are on page 1of 87

Machine Learning for

Bioinformaics
S 1133 Fall 2016
Chlo-Agathe Azencot

Centre for Computaional Biology, Mines ParisTech


chloe-agathe.azencott@mines-paristech.fr

Learning objecives

Determine whether a problem can be solved by


machine learning.
Formulate a problem in machine learning terms.
Give and moivate examples of applicaions of
machine learning in bioinformaics.
Pick an appropriate way to evaluate a machine
learning algorithm for classiicaion/regression.
Know how to solve speciic machine learning
problems:

Dimensionality reducion with PCA


(Classiicaion and) regression with linear regression
2

Feature selecion with:

Classiicaion (and regression) with decision trees and


random forests
Classiicaion (and regression) with SVMs
[22 sept]
Clustering with:
[22 sept]

Filter approaches
Regularized linear regression

Hierarchical clustering
k-means

Recovering a series of states from a series of


observaions with HMMs.
[17 oct, Thomas Walter]

http://cazencott.info/index.php/pages/Teaching
3

Some bioinformaics quesions

What is the funcion of a gene/protein?


What is the binding strength between a protein and
a ligand?
Which genes are involved in speciic diseases?
Which proteins interact with which other proteins?
What is the yield of a crop?

What is (Machine) Learning?

Why Learn?

Learning: Modifying a behavior based on


experience. [F. Benureau]
Machine learning: Programming computers to
opimize a performance criterion using example
data.
There is no need to learn to apply exising
algorithms.
Machine learning is used (among other things)
when human experise does not exist.

What we talk about when we talk


about learning

Learning general models from paricular examples


(data)

Data is (mostly) cheap and abundant;


Knowledge is expensive and scarce.

Example in retail:
From customer transacions to consumer behavior
People who bought Game of Thrones also bought
Lord of the Rings [amazon.com]

Goal: Build a model that is a good and useful


approximaion to the data.
7

What is machine learning?

Opimizing a performance criterion using example


data or past experience.
Role of Staisics: Inference from a sample
Role of Computer Science: Eicient algorithms to

Solve the opimizaion problem;


Represent and evaluate the model for inference.

Classes of machine learning problems

Supervised learning: Predict outcome from features

Unsupervised learning: Find paterns in the data

Classiicaion
Regression
Ranking
Dimensionality reducion
Clustering

Semi-supervised learning: Predict outcome for


unlabeled but known instances.
Reinforcement learning: Maximize cumulaive reward.
Outliers detecion: Detect anomalies.
9

Classiicaion

10

Cancer diagnosis

Given the expression levels of 20k genes, is the


leukemia acute lymphocyic (ALL) or myeloid (AML)?
11

Cancer prognosis

Given the expression levels of 20k genes in a tumor


ater surgery, is the cancer likely to relapse?

12

Pharmacogenomics

Given the genome of a paient, which treatment


should we give?
13

Protein annotaion

Given the sequence of a protein, is it secreted or not?


14

Drug discovery

Is a new candidate molecule likely to be acive?


15

Regression

Image from Yunyoungmok on wikimedia.commons

16

Binding ainity

Predict the binding ainity between a ligand and a


protein.
Image: Surlex-QMOD

17

Age of onset

Predict the age of onset of a disease from clinical


and geneic variables.
Image: schizophrenia.com

18

Solubility

Predict the solubility in octanol / water of a


chemical compound.
Image: kreais.eu

19

Crop yield

Predict the yield of a crop from its genome.


Image: Dehaan / wikimedia.commoms

20

Ranking

Find a funcion that orders a given list of items of X


As a classiicaion problem: pairwise approach
Is {x1, x2} correctly ordered?

As a regression problem: pointwise approach


Listwise approach: directly opimize for the best
list.

21

Ranking drug candidates

Rank chemical compounds according to how likely


they are to cure a disease

Ranking conformers

Select the best 3D structure(s) for a chemical among


those generated by various methods.

Knowledge extracion

Supervised learning models can be used to

make predicions about new instances;


extract knowledge: interpretaion.

Feature selecion

Which of the predicive variables are incorporated into


the model? How important are they?
Filter approaches
Wrapper approaches
Embedded approaches

24

Dimensionality reducion
Reduce the number of input variables

Feature selecion
Feature extracion:
Transform the data into a space of lower dimension.

Goals of dimensionality reducion:

Reduce storage space & computaional ime


Remove colineariies
Visualizaion (in 2 or 3 dimensions) and intepretaion.

25

Clustering

Goal: Group objects into clusters, i.e. classes that


were unknown beforehand.
Objects in the same cluster are more similar to
each other than to objects in a diferent cluster.
Moivaion:

Understanding general characterisics of the data;


Visualizaion;
Infering some properies of an object based on how it
relates to other objects.

22/09/2016: Clustering gene expression data.


26

Applicaions of ML to bioinformaics

Comparaive genomics

TF binding sites
promoters binding sites
operons

Sequence assembly
Funcion predicion
Biomedical images analysis

Proteomics

Moif ideniicaion

idenify coding regions


splice site predicion

Structure predicion
of DNA, RNA, proteins...

Gene inding

build phylogeneic trees


compare & infer gene funcions

Text mining

protein locaion predicion


protein-protein interacions
gene/protein annotaion
disambiguaion

Systems biology
networks & pathways

Microarray data analysis


NGS data analysis
27

Dimensionality reducion:
Principal Component Analysis

28

Feature extracion

Create new features, but fewer than before


Typically: project p features on a m-dimensional
space
Principal Component Analysis does so

Linearly
In a unsupervised fashion.

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that


informaion loss is minimized when the data is
projected on that space.

30

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that


informaion loss is minimized when the data is
projected on that space.
Unsupervised: We're only looking at the data, not
at any labels.

31

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that


informaion loss is minimized when the data is
projected on that space.
Unsupervised: We're only looking at the data, not
at any labels.
In PCA, we want the variance to be
maximized.
Warning!
This requires to normalize the data.

32

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that


informaion loss is minimized when the data is
projected on that space.
Unsupervised: We're only looking at the data, not
at any labels.
In PCA, we want the variance to be
maximized.
Projecion on x2

Projecion on x1

33

Principal Components Analysis (PCA)

Goal: Find a low-dimensional space such that


variance is maximized when the data is projected
on that space.
Assumpion: the data is centered i.e. it has mean 0.
If not: substract the mean:
What formula gives us the projecion z of x on the
direcion of w (assuming w to be a unit vector)?

34

First principal component (PC1)

Goal: ind
such that

Cov(x)==xxT (data centered)

Opimizaion problem:

Using Lagrange mulipliers:


35

First principal component (PC1)

Take the derivaive, set it to 0.


Hence
What does it tell us about , w1 w.r.t the matrix ?

36

First principal component (PC1)

Take the derivaive, set it to 0.

Hence

is an eigenvalue of
w1 is an eigenvector of

Plugging back in:

w1 is the eigenvector of with the largest eigenvalue.


37

Second principal component (PC2)

The second principal component is given by the


eigenvector of with the second largest eigenvalue.
It is orthogonal to the irst PC.
And so on and so forth for all other PCs.

38

Singular value decomposiion

nxp

pxp
orthogonal

SVD of X:
nxn
orthogonal

nxp
diagonal, non-neg

Cov(X):
Eigendecomposiion of :
pxp

pxp
diagonal, eigenvalues
pxp
j-th col=j-th eigenvector

The singular values of X are the square roots of the


eigenvalues of .
The right singular vectors of X are the eigenvectors of .

39

What PCA does

W: p x m matrix of the m leading eigenvectors of .


m irst PCs:

40

How to choose m
Percentage of variance explained:

Total variance in the data = Tr() =


The irst m PCs account for
variance.

of the total

Scree graph:
variance explained

number of PCs

41

How to choose m

Pick enough components to explain a ixed


percentage of the variance.
Find the elbow (where adding more components
doesn't really help much).
Scree graph:
variance explained

number of PCs

42

Scree graph (example)

Optdigits dataset of the UCI repository [Ethem Alpaydin]

43

PCA example: Populaion geneics


Geneic data of
1387 Europeans

[Novembre et al, 2008]

44

Evaluaing a supervised
machine learning model

45

Classiicaion

Training set:

Goal: Find f such that

Empirical error of f on the training set:

Choose f so as to it the training set well: small


empirical error.

Overiing (classiicaion)

Which of the black


and purple classiiers
has a largest
empirical error?
Which model seems
more likely to be
correct?

47

Regression

Training set:

Goal: Find f such that

Empirical error of f on the training set:

Choose f so as to it the training set well.

Overiing (regression)

49

Predicion error

Overiing and model complexity

On new data

On training data
Model complexity
50

Validaion sets

Choose the model that performs best on a


validaion set separate from the training set.
Training

Validaion

51

Cross-validaion

Cut the training set in k separate folds.


For each fold, train on the (k-1) remaining folds.
Validaion
Training

Validaion
Validaion
Training
Training

Training
Validaion
Validaion
52

Classiicaion model evaluaion

Confusion matrix
True class

Predicted
class

-1

-1
True Negaives

+1
False Negaives

+1

False Posiives

True Posiives

False posiives (false alarms) are also called type I errors

False negaives (misses) are also called type II errors.

53

Sensiivity = Recall = True posiive rate (TPR)

Speciicity = True negaive rate (TNR)

Precision = Posiive predicive value (PPV)

False discovery rate (FDR)


54

Accuracy

F1-score = harmonic mean of precision and


sensiivity.

55

Example: Pap smear

4,000 apparently healthy women of age 40+


Tested for cervical cancer through pap smear and
histology (gold standard)
Cancer

No cancer

Total

Positive test

190

210

400

Negative test

10

3590

3600

Total

200

3800

4000

What are the sensiivity, speciicity, and PPV of the


test?

Sensiivity = Recall = True posiive rate (TPR)

Speciicity = True negaive rate (TNR)

Precision = Posiive predicive value (PPV)

Cancer

No cancer

Total

Positive test

190

210

400

Negative test

10

3590

3600

Total

200

3800

4000
57

In this populaion:
Sensiivity = 95.0 %

Speciicity = 94.5 %

Cancer

No cancer

Total

Positive test

190

210

400

Negative test

10

3590

3600

Total

200

3800

4000

PPV = 47.5 %

Prevalence of the disease = 200/4000 = 0.05


P(cancer|posiive test) = PPV = 47.5 %
P(no cancer| negaive test) = 3590/3600 = 99.7 %
Poor diagnosis tool
Good screening tool

ROC curves

Perfect classiier

ii
er

ss

cla

ra
nd
om

Receiver-Operator Characterisic.
Summarized by the area under the curve (AUROC).

True posiive rate

False posiive rate

Perfect classiier:
AUROC = 1.0
Random classiier:
AUROC = 0.5
Our classiier:
0.5 < AUROC < 1.0
59

Predicing breast cancer risk based on


mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Geneic Variants Improve Breast Cancer Risk
Predicion on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

= 1 - FPR

Which method
outperforms the others?
Is a low FPR or high TPR
preferable in a clinical
seing?
60

Predicing breast cancer risk based on


mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Geneic Variants Improve Breast Cancer Risk
Predicion on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

= 1 - FPR

High recall = fewer


chances to miss a case
High speciicity / low
FPR = fewer false alarms
61

Precision-Recall curves
Sensiivity = Recall = True posiive rate (TPR)

Precision

Good corner

Precision = Posiive predicive value (PPV)

Bad corner
Recall 1
62

Predicing breast cancer risk based on


mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Geneic Variants Improve Breast Cancer Risk
Predicion on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

Sensiivity = Recall = True posiive rate (TPR)

Precision = Posiive predicive value (PPV)

63

Predicing breast cancer risk based on


mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Geneic Variants Improve Breast Cancer Risk
Predicion on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

Sensiivity = Recall = True posiive rate (TPR)

Precision = Posiive predicive value (PPV)

High recall = fewer chances


to miss a case
High precision = substanially
more true diagnoses than
false alarms

64

Logisic & Linear Regression

65

Linear regression

66

Least-square it

Minimize the sum of squared residuals

Assuming X has full rank (and hence XTX posiive


deinite)

67

Correlated variables

If the variables are decorrelated:

Each coeicient can be esimated separately;


Interpretaion is easy:
A change of 1 in xj is associated with a change of j in Y,
while everything else stays the same.

Correlaions between variables cause problems:

The variance of all coeicients tend to increase;


Interpretaion is much harder
When xj changes, so does everything else.

68

What about classiicaion?

Idea: Model Pr(Y=1|X) as a linear funcion and use


a linear regression.
Why might this be a problem?

69

What about classiicaion?

Idea: Model Pr(Y=1|X) as a linear funcion and use


a linear regression.
Why might this be a problem?

Pr(Y=1|X) must be between 0 and 1.


Non-linearity:

If y close to +1 or -1, x must change a lot for y to change;


If y close to 0, that's not the case.

70

What about classiicaion?

Idea: Model Pr(Y=1|X) as a linear funcion

Problem: Pr(Y=1|X) must be between 0 and 1.


Problem: Non-linearity:

If y close to +1 or -1, x must change a lot for y to change;


If y close to 0, that's not the case.

Hence: use a logit transformaion

Logisic regression.

71

Example: Endometrium vs Uterus


tumor classiicaion

61 endometrium tumor samples

54 675 genes

124 uterus tumor samples

10-fold cross-validaion

72

Feature selecion

73

Large p, small n
Geneics and genomics

thousands of genes, millions of SNPs


usually, at best thousands of paients

74

Filter approach

Assign a score to each feature

Correlaion to the label


Staisical test of associaion
Mutual informaion criterion

Only keep features whose score is above a chosen


threshold.

75

Linear regression

Least-squares it (equivalent to MLE under the


assumpion of Gaussian noise):

The soluion is uniquely deined when n > p and XTX


inverible.

76

When X X not inversible

Pseudo-inverse
Linear system of p equaions:
Numerical methods

Gaussian eliminaion
LU decomposiion
Gradient descent
J()

77

Linear regression when p >> n


Simulated data: p=1000, n=100, 10 causal features
True coeicients

Predicted coeicients

78

Example: Endometrium vs Uterus


tumor classiicaion

61 endometrium tumor samples

54 675 genes

124 uterus tumor samples

10-fold cross-validaion

79

Regularizaion

Minimize
SSE + penalty on model complexity

can be set by cross-validaion.


Simpler model fewer parameters
shrinkage: drive the coeicients of the
parameters towards 0.
Sparsity:

many coeicients get a weight of 0


they can be eliminated from the model.
80

Lasso

L1 penalizaion
Equivalent to
The L1 penalty shrinks the coeicients to zero.
2

iso-contours of the
least-squares error
unconstrained min.
= least-squares sol.
1
feasible region:
|1| + |2| t

81

feature coeicients

Lasso soluion path

decreasing value of

82

Example: Endometrium vs Uterus


tumor classiicaion

61 endometrium tumor samples

54 675 genes

124 uterus tumor samples

10-fold cross-validaion

83

Example: Endometrium vs Uterus


tumor classiicaion

61 endometrium tumor samples

54 675 genes

124 uterus tumor samples

10-fold cross-validaion

Parameter set by inner crossvalidaion to maximize accuracy


63 genes with non-zero weights
84

Toolboxes

Python: scikit-learn
http://scikit-learn.org

R: Machine Learning Task View


http://cran.r-project.org/web/views/MachineLearning.html

Matlab: Machine Learning with MATLAB


http://fr.mathworks.com/machine-learning/index.html

Staisics and Machine Learning Toolbox


Neural Network Toolbox
85

Summary

Machine learning plays an important role in


bioinformaics.

Predicive models (classiicaion / regression)


Knowledge extracion, in paricular feature selecion

One must be wary of overiing and exert cauion


when evaluaing a machine learning model.
Example of unsupervised learning:
PCA

Examples of classiicaion & regression algorithms:

linear regression and logisic regression


lasso.
86

References

Textbook: Hasie and Tibshirani (2009).Elements of staisical learning.


http://statweb.stanford.edu/~tibs/ElemStatLearn
Textbook: Schlkopf, Tsuda and Vert (2005). Kernel Methods in
Computaional Biology.
Review aricle: M. W. Libbrecht and W. S. Noble (2015). Machine
learning applicaions in geneics and genomics. Nature Reviews
Geneics.
Tutorial on PCA (J. Schlens):
https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf

Pour aller plus loin...

Apprenissage ariiciel, S1324 (semestre 4)


Opimisaion, S1734 (semestre 4)

87

You might also like