Professional Documents
Culture Documents
Bioinformaics
S 1133 Fall 2016
Chlo-Agathe Azencot
Learning objecives
Filter approaches
Regularized linear regression
Hierarchical clustering
k-means
http://cazencott.info/index.php/pages/Teaching
3
Why Learn?
Example in retail:
From customer transacions to consumer behavior
People who bought Game of Thrones also bought
Lord of the Rings [amazon.com]
Classiicaion
Regression
Ranking
Dimensionality reducion
Clustering
Classiicaion
10
Cancer diagnosis
Cancer prognosis
12
Pharmacogenomics
Protein annotaion
Drug discovery
Regression
16
Binding ainity
17
Age of onset
18
Solubility
19
Crop yield
20
Ranking
21
Ranking conformers
Knowledge extracion
Feature selecion
24
Dimensionality reducion
Reduce the number of input variables
Feature selecion
Feature extracion:
Transform the data into a space of lower dimension.
25
Clustering
Applicaions of ML to bioinformaics
Comparaive genomics
TF binding sites
promoters binding sites
operons
Sequence assembly
Funcion predicion
Biomedical images analysis
Proteomics
Moif ideniicaion
Structure predicion
of DNA, RNA, proteins...
Gene inding
Text mining
Systems biology
networks & pathways
Dimensionality reducion:
Principal Component Analysis
28
Feature extracion
Linearly
In a unsupervised fashion.
30
31
32
Projecion on x1
33
34
Goal: ind
such that
Opimizaion problem:
36
Hence
is an eigenvalue of
w1 is an eigenvector of
38
nxp
pxp
orthogonal
SVD of X:
nxn
orthogonal
nxp
diagonal, non-neg
Cov(X):
Eigendecomposiion of :
pxp
pxp
diagonal, eigenvalues
pxp
j-th col=j-th eigenvector
39
40
How to choose m
Percentage of variance explained:
of the total
Scree graph:
variance explained
number of PCs
41
How to choose m
number of PCs
42
43
44
Evaluaing a supervised
machine learning model
45
Classiicaion
Training set:
Overiing (classiicaion)
47
Regression
Training set:
Overiing (regression)
49
Predicion error
On new data
On training data
Model complexity
50
Validaion sets
Validaion
51
Cross-validaion
Validaion
Validaion
Training
Training
Training
Validaion
Validaion
52
Confusion matrix
True class
Predicted
class
-1
-1
True Negaives
+1
False Negaives
+1
False Posiives
True Posiives
53
Accuracy
55
No cancer
Total
Positive test
190
210
400
Negative test
10
3590
3600
Total
200
3800
4000
Cancer
No cancer
Total
Positive test
190
210
400
Negative test
10
3590
3600
Total
200
3800
4000
57
In this populaion:
Sensiivity = 95.0 %
Speciicity = 94.5 %
Cancer
No cancer
Total
Positive test
190
210
400
Negative test
10
3590
3600
Total
200
3800
4000
PPV = 47.5 %
ROC curves
Perfect classiier
ii
er
ss
cla
ra
nd
om
Receiver-Operator Characterisic.
Summarized by the area under the curve (AUROC).
Perfect classiier:
AUROC = 1.0
Random classiier:
AUROC = 0.5
Our classiier:
0.5 < AUROC < 1.0
59
= 1 - FPR
Which method
outperforms the others?
Is a low FPR or high TPR
preferable in a clinical
seing?
60
= 1 - FPR
Precision-Recall curves
Sensiivity = Recall = True posiive rate (TPR)
Precision
Good corner
Bad corner
Recall 1
62
63
64
65
Linear regression
66
Least-square it
67
Correlated variables
68
69
70
Logisic regression.
71
54 675 genes
10-fold cross-validaion
72
Feature selecion
73
Large p, small n
Geneics and genomics
74
Filter approach
75
Linear regression
76
Pseudo-inverse
Linear system of p equaions:
Numerical methods
Gaussian eliminaion
LU decomposiion
Gradient descent
J()
77
Predicted coeicients
78
54 675 genes
10-fold cross-validaion
79
Regularizaion
Minimize
SSE + penalty on model complexity
Lasso
L1 penalizaion
Equivalent to
The L1 penalty shrinks the coeicients to zero.
2
iso-contours of the
least-squares error
unconstrained min.
= least-squares sol.
1
feasible region:
|1| + |2| t
81
feature coeicients
decreasing value of
82
54 675 genes
10-fold cross-validaion
83
54 675 genes
10-fold cross-validaion
Toolboxes
Python: scikit-learn
http://scikit-learn.org
Summary
References
87