Detecting Novel Associations in Large Data Sets

CS-GN-TEAM: internal presentation
detecting novel associations

in large data sets
Michele Filannino + You Presented paper: D. N. Reshef et al., Detecting Novel Associations in Large Data Sets, Science, vol. 334, no. 6062, pp. 1518-1524, 2011.
Manchester, 05/03/2012
presentation my research taster project
where we are
05/03/2012, Michele Filannino
/ 36
Introduction
novel association
two variables, X and Y, are associated if there is a
relationship between them
functional
non functional
novel: unknown
/ 36
example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6
f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28
f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00
f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00
f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96
f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71
4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00
/ 36
example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6
f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28
f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00
f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00
f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96
f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71
4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00
/ 36
scatter plot: f0 vs. f2
f2(x) = f0(x) + 1
/ 36
example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6
f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28
f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00
f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00
f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96
f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71
4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00
/ 36
scatter plot: f0 vs. f1
no relation
/ 36
correlation coecients
Pearson f0-f5 f0-f1 f0-f2 f2-f3 f0-f3 0.63 -0.17 1.00 -0.08 -0.08 Mutual Infor. 2.45 1.57 3.32 3.12 3.12 MI norm. 0.74 0.47 1.00 0.94 0.94
10
/ 36
pros. & cons.

Pearsons coe.

Mutual Information

closed interval result only linear relations feature independency
non linear relations only categorical data biased towards higher arity features
11
/ 36
the new measure
motivations
generality:
capture a wide range of interesting associations, not limited to specic function types
equitability:
give similar scores to equally noisy relationships of dierent types
13
/ 36
denition of MIC
Given a nite set D of ordered pairs, we can
partition the X-values of D into x bins and the Yvalues of D into y bins
We obtain a pair of partitions called x-by-y grid

D = (F0, F1) F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00) F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)
14
/ 36
x-by-y grid
2-by-4 grid
15
/ 36
denition of MIC
given the grid we could calculate D|G, the frequency
distribution induced by the points in D on the cells of G
dierent grids G result in dierent distributions D|G
16
/ 36
maximal MI over all grids
number of columns
number of rows
17
/ 36
characteristic matrix
Innite matrix!
normalisation factor
(derived by MI denition)
18
/ 36
Maximal Information Coe.
max grid size
19
/ 36
matrix computation
space of grids grows exponentially
B(n) O(n1-) for 0 < < 1
approximation of MIC
heuristic dynamic programming
20 / 36
MIC summary

closed interval result non linear relations all types of data B(n) is crucial

too high: non-zero scores even for random data too low: we are searching only for simple pattern
still univariate
21
/ 36
B(n) behaviour
22 / 36
B(n) behaviour
23 / 36
how to use it
python
import xstats.MINE as MINE x = [40,50,None,70,80,90,100,110,120,130,140,150, 160,170,180,190,200,210,220,230,240,250,260]
y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44, -0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09, -0.44,0.31,0.03,0.57,0,0.01]
print "x y", MINE.analyze_pair(x, y)
https://github.com/ajmazurie/xstats.MINE
25 / 36
python: result
{'MCN': 2.5849625999999999, 'MAS': 0.040419996, 'pearson': 0.31553724, 'MIC': 0.38196000000000002, 'MEV': 0.27117000000000002, 'non_linearity': 0.28239626000000001}
26 / 36
correlation coecients
Pearson f0-f5 f0-f1 f0-f2 f2-f3 f0-f3
0.63
Mutual Informat.
2.45
MI norm.
0.74
MIC
0.24
graph
-0.17
1.57
0.47
0.24
1.00
3.32
1.00
1.00
-0.08
3.12
0.94
0.24
-0.08
3.12
0.94
0.24
27 / 36
MIC summary

closed interval result non linear relations all types of data B(n) is crucial
n is too low!
still univariate
28 / 36
python
import xstats.MINE as MINE import math x = [n*0.01 for n in range(1,2000)] y = [math.sin(n) for n in x] result = MINE.analyze_pair(x, y)
print "MIC:", result[MIC]
print "Pearson:", result[pearson]
>>> MIC: 0.99999 >>> Pearson: -0.16366038

29 / 36
conclusion
relationship types
Source: paper
31
/ 36
relationship types
Source: paper
32 / 36
real application
Source: paper
33 / 36
suggestions
use MIC only when you have lots of samples
samples > 2000
use B(n) = n0.6 dont use it for all the possible pairs of features
it is slower than Pearsons correlation coecient or Mutual Information
34 / 36
Thank you.
references
D. N. Reshef et al., Detecting Novel Associations in
Large Data Sets, Science, vol. 334, no. 6062, pp. 1518-1524, 2011.
D. N. Reshef et al., Supporting Online Material for

Detecting Novel Associations in Large Data Sets
36 / 36

Detecting Novel Associations in Large Data Sets

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detecting Novel Associations in Large Data Sets

Uploaded by

Copyright:

Available Formats

CS-GN-TEAM: internal presentation

detecting novel associations

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

scatter plot: f0 vs. f2

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

scatter plot: f0 vs. f1

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

pros. & cons.

closed interval result only linear relations feature independency

05/03/2012, Michele Filannino

the new measure

presentation my research taster project

give similar scores to equally noisy relationships of dierent types

05/03/2012, Michele Filannino

presentation my research taster project

We obtain a pair of partitions called x-by-y grid

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

dierent grids G result in dierent distributions D|G

05/03/2012, Michele Filannino

presentation my research taster project

maximal MI over all grids

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

Maximal Information Coe.

max grid size

05/03/2012, Michele Filannino

presentation my research taster project

B(n) O(n1-) for 0 < < 1

heuristic dynamic programming

05/03/2012, Michele Filannino

presentation my research taster project

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44, -0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09, -0.44,0.31,0.03,0.57,0,0.01]

print "x y", MINE.analyze_pair(x, y)

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project

05/03/2012, Michele Filannino

presentation my research taster project