Professional Documents
Culture Documents
Manchester, 05/03/2012
where we are
/ 36
Introduction
novel association
two variables, X and Y, are associated if there is a
relationship between them
functional
non functional
novel: unknown
05/03/2012, Michele Filannino
/ 36
example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6
f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28
f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00
f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00
f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96
f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71
4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00
/ 36
example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6
f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28
f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00
f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00
f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96
f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71
4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00
/ 36
f2(x) = f0(x) + 1
/ 36
example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6
f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28
f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00
f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00
f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96
f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71
4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00
/ 36
no relation
/ 36
correlation coecients
Pearson f0-f5 f0-f1 f0-f2 f2-f3 f0-f3 0.63 -0.17 1.00 -0.08 -0.08 Mutual Infor. 2.45 1.57 3.32 3.12 3.12 MI norm. 0.74 0.47 1.00 0.94 0.94
10
/ 36
Mutual Information
non linear relations only categorical data biased towards higher arity features
11
/ 36
motivations
generality:
capture a wide range of interesting associations, not limited to specic function types
equitability:
13
/ 36
denition of MIC
Given a nite set D of ordered pairs, we can
partition the X-values of D into x bins and the Yvalues of D into y bins
14
/ 36
x-by-y grid
2-by-4 grid
15
/ 36
denition of MIC
given the grid we could calculate D|G, the frequency
distribution induced by the points in D on the cells of G
16
/ 36
number of columns
number of rows
17
/ 36
characteristic matrix
Innite matrix!
normalisation factor
(derived by MI denition)
18
/ 36
19
/ 36
matrix computation
space of grids grows exponentially
approximation of MIC
20 / 36
MIC summary
closed interval result non linear relations all types of data B(n) is crucial
too high: non-zero scores even for random data too low: we are searching only for simple pattern
still univariate
05/03/2012, Michele Filannino
21
/ 36
B(n) behaviour
22 / 36
B(n) behaviour
23 / 36
how to use it
python
import xstats.MINE as MINE x = [40,50,None,70,80,90,100,110,120,130,140,150, 160,170,180,190,200,210,220,230,240,250,260]
https://github.com/ajmazurie/xstats.MINE
25 / 36
python: result
{'MCN': 2.5849625999999999, 'MAS': 0.040419996, 'pearson': 0.31553724, 'MIC': 0.38196000000000002, 'MEV': 0.27117000000000002, 'non_linearity': 0.28239626000000001}
26 / 36
correlation coecients
Pearson f0-f5 f0-f1 f0-f2 f2-f3 f0-f3
0.63
Mutual Informat.
2.45
MI norm.
0.74
MIC
0.24
graph
-0.17
1.57
0.47
0.24
1.00
3.32
1.00
1.00
-0.08
3.12
0.94
0.24
-0.08
3.12
0.94
0.24
27 / 36
MIC summary
closed interval result non linear relations all types of data B(n) is crucial
n is too low!
still univariate
28 / 36
python
import xstats.MINE as MINE import math x = [n*0.01 for n in range(1,2000)] y = [math.sin(n) for n in x] result = MINE.analyze_pair(x, y)
29 / 36
conclusion
relationship types
Source: paper
31
/ 36
relationship types
Source: paper
32 / 36
real application
Source: paper
33 / 36
suggestions
use MIC only when you have lots of samples
use B(n) = n0.6 dont use it for all the possible pairs of features
34 / 36
Thank you.
references
D. N. Reshef et al., Detecting Novel Associations in
Large Data Sets, Science, vol. 334, no. 6062, pp. 1518-1524, 2011.
36 / 36