You are on page 1of 36

CS-GN-TEAM: internal presentation

detecting novel associations


in large data sets
Michele Filannino + You Presented paper: D. N. Reshef et al., Detecting Novel Associations in Large Data Sets, Science, vol. 334, no. 6062, pp. 1518-1524, 2011.

Manchester, 05/03/2012

presentation my research taster project

where we are

05/03/2012, Michele Filannino

/ 36

Introduction

presentation my research taster project

novel association
two variables, X and Y, are associated if there is a
relationship between them

functional

non functional

novel: unknown
05/03/2012, Michele Filannino

/ 36

presentation my research taster project

example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6

f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28

f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00

f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00

f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96

f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71

4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00

05/03/2012, Michele Filannino

/ 36

presentation my research taster project

example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6

f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28

f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00

f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00

f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96

f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71

4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00

05/03/2012, Michele Filannino

/ 36

presentation my research taster project

scatter plot: f0 vs. f2

f2(x) = f0(x) + 1

05/03/2012, Michele Filannino

/ 36

presentation my research taster project

example
f0 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Data set 10x6

f1
-0.76 0.41 0.14 -0.54 -0.96 0.91 0.66 0.99 0.84 -0.28

f2
5.00 10.00 4.00 11.00 6.00 3.00 8.00 9.00 2.00 7.00

f3
12.00 23.00 0.00 100.00 45.00 123.00 4.00 -2.00 36.00 0.00

f4
8.22 27.12 0.56 94.02 39.25 125.73 9.26 6.90 37.68 -1.96

f5
1.83 4.30 -0.43 6.24 3.56 2.97 2.56 2.37 1.58 0.71

4.00 9.00 3.00 10.00 5.00 2.00 7.00 8.00 1.00 6.00

05/03/2012, Michele Filannino

/ 36

presentation my research taster project

scatter plot: f0 vs. f1

no relation

05/03/2012, Michele Filannino

/ 36

presentation my research taster project

correlation coecients
Pearson f0-f5 f0-f1 f0-f2 f2-f3 f0-f3 0.63 -0.17 1.00 -0.08 -0.08 Mutual Infor. 2.45 1.57 3.32 3.12 3.12 MI norm. 0.74 0.47 1.00 0.94 0.94

05/03/2012, Michele Filannino

10

/ 36

presentation my research taster project

pros. & cons.


Pearsons coe.

Mutual Information

closed interval result only linear relations feature independency

non linear relations only categorical data biased towards higher arity features

05/03/2012, Michele Filannino

11

/ 36

the new measure

presentation my research taster project

motivations
generality:

capture a wide range of interesting associations, not limited to specic function types

equitability:

give similar scores to equally noisy relationships of dierent types

05/03/2012, Michele Filannino

13

/ 36

presentation my research taster project

denition of MIC
Given a nite set D of ordered pairs, we can
partition the X-values of D into x bins and the Yvalues of D into y bins

We obtain a pair of partitions called x-by-y grid


D = (F0, F1) F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00) F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)

05/03/2012, Michele Filannino

14

/ 36

presentation my research taster project

x-by-y grid

2-by-4 grid

05/03/2012, Michele Filannino

15

/ 36

presentation my research taster project

denition of MIC
given the grid we could calculate D|G, the frequency
distribution induced by the points in D on the cells of G

dierent grids G result in dierent distributions D|G

05/03/2012, Michele Filannino

16

/ 36

presentation my research taster project

maximal MI over all grids

number of columns

number of rows

05/03/2012, Michele Filannino

17

/ 36

presentation my research taster project

characteristic matrix

Innite matrix!

normalisation factor
(derived by MI denition)

05/03/2012, Michele Filannino

18

/ 36

presentation my research taster project

Maximal Information Coe.

max grid size

05/03/2012, Michele Filannino

19

/ 36

presentation my research taster project

matrix computation
space of grids grows exponentially

B(n) O(n1-) for 0 < < 1

approximation of MIC

heuristic dynamic programming

05/03/2012, Michele Filannino

20 / 36

presentation my research taster project

MIC summary

closed interval result non linear relations all types of data B(n) is crucial

too high: non-zero scores even for random data too low: we are searching only for simple pattern

still univariate
05/03/2012, Michele Filannino

21

/ 36

presentation my research taster project

B(n) behaviour

05/03/2012, Michele Filannino

22 / 36

presentation my research taster project

B(n) behaviour

05/03/2012, Michele Filannino

23 / 36

how to use it

presentation my research taster project

python
import xstats.MINE as MINE x = [40,50,None,70,80,90,100,110,120,130,140,150, 160,170,180,190,200,210,220,230,240,250,260]

y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44, -0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09, -0.44,0.31,0.03,0.57,0,0.01]

print "x y", MINE.analyze_pair(x, y)

https://github.com/ajmazurie/xstats.MINE

05/03/2012, Michele Filannino

25 / 36

presentation my research taster project

python: result
{'MCN': 2.5849625999999999, 'MAS': 0.040419996, 'pearson': 0.31553724, 'MIC': 0.38196000000000002, 'MEV': 0.27117000000000002, 'non_linearity': 0.28239626000000001}

05/03/2012, Michele Filannino

26 / 36

presentation my research taster project

correlation coecients
Pearson f0-f5 f0-f1 f0-f2 f2-f3 f0-f3
0.63

Mutual Informat.
2.45

MI norm.
0.74

MIC
0.24

graph

-0.17

1.57

0.47

0.24

1.00

3.32

1.00

1.00

-0.08

3.12

0.94

0.24

-0.08

3.12

0.94

0.24

05/03/2012, Michele Filannino

27 / 36

presentation my research taster project

MIC summary

closed interval result non linear relations all types of data B(n) is crucial

n is too low!

still univariate

05/03/2012, Michele Filannino

28 / 36

presentation my research taster project

python
import xstats.MINE as MINE import math x = [n*0.01 for n in range(1,2000)] y = [math.sin(n) for n in x] result = MINE.analyze_pair(x, y)

print "MIC:", result[MIC]

print "Pearson:", result[pearson]

>>> MIC: 0.99999 >>> Pearson: -0.16366038


05/03/2012, Michele Filannino

29 / 36

conclusion

presentation my research taster project

relationship types

Source: paper

05/03/2012, Michele Filannino

31

/ 36

presentation my research taster project

relationship types

Source: paper

05/03/2012, Michele Filannino

32 / 36

presentation my research taster project

real application

Source: paper

05/03/2012, Michele Filannino

33 / 36

presentation my research taster project

suggestions
use MIC only when you have lots of samples

samples > 2000

use B(n) = n0.6 dont use it for all the possible pairs of features

it is slower than Pearsons correlation coecient or Mutual Information

05/03/2012, Michele Filannino

34 / 36

Thank you.

presentation my research taster project

references
D. N. Reshef et al., Detecting Novel Associations in
Large Data Sets, Science, vol. 334, no. 6062, pp. 1518-1524, 2011.

D. N. Reshef et al., Supporting Online Material for


Detecting Novel Associations in Large Data Sets

05/03/2012, Michele Filannino

36 / 36

You might also like