Similarity and Dissimilarity Measures

Similarity and Dissimilarity Measures
http://inside.mines.edu/~ckarlsso/mining_portfolio/similarity.html
Data Mining Portfolio

Euclidean distance (#euclid)
Pearson correlation coefficient (#pearson)
Cosine Similarity Score (#cosine)
Extended Jaccard Coefficient (#jaccard)
Complete Code Example with Notes (similarity-code.html)
Information retrieval, similarities/dissimilarities, finding and
implementing the correct measure are at the heart of data
mining. The oldest approach to solving this problem was to
have people work with people using meta data (libraries). This
functioned for millennia. Roughly one century ago the
Boolean searching machines entered but with one large
problem. People do not think in Boolean terms which require
structured data thus data mining slowly emerged where
priorities and unstructured data could be managed.
Are they alike (similarity)? Are they different (dissimilarity)?
To what degree are they similar or dissimilar (numerical
measure)? How are they alike/different and how is this to be
expressed (attributes)? Measuring similarities/dissimilarities
is fundamental to data mining; almost everything else is
based on measuring distance.
Euclidean Distance:
is the distance between two points (p, q) in
any dimension of space and is the most
common use of distance. When data is
dense or continuous, this is the best proximity
measure.
def sim_distance(prefs, person1, person2):
# Get the list of shared_items
si = {}
for item in prefs[person1]:
if item in prefs[person2]:
1 of 4
8/28/12 1:48 PM
si[item] = 1
# If they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in si])
#
#
for item in prefs[person1] if item in prefs[person2]])

Euclidean distance, note this one is smaller the closer they are together
return sqrt(sum_of_squares)
To give us a higher value when two points are close add one and take inverse
return 1/(1+sqrt(sum_of_squares))
Pearson correlation Coefficient:

measures the strength and the direction of
the linear relationship between two variables.
The value is always between [-1; 1] where 1 is
strong positive relation, 0 is no relation and -1
is a strong negative correlation. It is the most
widely used correlation coefficient and works
very well when all is linear but not with
curvilinear.
# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs, p1, p2):
# Get the list of mutually rated items
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
# I'm making a change here, makes no sense to first add up all calculations
# and then check if sum is zero or not, better check and then add. CK
# if they have no ratings in common, return 0
# Sum calculations
n=len(si)
# Add all the preferences
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
# Sums of the squares
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
2 of 4
8/28/12 1:48 PM
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])

# Sum of the products
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
Cosine Similarity:
is often used when comparing two
documents against each other. It measures
the angle between the two vectors. If the
value is zero the angle between the two
vectors is 90 degrees and they share no
terms. If the value is 1 the two vectors are the
same except for magnitude. Cosine is used
when data is sparse, asymmetric and there is
a similarity of lacking characteristics.
# Returns the Cosine Similarity Score for p1 and p2
def sim_cosine(prefs, p1, p2):
si={}
# Calcuate the normalized vector
num_p = sum([prefs[p1][it]*prefs[p2][it] for it in si])
norm_p1 = sqrt(sum([pow(prefs[p1][it],2) for it in si]))
norm_p2 = sqrt(sum([pow(prefs[p2][it],2) for it in si]))
# Calculate the Cosine Similarity Score.
s_cos = cos(num_p / (norm_p1*norm_p2))
3 of 4
8/28/12 1:48 PM
return s_cos
Extended Jaccard Coefficient

is used to compare documents. It measures
the similarity of two sets by comparing the
size of the overlap against the size of the two
sets. Should the two sets have only binary
attributes then it reduces to the Jaccard
Coefficient. As with cosine, this is useful
under the same data conditions and is well
suited for market-basket data .
# Returns the Extended Jaccard Coefficient for p1 and p2
def sim_ext_jaccard(prefs, p1, p2):
si={}
# Calcuate the different parts of the of score
cross_p = sum([prefs[p1][it]*prefs[p2][it] for it in si])
norm_p1 = sum([pow(prefs[p1][it],2) for it in si])
norm_p2 = sum([pow(prefs[p2][it],2) for it in si])
# Calculate the Extended Jaccard Coefficient.
EJ = cross_p /(norm_p1 + norm_p2 - cross_p)
return EJ
* All code examples are implementations of codes in 'Programming
Collective Intelligence' by Toby Segaran, O'Reilly Media 2007.
Christer Karlsson
4 of 4
8/28/12 1:48 PM

Similarity and Dissimilarity Measures

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Similarity and Dissimilarity Measures

Uploaded by

Copyright:

Available Formats

Similarity and Dissimilarity Measures

Data Mining Portfolio