Professional Documents
Culture Documents
}
}
Typical Applications
"
Partitioning Methods
"
Hierarchical Methods
"
Hypothesize a model for each cluster and find the best fit of the
data to the given model
Model-based methods
"
Grid-based Methods
"
Density-based Methods
"
Subspace clustering
Constraint-based methods
"
}
}
Given
"
"
Choose 3 objects
(cluster centroids)
Assign each object
to the closest centroid
to form Clusters
+
Update cluster
centroids
+
+
K-Means Method
Recompute
Clusters
+
+
+
If Stable centroids,
then stop
+
+
+
K-Means Algorithm
}
Input
"
"
}
}
K-Means Properties
}
E =
p mi |
i 1 pCi
"
E: the sum of the squared error for all objects in the data set
P: the data point in the space representing an object
"
"
It works well when the clusters are compact clouds that are
rather well separated from one another
K-Means Properties
Advantages
} K-means is relatively scalable and efficient in processing large
data sets
} The computational complexity of the algorithm is O(nkt)
" n: the total number of objects
" k: the number of clusters
" t: the number of iterations
" Normally: k<<n and t<<n
Disadvantage
} Can be applied only when the mean of a cluster is defined
} Users need to specify k
} K-means is not suitable for discovering clusters with nonconvex
shapes or clusters of very different size
} It is sensitive to noise and outlier data points (can influence the
mean value)
"
Dissimilarity calculations
"
"
"
"
11
E =
p oi |
i 1 pCi
E: the sum of absolute error for all objects in the data set
" P: the data point in the space representing an object
" Oi: is the representative object of cluster Ci
"
9
8
A1
A2
7
6
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
4
10
5
4
9
8
9
cluster1
A1
O1
A2
7
6
O2
O3
O4
O5
O6
O7
O8
O9
O10
cluster2
3
4
10
5
4
9
8
Data Objects
cluster1
A1
O1
A2
O2
O3
O4
O5
O6
O7
O8
O9
O10
cluster2
3
4
10
5
4
9
8
E = | p oi | =| o1 o2 | + | o3 o2 | + | o4 o2 |
i 1 pCi
+ | o5 o8 | + | o6 o8 | + | o7 o8 | + | o9 o8 | + | o10 o8 |
9
cluster1
A1
O1
A2
7
6
O2
O3
O4
O5
O6
O7
O8
O9
O10
cluster2
3
4
10
5
4
9
8
E = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
9
cluster1
A1
O1
A2
7
6
O2
O3
O4
O5
O6
O7
O8
O9
O10
cluster2
3
4
10
E = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
9
cluster1
A1
O1
A2
7
6
O2
O3
O4
O5
O6
O7
O8
O9
O10
cluster2
3
4
10
5
4
9
8
S = 22 20
S> 0 it is a bad idea to replace O8 by O7
K-Medoids Method
9
Data Objects
cluster1
A1
O1
A2
O2
O3
O4
O5
O6
O7
O8
O9
O10
cluster2
3
4
10
5
4
9
8
K-Medoids Method
Cluster 1
Cluster 2
First case
B
B
Representative object
Random Object
Currently P assigned to A
Cluster 1
A
Cluster 2
Second case
p
B
Representative object
Random Object
Currently P assigned to B
P is reassigned to A
K-Medoids Method
Cluster 1
A
Cluster 2
Third case
p
B
B
Representative object
Random Object
Currently P assigned to B
Cluster 1
Cluster 2
Fourth case
A
B
p
Representative object
Random Object
Currently P assigned to A
P is reassigned to B
K-Medoids Algorithm(PAM)
PAM : Partitioning Around Medoids
}
Input
"
"
}
}
Advantages
"
Disadvantages
"
"
"
3.2.4 CLARA
}
A random sample
should closely represent the
original data
PAM
sample
CLARA
}
Clusters
Clusters
PAM
PAM
PAM
sample1
sample2
samplem
CLARA Properties
}
"
n: number of objects
"
PAM finds the best k medoids among a given data, and CLARA
finds the best k medoids among the selected samples
Problems
"
"
"
3.2.5 CLARANS
}
Current
medoids
Cost=-10
Cost=5
Cost=2
Cost=-15
Cost=3
medoids
Cost=20
Cost=1
Sample
medoids
First step of the search
Neighbors are from the chosen sample
Current
medoids
Original data
medoids
Current
medoids
The number of neighbors sampled from the original data is specified by the user
CLARANS Properties
}
Advantages
"
"
Handles outliers
Disadvantages
"
"