10.1016 J.ins.2015.03.038 A Similarity Assessment Technique For Effective Grouping of Documents

Information Sciences 311 (2015) 149162
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
A similarity assessment technique for effective grouping

of documents
Tanmay Basu , C.A. Murthy
Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India
a r t i c l e
i n f o
Article history:
Received 20 February 2014
Received in revised form 25 December 2014
Accepted 15 March 2015
Available online 21 March 2015
Keywords:
Document clustering
Text mining
Applied data mining
a b s t r a c t
Document clustering refers to the task of grouping similar documents and segregating dissimilar documents. It is very useful to nd meaningful categories from a large corpus. In
practice, the task to categorize a corpus is not so easy, since it generally contains huge
documents and the document vectors are high dimensional. This paper introduces a hybrid
document clustering technique by combining a new hierarchical and the traditional
k-means clustering techniques. A distance function is proposed to nd the distance
between the hierarchical clusters. Initially the algorithm constructs some clusters by the
hierarchical clustering technique using the new distance function. Then k-means algorithm
is performed by using the centroids of the hierarchical clusters to group the documents
that are not included in the hierarchical clusters. The major advantage of the proposed distance function is that it is able to nd the nature of the corpora by varying a similarity
threshold. Thus the proposed clustering technique does not require the number of clusters
prior to executing the algorithm. In this way the initial random selection of k centroids for
k-means algorithm is not needed for the proposed method. The experimental evaluation
using Reuter, Ohsumed and various TREC data sets shows that the proposed method performs signicantly better than several other document clustering techniques. F-measure
and normalized mutual information are used to show that the proposed method is effectively grouping the text data sets.
2015 Elsevier Inc. All rights reserved.
1. Introduction
Clustering algorithms partition a data set into several groups such that the data points in the same group are close to each
other and the points across groups are far from each other [9]. The document clustering algorithms try to identify inherent
grouping of the documents to produce good quality clusters for text data sets. In recent years it has been recognized that
partitional clustering algorithms e.g., k-means, buckshot are advantageous due to their low computational complexity. On
the other hand these algorithms need the knowledge of the number of clusters. Generally document corpora are huge in size
with high dimensionality. Hence it is not so easy to estimate the number of clusters for any real life document corpus.
Hierarchical clustering techniques do not need the knowledge of number of clusters, but a stopping criterion is needed to
terminate the algorithms. Finding a specic stopping criterion is difcult for large data sets.
The main difculty of most of the document clustering techniques is to determine the (content) similarity of a pair of documents for putting them into the same cluster [3]. Generally cosine similarity is used to determine the content similarity
between two documents [24]. Cosine similarity actually checks the number of common terms present in the documents. If
Corresponding author. Tel.: +91 33 25753109; fax: +91 33 25783357.
E-mail addresses: mailtanmaybasu@gmail.com (T. Basu), murthy@isical.ac.in (C.A. Murthy).
http://dx.doi.org/10.1016/j.ins.2015.03.038
0020-0255/ 2015 Elsevier Inc. All rights reserved.
150
T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162
two documents contain many common terms then they are very likely to be similar. The difculty is that there is no clear explanation as to how many common terms can identify two documents as similar. The text data sets are high dimensional data set
and most of the terms do not occur in each document. Hence the issue is to nd the content similarity in such a way so that it can
restrict the low similarity values. The actual content similarity between two documents may not be found properly by checking
only the individual terms of the documents. A new distance function is proposed to nd distance between two clusters based
on a similarity measure, extensive similarity between documents. Intuitively, the extensive similarity restricts the low (content) similarity values by a predened threshold and then determines the similarity between two documents by nding their
distance with every other document in the corpus. It assigns a score to each pair of documents to measure the degree of content
similarity. A threshold is set on the content similarity value of the document vectors to restrict the low similarity values. A histogram thresholding based method is used to estimate the value of the threshold from the similarity matrix of a corpus.
A new hybrid document clustering algorithm is proposed, which is a combination of a hierarchical and k-means clustering
technique. The hierarchical clustering technique produces some baseline clusters by using the proposed cluster distance
function. The hierarchical clusters are named as baseline clusters. These clusters are created in such a way that the documents
inside a cluster are very similar to each other. Actually the extensive similarity of all pair of documents of a baseline cluster is
very high. The documents of two different baseline clusters are very dissimilar to each other. Thus the baseline clusters intuitively determine the actual categories of the document collection. Generally there exist some singleton clusters after constructing the hierarchical clusters. The distance between a singleton cluster and each baseline cluster is not so small.
Hence k-means clustering algorithm is performed to group these documents to a particular baseline cluster, with which
it has highest content similarity. If for several iterations of k-means algorithm each of these singleton clusters are grouped
to the same baseline cluster then they are likely to be assigned correctly. The signicant property of the proposed technique
is that it can automatically identify the number of clusters. It has become clear from the experiments that the number of
clusters of each corpus is very close to the actual category. The experimental analysis using several well known TREC and
Reuter data sets have shown that the proposed method performs signicantly better than several existing document clustering algorithms.
The paper is organized as follows. Section 2 describes some related works. The document representation technique is presented in Section 3. The proposed document clustering technique is explained in Section 4. The evaluation criteria for evaluating the clusters generated by a particular method is described in Section 5. Section 6 presents the experimental results
and a detailed analysis on the results. Finally we conclude and discuss about the further scope of this work in Section 7.
2. Related works
There are two basic types of document clustering techniques available in the literature hierarchical and partitional clustering techniques [8,11].
Hierarchical clustering produces a hierarchical tree of clusters where each individual level can be viewed as a combination of clusters in the next lower level. This hierarchical structure of clusters is also known as dendrogram. The
hierarchical clustering techniques can be divided into two parts agglomerative and divisive. In an Agglomerative
Hierarchical Clustering (AHC) method [30], starting with each document as individual cluster, at each step, the most similar
clusters are merged until a given termination condition is satised. In a divisive method, starting with the whole set of documents as a single cluster, the method splits a cluster into smaller clusters at each step until a given termination condition is
satised. Several halting criteria for AHC algorithms have been proposed. But no widely acceptable halting criterion is available for these algorithms. As a result some good clusters may be merged, which will be eventually meaningless to the user.
There are mainly three variations of AHC techniques single-link, complete-link and group-average hierarchical method for
document clustering [6].
In single-link method the similarity between a pair of clusters is calculated as the similarity between the two most similar
documents where each document represents each individual cluster. The complete-link method measures the similarity
between a pair of clusters as the least similar documents, one of which is in each cluster. The group average method merges
two clusters if they have least average similarity than the other clusters. Average similarity means the average of the similarities between the documents of each cluster. In a divisive hierarchical clustering technique, initially, the method assumes
the whole data set as a single cluster. Then at each step, the method chooses one of the existing clusters and splits it into two.
The process continues till only singleton clusters remain or it reaches a given halting criterion. Generally the cluster with the
least overall similarity is chosen for splitting [30].
In a recent study, Lai et al. have proposed an agglomerative hierarchical clustering algorithm by using dynamic k-nearest
neighbor list for each cluster. The clustering technique is named as Dynamic k-Nearest Neighbor Algorithm (DKNNA) [16]. The
method uses a list of dynamic k nearest neighbors to store k nearest neighbors of each cluster. Initially the method assumes
each document as a cluster and nds the k nearest neighbors of each cluster. The minimum distant clusters are merged and
their nearest neighbors are updated accordingly and then again nds the minimum distant clusters and merge them and so
on. The algorithm continues until the desired number of clusters are obtained. In the merging and updating process of each
iteration, the k nearest neighbors of the clusters, which are affected by the merging process are updated. If the set of k nearest neighbors are empty for some of the clusters being updated, their nearest neighbors are determined by searching all the
clusters. Thus the proposed approach can guarantee the exactness of the nearest neighbors of a cluster and can obtain good
quality clusters [16]. Although the algorithm has shown good results for some articial and image data sets, but it has two
151
limitations to apply it to text data sets. The method needs the knowledge of desired number of clusters, which is very difcult to predict and it is problematic to determine a valid k for text data sets.
In contrast to hierarchical clustering techniques, partitional clustering techniques allocate data into a previously known
xed number of clusters. The commonly used partitional clustering technique is k-means method, where k is the desired
number of clusters [13]. Here initially k documents are chosen randomly from the data set, and they are called seed points.
Each document is assigned to its nearest seed point, thereby creating k clusters. Then the centroids of the clusters are computed, and each document is assigned to its nearest centroid. The same process continues until the clustering does not
change i.e., the centroids in two consecutive iterations remain the same. Generally, the number of iterations is xed by
the user. The procedure stops if it converges to a solution i.e., the centroids are the same for two consecutive iterations,
or the process terminates after a xed number of iterations. k-means algorithm is advantageous for its low computational
complexity [23]. It takes linear time to build the clusters. The main disadvantage is that the number of clusters is xed and it
is very difcult to select a valid k for an unknown text data set. Also there is no universally acceptable way of choosing the
initial seed points. Recently Chiang et al. proposed a time efcient k-means algorithm by compressing and removing the
patterns at each iteration that are unlikely to change their membership thereafter [22], but the limitations of the k-means
clustering technique have not been discussed.
Bisecting k-means method [30] is a variation of basic k-means algorithm. This algorithm tries to improve the quality of
clusters in comparison to k-means clusters. In each iteration, it selects the largest existing cluster (the whole data set in
the rst iteration) and divides it into two subsets using k-means (k = 2) algorithm. This process is continued till k clusters
are formed. Bisecting k-means algorithm generally produces almost uniform sized clusters. Thus it can perform better than
k-means algorithm when the actual groups of a data set are almost of similar size i.e., the number of documents in the
categories of a corpus are close to each other. On the contrary, the method produces poor clusters for the corpora, where
the number of documents in the categories differ very much. This method also faces difculties like k-means algorithm,
in choosing the initial seed points and a proper value of the parameter k.
Buckshot algorithm is a combination of basic k-means and hierarchical clustering methods. It tries to improve the performance of k-means algorithm by choosing better initial centroids [26]. It uses a hierarchical clustering algorithm on some
sample documents of the corpus in order to nd robust initial centroids. Then k-means algorithm is performed to nd
the clusters using these robust centroids as the initial centroids [3]. But repeated calls to this algorithm may produce different partitions. If the initial random sampling does not represent the whole data set properly, the resulting clusters may be of
poor quality. Note that appropriate value of k is necessary for this method too.
Spectral clustering algorithm is a very popular clustering method which works on the similarity matrix rather than the
original term-document matrix using the idea of graph cut. It uses the top eigenvectors of the similarity matrix derived from
the similarity between documents [25]. The basic idea is to construct a weighted graph from the corpus, where each node
represents a document and each weighted edge represents the similarity between two documents. In this technique the
clustering problem is formulated as a graph cut problem. The core of this theory is the eigenvalue decomposition of the
Laplacian matrix of the weighted graph obtained from data [10]. Let X fd1 ; d2 ; . . . ; dN g be the set of N documents to cluster.
Let S be the N N similarity matrix where Sij represents the similarity between the documents di and dj . Ng et al. [25] proposed a spectral clustering algorithm, which simultaneously partitions the Laplacian data matrix L into k subsets using the k

qd ;d
largest eigenvectors and they have used a gaussian kernel Sij exp 2ri 2 j on the similarity matrix. Here qdi ; dj denotes
the similarity between di and dj and r is the scaling parameter. The gaussian kernel is used to get rid of the curse of dimensionality. The main difculty of using a gaussian kernel is that, it is sensitive to the parameter r [21]. A wrong value of r may
highly degrade the quality of the clusters. It is extremely difcult to select a proper value of r for a document collection,
since the text data sets are generally sparse with high dimension. It should be noted that the method also suffers from
the limitations of the k-means method, discussed above.
Non-negative Matrix Factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. It
nds the positive factorization of a given positive matrix [19]. Xu et al. [33] have demonstrated that NMF performs well for
text clustering compared to the other similar methods like singular value decomposition and latent sematic indexing. The
technique factorizes the original term-document matrix D approximately as, D UV T , where U is a non-negative matrix
of size n m, and V is an m N non-negative matrix. The base vectors in U can be interpreted as a set of terms in the vocabulary of the corpus, while V describes the contribution of the documents to these terms. The matrices U and V are randomly
initialized, and their contents iteratively estimated [1]. The Non-negative Matrix Factorization method attempts to determine U and V, which minimize the following objective function

1

T
D UV
2
where kk denotes the squared sum of all the elements in the matrix. This is an optimization problem with respect to the
matrices U uik ; V v jk ; 8i 1; 2; . . . ; n, 8j 1; 2; . . . ; N and k 1; 2; . . . ; m and as the matrices U and V are non-negative,
we have uik P 0; v jk P 0. This is a typical constrained non-linear optimization problem and can be solved using the Lagrange
method [3]. The interesting property of NMF technique is that it can also be used to nd the word clusters instead of document clusters. The columns of U can be used to discover a basis, which corresponds to word clusters. The NMF algorithm has
its disadvantages too. The optimization problem of Eq. (1) is convex in either U or V, but not in both U and V, which means
152
that the algorithm can guarantee convergence to a local minimum only. In practice, NMF users often compare the local minima from several different starting points, using the results of the best local minimum found. On large sized corpora this may
be problematic [17]. Another problem with NMF algorithm is that it relies on random initialization and as a result, the same
data might produce different results across runs [1].
Xu et al. [34] proposed concept factorization (CF) based document clustering technique, which models each cluster as a
linear combination of the documents, and each document as a linear combination of the cluster centers. The document clustering is then accomplished by computing the two sets of linear coefcients, which is carried out by nding the non-negative
solution that minimizes the reconstruction error of the documents. The major advantage of CF over NMF is that it can be
applied to data containing negative values and the method can be implemented in the kernel space. The method has to select
k concepts (cluster centers) initially and it is very difcult to predict a value of k in practice. Dasgupta et al. [7] proposed a
simple active clustering algorithm which is capable of producing multiple clusterings of the same data according to user
interest. The advantage of this algorithm is that the user feedback required by this algorithm is minimal compared to the
other existing feedback-oriented clustering techniques, but the algorithm may suffer from human feedback, if the topics
are sensitive or when the perception varies. Carpineto et al. have done a good survey on search results clustering techniques.
They have elaborately explained and discussed various issues related to web clustering engines [5]. Wang et al. [32] proposed an efcient soft-constraint algorithm by obtaining a satisfactory clustering result so that the constraints would be
respected as many as possible. The algorithm is basically an optimization problem and it starts by randomly assuming some
initial cluster centroids. The method can produce insignicant clusters if the initial centroids are not properly selected. Zhu
et al. [35] proposed a semi-supervised Non-negative Matrix Factorization method based on the pairwise constraints mustlink and cannot-link. In this method must-link constraints are used to control the distance of the data in the compressed
form, and cannot-link constraints are used to control the encoding factor to obtain a very good performance. The method
has shown very good performance in some real life text corpora. The algorithm is a new variety of NMF method, which again
relies on random initialization and may produce different clusters for several runs on a corpus, where the sizes of the
categories highly varies from each other.
3. Vector space model for document representation
The number of documents in the corpus throughout this article is denoted by N. The number of terms in the corpus is
th
th
denoted by n. The i term is represented by t i . Number of times the term t i occurs in the j document is denoted by
tfij ; i 1; 2; . . . ; n; j 1; 2; . . . ; N. Document frequency dfi is the number of documents in which the term t i occurs. Inverse

document frequency idfi log dfNi , determines how frequently a word occurs in the document collection. The weight of
th
th
the i term in the j document, denoted by wij , is determined by combining the term frequency with the inverse document
frequency [29] as follows:

wij tfij idfi tfij log
N
;
dfi
8i 1; 2; . . . ; n and 8 j 1; 2; . . . ; N
The documents are represented using the vector space model in most of the clustering algorithms [29]. In this model each
th
~ w ; w ; . . . ; w .
document dj is considered to be a vector, where the i component of the vector is wij i.e., d
j
1j
2j
nj
The key factor in the success of any clustering algorithm is the selection of a good similarity measure. The similarity
~i and d
~j , it is required
between two documents is achieved through some distance function. Given two document vectors d
to nd the degree of similarity (or dissimilarity) between them. Various similarity measures are available in the literature
but the commonly used measure is cosine similarity between two document vectors [30], which is given by
Pn

~ d~
d
i
j
k1 wik wjk
;
cos d~i ; d~j q
P
Pn
n
~
~
2
2
di dj
k1 wik
k1 wjk
8i; j
The weight of each term in a document is non-negative. As a result the cosine similarity is non-negative and bounded

~; d
~ 1 means the documents are exactly similar and the similarity decreases as the value decreases
between 0 and 1. cos d
i
j
to 0. An important property of the cosine similarity is its independence of document length. Thus cosine similarity has
become popular as a similarity measure in the vector space model [14]. Let D fd1 ; d2 ; . . . ; dr g be the set of r documents,
P ~
~
where each document has n number of terms. The centroid of D; Dcn can be calculated as, Dcn 1r rj1 d
j , where dj is the
corresponding vector of document dj .
4. Proposed clustering technique for effective grouping of documents
A combination of hierarchical clustering and k-means clustering methods has been introduced based on a similarity
assessment technique to effectively group the documents. The existing document clustering algorithms so far discussed
determine the (content) similarity of a pair of documents for putting them into the same cluster. Generally the content similarity is determined by the cosine of the angle between two document vectors. The cosine similarity actually checks the
number of common terms present in the documents. If two documents contain many common terms then the documents
153
are very likely to be similar, but the difculty is that there is no clear explanation as to how many common terms can identify
two documents as similar. The text data sets are high dimensional data set and most of the terms do not occur in each document. Hence the issue is to nd the content similarity in such a way so that it can restrict the low similarity values. The
actual content similarity between two documents may not be found properly by checking the individual terms of the documents. Intuitively, if two documents are content wise similar then they should have similar type of relation with most of the
other documents i.e., if two documents x and y have similar content and if x is similar to any other document z then y must
be similar or somehow related to z. This important characteristic is not observed in cosine similarity measure.
4.1. A similarity assessment technique
A similarity measure, Extensive Similarity (ES) is used to nd the similarity between two documents in the proposed work.
The similarity measure extensively checks all the documents in the corpus to determine the similarity. The extensive similarity between two documents is determined depending on their distances with every other document in the corpus. Intuitively,
two documents are exactly similar, if they have sufcient content similarity and they have almost same distance with every
other document in the corpus (i.e., both are either similar or dissimilar to all the other documents) [18]. The content similarity
is dened as a binary valued distance function. The distance between two documents is minimum i.e., 0 when they have sufcient content similarity, otherwise the distance is 1 i.e., they have very low content similarity. The distance between two
documents di and dj ; 8i; j is determined by putting a threshold h 2 0; 1 on their content similarity as follows:
disdi ; dj
1 if qdi ; dj 6 h
0
otherwise
where q is the similarity measure to nd the content similarity between di and dj . Here h is a threshold value on the content
similarity and it is used to restrict the low similarity values. A data dependent method for estimating the value of h is dis

~; d
~ and
~ , where d
cussed later. In the context of document clustering q is considered as cosine similarity i.e., qdi ; dj cos d
i
j
i

~
~
~
dj are the corresponding vectors of documents di and dj respectively. If cos di ; dj 1 then we can strictly say that the docu

~i ; d
~j > h then they have sufcient content similarity
ments are dissimilar. On the other hand, if the distance is 0, i.e., cos d
and the documents are somehow related to each other. Let us assume that di and dj have cosine similarity 0.52 and dj and d0
(another document) have cosine similarity 0.44 and h 0:1. Hence both disdi ; dj 0 and disdj ; d0 0 and the task is to
distinguish these two distances of same value.
The extensive similarity is thus designed to nd the grade of similarity of the pair of documents which are similar content
wise [18]. If disdi ; dj 0 then extensive similarity nds the individual content similarities of di and dj with every other
document, and assigns a score (l) to denote the extensive similarity between the documents as below.
li;j
N
X
jdisdi ; dk disdj ; dk j
k1
Thus the extensive similarity between documents di and dj ; 8i; j is dened as

ESdi ; dj
N li;j
if disdi ; dj 0
1
otherwise
Two documents di ; dj have maximum extensive similarity N, if the distance between them is zero, and distance between di
and dk is same as the distance between dj and dk for every k. In general, if the above said distances differ for li;j times then the
extensive similarity is N li;j . Unlike other similarity measures, ES takes into account the distances of the said two documents di ; dj with respect to all the other documents in the corpus when measuring the distance between them [18]. li;j indicates the number of documents with which the similarity of di is not the same as the similarity of dj . As the li;j value increases,
the similarity between the documents di and dj decreases. If li;j 0 then di and dj are exactly similar. Actually li;j denotes a
grade of dissimilarity and it indicates that di and dj have different distances with li;j number of documents. The extensive
similarity is used to dene the distance between two clusters in the rst stage of the proposed document clustering method.
A distance function is proposed to create the baseline clusters. It nds the distance between two clusters say, C x and C y .
Let T xy be a multi-set consisting of the extensive similarities between each pair of documents, one from C x and the other from
C y and it is dened as,
T xy fESdi ; dj :
ESdi ; dj P 0;
8di 2 C x and dj 2 C y g
Note that T xy consisting of all the occurrences of the same extensive similarity values (if any) for different pairs of documents.
The proposed distance between two clusters C x and C y can be dened as
dist clusterC x ; C y
if T xy ;
N avgT xy otherwise
154
The function dist cluster nds the distance between two clusters C x and C y as the average of the multi set of non-negative ES
values. The distance between C x and C y is innite, if there are no two documents that have a non-negative ES value i.e., no
similar documents are present in C x and C y . Intuitively, innite distance between clusters denotes that every pair of documents, one from C x and the other from C y either share a very few number of terms, or no term is common between them i.e.,
they have a very low content similarity. Later we shall observe that any two clusters with innite distance between them
remain segregated from each other. Thus, a signicant characteristic of the function dist cluster is that it would never merge
two clusters with innite distance between them.
The proposed document clustering algorithm initially assumes each document as a singleton cluster. Then it merges those
clusters with minimum distance, and the distance is within a previously xed limit a. The process of merging continues until
the distance between every two clusters is less than or equal to a. The clusters which are not singletons are named as Baseline
Clusters (BC). The selection of the value of a is discussed in Section 6.2 of this article.
4.2. Properties of dist cluster
The important properties of the function dist cluster are described below.
The minimum distance between any two clusters C x and C y is 0, when avgT xy N i.e., the extensive similarity value
between every pair of documents, one from C x and the other from C y is N. Although in practice this minimum value
can be rarely observed between two different document clusters. The maximum value of dist cluster is innite.
If C x C y then dist clusterC x ; C y N avgT xx 0.
dist clusterC x ; C y 0 ) avgT xy N ) ESdi ; dj N;
8di 2 C x and 8dj 2 C y :
Now ESdi ; dj N implies that two documents di and dj are exactly similar. Note that ESdi ; dj N ) disdi ; dj 0
and li;j 0. Here disdi ; dj 0 implies that di and dj are similar in terms of content, but they are not necessarily same
i.e., we can not say di dj , if disdi ; dj 0.
Thus dist clusterC x ; C y 0 ; C x C y and hence dist cluster is not a metric.
It is symmetric. For every pair of clusters C x and C y ; dist clusterC x ; C y dist clusterC y ; C x .
dist clusterC x ; C y P 0 for any pair of clusters C x and C y .
For any three clusters C x ; C y and C 0 , we may have
dist clusterC x ; C y dist clusterC y ; C 0 dist clusterC x ; C 0 < 0

when 0 6 dist clusterC x ; C y < N, 0 6 dist clusterC y ; C 0 < N and dist clusterC x ; C 0 1. Thus it does not satisfy the
triangular inequality.
4.3. A method for estimation of h
There are several types of document collections available in real life. The similarities or dissimilarities between documents present in one corpus may not be same as the similarities or dissimilarities of the other corpora, since the characteristics of the corpora are different [18]. Additionally, one may view the clusters present in a corpus (or in different corpora)
under different scales, and different scales produce different partitions. Similarities corresponding to one scale in one corpus
may not be same as the similarities corresponding to the same scale in a different corpus. This has been the reason to make
the threshold on similarities data dependent [18]. In fact, we feel that a xed threshold on similarities will not give satisfactory results on several data sets.
There are several methods available in literature for nding a threshold for a two-class (one class corresponds to similar
points, and the other corresponds to dissimilar points) classication problem. A popular method for such classication is histogram thresholding [12].
Let, for a given corpus, the number of distinct similarity values be p, and let the similarity values be s0 ; s1 ; . . . ; sp1 . Without
loss of generality, let us assume that (a) si < sj ; ifi < j and (b) (si1 si s1 s0 ; 8i 1; 2; . . . ; p 2. Let gsi denote the
number of occurrences of si ; 8i 0; 1; . . . ; p 1. Our aim is to nd a threshold h on the similarity values so that a similarity
value s < h implies the corresponding documents are practically dissimilar, otherwise they are similar. The aim is to make
the choice of threshold to be data dependent. The basic steps of the histogram thresholding technique are as follows:
Obtain the histogram corresponding to the given problem.
Reduce the ambiguity in histogram. Usually this step is carried out using a window. One of the earliest such techniques is the moving average technique in time series analysis [2], which is used to reduce the local variations in
a histogram. It is convolved with the histogram resulting in a less ambiguous histogram. We have used the weighted
moving averages using window length 5 of the gsi values as,
gsi
gsi2 gsi1 gsi gsi1 gsi2
f si Pp1
;

5
gs
j
j0
8i 2; 3; . . . ; p 3
155
Find the valley points in the modied histogram. A point si corresponding to the weight function f si is said to be a
valley point if f si1 > f si and f si < f si1 .
The rst valley point of the modied histogram is taken as the required threshold on the similarity values.
In the modied histogram corresponding to f, there can be three possibilities regarding the valley points, which are stated
below.
(i) There is no valley point in the histogram. If there is no valley point in the histogram then either it is a constant function, or it is an increasing or decreasing function of similarity values. These three types of histograms impose strong
conditions on the similarity values which are unnatural for a document collection. Another possible histogram where
there is no valley point is a unimodal histogram. There is a single mode in the histogram, and the number of occurrences of a similarity value increases as the similarity values increase to the mode, and decreases as the similarity values move away from mode. This is an unnatural setup since, there is no reason of having such a strong property to be
satised by a histogram of similarity values.
(ii) Another option is that there exists exactly one valley point in the histogram. The number of occurrences of the valley
point is smaller than the number of occurrences of the other similarity values in a neighborhood of valley point. In
practice this type of example is also rare.
(iii) The third and most usual possibility is that the number of valley points is more than one i.e., there exist several variations in the number of occurrences of similarity values. Here the task is to nd a threshold from a particular valley. In
the proposed technique the threshold is selected from the rst valley point. The threshold may be selected from the
second or third or a higher valley. But for this we may treat some really similar documents as dissimilar, which lie in
between the rst valley point and the higher one. Practically the text data sets are sparse and high dimensional. Hence
high similarities between documents are observed in very few cases. It is true that, for a high h value the extensive
similarity between every two documents in a cluster will be high, but the number of documents in each cluster will
be too few due to the sparsity of the data. Hence h is selected from the rst valley point as the similarity values in the
other valleys are higher than the similarity values in the rst valley point.
Generally similarity values do not satisfy the property that si1 si s1 s0 ; 8i 1; 2; . . . ; p 2. In reality there are
th
p 1 distinct class intervals of similarity values, where the i

i 1; 2; . . . ; p. The p 1
th
class interval is vi1 ; vi , a semi-closed interval, for
class interval corresponds to the set where each similarity value is greater than or equal to
vp .
th
gsi corresponds to the number of similarity values falling in the i class interval. The vi s are taken in such a way that
vi1 vi v1 v0 ; 8i 2; 3; . . . ; p. Note that v0 0 and the value of vp is decided on the basis of the observations.
The last interval, i.e., the p 1th interval is not considered for the valley point selection, since we assume that if any similarity value is greater than or equal to vp then the corresponding documents are actually similar. Under this setup, we have
v v
taken si i 2 i1 ; 8i 0; 1; . . . ; p 1. Note that the dened si s satisfy the properties (a) si < sj ; ifi < j and (b)
(si1 si s1 s0 ; 8i 1; 2; . . . ; p 2. The proposed method nds the valley point, and its corresponding class interval.
The minimum value of that particular class interval is taken as the threshold.
Example. Let us consider an example of histogram thresholding for the selection of theta for a corpus. The similarity values,
the values of g and f are shown in Table 1. Initially we have divided the similarity values into a few class intervals of length
0.001. Let us assume that there are 80 such intervals of equal length and si represents the middle point of the ith class interval
for i 0; 1; . . . ; 79. The values of gsi s and the corresponding f si s are then found. Note that the moving averages have been
used to remove the ambiguities in the gsi values. Valleys in the similarity values corresponding to 76 f si s are then found.
Let s40 , which is equal to 0.0405, be the rst valley point, i.e., f s39 > f s40 and f s40 < f s41 . The minimum similarity value
of the class interval 0:040 0:041 is taken as the threshold h.
Table 1
An example of h estimation by histogram thresholding technique.
Class intervals (vi s)
si s
No. of elements of the intervals
Moving averages
0:0000:001
0:0010:002
0:0020:003
..
.
0:0400:041
..
.
0:0770:078
0:0780:079
0:0790:080
P0.080
0:0005
0:0015
0:0025
..
.
0:0405
..
.
0:0775
0:0785
0:0795
gs0
gs1
gs2
..
.
gs40
..
.
gs77
gs78
gs79
gs80
f s2
..
.
f s40
..
.
f s77
156
4.4. Procedure of the proposed document clustering technique

The proposed document clustering technique is described in Algorithm 1. Initially each document is taken as a cluster.
Therefore Algorithm 1 starts with N individual clusters. In the rst stage of Algorithm 1, a distance matrix is developed
th
th
th
whose ij entry is the dist clusterC i ; C j value where C i and C j are i and j cluster respectively. It is a square matrix and
has N rows and N columns for N number of documents in the corpus. Each row or column of the distance matrix is treated
as a cluster. Then the Baseline Clusters (BC) are generated by merging the clusters whose distance is less than a xed threshold a. The value of a is constant throughout Algorithm 1. The process of merging stated in step 3 of Algorithm 1 merges two
rows say i and j and the corresponding columns of the distance matrix by following a convention regarding numbering. It
merges two rows into one, the resultant row is numbered as minimum of i; j, and the other row is removed. Similar numbering follows for columns too. Then the index structure of the distance matrix is updated accordingly.
Algorithm 1. Iterative document clustering by baseline clusters
Input: (a) A set of clusters C fC 1 ; C 2 ; . . . ; C N g, where N is the number of documents.
C i fdi g; i 1; 2; . . . ; N, where di is the ith document of the corpus.
(b) A distance matrix DMij dist clusterC i ; C j ; 8i; j 2 N.
(c) a be the desired threshold on dist cluster and iter be the number of iteration.
Steps of the algorithm:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
for each clusters C i ; C j 2 C where C i C j and N > 1 do

if dist clusterC i ; C j ) 6 a then
DM
merge (DM; i; j)
Ci
Ci [ Cj
N
N1
end if
end for
nbc
0; BC
;
//Baseline clusters are initialized to empty set
nsc
0; SC
;
//Singleton clusters are initialized to empty set
for i 1 to N do
if jC i j > 1 then
nbc
nbc 1
// No. of baseline clusters
BCnbc
Ci
// Baseline clusters
else
nsc
nsc 1
// No. of singleton clusters
SCnsc
Ci
// Singleton clusters
end if
end for
if nsc 0jjnbc 0 then
return BC
// If no singleton cluster at all exists or no baseline cluster is generated
else
EBCk
BCk ; 8k 1; 2; . . . ; nbc
// Initialization of extended baseline clusters
ebctk
centroid of BCk ; 8k 1; 2; . . . ; nbc
// Extended base centroids
nct k
0; 8k 1; 2; . . . ; nbc; it
0
while ebctk nct k ; 8k 1; 2; . . . ; nbc and it 6 iter do
ebctk
centroid of EBCk ; 8k 1; 2; . . . ; nbc
NCLk
BCk ; 8k 1; 2; . . . ; nbc
// New set of clusters at each iteration
for j 1 to nsc do
if ebctk is the nearest centroid of SCj ; 8k 1; 2; . . . ; nbc then
NCLk
NCLk [ SCj
// Merger of singleton clusters to baseline clusters
end if
end for
nctk
centroid of NCLk ; 8k 1; 2; . . . ; nbc
EBCk
NCLk ; 8k 1; 2; . . . ; nbc
it it 1
end while
return EBC
end if
Output: A set of extended baseline clusters EBC fEBC1 ; EBC2 ; . . . ; EBCnbc g
157
After constructing the baseline clusters some clusters may remain as singleton clusters. Every such singleton cluster
(i.e., a single document) is merged with one of the baseline clusters using k-means algorithm in the second stage. In
the second stage the centroids of the baseline clusters (i.e., non singleton clusters) are calculated and they are named
as base centroids. The value of k for k-means algorithm is taken as the number of baseline clusters. The rest of the documents which are not included in the baseline clusters are grouped by the iterative steps of the k-means algorithm using
these base centroids as the initial seed points. Note that, those documents, which are not included in the baseline clusters,
are only considered for clustering in this stage. But, for the calculation of a cluster centroid, every document in the cluster,
including the documents in the baseline clusters, are considered. A document is put into that cluster for which the content
similarity between the document and the base centroid is maximum. The newly formed clusters are named as Extended
Baseline Clusters (EBC).
It may be noted that the processing in the second stage is not needed if no singleton cluster is produced in the rst stage.
We believe that such a possibility is remote in real life and none of our experiments yielded such an outcome. However, such
a clustering is desirable as it produces compact clusters.
4.5. Impact of extensive similarity on the document clustering technique
The extensive similarity plays signicant role in constructing the baseline clusters. The documents in the baseline clusters
are very similar to each other as their extensive similarity is very high (above a threshold h). It may be observed that whenever two baseline clusters are merged in the rst stage, the similarity between any two documents in the baseline clusters
are at least be equal to h. Note that the distance between two different baseline clusters is greater than or equal to a and the
distance between a baseline cluster and a singleton cluster (or between two singleton clusters) may be innite and they
would never merge to construct a new baseline cluster. Innite distance between two clusters indicates that the extensive
similarity between at least one document of the baseline cluster and the document of the singleton cluster (or, between the
documents of two different singleton clusters) is 1. Thus the baseline clusters intuitively determine the categories of the
document collection by measuring the extensive similarity between documents.
4.6. Discussion
The proposed clustering method is a combination of baseline clustering and k-means clustering methods. Initially it creates some baseline clusters. The documents which do not have much similarity with any one of the baseline clusters would
remain as singleton clusters. Therefore k-means method is implemented to group these documents to the corresponding
baseline clusters. k-means algorithm has been used due to its low computational complexity. It is also useful as it can be
easily implemented. But the performance of k-means algorithm suffers from selection of initial seed points and there is
no method for selecting a valid k. It is very difcult to select a proper k for a sparse text data set with high dimensionality.
In various other clustering techniques, k-means algorithm has been used as an intermediary stage e.g., spectral clustering,
buckshot algorithms etc. These algorithms also suffer from the said limitations of k-means method. Note that the proposed
clustering method overcomes these two major limitations of k-means clustering algorithm and has utilized the effectiveness
of k-means method by introducing the idea of baseline clusters. The effectiveness of the proposed technique in terms of clustering quality may be observed in the experimental results section later.
The proposed technique is designed like buckshot clustering algorithm. The main difference between buckshot and the
proposed one lies in designing the hierarchical clusters in the rst stage of both of the methods. Buckshot uses the traditional single-link clustering technique to develop the hierarchical clusters to create the initial centroids of the k-mean
clustering in the second stage. Thus buckshot may suffer from the limitations of both single-link clustering technique
(e.g., chaining effect [8]) and k-means clustering technique. In practice the text data sets contain many categories of
p
uneven sizes. In those data sets initial random selection of kn documents may not be proper, i.e, no documents may
be selected from an original cluster if its size is small. Note that no random sampling is required for the proposed clustering technique. In the proposed one the hierarchical clusters are created using extensive similarity between documents,
and these hierarchical baseline clusters would no more be used in the second stage. In the second stage, k-means algorithm is performed only to group those documents that have not been included in the baseline clusters and the initial
centroids are generated from these baseline clusters. In Buckshot algorithm, all the documents are taken into consideration for clustering by k-means algorithm. It uses the single-link clustering technique, only to create the initial seed points
of the k-means algorithm. Later it can be seen from the experiments that the proposed one performs signicantly better
than buckshot clustering technique.
The process of creating baseline clusters in the rst stage of the proposed technique is quite similar to the group-average
hierarchical document clustering technique [30]. Both the techniques nd the average of similarities of the documents of
two individual clusters for merging them into one. The proposed method nds the distance between two clusters using
extensive similarity, whereas the group-average hierarchical document clustering technique generally uses cosine similarity.
The group-average hierarchical clustering technique cannot distinguish two dissimilar clusters explicitly, like the proposed
method. This is the main difference between the two techniques.
158
5. Evaluation criteria
If the documents within a cluster are similar to each other and dissimilar to the documents in the other clusters then the
clustering algorithm is considered to perform well. The data sets under consideration have labeled documents. Hence quality
measures based on labeled data are used here for comparison.
Normalized mutual information and f-measure are very popular and are used by a number of researchers [30]31 to measure the quality of a cluster using the information of the actual categories of the document collection. Let us assume that R is
the set of categories and S is the set of clusters. Consider there are I number of categories in R and J number of clusters in S.
There are a total of N number of documents in the corpus i.e., both R and S individually contains N documents. Let ni be the
number of documents belonging to category i, mj be the number of documents belonging to cluster j and nij be the number of
documents belonging to both category i and cluster j, for all i 1; 2; . . . ; I and j 1; 2; . . . ; J.
Mutual information is a symmetric measure to quantify the statistical information shared between two distributions,
which provides an indication of the shared information between a set of categories and a set of clusters. Let IR; S denotes
the mutual information between R and S and ER and ES be the entropy of R and S respectively. IR; S and ER can be
dened as
IR; S

J
I X
X
nij
Nnij
;
log
N
ni mj
i1 j1
ER
I
n
X
ni
i
log
N
N
i1
There is no upper bound for I(R, S), so for easier interpretation and comparisons a normalized mutual information that ranges
from 0 to 1 is desirable. The normalized mutual information (NMI) is dened by Strehl et al. [31] as follows:
IR; S
NMIR; S p
ERES
F-measure determines the recall and precision value of each cluster with a corresponding category. Let, for a query the set
of relevant documents be from category i and the set of retrieved documents be from cluster j. Then recall, precision and fmeasure are given as follows:
nij
nij
; 8 i; j;
Precisionij
;
ni
mj
2 Recallij Precisionij
; 8 i; j
F ij
Recallij Precisionij
Recallij
8 i; j
If there is no common instance between a category and a cluster (i.e., nij 0) then we shall assume F ij 0. The value of F ij
will be maximum when Precisionij Recallij 1 for a category i and a cluster j. Thus the value of F ij lies between 0 and 1. The
best f-measure among all the clusters is selected as the f-measure for the query of a particular category i.e.,
F i maxj20;J F ij ; 8i. The f-measure of all the clusters is weighted average of the sum of the f-measures of each category,
P
F Ii1 nNi F i . We would like to maximize f-measure and normalized mutual information to achieve good quality clusters.
6. Experimental evaluation
6.1. Document collections
Reuters-21578 is a collection of documents that appeared on Reuters newswire in 1987. The documents were originally
assembled and indexed with categories by Carnegie Group, Inc. and Reuters, Ltd. The corpus contains 21,578 documents in
135 categories. Here we considered the ModApte version used in [4], in which there are 30 categories and 8067 documents.
We have divided this corpus into four groups and with the name as rcv1, rcv2, rcv3 and rcv4.
20-Newsgroups corpus is a collection of news articles collected from 20 different news sources. Each news source constitutes a different category. In this data set, articles with multiple topics are cross posted to multiple newsgroups i.e., there
are overlaps between several categories. The data set is named as 20ns here.
The rest of the corpora were developed in the Karypis lab [15]. The corpora tr31, tr41 and tr45 are derived from TREC-5,
TREC-6, and TREC-7 collections.1 The categories of the tr31, tr41 and tr45 were generated from the relevance judgment
provided in these collections. The corpus fbis was collected from the Foreign Broadcast Information Service data of TREC-5.
The corpora la1 and la2 were from the Los Angeles Times data of TREC-5. The category labels of la1 and la2 were generated
according to the name of the newspaper sections where these articles appeared, such as Entertainment, Financial, Foreign,
Metro, National, and Sports. The documents that have a single label were selected for la1 and la2 data sets. The corpora
oh10 and oh15 were created from OHSUMED collection, subset of MEDLINE database, which contains 233,445 documents
indexed using 14,321 unique categories[15]. Different subsets of categories have been taken to construct these data sets.
1
http://trec.nist.gov.
159

Table 2
Data sets overview.
Data set
No. of documents
No. of terms
No. of categories
20ns
fbis
la1
la2
oh10
oh15
rcv1
rcv2
rcv3
rcv4
tr31
tr41
tr45
18,000
2463
3204
3075
1050
913
2017
2017
2017
2016
927
878
690
35,218
2000
31,472
31,472
3238
3100
12,906
12,912
12,820
13,181
10,128
7454
8261
20
17
6
6
10
10
30
30
30
30
7
10
10
The number of documents, number of terms and number of categories of these corpora can be found in Table 2. For each
of the above corpora, the stop words have been extracted using the standard English stop word list.2 Then, by applying the
standard porter stemmer algorithm [27] for stemming, the inverted index is developed.
6.2. Experimental setup
Single-Link Hierarchical Clustering (SLHC) [30], Average-Link Hierarchical Clustering (ALHC) [30], Dynamic k-Nearest
Neighbor Algorithm (DKNNA) [16], k-means clustering [13], bisecting k-means clustering [30], buckshot clustering [6], spectral clustering [25] and clustering by Non-negative Matrix Factorization (NMF) [33] techniques are selected for comparison
with the proposed clustering technique. k-means and bisecting k-means algorithms have been executed 10 times to reduce
the effect of random initialization of seed points and for each execution they have been iterated 100 times to reach a solution
(if they are not converged automatically). Buckshot algorithm has also been executed 10 times to reduce the effect of random
p
initialization of initial kN documents. The f-measure and NMI values of k-means, bisecting k-means and buckshot clustering techniques shown here are the average of 10 different results. Note that the proposed method nds the number of clusters automatically from the data sets. The proposed clustering algorithm has been executed rst and then all the other
algorithms have been executed to produce the same number of clusters as the proposed one. Tables 3 and 4 show the f-measure and NMI values respectively for all the data sets. Number of Clusters (NCL) developed by the proposed method is also
p
p
shown. The f-measure and NMI are calculated using these NCL values. The value of a is chosen as, a
N for N number of
documents in the corpus. The NMF based clustering algorithm has been executed 10 times to reduce the effect of random
initialization and for each time it has been iterated 100 times to reach a solution. The values of k for DKNNA is taken as
k 10. The value of r of the spectral clustering technique is set by search over values from 10 to 20 percent of the total range
of the similarity values and the one that gives the tightest clusters is picked, as suggested by Ng et al. [25].
The proposed histogram thresholding based technique for estimating a value of h has been followed in the experiments.
We have considered class intervals of length 0.005 for similarity values. We have also assumed that content similarity (here
cosine similarity) value greater than 0.5 means that the corresponding documents are similar. Thus, the issue here is to nd a
h; 0 < h < 0:5 such that a similarity value grater than h denotes that the corresponding documents are similar. In the experiments we have used the method of moving averages with the window length of 5 for convolution. The text data sets are
generally sparse and the number of high similarity values is practically very low and there are uctuations in the heights
of the histogram for two successive similarity values. Hence it is not desirable to take the window length of 3 as the method
considers the heights of just the previous and the next value for calculating f si s. We have tried with window length of 7 or
9 on some of the corpora in the experiments, but the values of h remain more or less same as they are selected by considering
window length of 5. On the other hand window length of 7 or 9 need more calculations than window length of 5. These are
the reasons for the choice of window of length 5. It has been found that several local peaks and local valleys are removed by
this method. The number of valley regions after smoothing the histogram by the method of moving averages is always found
to be greater than three.
6.3. Analysis of results
Tables 3 and 4 show the comparison of proposed document clustering method with the other methods using f-measure
and NMI respectively, for all data sets. There are 104 comparisons for the proposed method using f-measure in Table 3. The
proposed one performs better than the other methods in 91 cases and for the rest 13 cases other methods (e.g., buckshot,
spectral clustering algorithms) have an edge over the proposed method. Few of the exceptions, where the other methods
perform better than the proposed one are e.g., SLHC and NMF for rcv3 (Here the f-measure of SLHC and NMF are respectively
2
http://www.textxer.com/resources/common-english-words.txt.
160
Table 3
Comparison of various clustering methods using f-measure.
Data sets
NCTa
NCLb
20ns
fbis
la1
la2
oh10
oh15
rcv1
rcv2
rcv3
rcv4
tr31
tr41
tr45
20
17
6
6
10
10
30
30
30
30
7
10
10
23
19
8
6
12
10
31
30
32
32
7
10
11
F-measure
BKMc
KM
BS
SLHC
ALHC
DKNNA
SC
NMF
Proposed
0.357
0.423
0.506
0.484
0.304
0.363
0.231
0.233
0.188
0.247
0.558
0.564
0.556
0.449
0.534
0.531
0.550
0.465
0.485
0.247
0.281
0.271
0.322
0.665
0.607
0.673
0.436
0.516
0.504
0.553
0.461
0.482
0.307
0.324
0.351
0.289
0.646
0.593
0.681
0.367
0.192
0.327
0.330
0.205
0.206
0.411
0.404
0.408
0.405
0.388
0.286
0.243
0.385
0.192
0.325
0.328
0.206
0.202
0.360
0.353
0.376
0.381
0.387
0.280
0.248
0.408
0.288
0.393
0.405
0.381
0.366
0.431
0.438
0.436
0.440
0.457
0.416
0.444
0.428
0.535
0.536
0.541
0.527
0.516
0.298
0.312
0.338
0.401
0.589
0.557
0.605
0.445
0.435
0.544
0.542
0.481
0.478
0.516
0.489
0.511
0.509
0.545
0.537
0.596
0.474
0.584
0.570
0.563
0.500
0.532
0.553
0.517
0.294
0.289
0.678
0.698
0.750
NCT stands for number of categories.

NCL stands for number of clusters.
BKM, KM, BS, SLHC, ALHC, DKNNA, SC and NMF stand for Bisecting k-Means, k-Means, BuckShot, Single-Link Hierarchical Clustering, Average-Link
Hierarchical Clustering, Dynamic k-Nearest Neighbor Algorithm, spectral clustering and Non-negative Matrix Factorization respectively.
b
Table 4
Comparison of various clustering methods using normalized mutual information.
Data sets
20ns
fbis
la1
la2
oh10
oh15
rcv1
rcv2
rcv3
rcv4
tr31
tr41
tr45
a
NCT
20
17
6
6
10
10
30
30
30
30
7
10
10
NCL
23
19
8
6
12
10
31
30
32
32
7
10
11
Normalized mutual information

BKMa
KM
BS
SLHC
ALHC
DKNNA
SC
NMF
Proposed
0.417
0.443
0.266
0.249
0.226
0.213
0.302
0.296
0.316
0.317
0.478
0.470
0.492
0.428
0.525
0.299
0.312
0.352
0.352
0.409
0.411
0.416
0.414
0.463
0.550
0.599
0.437
0.524
0.295
0.323
0.333
0.357
0.407
0.399
0.408
0.416
0.471
0.553
0.591
0.270
0.051
0.021
0.021
0.050
0.067
0.0871
0.053
0.049
0.048
0.065
0.054
0.084
0.286
0.362
0.218
0.215
0.157
0.155
0.108
0.150
0.162
0.175
0.212
0.237
0.354
0.325
0.405
0.241
0.252
0.239
0.236
0.213
0.218
0.215
0.220
0.414
0.456
0.512
0.451
0.520
0.285
0.335
0.417
0.358
0.429
0.426
0.404
0.414
0.436
0.479
0.503
0.432
0.446
0.296
0.360
0.410
0.357
0.434
0.420
0.476
0.507
0.197
0.506
0.488
0.433
0.544
0.308
0.386
0.406
0.380
0.495
0.465
0.448
0.452
0.509
0.619
0.694
All the symbols in this table are the same symbols used in Table 3.
0.408, 0.511 and the f-measure of the proposed method is 0.294). Similarly, Table 4 shows that the proposed method performs better than the other methods using NMI in 98 out of 104 cases.
A statistical signicance test has been performed to check whether these differences are signicant when other clustering
algorithms beat the proposed algorithm both in Tables 3 and 4. The same statistical signicance test has been performed
when the proposed algorithm performs better than the other clustering algorithms.
A generalized version of paired t-test is suitable for testing the equality of means when the variances are unknown. This
problem is the classical BehrensFisher problem in hypothesis testing and a suitable test statistic3 is described and tabled in
[20,28], respectively. It has been found that out of those 91 cases, where the proposed algorithm performed better than the
other algorithms, in Table 3, the differences are statistically signicant in 86 cases for the level of signicance 0.05. For all of
the rest 13 cases the differences are statistically signicant in Table 3, for the same level of signicance. Hence the performance
of the proposed method is found to be signicantly better than the other methods in 86.86% (86/99) cases using f-measure.
Similarly, in Table 4 the results are signicant in 89 out of 98 cases when proposed method performed better than the other
methods and all the results of the rest 6 cases are signicant. Thus in 93.68% (89/95) cases the proposed method performs signicantly better than the other methods using NMI. Clearly, these results show the effectiveness of the proposed document
clustering technique.
Remark. A point is to be noted that the number of clusters produced by the proposed method for each corpus are close to
the actual number of categories of each corpus. It may be observed from Tables 3 and 4 that the number of clusters is equal to
the actual number of categories for la2, oh15, rcv2, tr31 and tr41 corpora. The difference between the number of clusters and
actual number of categories is at most 2 for rest of the corpora. Since the text data sets provided here are very sparse and
x1 x2
The test statistic is of the form t p
, where x1 ; x2 are the means, s1 ; s2 are the standard deviations and n1 ; n2 are the number of observations.
2
2
s1 =n1 s2 =n2
161

Table 5
Processing time (in seconds) of different clustering methods.
Methods
BKMa
KM
BS
SLHC
ALHC
DKNNA
SC
NMF
Proposed
20ns
fbis
la1
la2
oh10
oh15
rcv1
rcv2
rcv3
rcv4
tr31
tr41
tr45
1582.25
94.17
159.23
149.41
18.57
18.12
89.41
98.32
97.11
100.17
29.41
27.29
25.45
1594.54
91.52
153.22
144.12
18.32
20.02
91.15
106.08
106.29
95.28
30.23
28.16
25.01
1578.12
92.46
142.36
139.34
17.32
18.26
86.37
103.16
98.35
96.49
30.34
27.46
26.06
1618.50
112.15
160.12
163.47
26.31
24.15
87.47
104.17
98.47
109.32
33.15
33.96
31.17
1664.31
129.58
179.62
182.50
33.51
31.46
103.37
120.45
124.72
126.30
40.13
39.58
38.23
1601.23
104.32
162.25
164.33
23.26
23.18
88.37
99.27
98.47
99.49
32.35
30.52
28.50
1583.62
100.90
146.68
142.33
24.82
20.94
94.62
97.81
108.24
113.81
37.98
25.85
29.72
1587.36
93.19
153.50
144.46
22.48
17.61
87.80
93.51
94.96
98.70
29.33
27.54
26.51
1595.23
90.05
140.31
140.29
18.06
16.22
86.18
92.53
93.54
93.32
29.38
26.49
24.65
All the symbols in this table are the same symbols used in Table 3.
high dimensional, then it may be implied that the method proposed here for estimating the value of h is able to detect the
actual grouping of the corpus.
6.4. Processing time
The similarity matrix requires N N memory locations, and to store N clusters, initially, N memory locations are needed
for the proposed method. Thus the space complexity of the proposed document clustering algorithm is ON 2 . ON 2 time is
required to build the extensive similarity matrix and to construct (say) m N baseline clusters, the proposed method takes
OmN 2 time. In the nal stage of the proposed technique, k-means algorithm takes ON bmt time to merge (say) b number of singleton clusters to the baseline clusters, where t is number of iterations of k-means algorithm. Thus the time complexity of the proposed algorithm is ON 2 as m is very small compared to N.
The processing time of each algorithm used in the experiments have been measured on a quad core Linux workstation.
The time (in seconds) taken by different clustering algorithms to cluster every text data set are reported in Table 5. The time
shown here for the proposed algorithm is the sum of the times taken to estimate the value of h, to build the baseline clusters,
and to perform the k-means clustering algorithm to merge the remaining singleton clusters to the baseline clusters. The time
shown for bisecting k-means, buckshot, k-means and NMF clustering techniques are the average of the processing times of
10 iterations. It is to be mentioned that the codes for all the algorithms are written in C++ and the data structures for all the
algorithms are developed by the authors. Hence the processing time can be reduced by incorporating some more efcient
data structures for the proposed algorithm as well as the other algorithms. Note that the processing time of the proposed
algorithm is less than KM, SLHC, ALHC and DKNNA for each data set. The execution time of BKM is less than the proposed
one for 20ns, the execution time of SC is less than the proposed one for tr41 and the execution time of NMF is less than the
proposed algorithm for tr31. The execution time of the proposed algorithm is less than BKM, SC and NMF for each of the
other data sets. The processing time of the proposed algorithm is comparable with buckshot algorithm (although in most
of the data set the processing time of the proposed algorithm is less than buckshot). The dimensionality of the data sets (used
in the experiments) varies from 2000 (fbis) to 35,218 (20ns). Hence the proposed clustering algorithm may be useful in
terms of processing time for any real life high dimensional data set.
7. Conclusions
A hybrid document clustering algorithm is introduced by combining a new hierarchical and traditional k-means clustering techniques. The baseline clusters produced by the new hierarchical technique are the clusters where the documents possess high similarity among them. The extensive similarity between documents ensures this quality of the baseline clusters. It
is developed on the basis of similarity between two documents and their distances with every other documents in the document collection. Thus the documents with high extensive similarity are grouped in the same cluster. Most of the singleton
clusters are nothing but the documents which have low content similarity with every other document. In practice the
number of such singleton clusters is sufcient and can not be ignored as outliers. Therefore k-means algorithm is performed
iteratively to assign these singleton clusters to one of the baseline clusters. Thus the proposed method reduces the error of
k-means algorithm due to random seed selection. Moreover the method is not as expensive as the hierarchical clustering
algorithm which can be observed from Table 5. The signicant characteristic of the proposed clustering technique is that
the algorithm automatically decides the number of clusters in the data. The automatic detection of number of clusters for
such a sparse and high dimensional text data is very important.
The proposed method is able to determine the number of clusters prior to implement the algorithm by applying a threshold h on the similarity value between documents. An estimation technique is introduced to determine a value of h from a
162
corpus. The experimental results show the value and validity of the proposed estimation of h. In the experiments the threshold
pp
N . It indicates that the distance between two different clusters must

on the distance between two clusters is taken as a
be greater than a. It is very difcult to x a lower bound on the distance between two clusters in practice as the corpora are
pp
1
1
N , say, a N 3 or a N 2 then some really different clusters
sparse in nature and have high dimensionality. If we select a >
p
p
1
N , say, a N 5 then we may get
may be merged into one, which is surely not desirable. On the other hand, if we select a <
some very compact clusters, but large number of small sized clusters would be created, which is also not expected in practice.
It may be observed from the experiments that the number of clusters produced by the proposed technique is very close to the
actual number of categories for each corpus and the proposed one outperforms the other methods. Hence we may claim that
p
p
the selection of a
N is proper, though it has been selected heuristically. The proposed hybrid clustering technique tries
to solve some issues of some of the well known partitional and hierarchical clustering techniques. Hence it may be useful in
many real life unsupervised applications. Note that any similarity measure can be used instead of cosine similarity to design
extensive similarity for different types of data sets except text data. It is to be mentioned that the value of a should be chosen
carefully whenever the method is applied to different other types of applications. In future we shall apply the proposed
method on social network data to nd different types of communities or topics. In that case we may have to incorporate
the idea of graph theory into the proposed distance function to nd relation between different sets of nodes of a social site.
Acknowledgment
The authors would like to thank the reviewers and the editor for their valuable comments and suggestions.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
N.O. Andrews, E.A. Fox, Recent Developments in Document Clustering, Technical Report, Verginia Tech., USA, 2007.
R.G. Brown, Smoothing, Forecasting and Prediction of Discrete Time Series, Prentice-Hall, Englewood Cliffs, NJ, 1962.
C. Aggarwal, C. Zhai, A survey of text clustering algorithms, Mining Text Data (2012) 77128.
D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 16241637.
C. Carpineto, S. Osinski, G. Romano, D. Weiss, A survey of web clustering engines, ACM Comput. Surveys 41 (3) (2009).
D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey, Scatter/gather: a cluster-based approach to browsing large document collections, in: Proceedings of
the International Conference on Research and Development in Information Retrieval, SIGIR93, 1993, pp. 126135.
S. Dasgupta, V. Ng, Towards subjectifying text clustering, in: Proceedings of the International Conference on Research and Development in Information
Retrieval, SIGIR10, NY, USA, 2010, pp. 483490.
R.C. Dubes, A.K. Jain, Algorithms for Clustering Data, Prentice Hall, 1988.
R. Duda, P. Hart, Pattern Classication and Scene Analysis, John Wiley & Sons, 1973.
M. Filipponea, F. Camastrab, F. Masullia, S. Rovettaa, A survey of kernel and spectral methods for clustering, Pattern Recognit. 41 (1) (2008) 176190.
R. Forsati, M. Mahdavi, M. Shamsfard, M.R. Meybodi, Efcient stochastic algorithms for document clustering, Inform. Sci. 220 (2013) 269291.
C.A. Glasbey, An analysis of histogram-based thresholding algorithms, Graph. Models Image Process. 55 (6) (1993) 532537.
J.A. Hartigan, M.A. Wong, A k-means clustering algorithm, J. Roy. Statist. Soc. (Appl. Statist.) 28 (1) (1979) 100108.
A. Huang, Similarity measures for text document clustering, in: Proceedings of the New Zealand Computer Science Research Student Conference,
Christchurch, New Zealand, 2008, pp. 4956.
G. Karypis, E.H. Han, Centroid-based document classication: analysis and experimental results, in: Proceedings of the Fourth European Conference on
the Principles of Data Mining and Knowledge Discovery, PKDD00, Lyon, France, 2000, pp. 424431.
J.Z.C. Lai, T.J. Huang, An agglomerative clustering algorithm using a dynamic k-nearest neighbor list, Inform. Sci. 217 (2012) 3138.
A.N. Langville, C.D. Meyer, R. Albright, Initializations for the Non-negative Matrix Factorization, in: Proceedings of the Conference on Knowledge
Discovery from Data, KDD06, 2006.
T. Basu, C.A. Murthy, Cues: a new hierarchical approach for document clustering, J. Pattern Recognit. Res. 8 (1) (2013) 6684.
D.D. Lee, H.S. Seung, Algorithms for Non-negative Matrix Factorization, in: Advances in Neural Information Processing Systems, vol. 13, 2001, pp. 556
562.
E.L. Lehmann, Testing of Statistical Hypotheses, John Wiley, New York, 1976.
X. Liu, X. Yong, H. Lin, An improved spectral clustering algorithm based on local neighbors in kernel space, Comput. Sci. Inform. Syst. 8 (4) (2011) 1143
1157.
C.S. Yang, M.C. Chiang, C.W. Tsai, A time efcient pattern reduction algorithm for k-means clustering, Inform. Sci. 181 (2011) 716731.
M.I. Malinen, P. Franti, Clustering by analytic functions, Inform. Sci. 217 (2012) 3138.
C.D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, New York, 2008.
A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: Proceedings of Neural Information Processing Systems (NIPS01),
2001, pp. 849856.
P. Pantel, D. Lin, Document clustering with committees, in: Proceedings of the International Conference on Research and Development in Information
Retrieval, SIGIR02, 2002, pp. 199206.
M.F. Porter, An algorithm for sufx stripping, Program 14 (3) (1980) 130137.
C.R. Rao, S.K. Mitra, A. Matthai, K.G. Ramamurthy (Eds.), Formulae and Tables for Statistical Work, Statistical Publishing Society, Calcutta, 1966.
G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983.
M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Proceedings of the Text Mining Workshop, ACM International
Conference on Knowledge Discovery and Data Mining (KDD00), 2000.
A. Strehl, J. Ghosh, Cluster ensembles a knowledge reuse framework for combining multiple partitions, J. Machine Learn. Res. 3 (2003) 583617.
J. Wang, S. Wu, H.Q. Vu, G. Li, Text document clustering with metric learning, in: Proceedings of the 33rd International Conference on Research and
Development in Information Retrieval, SIGIR10, 2010, pp. 783784.
W. Xu, X.Liu, Y.Gong, Document clustering based on Non-negative Matrix Factorization, in: Proceedings of the International Conference on Research
and Development in Information Retrieval, SIGIR03, Toronto, Canada, 2003, pp. 267273.
W. Xu, Y.Gong, Document clustering by concept factorization, in: Proceedings of the International Conference on Research and Development in
Information Retrieval, SIGIR2004, NY, USA, 2010, pp. 483490.
Y. Zhu, L. Jing, J. Yu, Text clustering via constrained nonnegative matrix factorization, in: Proceedings of the IEEE International Conference on Data
Mining (ICDM2011), 2011, pp. 12781283.

10.1016 J.ins.2015.03.038 A Similarity Assessment Technique For Effective Grouping of Documents

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10.1016 J.ins.2015.03.038 A Similarity Assessment Technique For Effective Grouping of Documents

Uploaded by

Copyright:

Available Formats

Information Sciences 311 (2015) 149162

Contents lists available at ScienceDirect

A similarity assessment technique for effective grouping

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

wij tfij idfi tfij log

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

Thus the extensive similarity between documents di and dj ; 8i; j is dened as

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

dist clusterC x ; C y 0 ) avgT xy N ) ESdi ; dj N;

8di 2 C x and 8dj 2 C y :

dist clusterC x ; C y dist clusterC y ; C 0 dist clusterC x ; C 0 < 0

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

p 1 distinct class intervals of similarity values, where the i

class interval is vi1 ; vi , a semi-closed interval, for

No. of elements of the intervals

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

4.4. Procedure of the proposed document clustering technique

for each clusters C i ; C j 2 C where C i C j and N > 1 do

Output: A set of extended baseline clusters EBC fEBC1 ; EBC2 ; . . . ; EBCnbc g

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

NCT stands for number of categories.

Normalized mutual information

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149162

N . It indicates that the distance between two different clusters must

You might also like