You are on page 1of 7

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

31

K-Means Clustering and Affinity Clustering based on Heterogeneous Transfer Learning


Shailendra Kumar Shrivastava, Dr. J. L. Rana, and Dr. R.C. Jain
Abstract - Heterogeneous Transfer Learning aims to extract the knowledge form one or more tasks from same feature space and applies this knowledge to target task of another features space. In this paper two clustering algorithms K-means clustering and Affinity clustering both based on Heterogeneous Transfer Learning (HTL) have been proposed. In both the algorithms annotated image datasets are used. K-means based on HTL first finds the cluster centroid of Text (annotations) by K-Means. In the next step these centroids of Text are used to initialize the centroids in image clustering by K-means. Second algorithm, Affinity clustering based on HTL first finds the exemplar of annotations and then these exemplar of annotations are used to initialize the similarity matrix of image datasets to find the clusters. F-Measure Scores and Purity scores increase and Entropy Scores decreases in both the algorithms. Clustering accuracy of affinity based on HTL is better than K-Means based on HTL. Key words- Heterogeneous Transfer learning, clustering, affinity propagation, K-Means, feature space.

1 INTRODUCTION

n the Literature[1] Machine Learning is defined as: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. However many machine learning methods work well only under assumption, that the training data and testing data are drawn from same feature space. If feature space is different in training and testing data, most statistical models will not work. In this case one needs to recollect the training and testing data in same features space and rebuild the model. But this is expensive and difficult. In such cases transfer learning [3] between task domains is desirable. Transfer learning allows the domain tasks, and distribution used in training and testing to be different. In heterogeneous transfer learning, the knowledge is transferred across the domains or tasks that have different features space e.g. classifying the web pages in Chinese using the training document in English [4]. Probabilistic latent semantic analysis (PLSA) [5] was used in clustering images by using the annotation (Text). Transfers learning in Machine Learning technologies [2] have already achieved significant success in many knowledge engineering areas including classification, regression and clustering. Clustering is a fundamental task in computerized data analysis. It is concerned with the problem of partitioning a collection of data points into groups/categories using unsupervised learning tech

Shailendra Kulmar Shrivastava is with the Department of Information Technology, Samrat Ashok Technological Institute, Vidisha, M.P.464001, India Dr.J.L.Rana,Ex Head of Department of Computer Sc. & Engineering was with the M.A.N.I.T., Bhopal, India Dr. R.C.Jain,Director , is with the Samrat Ashok Technological Institute, Vidisha, M.P.464001, India

niques. Data points in groups are similar. Such groups are called clusters [6][7][8] In this paper two algorithm, K-Means [8][9] based on Heterogeneous Transfer Learning and Affinity clustering based on transfer learning are proposed. Affinity propagation [6] is a clustering algorithm which for given set of similarities (also denoted by affinities) between pairs of data points, partitions the data by passing the messages among the data points. Each partition is associated with a prototypical point that best describes that cluster. AP associates each data point with one such prototype. Thus, the objective of AP is to maximize the overall sum of similarities between data points and their representatives. In K-Means starts with random initial partitions and keeps reassigning the patterns to clusters based on similarity between pattern and centroids until a convergence criterion is met. In annotated image dataset has two features space. First one is text another one is image feature space. In K-means Text data (annotations) is used to find the clusters by K-Means. In order, to transfer knowledge of text features space into image feature space first finds the centroid of annotations by K-Means. Now corresponding to Text (annotations) centroids, images centroids become available. Next we take complete image data sets and assign it to the centroid on the basis of minimum Euclidean distance and finally apply K-Means to generate the image clusters. In Affinity clustering based on HTL we use Text (annotations) of images to find exemplars by affinity propagation clustering. For transferring the knowledge form text features space to image feature space, we initialize the image similarity matrix diagonal by exemplar of text clustering then generate the images clusters of image similarity matrix by affinity propagation clustering. The remainder of this paper is organized as follows. Section 2 gives a brief over view of Transfer Learning,

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

32

original Affinity Propagation algorithm and Vector space model. Section 3 describes the main idea and details of our proposed algorithms. Section 4 discusses the experimental results and evaluations. Section 5 provides the concluding remarks and future directions.

tive to selection of initial partition.

2 RELATED WORKS
Before going into details of our proposed K-Means based on Heterogeneous Transfer learning and Affinity Clustering Based on Heterogeneous Transfer Learning algorithms, some works that are closely related to this paper are briefly reviewed. Transfers Learning, K-Means clustering algorithm, affinity propagation algorithm and vector space model will be discussed.

2.1 Transfer Learning Machine Learning methods work well only under common assumption, the training and test data from same features space and same distribution. When distributions changes, most statistical models need to be rebuilt from scratch using newly collected training data. In many real world applications, it is expensive or impossible to re-collect the needed training data and rebuild the model. It would be nice to reduce the need and effort to re-collect training data. In such cases Knowledge transfer or Transfer Learning [3] between tasks domain would be desirable. Transfer learning has following three main research issues (1) What to transfer (2) How to transfer (3) When to transfer. In the inductive transfer learning setting, the target task is different form source task, no matter source and target domain is the same or not. In the transductive transfer learning, the source and target tasks are same, while source and target domain are different. In the unsupervised transfer learning setting, similar to inductive transfer learning setting, target task is different but related to source tasks. In the heterogeneous transfer learning, transfer the knowledge across domain or task that has different feature space. 2.2 K-Means Clustering Algorithm K-Means[8][9] algorithm is one of the best known and most popular clustering algorithms K-Means seeks optimal partition of the data by minimizing the sum of squared error criterion, with an iterative optimization procedure. The K-Mean clustering procedure is as following. 1. Initialize a K - partition randomly or based on some prior knowledge. Calculate the cluster prototype matrix M = [ m 1 , , m K ] 2. Assign each object in the data set to the nearest cluster C 3. Recalculate the cluster prototype matrix based on the current partition, (1) 4. 5. Repeat steps 2 and 3 until there is no change for each cluster. Major Problem with this algorithm is that it is sensi-

2.3 Affinity Clustering Algorithm Affinity clustering algorithm [10][11][12] is based on message passing among data points. Each data point receives the availability from others data points (from exemplar) and sends the responsibility message to others data points (to exemplar). Sum of responsibilities and availabilities for data points identify the exemplars. After the identification of exemplar the data points are assigned to exemplar to form the clusters. Following are the steps of affinity clustering algorithms. 1. Initialize the availabilities to zero , 0 2. Update the responsibilities by following equation. , , (2) max , , Where , is the similarity of data point i and exemplar k. 3. Update the availabilities by following equation , 0, , (3) .., 0, , Update self-availability by following equation , max0, , (4) 4. Compute sum = , , for data point i and find the value of k that maximize the sum to identify the exemplars. 5. If Exemplars do not change for fixed number of iterations go to step (6) else go to Step (1) 6. Assign the data points to Exemplars on the basis of maximum similarity to find clusters. 2.4 Vector Space Model Vector space model [13] uses to represent the text documents. In VSD each document d is considered as a vector in the M-dimensional term (word) space. In the algorithm the tf-idf weighing scheme is used. In VSD model each document represented by following equation. 1, , 2, , . , (5) Where, N is the number of terms (words) in the document. And , 1 log , log1 / (6) Where , frequency of ith term in the document d and df(i) is the number of document containing i th term . Inverse document frequency (idf) is defined as the logarithm of the ratio of number of documents (N) to the number of document containing the given word (df).

3 CLUSTERING BASED ON HETEROGENOUS TRANSFER LEARNING


In this section, two algorithm of clustering based on heterogeneous transfer learning are proposed. First is

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

33

K-Means clustering based on heterogeneous Transfer Learning and second is Affinity Propagation Clustering Based on Heterogeneous Transfer Learning.

3.1 K-Means Clustering based on Heterogeneous Transfer Learning K-Means clustering based on heterogeneous transfer learning extends the K-Means for clustering. Annotated image data set has been used in simulation studies. In this annotations (Text features space) and images (image features space) are computed. K-Means clustering is applied to text (annotation) of images to find the centroid. In order to transfer knowledge from one task to another task, first step is to initialize the centroid in image clustering by the centroid obtained in Text clustering. For text clustering phrase base VSD [13] is used. In vector space model w (d, i), term frequency and document frequency are calculated on the basis of term. Term in vector space model is word. But the phrase instead of word will be used. This can be called Vector space model bases on the phrase. Phrase (Term) frequency and document frequency can be calculated by suffix tree. Here document frequency is a number, that a document contains the phrase. Generate the centroid of annotations by K-Means algorithm giving input VSD. Now apply the K-Means clustering to image data sets by initializing the centroid by the centroid obtained in text clustering. Proposed K-means clustering algorithm based on heterogeneous transfer learning can be written as following.
Input Annotations(Text) for Clustering Text preprocessing. Removing all stop words. Words steaming are done. 3. Find the words and assign the unique number to each word. 4. Convert text into sequence of number. 5. Suffix tree construction using Ukkonen algorithm. 6. Calculate the Phrase (term) frequency from suffix tree. 7. Calculate the document frequency of phrase from suffix tree. 8. Construct the Vector space model of text using phrase. 9. Apply k-means on VSD. 10. Initialize the centroid in image domain by centroid obtained from text clustering. 11. Apply K-means in Image data sets to find clusters. 1. 2.

emplar. In order to transfer knowledge from one task to another task, diagonal values of similarity matrix of image data sets are assigned on the bases of exemplar of text clustering. For text clustering phrase base VSD is used. In vector space model w (d, i), term frequency and document frequency is calculated on the basis of term. Term in vector space model is word. But the phrase is used instead of word. This can be called Vector space model bases on the phrase. Phrase (Term) frequency and document frequency can be calculated by suffix tree. Here document frequency is a number document contains the phrase. VSD model based on the phrase is used to compute the cosine similarity [1]. Similarity of two document di and dj is calculated by equation (7) and document can be represented by equation (4).

, |

(7)

Self-similarity/Preference [9] is finding from by equation (8).

simd , d

,, , -

1 k N

(8)

3.2 Affinity Clustering based on Heterogeneous Transfer Learning Affinity clustering based on heterogeneous transfer learning extends the affinity propagation clustering. Annotated image data set is used. In these annotations (Text features space) and images (image features space) form the starting point. Affinity clustering is applied to annotation (text) of images to find the ex-

Affinity propagation algorithm for clustering is applied to generate the exemplar. Extract the features of image data sets to make the features vector space of image data set. Next finds the similarity matrix from image vectors. Assign the diagonal value of similarity matrix of image domain on the bases of exemplar of Text clustering, which transfer the knowledge from one domain to other domain. Generate the exemplars/clusters by affinity propagation clustering algorithm. Proposed algorithm can be written as following. 1. Input Annotations(Text) for Clustering 2. Text preprocessing. Removing all stop words. Words steaming are done. 3. Find the words and assign the unique number to each word. 4. Convert text into sequence of number. 5. Suffix tree construction using Ukkonen algorithm. 6. Calculate the Phrase (term) frequency from suffix tree. 7. Calculate the document frequency of phrase from suffix tree. 8. Construct the Vector space model of text using phrase. 9. Find the Phrase based similarity matrix of documents from vector space model by equation 7. 10. Preference in similarity matrix is assigned by equation 8. 11. Initialize the availabilities to zero ai, k 0 12. Update the responsibilities by equations (2). 13. Update the availabilities by equation (3). 14. Update self-availability by equation (4). 15. Compute sum = ai, k ri, k for data point i and find the value of k that maximize the sum to identify the exemplars.

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

34

16. If Exemplars do not change for fixed number of iterations go to step (12) else go to Step (17) 17. Extract feature vector from image data sets. 18. Find the similarity matrix from image feature vector. 19. Transfer the knowledge from text feature space to image feature space. 20. Initialize the availabilities of to zero , 21. 22. 23. 24.

4.1.3 Purity
Purity indicates the percentage of the dominant class member in the given cluster. For measuring the overall clustering purity weighted average purity is used. Purity is given by following equation.

25. 26.

0 Update the responsibilities by equations 1. Update the availabilities by equation 2. Update self-availability by equation 3. Compute sum = , , for data point i and find the value of k that maximize the sum to identify the exemplars. If Exemplars do not change for fixed number of iterations go to step (21) else go to Step (26) Assign the data points to Exemplars on the basis of maximum similarity to find clusters.

max , ..

4.1.4 Entropy Entropy tell us the homogeneity of cluster. Higher homogeneity of cluster entropy should be low and vice versa. Like weighted F-Measure and weighted Purity weighted entropy is used which is given by following equation 1 log log

4 EXPERIMENTAL RESULTS AND EVALUATION


In this Section, results and evaluation of set of experiments are presented to verify the effectiveness and efficiency of our proposed algorithm for clustering. Evaluations parameters are F-Measures, Purity and Entropy. Experiments have been performed on data sets constructed from Corpus Caltech 256[14]. We will discuss Evaluation parameter, Datasets and results.

Where is the probability that a member of cluster belongs to class . To sum up, we would like to maximize the FsMeasure and Purity scores and minimize the entropy score of cluster to achieve high quality clustering.

4.2 Data Set Preparation Image data sets of 100,300,500 and 800 Images have been consturcted. Images are randomly chosen from Caltech-256. Manually annotated (text) files are created for each datasets. 4.3 Experimental Results Discussion Extensive experiments are carried out to show the effectiveness of proposed algorithms. Annotations and images have been combined. Experiments are performing on following combinations annotations and images. Without annotations ,100 annotations and 100 images ,100 annotations and 300 images ,100 annotations 500images ,100 images 800 annotations ,300 annotations 300 images ,300 annotations 500images, 300 annotations and 800 images ,500 annotations and 500 images ,500 annotations and 800 images. Results of experiments are given in Table 1, Table 2, and Table 3. It can be observed from fig 1, fig 2, fig 3, fig 4, fig 5 and fig 6 that in both algorithms the entropy scores, purity scores and entropy scores vary with number of annotations and number of images. In both algorithms entropy scores and purity scores are maximum and entropy scores is minimum on the optimum number of annotations. For comparison of K-means clustering based on HTL and Affinity clustering based on HTL are plotted. From fig 7, fig 8 and fig 9 it is observed that F-Measure scores and Purity scores are larger and Entropy scores is smaller in Affinity Clustering Based on HTL.

4.1 Evaluations Parameters [15] For ready reference definition and formulas of FMeasure, Purity and Entropy are given below. 4.1.2 F-measure F-Measure combines the Precision and Recall. Let C={ be clusters of data set D of N doc uments ,and let , represents the correct class of set D. Then the Recall of Cluster j with respect to Class i is defined as Recall(i , j )=

Then the Precision of Cluster j with respect to Class i is defined as F-Measures of cluster and class is the combinations of Precision and Recall in following manner.

Precision (i , j )=

2 , , , , F-Measure for overall quality of cluster set C is defined by the following equation ,

| C | F max Fi, j .. N

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

35

Fig 1: Variation of F-Measure Scores with Annotations(Text) in K-Means based heterogeneous transfer learning clustering

Fig 4: Variation of F-Measure Scores with Annotations(Text) in Affinity clustering based heterogeneous transfer learning

Fig 2: Variation of Purity Scores with Annotations(Text) in K-Means based heterogeneous transfer learning clustering

Fig 5: Variation of Purity Scores with Annotations(Text) in Affinity clustering based heterogeneous transfer learning

Fig 3: Variation of Entropy Scores with Annotations (Text) in K-Means based heterogeneous transfer learning clustering

Fig 6: Variation of Entropy Scores with Annotations(Text) in Affinity clustering based heterogeneous transfer learning

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

36

Number of Annotation

No. of F-Measure F-Measure Images AP Based K-Means in Data on HTL Based on sets HTL 0 100 0.30711 0.26873 100 100 0.43563 0.35208 0 300 0.25227 0.24254 100 300 0.42308 0.35364 300 300 0.24565 0.10109 0 500 0.18273 0.18823 100 500 0.41944 0.30234 300 500 0.28443 0.19764 500 500 0.19175 0.18912 0 800 0.18928 0.18064 100 800 0.40586 0.32492 300 800 0.35365 0.26969 500 800 0.16184 0.12764 Table 1: Comparison of F-Measure Scores No. of Purity Purity Images AP Based K-Means in Data on HTL Based on sets HTL 0 100 0.3700 0.2900 100 100 0.4800 0.3600 0 300 0.2800 0.2000 100 300 0.3907 0.2966 300 300 0.2700 0.2015 0 500 0.1980 0.1680 100 500 0.3362 0.2480 300 500 0.2000 0.1175 500 500 0.1287 0.1060 0 800 0.1900 0.1062 100 800 0.2875 0.2537 300 800 0.2025 0.1200 500 800 0.1912 0.1175 Table 2: Comparison of Purity Scores No. of Entropy Entropy ImagAP Based K-Means es in on HTL Based on Data HTL sets 0 100 0.75162 0.85679 100 100 0.60888 0.70140 0 300 0.80327 0.89764 100 300 0.68969 0.79095 300 300 0.78225 0.80882 0 500 0.80658 0.93903 100 500 0.69742 0.83917 300 500 0.77842 0.88506 500 500 0.79886 0.93907 0 800 0.87716 0.95362 100 800 0.74227 0.78226 300 800 0.78725 0.88942 500 800 0.86091 0.97506 Table 3: Comparison of Purity Scores

Fig 7: Comparison of F-Measure Scores with Annotations(Text) in K-Means clustering based HTL and AP Based on HTL(Number of Images in Data sets 800)

Number of Annotation

Fig 8: Comparison of Purity Scores with Annotations(Text) in K-Means clustering based HTL and AP Based on HTL(Number of Images in Data sets 800)

Number of Annotation

Fig 9: Comparison of Entropy Scores with Annotations(Text) in K-Means clustering based HTL and AP Based on HTL(Number of Images in Data sets 800)

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

37

5 CONCLUDING REMARKS AND FUTURE DIRECTIONS


In this paper two algorithms for clustering. K-Means Clustering based on HTL and Affinity Clustering based on HTL have been proposed. Clustering Accuracy of K-Means based on HTL is better than K-Means whereas Affinity Clustering based on HTL gives far better clustering accuracy than simple Affinity Propagation Clustering. It is also concluded that the clustering accuracy of Affinity based on HTL is much better than the K-Means Based on HTL. Extensive experiments on many datasets show that the proposed Affinity based on HTL produces better clustering accuracy with less computational complexity. There are a number of interesting potential avenues for future research. Affinity Clustering based on HTL can be made hierarchical. Results of FAPML can be improved by designing it on the basis of HTL. Both algorithms can be applied to information retrieval.

Shailendra Kumar Shrivastava, B.E.(C.T.),M.E.(CSE) Associate Professor in Department of Information Technology. Samrat Ashok Technological Institute Vidisha. He has more than 23 Years Teaching Experiences. He has published more than 50 research papers in National/International conferences and Journals .His area of interest is machine learning and data mining.He is PhD. Scholar at R.G.P.V.Bhopal Dr J.L.Rana B.E.M.E.(CSE),PhD(CSE) .Formerly he was Head of Department Computer Science and Engineering , M.A.N.I.T. Bhopal M.P. Inidia.He has more than 40 Years Teaching Experinces. His area of intrest includes Data Mining, Image Processing, and Ad-Hoc Network etc. He has so many publications in International Journal and conferences. Dr. R.C.Jain PhD .He is the director Samrat Ashok Technological Institute Vidisha M.P. India.He has more than 35 Years Teaching Experiences. Research Interest includes Data Mining , Computer Grpahics, Image Processing, Data Mining .He has published more than 250 research papers in Internatinal Journals and Conferences.

REFERENCES
[1] [2] [3] Tom M. Mitchell, Machine Learning ,McGraw-Hill , 1997 pp1- 414 EthemAlpaydin , Introduction to Machine Learning ,Prentice Hall of India Private Limited New Dehli,2006,pp133-150. Sinno Jialin Pan and Qiang Yang ,A Survey on Transfer Learning , IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) Volume 22, No. 10, October 2010 ,pp 1345-1359, X. Ling, G.-R. Xue, W. Dai, Y. Jiang, Q. Yang, and Y. Yu, Can Chinese web pages be classified with english data source?, Proceedings of the 17th International Conference on World Wide Web, Beijing, China ,ACM, April 2008, pp. 969978. Qiang Yang, Yuqiang Chen, Gui-Rong, Xue,Wenyuan Dai ,Yong ,Heterogeneous Transfer Learning for Image Clustering via the Social Web,ACL-IJCNLP 2009 ,pp 1-9. RuiXu Donald C. Winch, Clustering , IEEE Press 2009 ,pp 1282 Jain, A. and DubesR. Algorithms for Clustering Data , Englewood Cliffs, NJ Prentice Hall, 1988. A.K. Jain, M.N. Murthy and P.J. Flynn, Data Clustering: A Review , ACM Computing Surveys, Vol.31. No 3, September 1999, pp 264-322. RuiXu, and Donald Wunsch, Survey of Clustering. Algorithms , IEEE Transactions on Neural Network, Vol 16, No. 3, 2005 pp 645. Frey, B.J. and DueckD. Clustering by Passing Messages Between Data Points , Science 2007, pp 972976. Kaijun Wang, Junying Zhang, Dan Li, Xinna Zhangand Tao Guo, Adaptive Affinity Propagation Clustering,ActaAutomaticaSinica, 2007 ,1242-1246. Inmar E. Givoni and Brendan J. Frey,A Binary Variable Model for AffinityPropagation,Journal Neural Computation,Volume 21 Issue 6, June 2009,pp1589-1600. Salton G., Wong A., and Yang C. S., 1975, A Vector Space Model for Automatic Indexing, Comm. ACM, vol. 18, no. 11, pp. 613-620. http://www.vision.caltech.edu/Image_Datasets/Caltech25 6/ Chim H. and Deng X., 2008 Efficient Phrase Based Document Similarity for Clustering, IEEE Trans. Knowledge and Data Engineering, vol. 20, No.9.

[4]

[5] [6] [7] [8] [9]

[10] [11] [12]

[13]

[14] [15]

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617

You might also like