You are on page 1of 7

Proceedings of the 5th National Conference; INDIACom-2011 Computing For Nation Development, March 10 11, 2011 Bharati Vidyapeeths

s Institute of Computer Applications and Management, New Delhi

A Model for Analyzing Relevance of Web Results against Semantic Queries


Kumar Sourabh1 and Vibhakar Mansotra2 1 Department of Computer Science and IT, University of Jammu, Jammu 180001 J&K India; 2 Department of Computer Science and IT, University of Jammu, Jammu 180001 J&K India; 1 kumar9211.sourabh@gmail.com(Phone: +91 9469163570) and 2vibhakar20@yahoo.co.in(Phone: +91 9419103488)
ABSTRACT World Wide Web has tremendous effect on the socio economic development of the people it may be widely due to a variety of reasons particularly the availability of huge information on the web. No doubt the web is huge repository as far as availability of information is concerned but gathering or retrieval of relevant information is really a cause of concern. There are ample of search engines which provide efficient ways and means for searching the desired information on the web. The prominent tools are Google, Yahoo, AltaVista etc. However inspite of the fact that a large number of tools exist for searching information on the web the users on number of occasions are not satisfied with the results obtained. This happens mostly with the semantic queries when user looks for the results of a meaningful query submitted by him/her. Looking into the variation in nature of query and results an attempt has been made by the authors to develop a model which will reduce the efforts of the user to find useful information. The authors have designed, developed and implemented a tool which may be used by the user to analyze the results obtained from the search engine Google for finding useful information. KEYWORDS Semantic, Search engines, query, information retrieval, Web mining 1. INTRODUCTION Web mining is the use of Data Mining techniques to automatically discover and extract information from Web documents and services. We could simply view Web mining as an extension of knowledge discovery that is applied on the Web data. There is a close relationship between data mining, machine learning and advanced data analysis. However web mining or Information discovery on the web is not the same as Information Retrieval (IR). Information Retrieval is the automatic retrieval of documents relevant to the query. IR has the primary goal of indexing text and searching for useful documents in a collection and now a days research in IR includes modeling, document classification and categorization, user interfaces, data visualization, filtering etc. The task that can be considered to be an instance of web mining is web document classification or categorization which could be used for indexing. Viewed in this respect, Web mining is a part if IR process. The World Wide Web (WWW) is a popular and interactive medium to disseminate information today. People either browse or use the search engine services to find specific information on Web. When a user uses a search engine he/she usually inputs a keyword query and the query response is the list of pages ranked based on their similarity to the query. Now a day search engines clearly dominate on the web therefore user rely on search engines for his/her information needs. Other approaches such as web directories, social book-marking or question answering services only play an underpart. [1] According to the findings of a survey on the behavior of search engine users by iProspect in January 2006, a fairly high level of confidence exists on the part of users across search engines in general, as a vast majority seems to trust their search engine of choice to return the correct information more than they trust themselves to enter the appropriate keywords In the case where an initial search is unsuccessful 82% of the users re-launch their queries on the same search engine but add more keywords to refine their subsequent search. It can be seen from above that users show their complete dependence and surrender to the search engines for the information retrieval. [2] Since users use search engines as their primary source of information but the major problem in web searching is still the same as it was a decade ago: The relevance. Surely search engines have improved a great deal over years but results are still far from perfect. There have been some changes recently at Google, Yahoo and others that affected the user guidance while submitting query. [3]. It has been observed that this guiding feature works only for popular keywords where as for non popular keywords or query the feature does not help. Research in information retrieval is addressing the lack of effectiveness of current search engines interfaces by attempting to provide redundant visual cues for interpreting retrieval results and to extend user control over their presentation and selection.[4][5]. Instead of presenting the user with a different list of summaries or with a list of summaries with a different document description, we aim at giving the user more control over the set of summaries that can be selected for perusal.

Copy Right INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

Proceedings of the 5th National Conference; INDIACom-2011

2. EXACT AND BEST MATCH RETRIEVAL It has been widely observed that when a user submits a query to the search engine for undergoing a search operation the results so obtained hardy match with the exact requirements. Most often the user submits a query as a line of text and results obtained rarely match the exactly with what the user wants to search. The question arises how a user would know the presence of exact line of text on the web. A general approach of almost all the commercial search engines is to find the exact match and in case the exact match is not found then the search engine finds a best or partial match based on the keywords present in the query. Such a process sometimes leads to irrelevant results as different keywords have different meanings and the combination of such keywords lead to results which the users never expect. The current search engine interfaces attempt to provide redundant visual cues for interpreting results as such the user is left with no choice but to click almost each and every result obtained from the web for finding the useful information. A semi-automatic solution to the partial match problem is to present the user with visual information that highlights the distribution of the possible various meanings (arising from partial matches between the query and the documents) in the documents themselves, and then let the user select the documents containing the meanings of interest. This approach has been used by Hearst, Veerasamy and Heikes, and Byrd, with the main goal of clarifying the role played by the query terms in the result of ranked output systems [6]. The approach used in the paper shares a similar concern but employs a different visualization and interaction scheme. Based on users query we analyze (full page analysis) the results returned by the search engine and present to user with he most relevant result of his/her concern out of various results returned by the Web search engine, It has been observed that result description though precise returned by the search engines is not enough for the user to interpret the nature of the document. However, the main advantage of the approach is that the users do not need to go for several screen displays of retrieved results and need not figure out themselves which are the displayed documents with a relevant combination of terms. The goal of our research is to facilitate inspection and utilization of Web retrieval results. Our approach is based on the notion of view, where a view is simply defined as the subset of retrieved documents that contain a specified subset of query terms. Similar to other recent exact matching retrieval systems, which can be the main rationale of our approach is the selection of documents of interest which can be facilitated by decomposing a query into its constituents and checking for their inclusion in a document individually. To accomplish this we proceed with the semantic queries i.e. queries that have a meaning. Though there are various semantic search engines present such as Swoogle, Hakia, SenseBot, Powerset but due to

their non popularity among the wide section of international society they are seldom used. Also most of the semantic search engines are domain based which limit the scope of world wide search. In this paper we propose a new method to fetch relevant results against large semantic queries from the commercial web search engines. Our aim of research is to provide the user the most appropriate result returned by the web search engine against small as well as large semantic queries. For the experiments a very popular and widely used search engine i.e. Google has been used as a web search tool. This kind of work has been rarely reported in the literature on user-oriented visualization and manipulation of retrieval results. we feel that this gap should be filled in order to assess the utility of these tools in a more realistic manner. 3. DESCRIPTION OF THE TOOL The authors propose a tool which is a semi-automated yet a prototype implemented in C++ language which analyses the top pages returned by a search engine for a specific semantic query and generates an order of the relevant pages (by pages we mean the web pages against the urls returned by the search engines). As mentioned above we take into consideration only the large semantic queries. First of all the semantic query supplied by the user is divided into its various constituents. By constituents we mean small semantic queries and combinations of keywords present in the original query relative to the original semantic query being supplied by the user. Here is an example consider a semantic query role of education in the development of nation. The various best match query divisions are Relative semantic queries i.e. different semantic breakups of the query are: Role of education in the development Role of education Development of nation Relative keyword combination i.e. various combinations of the query are worked out as: Role education development nation Role education development Role education nation Role education Education development Development nation Role Education Development Nation The experiment is done in four phases out of which the first phase is manual and rest are automated. First phase: This phase is manual and is based on breaking a large semantic query into its constituents.

Copy Right INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

A Model for Analyzing Relevance of Web Results against Semantic Queries

Second Phase: This phase is based on searching for frequencies against semantic queries as well as keywords in the web page. Third Phase: This phase is based on generation of Rank. Fourth Phase: Expected output.

3.3 EXPERIMENT DESIGN The experiment is designed and conducted on large semantic queries. Various steps of experiment design include as. Designing a large semantic query to fetch best results from the search engine. Storing and analysis of the top ten pages Searching various query constituents page wise and recording of frequencies of exact and best hits in a table. Table generated above is then used for rank evaluation. Rank so obtained gives the order of the pages for which the search may be perused. The working of the model is shown graphically in fig-1

Table 1 describes he various constituents and hits found accordingly against first query i.e. Scientific accuracy of bible predictions There are a total of eleven columns in the tables but number of rows may vary depending upon number of the constituents of the query. The first column entitled query contains the constituents of the original query issued by the user. Rest of the columns are the top ten results by the search engine for example decimal one means top result of the search engine decimal two means second result and so on. Against each query in the first column there is a number which means the frequency of that query in the full-text version of the web page. Table 2 describes he various constituents and hits found accordingly against second query i.e. Effect of nuclear waste on the environment. Table 3 describes he various constituents and hits found accordingly against third query i.e. Importance of education for the people in rural areas After the generation of the table the next step is to compute the rank of the columns so that best column or best result can be selected. Calculation of rank procedure goes as follows. 5. RANK CALCULATIONS For calculation of rank query constituents are further divided into four categories. The detailed description of the categorization is given below as. Category-1: The original query being supplied by the user. Category-2: Those constituents of the query which have the semantic significance or make sense. Category-3: Multiple keywords excluding punctuations, articles etc. Category-4: Single keywords. As an example the query is categorized using the above rule as follows: Category-1: scientific accuracy of bible predictions. Category-2: scientific accuracy of bible. Category-3: scientific accuracy bible, scientific accuracy predictions, scientific predictions. Category-4: scientific, accuracy, bible, predictions. As discussed above the tables contain frequencies i.e. hits against query constituents; these hits are taken as a weight factor for the generation of rank and also presence of frequency is considered as logical one and zero otherwise. A rank table against first query scientific accuracy of Bible predictions is shown below which contains nine columns and ten rows excluding the heading row. First column entitled Pages where page means the full text version of the web page against the first url returned by the Google search engine. The column Cat stands for category followed by a number e.g. Cat 1 stands for category-1 and so on. Similarly column Wt stands for weight followed by a number e.g. wt1 stands for weight against a particular category number. A table no 4 given below is a rank file generated by the rank generator for the first experiment.

Fig 1 Graphical design of the tool. 4. EXPERIMENTAL RESULTS For conducting the experiments we have selected large semantic queries. In all the cases almost matching and similar results have been found. However due to large volume of data involved the authors present and analyze only three semantic queries. The queries which have been considered for presentation and analysis thus include. a. Scientific accuracy of bible predictions. b. Effect of nuclear waste on the environment. c. Importance of education for the people in rural areas. As discussed above we divided the queries into its constituents shown in the table below but first we discuss the organization of the table.

Copy Right INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

Proceedings of the 5th National Conference; INDIACom-2011

A detailed description for rank calculation is as follows: Any hit in category 1 row (Category1) is of higher importance. In the same way Category 2 enjoys next priority and like wise. If any numbers of columns fall in same priority then their weights are evaluated for conflict resolution or rank evaluation. However if weights come out as same then next category according to priority is taken into consideration. Based on above fact the relevant pages for above query are ordered as: P3, 4, P10, P9, P1, P2, P8, P5, P7, and P6. In the above table category1 has no hit and thus no weight therefore out of consideration. In category 2 there is a hit against result three for best semantic match and no ht for other results therefore P3 wins in Category 2. Now consider Category 3 all columns have got a hit or logical 1 an indicator of presence of hit, therefore we take their weights into consideration P3 has got highest weight (21) but it is already considered as best candidate so we consider next highest weight (14) which is against P4. Therefore next winner is P4. But still P1, P2, P5, P6, P7, P8, P9, P10 are in competition. Out of which P10, P9, P1, P2 emerges out with respective weights as 9, 8, 5, and 4. Now P5, P7 P8 share same weight so we consider their weights from category 4. P8 comes out with max weight as 18 then P5 the P7 and finally P6 is moved to last. Thus the sequence of relevant results becomes. P3, P4, P10, P9, P1, P2, P8, P5, P7, P6. Most relevant page against semantic query scientific accuracy of bible prediction according to the experiment is Page number three of top ten pages returned by the search engine. Similarly tables given below are rank files generated by the rank generator for the second and third experiments. A table number 5 given below is a rank file generated by the rank generator for the second experiment. Thus the sequence of relevant results against query Effect of nuclear waste on the environment becomes.P8, P2, P1, P4, P3, P9, P10, P5, P6 and P7. Table 6:- Rank file number 3 against third query Importance of education for people in rural areas Thus the sequence of relevant results against query Importance of education for people in rural areas becomes: P1, P7, P2, P8, P3, P5, P6, P4, P9 and P10. 6. CONCLUSIONS The model presented in this paper shifts the user effort from inspection to evaluation of retrieval results. Although it increases the number and the length of the submitted queries but the level of user satisfaction also increases. One important design parameter of the tool is the amount of textual information extracted from the retrieved results and used to compute the views. If a centralized repository with all accessible documents is available, one can use a relatively high number of retrieved documents along with their full-text descriptions without sacrificing the response time of the interface.

However, for ubiquitous searches on the Web, it may be necessary for efficiency reasons to use only a very limited number of the documents retrieved by the engine and to use the short document summaries, as provided by the engine without downloading the full-text descriptions. The current version of the tool manages to keep the computational overhead small by using only first ten of retrieved results. Of course, using only a small fraction of the amount of textual information theoretically available may affect the retrieval effectiveness of the view mechanism. 7. FUTURE SCOPE As mentioned above our tool is semi automatic in which first phase i.e. decomposition of the semantic query to its constituents is completely manual. This still is a challenging task. In order to make it a fully automated, this phase also needs to be automated. After the development of a fully automated tool, few other parameters shall be worked out for testing the efficiency, throughput, and response time of the tool. After the tool is fully developed and tested it may be launched as a complete application for the public use. REFERENCES [1]. Raymond Kosala, Heverlee, Belgium: Web Mining Research: A Survey. ACM SIKDDD Volume 2 Issue 2 pp1-15 [2]. www.iprospect.com [3]. Dirk Lewandowski, Hamburg, How can users be guided to quality content? Journal of Information Services & Use 28(2008) 261-268 DOI10.3233/ISU-2008-0583 IOS press [4]. Hearst, M.: User interfaces and visualization. In: BaezaYates, R., Ribeiro-Neto, B. (eds.): Modern Information Retrieval, pp. 257322, ACM, New York, 1999 [5]. Savage-Knepshield, P.A., Belkin, N.J.: Interaction in information retrieval. J. Am. Soc. Inf. Sci. 50(12):1067 1082, 1999 [6]. Ezio Berenci, Claudio Carpineto, Vittorio Giannini, Stefano Mizzaro : Effectiveness of keyword-based display and selection of retrieval results for interactive searches. Int J Digit Libr (2000) 3: 249260 / Digital Object Identifier (DOI) 10.1007/s007990000035

Copy Right INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

A Model for Analyzing Relevance of Web Results against Semantic Queries

Decomposed Top 10 results by the Google search engine Query 1 2 3 4 5 9 10 Decomposed Top 10 results by 6 Google search engine the 7 8 scientific accuracy of bible Query 1 2 30 40 5 6 07 0 8 09 10 0 0 0 0 0 predictions scientific accuracy of bible 0 0 1 0 0 0 0 0 0 0 scientific accuracy bible 1 0 3 3 0 0 0 0 0 1 scientific accuracy predictions 0 0 0 0 0 0 0 0 0 0 Scientific bible predictions 0 0 0 0 0 0 0 0 1 0 Scientific predictions 0 0 0 0 0 0 0 1 1 0 Scientific accuracy 1 0 3 1 1 0 0 1 0 2 scientific bible 1 0 11 4 1 0 0 0 3 3 accuracy predictions 0 0 0 0 0 0 0 0 0 1 accuracy bible 2 2 3 5 0 1 2 0 1 1 Bible predictions 0 2 0 1 0 0 0 0 2 1 accuracy 3 3 4 6 1 1 4 1 1 6 predictions 0 5 0 2 0 0 4 5 5 1 scientific 2 1 44 11 13 1 0 5 9 5 bible 33 19 31 95 2 19 7 7 46 38 Table 1 Result generated by page analyzer

Decomposed Query Effect of nuclear waste on the environment Effect of nuclear waste effect on the environment effect nuclear waste environment effect nuclear waste effect nuclear environment nuclear waste environment effect environment nuclear waste effect nuclear nuclear environment effect waste waste environment effect nuclear waste environment

1 0 1 1 1 3 4 2 7 10 8 11 3 2 11 96 39 19

Top 10 results by the Google search engine 2 3 4 5 6 7 8 9 10 0 2 0 1 3 1 7 1 56 4 6 3 10 9 128 228 18 0 1 0 1 2 1 5 1 14 2 7 2 5 2 30 17 24 1 1 1 4 4 4 4 3 5 3 3 3 3 5 19 24 9 0 0 0 0 0 0 1 0 11 0 1 0 3 2 18 69 3 0 0 0 0 0 0 1 0 3 0 1 0 5 0 5 26 8 0 0 0 0 0 0 2 0 3 0 2 0 2 0 3 9 6 3 3 3 3 3 3 4 3 13 3 4 3 4 3 19 19 4 0 0 1 0 0 2 1 2 1 1 5 1 1 15 23 11 14 0 0 0 0 0 0 8 2 23 1 21 0 9 3 116 50 33

Table 2 Result generated by page analyzer

Copy Right INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

Proceedings of the 5th National Conference; INDIACom-2011

1 .

importance of education for people in rural areas importance of education for rural people importance of education for people importance of education in rural areas importance of education education for people in rural areas education for rural people education for people education (in)(for) rural areas importance education people rural areas importance education people rural importance education people importance people rural importance education rural areas importance education rural education people rural areas education people rural education rural areas people rural areas importance education importance people importance rural importance areas education people education rural education areas people rural people areas rural areas importance education people rural areas

0 1 1 0 1 0 13 0 1 1 3 3 4 1 3 3 15 5 4 3 4 5 2 15 23 5 18 4 8 5 59 25 38 9

0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 10 1 0 0 0 0 0 11 12 2 11 0 3 0 26 12 25 5

0 0 0 0 1 0 0 0 3 0 0 0 0 0 0 2 3 10 3 2 0 0 0 24 20 22 4 5 17 2 458 45 34 42

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 4 12 2 0 2 2 2 215 6 22 10

0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 1 6 0 1 0 0 0 1 7 6 1 0 8 1 41 1 13 8

0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 1 1 1 0 0 3 3 18 3 5 5

0 0 0 0 1 0 9 0 5 0 0 0 0 0 2 2 13 11 2 2 0 3 0 13 49 11 17 2 18 3 141 21 112 23

0 0 0 0 3 0 0 0 2 0 0 0 0 0 0 0 0 2 0 3 0 0 0 1 3 3 0 0 6 4 89 2 11 12

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 1 0 0 1 0 2 17 2 4 1 12 1 26 6 88 15

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 2 3 3 1 1 6 1 63 3 3 12

Table 3: Result generated by page analyzer Pages Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Cat 1 0 0 0 0 0 0 0 0 0 0 Wt 1 0 0 0 0 0 0 0 0 0 0 Cat 2 0 0 1 0 0 0 0 0 0 0 Wt 2 0 0 1 0 0 0 0 0 0 0 Cat 3 1 1 1 1 1 1 1 1 1 1 Wt 3 5 4 21 14 2 1 2 2 8 9 Cat 4 1 1 1 1 1 1 1 1 1 1 Wt 4 38 28 79 114 16 21 15 18 61 50

Copy Right INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

A Model for Analyzing Relevance of Web Results against Semantic Queries

Table 4:- Rank file number 1 against first query scientific accuracy of Bible predictions.

Pages Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10

Cat 1 0 0 0 1 0 0 0 1 0 0

Wt 1 0 0 0 1 0 0 0 3 0 0

Cat 2 1 1 1 1 0 0 0 1 1 0

Wt 2 2 2 1 2 0 0 0 6 1 0

Cat 3 1 1 1 1 1 1 1 1 1 1

Wt 3 51 92 40 36 16 10 9 43 14 64

Cat 4 1 1 1 1 1 1 1 1 1 1

Wt 4 165 383 73 57 92 39 18 45 63 202

Table 5:- Rank file number 2 against second query Effect of nuclear waste on the environment Pages Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Cat 1 0 0 0 0 0 0 0 0 0 0 Wt 1 0 0 0 0 0 0 0 0 0 0 Cat 2 1 1 1 1 1 1 1 1 0 0 Wt 2 17 5 4 1 3 3 15 5 0 0 Cat 3 1 1 1 1 1 1 1 1 1 1 Wt 3 129 50 112 23 31 9 145 18 45 20 Cat 4 1 1 1 1 1 1 1 1 1 1 Wt 4 136 68 581 255 64 34 300 118 136 82

Table 6:- Rank file number 3 against third query Importance of education for people in rural areas

Copy Right INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

You might also like