You are on page 1of 4

Information Retrieval From Document Clustering to Document Ranking By Hani Mahmoud This is the age of the Internet.

Few people deny that. Meanwhile, some say that this is also the beginning of the age of biology. They say this because of technology like the Internet. However, it must be noted that the word beginning is usedoften because those who believe that technology will usher in major advancements in biology also believe that technology is almost, but not quite ready yet. They see the grand potential in the Internet, and so do I. Therefore, countless projects are continuously being undertaken in order to begin to take full advantage of the Internet. This project in particular focuses on information retrieval. In the beginning I intended to gather 10 abstracts from PubMed and attempt to execute useful searches on them, which would retrieve relevant medical information by means of document clustering. My initial goal was simply to write a program and discover a similarity metric by which to cluster documents. The ultimate objective was to use such a program as a way to begin to take advantage of the (seemingly) infinitely vast resource that is the Internet. I hoped, and still hope, to have a program that returns relevant and useful documents in response to a users search. Such a program, if modified professionally, could potentially contribute to point of care medicine, allowing the sick to find appropriate treatments quickly and easily from the comfort of their own home. PubMed is indubitably a great and reliable resource that could be enhanced if it was modified slightly to be more useful to the general public, and this project could be a good starting point for this technology-based age of biology. First, I needed to narrow things down. There are so many different ways in which one could medically benefit from improving information retrieval techniques on the Internet, so I decided to focus on self-diagnosis of melanoma. Early detection of cancer in general is often crucial to a successful treatment. Thus, I see potential in the Internet to aid possibly sick people in discovering signs of their disease before it is too late. I envision a time when a potential patient can avoid a costly doctors visit because he/she is fully able to rely on information found online. In this case, an individual might have some suspicions of early stages of melanoma, so he could instantly visit a reliable Internet site such as PubMed and easily access information that would inform him whether or not it is cost effective to visit a doctor. I realize that this seems quite ridiculous given the present state of the Internet, but if one looks back at the sheer amount of growth the Internet has undergone in the past decade, it seems that anything is possible in the near future. We just need to start somewhere, so it might as well be with this project. Just as I realize the current absurdity of the long-term goals of this project, I also realize that many similar projects have come before it. However, I believe that the particular methods of this specific project shed light on how to pursue this topic in the future. I started by executing a PubMed title search with the words melanoma self detection. PubMed then returned 10 entries that included the three search words in the title. One would think that these 10 items would all be closely related to self-detecting melanoma. However, upon reading through the abstracts, I found that a few of them had little to do with the others and more surprisingly, little to do with how to self-detect

melanoma. Focusing solely on the abstracts, I proceeded to the next stepfiltering them. This process could have been done using code, but as a rookie programmer I decided to filter the abstracts manually. I did this by using Microsoft Words find and replace feature to remove all punctuation and stop words (i.e. the, and, or, etc.). In retrospect, this manual process actually made the final code much simpler, and simple code was one of my main goals throughout the process. After this, I had to find a similarity metric by which to cluster the abstracts. I started by running Ira Kalets [1, page 131] code, itemcount-hashed, on each filtered abstract using Allegro Common Lisp on Windows 7.
(defun item-count-hashed (stream) (let ((results (make-hash-table)) (tmp (read stream nil :eof))) (until (eql tmp :eof) (if (gethash tmp results) (incf (gethash tmp results)) (setf (gethash tmp results) 1)) (setq tmp (read stream nil :eof))) results))

This code essentially constructs a hash table of all the different words (keys) in the abstract and keeps a count (value) of the occurrence of each word, as seen here.
Key: BASED Value: 1 Key: CANCER Value: 2 Key: PARTICIPANT Value: 4

Note that in order to modify this code to filter the abstract, I would have had to insert more code that would have made it more complex, especially for a novice programmer. Similarly, writing code to develop a meaningful similarity metric based on the above results would have been extremely difficult for me. Thus, I had to return to the drawing board and start again with the filtered abstracts. After some thought, I recalled Lisps intersection function. As its name implies, this function takes two lists as input and returns their intersection (i.e. words in common). So, I proceeded by setting each of the abstracts (named 1a, 2a, 3a, etc.) as a list:
(setq 1a (abstract)).

Next, I had to take the length of the intersection of every possible pair of abstracts in order to determine how many words one abstract had in common with another.
(length (intersection 1a 2a))

However, after reviewing some calculations that were incongruent with my own observations, I realized that something was missing. A couple abstracts were much shorter than the rest, so these ended up having a considerably lower similarity rating than I deemed they should have. Thus, I had to find a way to account for abstract length. The answer was simple: divide the length of the intersection by the combined length of the two abstracts.
(/ (length (intersection 1a 2a)) (float (+ (length 1a) (length 2a))))

The result was many meaningless decimal values that required several calculations in order to produce meaningful results. As I organized the numbers returned by the code above, I began to search for ways to actually cluster the abstracts based on the data on hand. I started by simply inputting the numbers into a table, expecting some trend to reveal itself.
x 1 2 3 4 5 6 7 8 9 10 1 x x x x x x x x x x 2 .233 x x x x x x x x x 3 .175 .156 x x x x x x x x 4 .118 .062 .085 x x x x x x x 5 .112 .054 .101 .204 x x x x x x 6 .102 .057 .092 .145 .139 x x x x x 7 .200 .138 .257 .107 .096 .124 x x x x 8 .093 .067 .128 .102 .093 .106 .114 x x x 9 .086 .068 .138 .119 .099 .139 .092 .101 x x 10 .090 .066 .140 .209 .112 .110 .105 .069 .156 x

However, upon studying the values above, I found no striking trends. Therefore, I proceeded by determining (manually again) the first, second, third, and fourth quartiles of the values. Next, I sorted each abstract relationship into its appropriate quartile:
!"!#$%&%!"!'(%)%*+'%%*+*!%%(+$%%(+#%%(+,%%(+-%%(+'%%(+*!%%.+$%%.+,%%/+'%%-+*!% !"!'.%&%!"*!/%))%*+,%%*+-%%.+#%%$+/%%$+-%%#+/%%#+-%%#+'%%%,+-%%/+*!%%-+'% !"*!-%&%!"*.'%)))%*+$%%*+#%%(+/%%.+-%%.+'%%$+'%%#+,%%#+*!%%,+/%%,+'%%,+*!%%/+-% !"*$!%&%!"(#/%))))%%*+(%%*+.%%*+/%%(+.%%.+/%%.+*!%%$+#%%%$+,%%%$+*!%%'+*!"% %

At this point I was closer to something meaningful, but still not quite there yet. So, I simply created a rating systemone point every time an abstract appeared in the first quartile, all the way up to four points when it appeared in the fourth. The result was the following table:
Abstract 1 2 3 4 5 6 7 8 9 10 Score 24 17 26 24 22 22 24 18 20 23 Avg. |delta| = 3.4 delta -3 3 7 -1 -7 -3 0 6 -2 -2

% Finally I had something useful. Each abstract had a score indicating how similar it is to the other abstracts in the group. One can see, for example, that abstracts numbers 8 and 2 have much less similarity to the rest, while abstract 3 is very similar to many of the abstracts. In fact, these scores were quite congruent with the conclusions I made based on reading the abstracts. The delta column was a result of my calculations before taking

abstract length into account. The score of abstracts 3 and 5 changed significantly, because one was very lengthy, while the other was quite short. The average change in score, 3.4, indicates the importance of document length in projects like this one. After examining the results, one notices that the initial goal of clustering the abstracts was essentially abandoned somewhere between the code and the calculations. In retrospect, it appears that in order to keep the coding simple enough for my level of expertise, I had to abandon my specific goals and essentially see where the code took me. In a professional situation, this would not be practical, but I sincerely believe that this approach provided for a good learning process in this particular case. Furthermore, abandoning the initial goal did not affect my progress toward achieving the ultimate goal of improving information retrieval in order to literally place the health field at the fingertips of patients. Although the results above may not have clustered the abstracts, they surely ranked them in a useful manner. When refined, this concept could improve information retrieval in various ways. For example, one might search for melanoma selfdetection and locate a very useful or informative article. The ranking system developed here could then be utilized as a tool to find similar articles that are just as helpful as the original one. However, there is still a ways to go. I realize that in order to expand this project to a professional level, much more code would have to be used. It would be extremely impractical, for example, to manually filter all punctuation and stop words. Similarly, all manual calculations would have to be automatized. Thus, balance is key. I was able to get away with very minimal amounts of coding herethe whole project revolves around only two lines of code. However, I was still able to get the job done. On the other hand, Microsoft Excel, which is indubitably a much more complex program, contains eleven million lines of source code. Clearly Excel is a large program that requires a lot of code, but there is no reason that it needs eleven million lines of code. If the programmers had been more thoughtful in their coding, they surely could have reduced the amount of code into a mere fraction of those eleven million lines. It is important for source code to remain concise, so that it can be understood by a single human being whenever possible. While this project would need more code to be commercially useful, its basis revolves around minimal amounts of code. Thus, even if the ranking system developed in this project were useless, it is important to note that the conservative coding techniques that made this project can and should be applied elsewhere. Programs may require more than two lines of source code, but two lines can go a long way. Reference 1. Kalet, I. 2009. Principles of Biomedical Informatics. London, UK: Academic Press. ! !

You might also like