Professional Documents
Culture Documents
Style Similarities
TANGUY URVOY, EMMANUEL CHAUVEAU, and PASCAL FILOCHE
Orange Labs (France Telecom R&D)
and
THOMAS LAVERGNE
Orange Labs and ENST Paris 3
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier
paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web author-
ing software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).
Those pages built using the same generating method (template or script) share a common “look
and feel” that is not easily detected by common text classification methods, but is more related to
stylometry.
In this work we study and compare several HTML style similarity measures based on both
textual and extra-textual features in HTML source code. We also propose a flexible algorithm to
cluster a large collection of documents according to these measures. Since the proposed algorithm
is based on locality sensitive hashing (LSH), we first review this technique.
We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and
enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-
UK2006 dataset.
Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Anal-
ysis and Indexing
General Terms: Algorithms, Experimentations
Additional Key Words and Phrases: Clustering, document similarity, search engine spam, stylom-
etry, templates identification
ACM Reference Format:
Urvoy, T., Chauveau, E., Filoche, P., and Lavergne, T. 2008. Tracking Web spam with
HTML style similarities. ACM Trans. Web 2, 1, Article 3 (February 2008), 28 pages. DOI =
10.1145/1326561.1326564 http://doi.acm.org/10.1145/1326561.1326564
Authors’ address: Orange Labs, 2 Avenue Pierre Marzin, 22307 Lannion cedex, France; e-mail:
{Tanguy.urvoy,emmanuel.chauveau,pascal.filoche,tomas.lavergne}@orange-ftgroup.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permission@acm.org.
C 2008 ACM 1559-1131/2008/02-ART3 $5.00 DOI 10.1145/1326561.1326564 http://doi.acm.org/
10.1145/1326561.1326564
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:2 • T. Urvoy et al.
1. INTRODUCTION
Automatically generated content is nowadays ubiquitous on the Web, especially
with the advent of professional web sites and popular three-tier architectures
such as “LAMP” (Linux Apache Mysql Php). Generation of these pages using
such architecture involves:
—a scripting component;
—a page template (“skeleton” of the site pages);
—content (e.g., product catalog or articles repository), usually stored in
databases.
When summoned, the scripting component combines the page template with
information from the database to generate an HTML page, indistinguishable
from a static HTML page from a crawler point of view.
1.2.2 HTML, Stylometry, and Templates. What best describes the rela-
tionship between the pages generated using the same template or method is
more on a style ground than on a topical one. This relates our problem with
the stylometry area. Up to now, stylometry was more generally associated with
authorship identification, to deal with problems such as attributing Shake-
speare’s plays to the right author, or to detect computer software plagiarism
[Gray et al. 1997]. Usual metrics in stylometry are mainly based on word counts
[McEnery and Oakes 2000], but also sometimes on nonalphabetic features such
as punctuation. In the area of web spam detection, [Westbrook and Greene
2002; Ntoulas et al. 2006] and [Lavergne 2006] propose to use lexicometric fea-
tures to classify the part of web spam that does not follow regular language
metrics.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:4 • T. Urvoy et al.
html files
3.1 Preprocessing
3.3 Clustering
enhanced tagging
Fig. 1. The style-based spam detection framework: all computing steps are described in Section 3.
The specific problem of template detection has been well studied. For instance
Bar-Yossef and Rajagopalan [2002] and more recently Chakrabarti et al. [2007]
proposed Document Object Model (DOM-tree) based algorithms to remove tem-
plates from HTML files. To circumvent the heavy computational cost of DOM-tree
extraction, [Chen et al. 2006] proposed a “flat” algorithm to detect and remove
template blocks from HTML. All these papers consider template structure as
noise.
LSH scheme for a similarity function sim if for any couple d 1 , d 2 ∈ D, we have
the probability
Ph∈F [h(d 1 ) = h(d 2 )] = sim(d 1 , d 2 ) .
We can build an estimator for the similarity by gluing together inde-
pendent LSH functions. Let sim H (x , y ) be the number of equal dimensions
between two vectors x , y —that is, sim H (x , y ) = |{i |xi = y i }|, let h1 , . . . , hm
be m independent LSH functions and let H be the mapping defined by H(d ) =
(h1 (d ), . . . , hm (d )), then
sim H (H(d 1 ), H(d 2 ))
Sim(d 1 , d 2 ) .
m
To avoid confusion we call LSH fingerprint the reduced vector H(d ), and LSH
keys the hash values hi (d ) for 1 ≤ i ≤ m. We present here two standard algo-
rithms for building LSH fingerprints and their application to large scale similar-
ity clustering. Other known similarity estimation algorithms are modsampling
proposed by Manber [1994] and winnowing by Schleimer et al. [2003].
1.2
Jaccard vs MinHashing
Cosine vs Charikar
1
0.6
0.4
0.2
−0.2
−0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
full text similarity
Fig. 2. A rough comparison of real similarity and its 256 bits LSH estimation for 400 similarity
pairs picked in a set of 1000 HTML files.
where θ(d1 , d2 ) is the angle between d1 and d2 . This value is mapped to the
cosine similarity by:
cosine(d1 , d2 ) = cos(π · (1 − Pr ))) .
Fig. 3. To detect that two pages were generated by the same script, one must both consider visual
features and hidden features of HTML. For example, some scripts always jump one line after writing
height=“2”><font.
3.1 Preprocessing
3.1.1 Splitting Content and Noise. The usual procedure to compare two
documents is to first remove everything that does not reflect their content. In
the case of HTML documents, this preprocessing step may include removing
tags, extra spaces and stop words. It may also include normalization of words
by capitalization or stemming.
In order to reflect similarity based on form (and more specifically tem-
plates) rather than content, we propose to apply the opposite strategy: keep
only the “noisy” parts of HTML documents by removing any usual informa-
tional content, and then compute document similarity based on the filtered
version.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:9
—html alpha preprocessor (denoted hss as in Urvoy et al. [2006]) filters all
alphanumeric characters from the original document, and keeps anything
else. We expect to be able to model the “style” of HTML documents through
the use of those neglected features of HTML texts like extra-spaces, lines
feedback, or tags.
—html alpha var spaces (denoted hss-varsp) is a variant of the former one,
where repeated blank spaces are squeezed into a single one. This intends
to smooth differences that may arise in output of documents using the same
templates but a varying number of words in content parts. Should the text be
very large, a large part of the html alpha output would otherwise be composed
of the spaces used to separate words, reducing the impact of the true template
part in the preprocessed document.
—html tags (denoted tags) applies a straightforward method to extract the for-
matting skeleton from the page: it filters anything but HTML tags and at-
tributes from the original document. Content between tags is ignored, as well
as comments and javascript.
—html tags and alpha (denoted tags-alpha) mixes the approaches of html
alpha and html tags. HTML tags and attributes are kept in the output,
as well as any nonalphanumeric characters between tags and in javascript
chunks.
Two more preprocessors are used in order to compare those strategies with
more standard filters:
—html to words (denoted words) outputs every alphanumeric character out-
side of tags
—full (denoted full) outputs the initial document unmodified
WORDS (words)
Mozilla org Home of the Mozilla Project Skip to main content Mozilla About ...
FULL (full)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<title>Mozilla.org - Home of the Mozilla Project</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="keywords" content="web browser, mozilla, firefox, camino, thunderbird, ...">
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:10 • T. Urvoy et al.
<>
<>. - </>
< -="-" ="/; =-">
< ="" =" , , , , , , , ">
<>
<>.- </>
<-="-" ="/; =-">
<="" =", , , , , , , ">
TAGS (tags)
<DOCTYPE HTML PUBLIC W3C DTD HTML 4 01 EN http www w3 org TR html4 strict dtd ><html lang >...
...<head><title></title><meta http equiv content ><meta name content><link rel type href media
>...
<head>
<title>. - </title>
<meta http equiv content >
<meta name content >
3.2 Fingerprinting
As mentioned earlier, using a pairwise similarity measure does not fit large
scale clustering. Using LSH fingerprints on the preprocessing output is required
to address the scalability issue. Some technical details may significantly affect
the quality of the similarity estimation.
mask. To reduce the fingerprint size to 256 bits, we keep only four bits per
key.
For Charikar’s algorithm, instead of precomputing the random normal dis-
tribution of hyperplane characteristic vector r, we use a linear shuffle and a pre-
computed inverse normal function: let P be a big prime, and let inorm : [n] → Z
be a discretization of inverse normal law, then
ri := inorm(i P mod n) .
From 256 random primes we get a 256 bit fingerprint. This implementation
slightly differs from the one used by Henzinger [2006] where the ri values are
restricted to {−1, +1}.
B-words
C-words
B-full
C-full
B-hss
C-hss
B-tags
C-tags
B-tags-alpha
C-tags-alpha
B-hss-varsp
C-hss-varsp
Fig. 4. Correlogram between each similarity measures on a sample of 105 pairs of HTML docu-
ments. Most of these sampled pairs were dissimilar. Darkest colors implies high correlation.
3
2
Fig. 5. The quasi-transitivity of estimated similarity relation is well illustrated by this complete
similarity graph (realized from 3000 HTML files). With a transitive relation, each connected compo-
nent would be a clique. For “bunch of grapes” components like (1) and (2), there is a low probability
for connecting edges to be sampled. An edge density measure may be employed to detect “worms”
components like (3).
90
Edge recall
80
75
70
65
60
55
50
45
40
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
window aperture (% of keys number)
Fig. 6. Edge recall according to relative window aperture for 105 keys and a similarity threshold
of 40/64: even a small window allows us to retrieve most similarity edges in one pass.
LSH fingerprints
similarity edges
url clusters
Fig. 8. spam/normal labels in a HTML style similarity graph. This graph was built with the B-hss-
varspace similarity on a sample of 105 HTML files from UK-2006. Each node represents a host and
each edge represent a high similarity between pages in different hosts. Black nodes indicate spam
labels and white nodes indicate normal labels, other are unknown. The consistency of clusters is
better than in Figure 9. The highly connected clusters are symptomatic of links farms or mirrors.
We observe on Figure 8 that most host-clusters contain only one kind of label.
In such a case, a high consistency means that the major label is really prevalent
over the other: the label can be propagated to the whole cluster. On the other
hand, a low consistency indicates a balanced amount of each labels and we
cannot rely on this cluster to propagate one or the other label. The big central
cluster of the Figure 9 is a perfect example of such indecision.
A straightforward approach is to propagate a reliable human evaluation. As
discussed later on in the WEBSPAM-UK2006-based experimentation chapter, this
approach is efficient for big clusters (most of the time these are link-farms), but
the majority of URLs being in small or medium clusters, the recall remains too
low.
A more efficient way to use the small and medium clusters is to train a clas-
sifier on features available for all hosts or web pages. Such content-based and
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:16 • T. Urvoy et al.
Fig. 9. spam/normal labels in a words similarity graph. This graph was built with the B-words
similarity on a sample of 105 HTML files from WEBSPAM-UK2006. Each node represents a host and
each edge represents a high similarity between pages in different hosts. Black nodes indicate spam
labels and white nodes indicates normal labels, other are unknown. The small pure clusters indi-
cates “spam-full” and “spam-free” of topics. Most hosts are in the biggest cluster: spam independent
topics.
140
HSS−similarity to franao.com
80 1
other sites
60
2
40
3
20
franao.com : template 2
0
0 20 40 60 80 100 120 140 160 180 200
row x 1000 (by decreasing HSS−similarity)
Fig. 10. By sorting all HTML documents by decreasing similarity with one reference page (here
from franao.com) we get a curve with different gaps. The third gap (around 180, 000) marks the
end of franao web site.
4.1 Dataset
We used a corpus of five million HTML pages crawled from the Web. This corpus
was built by combining three crawl strategies:
At the end of the process, we estimated roughly that 2/5 of the collected docu-
ments were spam.
After fingerprinting, the initial volume of 130 GB of data to analyze was
reduced down to 390 MB, so the next step could be performed in main memory.
—the first 20, 000 HTML pages are long lists of sponsored links, all built using
the same template (template 1);
—around 20, 000 there is a smooth gap (1) between long list and short list pages,
but up to 95, 000, the template is the same;
—around 95, 000, there is a strong gap (2) which marks a new template: a list of
80 random internal links spanning four columns (template 2): up to 180, 000,
all pages are internal franao links built according to template 2;
—around 180, 000 there is a strong gap (3) between franao pages and other
web sites pages.
(b) Copy/Paste clusters contain pages that are not part of mirrors, but
do share the same content: either a text (e.g. license, porn site legal
warning), a frameset scheme, or the same javascript code (often with
little actual content).
The style similarity cluster classes are the most interesting benefit of the use
of B-hss clustering. They allow an easy classification of web pages by categoriz-
ing a few of them.
5. UK-2006 EXPERIMENTS
5.1 Dataset
The WEBSPAM-UK2006 reference dataset is a collection based on a crawl of the .uk
domain done by the University of Roma La Sapienza with a large number of
hosts labeled by a team of volunteers. The whole labeling process is described
by Castillo et al. [2006] in their article “A reference collection for Web spam.”
The full dataset contains 77 million Web pages spread over 11,400 hosts.
There are about 5,400 labeled hosts. Because the assessors spent on average
five minutes per host, it makes sense to summarize the dataset by taking only
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:20 • T. Urvoy et al.
on the first 400 reachable pages of each host. We worked on the summarized
dataset, which contains 3.3 million web pages stored in 8 volumes of 1.7GB
each.
100 100000
90000
80 80000
70000
Number of clusters
Percent of clusters
60 60000
50000
40 40000
30000
20 20000
10000
0 0
B-
B-
B-
B-
B-
B-
C
-ta
-ta
-h
-fu
-h
-w
ta
ta
hs
hs
fu
ss
ss
or
or
gs
gs
ll
gs
gs
ll
s
s-
ds
-v
ds
-n
va
-n
ar
oi
oi
rs
sp
se
se
p
16+ 4-7 1
8-15 2-3 number of clusters
Fig. 11. Distribution of cluster size (number of hosts) and number of clusters for each pre-
processing.
Table II. Consistency of the Clusters Depending on the Algorithm, Preprocessing and Cluster
Size
Alg. B Alg. C
Mean Mean
Consistency Stddev Consistency Stddev
2–3 0.9939 0.077001 2–3 0.9764 0.14974
words
words
4–7 0.98411 0.11016 4–7 0.97556 0.13342
8–15 0.96433 0.15218 8–15 0.98735 0.084978
16+ 0.98403 0.10645 16+ 0.98996 0.1316
2–3 0.9956 0.065797 2–3 0.99459 0.073127
Full
Full
4–7 0.99418 0.071929 4–7 0.99455 0.069851
8–15 0.99 0.084263 8–15 0.98612 0.10964
16+ 0.9945 0.06451 16+ 0.99597 0.056474
2–3 0.99819 0.041969 2–3 0.99691 0.055119
4–7 0.99748 0.046309 4–7 0.99669 0.053076
hss
hss
8–15 0.99567 0.059022 8–15 0.99556 0.058094
16+ 0.99591 0.057085 16+ 0.99312 0.071062
hss-varsp
hss-varsp
2–3 0.99762 0.048259 2–3 0.99797 0.044552
4–7 0.9979 0.041595 4–7 0.9973 0.047808
8–15 0.98827 0.093353 8–15 0.99586 0.056173
16+ 0.99501 0.062869 16+ 0.99318 0.070702
2–3 0.99807 0.043069 2–3 0.99838 0.03961
tags
tags
tags-alpha
recall and on the other hand by the clustering itself that leaves some URLs
(and hosts) unreachable. In other words, some clusters may not contain any
spam nor normal labels. URLs in such clusters are impossible to label by tag
spreading.
Methods based on words (i.e., words and full) have a recall that is signifi-
cantly lower than style-based methods whatever the rate of known labels in
input. Figure 14 combines precision and recall and shows the efficiency gap
between words based methods and other methods. This gap is also clear with
F1 measure.
2 · precision · recall
F1 =
(precision + recall)
The F1 measure is a way to appreciate the global quality of the various
methods. Figure 15 shows F1 measure according to the percentage of tagged
hosts.
For HTML style-based methods, we note that Charikar-based methods are
generally slightly better for highest percentages of known input and corre-
sponding Broder-based methods for smaller known input.
We also note, according to Figure 15, a global performance increase with
the rate of known labels. Hence using an external imperfect spam classifier is
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:23
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags
useful to increase the spam/normal base information before the tag spreading
described in Section 3.4.
According to these results, we note that the tag-alpha preprocessor gives
best results on lower rates of tagged hosts combined with Broder fingerprints
and for higher rates with Charikar ones. B-hss-varsp preprocessor is the best
on average.
5.2.4 A Side Note About Mirrors. The full preprocessing also allows to
detect mirrors by looking at Web sites with very high similarities.
This is done in two steps. In a first step, we cluster using B-full similarity
with a high similarity level of 60 out of 64. In a second step, we cluster the hosts
using Jaccard similarity on the set of clusters on which they span.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:24 • T. Urvoy et al.
0.45
0.4
0.35
0.3
Recall
0.25
0.2
0.15
0.1
0.05
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags
0.45
0.4
0.35
0.3
Recall
0.25
0.2
0.15
0.1
0.05
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags
Fig. 14. Precision versus recall for clustering methods according to prelabeled hosts rate.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:25
0.6
0.55
0.5
0.45
0.4
F-Mesure
0.35
0.3
0.25
0.2
0.15
0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0 5 10 15 20
% of false tags
B-tag-alpha C-tag-alpha
Fig. 16. Evolution of precision when injecting errors for tag-alpha preprocessed clusterings.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:26 • T. Urvoy et al.
BAR-YOSSEF, Z. AND RAJAGOPALAN, S. 2002. Template detection via data mining and its applications.
In Proceedings of the 11th International Conference on World Wide Web (WWW’02). ACM Press,
580–591.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:27
BAWA, M., CONDIE, T., AND GANESAN, P. 2005. LSH forest: Self-tuning indexes for similarity search.
In Proceedings of the 14th International Conference on World Wide Web (WWW’05). ACM Press,
651–660.
BENCZÚR, A., CSALOGÁNY, K., AND SARLÓS, T. 2006. Link-based similarity search to fight web spam.
In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb’06). Seattle, WA.
BOULLÉ, M. 2006. MODL: A Bayes optimal discretization method for continuous attributes. Ma-
chine Learn. 65, 1, 131–165.
BRODER, A. 1997. On the resemblance and containment of documents. In Proceedings of the
Compression and Complexity of Sequences (SEQUENCES’97). IEEE Computer Society. 21.
BRODER, A. Z., GLASSMAN, S. C., MANASSE, M. S., AND ZWEIG, G. 1997. Syntactic clustering of the web.
In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science
Publishers, 1157–1166.
CASTILLO, C., DONATO, D., BECCHETTI, L., BOLDI, P., LEONARDI, S., SANTINI, M., AND VIGNA, S. 2006. A
reference collection for web spam. SIGIR Forum 40, 2, 11–24.
CASTILLO, C., DONATO, D., GIONIS, A., MURDOCK, V., AND SILVESTRI, F. 2007. Know your neigh-
bors: Web spam detection using the web topology. In Proceedings of SIGIR. ACM Press, 423–
430.
CHAKRABARTI, D., KUMAR, R., AND PUNERA, K. 2007. Page-level template detection via isotonic
smoothing. In Proceedings of the 16th International Conference on World Wide Web (WWW’07).
ACM Press, 61–70.
CHARIKAR, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceed-
ings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02). ACM Press, 380–
388.
CHEN, L., YE, S., AND LI, X. 2006. Template detection for large scale search engines. In Proceedings
of the ACM Symposium on Applied Computing (SAC’06). ACM Press, 1094–1098.
FETTERLY, D., MANASSE, M., AND NAJORK, M. 2004. Spam, damn spam, and statistics: Using statis-
tical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the
Web and Databases (WebDB’04). ACM Press, 1–6.
FETTERLY, D., MANASSE, M., AND NAJORK, M. 2005. Detecting phrase-level duplication on the world
wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR’05). ACM Press, 170–177.
FILOCHE, P., URVOY, T., EMMANUEL, C., AND LAVERGNE, T. 2007. France Telecom R&D entry. Web
Spam Challenge 2007 (Track I).
GRAY, A., SALLIS, P., AND MACDONELL, S. 1997. Software forensics: Extending authorship analysis
techniques to computer programs. In Proceedings of the 3rd Biannual Conference of International
Association of Forensic Linguists (IAFL’97). 1–8.
GYÖNGYI, Z. AND GARCIA-MOLINA, H. 2005. Web spam taxonomy. In Proceedings of the 1st Interna-
tional Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05). Chiba, Japan.
GYÖNGYI, Z., GARCIA-MOLINA, H., AND PEDERSEN, J. 2004. Combating Web spam with TrustRank.
In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan
Kaufmann, 576–587.
HEINTZE, N. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop
on Electronic Commerce.
HENZINGER, M. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms.
In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Devel-
opment in Information Retrieval (SIGIR ’06). ACM Press, 284–291.
INDYK, P. AND MOTWANI, R. 1998. Approximate nearest neighbors: towards removing the curse
of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing
(STOC’98). ACM Press, 604–613.
JENKINS, B. 1997. A hash function for hash table lookup. Dr Dobbs Journal.
LAVERGNE, T. 2006. Unnatural language detection. In Proceedings of Young Scientists’ Conference
on Information Retrieval (RJCRI’06).
MANBER, U. 1994. Finding similar files in a large file system. In USENIX Winter. 1–10.
MCENERY, T. AND OAKES, M. 2000. Authorship identification and computational stylometry. In
Handbook of Natural Language Processing. Marcel Dekker Inc.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:28 • T. Urvoy et al.
MEYER ZU EISSEN, S. AND STEIN, B. 2004. Genre classification of web pages. In Proceedings of 27th
German Conference on Artificial Intelligence (KI-04), S. Biundo, T. Frühwirth, and G. Palm, Eds.
Lecture Notes in Computer Science, vol. 3238.
NTOULAS, A., NAJORK, M., MANASSE, M., AND FETTERLY, D. 2006. Detecting spam web pages through
content analysis. In Proceedings of the 15th International Conference on World Wide Web
(WWW’06). ACM Press, 83–92.
SCHLEIMER, S., WILKERSON, D. S., AND AIKEN, A. 2003. Winnowing: Local algorithms for document
fingerprinting. In Proceedings of the SIGMOD Conference, A. Y. Halevy, Z. G. Ives, and A. Doan,
Eds. ACM Press, 76–85.
URVOY, T., LAVERGNE, T., AND FILOCHE, P. 2006. Tracking web spam with hidden style similarity. In
Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb’06).
VAN RIJSBERGEN, C. J. 1979. Information Retrieval 2nd ed. University of Glasgow, Glasgow, Scot-
land, UK.
WESTBROOK, A. AND GREENE, R. 2002. Using semantic analysis to classify search engine spam.
Tech. rep., Stanford University.
ZOBEL, J. AND MOFFAT, A. 1998. Exploring the similarity space. In Proceedings of the SIGIR
Forum 32, 1, 18–34.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.