You are on page 1of 28

Tracking Web Spam with HTML

Style Similarities
TANGUY URVOY, EMMANUEL CHAUVEAU, and PASCAL FILOCHE
Orange Labs (France Telecom R&D)
and
THOMAS LAVERGNE
Orange Labs and ENST Paris 3

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier
paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web author-
ing software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).
Those pages built using the same generating method (template or script) share a common “look
and feel” that is not easily detected by common text classification methods, but is more related to
stylometry.
In this work we study and compare several HTML style similarity measures based on both
textual and extra-textual features in HTML source code. We also propose a flexible algorithm to
cluster a large collection of documents according to these measures. Since the proposed algorithm
is based on locality sensitive hashing (LSH), we first review this technique.
We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and
enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-
UK2006 dataset.
Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Anal-
ysis and Indexing
General Terms: Algorithms, Experimentations
Additional Key Words and Phrases: Clustering, document similarity, search engine spam, stylom-
etry, templates identification
ACM Reference Format:
Urvoy, T., Chauveau, E., Filoche, P., and Lavergne, T. 2008. Tracking Web spam with
HTML style similarities. ACM Trans. Web 2, 1, Article 3 (February 2008), 28 pages. DOI =
10.1145/1326561.1326564 http://doi.acm.org/10.1145/1326561.1326564

Authors’ address: Orange Labs, 2 Avenue Pierre Marzin, 22307 Lannion cedex, France; e-mail:
{Tanguy.urvoy,emmanuel.chauveau,pascal.filoche,tomas.lavergne}@orange-ftgroup.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permission@acm.org.

C 2008 ACM 1559-1131/2008/02-ART3 $5.00 DOI 10.1145/1326561.1326564 http://doi.acm.org/
10.1145/1326561.1326564

ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:2 • T. Urvoy et al.

1. INTRODUCTION
Automatically generated content is nowadays ubiquitous on the Web, especially
with the advent of professional web sites and popular three-tier architectures
such as “LAMP” (Linux Apache Mysql Php). Generation of these pages using
such architecture involves:
—a scripting component;
—a page template (“skeleton” of the site pages);
—content (e.g., product catalog or articles repository), usually stored in
databases.
When summoned, the scripting component combines the page template with
information from the database to generate an HTML page, indistinguishable
from a static HTML page from a crawler point of view.

1.1 Spamdexing and Generated Content


While many genuine Web sites generate pages dynamically, this ability to au-
tomatically generate a large number of pages is also appealing for spamdexing.
By analogy with email spam, the word spamdexing designates the techniques
used to mislead search engines ranking algorithms and raise a Web site to a
higher-than-deserved rank in search engines response lists. Indeed, Fetterly
et al. [2004] point out that “the only way to effectively create a very large num-
ber of spam pages is to generate them automatically.”
This maze of fake Web pages is called a link farm. When those pages are all
hosted under a few domains, the detection of those domains can be a sufficient
countermeasure for a search engine, but this is not an option when the link
farm spans hundreds or thousands of different hosts—for instance, using word
stuffed new domain names, or buying expired ones [Gyöngyi and Garcia-Molina
2005].
One would like to be able to detect all pages generated using the same method
once a spam page is detected in a particular search engine response list. A direct
application of such a process is to enhance the efficiency of search engines
blacklists by “spreading” detected spam information to find affiliate domains,
as suggested in Gyöngyi et al. [2004].

1.2 Detecting Spam Web Pages


We see the problem of spam detection in a search engine back office process as
twofold:
—pinpoint sets of dubious pages in a large uncategorized corpus;
—detect new instances of already encountered spam (through editorial review
or automatic classification methods);
These facets deal with the enhancement of the recall and the precision of spam
detection. The first facet especially requires to find useful features to model
spam, while the other also lean on a good definition of similarity between
web pages. This similarity can be based on hyperlinks [Benczúr et al. 2006;
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:3

Castillo et al. 2007] or content. In this paper we focus on nontextual content


similarity.

1.2.1 Detecting Spam by Similarity. Semantic text similarity detection


usually involves word-based features, such as in email Bayesian filtering. From
these features a statistical model tries to detect texts that are similar to known
examples of spam. Such word-based features tend to model the text topic: for
example “exiled presidents” and “energizing sex drugs” are recurrent topics in
email spam. In the case of the Web, even if many pages about “free ringtones” or
“naked girls” come from deceiving or overly optimized sites, the topic cannot be
the only criterion to decide if a Web site is using spamming techniques [Castillo
et al. 2006].
Some spam techniques like honey pots are based on text plagiarism: the honey
pot technique consists of mirroring a reputed Web site to introduce sneaky
links in its HTML code. This strong syntactic similarity is detected by semi-
duplicate detection algorithms [Broder et al. 1997; Heintze 1996; Charikar
2002]. Another form of plagiarism consists of building fake content by stitching
together parts of text collected from several other Web sites. To counter this
form of spam, [Fetterly et al. 2005] proposed to use a phrase-level syntactic
clustering.
The use of word-based features is not always relevant because even if the
spam pages often share the same generation method, they rarely share the
same vocabulary [Westbrook and Greene 2002]. Furthermore, automatically
generated link farm pages tend to use large dictionaries in order to span a
large number of possible requests [Gyöngyi and Garcia-Molina 2005]—hence,
using common text filtering methods with this kind of Web spam fails to catch
a significant proportion of positive instances.
To detect similarity based on page generation method, one needs to use fea-
tures more closely related to the internal structure of HTML documents. Meyer
Zu Eissen and Stein [2004] propose to use HTML-specific features like average
number of <P> tags or average number of anchor links along with text and word
statistics to classify web documents between predefined genres like FAQ, shop,
or news. To detect generating schemes rather than genres, a less supervised
approach is required.

1.2.2 HTML, Stylometry, and Templates. What best describes the rela-
tionship between the pages generated using the same template or method is
more on a style ground than on a topical one. This relates our problem with
the stylometry area. Up to now, stylometry was more generally associated with
authorship identification, to deal with problems such as attributing Shake-
speare’s plays to the right author, or to detect computer software plagiarism
[Gray et al. 1997]. Usual metrics in stylometry are mainly based on word counts
[McEnery and Oakes 2000], but also sometimes on nonalphabetic features such
as punctuation. In the area of web spam detection, [Westbrook and Greene
2002; Ntoulas et al. 2006] and [Lavergne 2006] propose to use lexicometric fea-
tures to classify the part of web spam that does not follow regular language
metrics.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:4 • T. Urvoy et al.

html files

3.1 Preprocessing

3.2 LSH Fingerprinting

3.3 Clustering

clusters spam tags

3.4 tags spreading

enhanced tagging

Fig. 1. The style-based spam detection framework: all computing steps are described in Section 3.

The specific problem of template detection has been well studied. For instance
Bar-Yossef and Rajagopalan [2002] and more recently Chakrabarti et al. [2007]
proposed Document Object Model (DOM-tree) based algorithms to remove tem-
plates from HTML files. To circumvent the heavy computational cost of DOM-tree
extraction, [Chen et al. 2006] proposed a “flat” algorithm to detect and remove
template blocks from HTML. All these papers consider template structure as
noise.

1.3 Overview of This Article


This article is an extended version of our AIRWeb 2006 work [Urvoy et al. 2006].
We use a more flexible clustering algorithm and a more efficient spam detection
strategy (tag-spreading).
The main principle remains to cluster Web pages generated using similar
tools, scripts or HTML-templates to enhance the quality of web spam detection.
We first review syntactic similarity measures and large scale, fingerprint-based
clustering in Section 2. We then detail the specificities of our spam-detection
framework in Section 3 (An overview of this framework is described by Figure 1).
The experimental results of Urvoy et al. [2006] were based on our own (mostly
French) dataset. These are described in Section 4. The WEBSPAM-UK2006 tagged
dataset [Castillo et al. 2006] allowed us to perform more advanced experiments
that are described in Section 5.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:5

2. PRELIMINARIES ON SIMILARITY AND CLUSTERING


Our approach of spam detection being based on similarity, we first review the
standard text similarity measures.

2.1 Similarity Measures


The first step before comparing documents is to extract their (interesting) con-
tent: this is the preprocessing step.
The second step is to transform this content into a suitable model for compar-
ison (except for string-edit based distances like Levenshtein and its derivatives
where this intermediate model is not mandatory). Usually the documents are
split up into parts. Depending on the expected granularity, these parts may be
sequences of letters (n-grams), words, sequences of words, sentences or para-
graphs. Parts may overlap or not.
The most important ingredients for the quality of comparison are the
preprocessing step and the parts granularity, but for a given preprocessing
and a given granularity there are many flavors of similarity measure [Van Ri-
jsbergen 1979; Zobel and Moffat 1998]. We present here the most common
ones.
When frequency is not important, as in plagiarism detection, the documents
are represented by sets of parts. The most commonly used set-based similarity
measure is the Jaccard index: for two sets of parts D1 , D2 :
|D1 ∩ D2 |
Jaccard(D1 , D2 ) = .
|D1 ∪ D2 |
Variants may be used for the normalizing factor, such as in inclusion
coefficient:
|D1 ∩ D2 |
Inc(D1 , D2 ) = .
min{|D1 |, |D2 |}
When frequencies or weights of parts in documents are important, it is more
convenient to represent documents as high-dimension sparse vectors. The most
common vector similarity is the cosine index:
d1 · d2
cosine(d1 , d2 ) = .
||d1 || · ||d2 ||
The pairwise calculation of similarities is interesting for fine comparison
inside a small set of documents, but the quadratic explosion induced by such
a brute-force approach is unacceptable at the scale of a Web search engine. To
circumvent this explosion, more powerful methods are required. These methods
are described in the next sections.

2.2 LSH Fingerprints


A locality sensitive hash (LSH) function [Indyk and Motwani 1998] is a hashing
function where the probability of collision is high for similar documents and
low for nonsimilar ones. More formally, if D is a set of documents and sim :
D × D → [0, 1] is a given similarity measure, a set of functions F ⊆ ND is an
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:6 • T. Urvoy et al.

LSH scheme for a similarity function sim if for any couple d 1 , d 2 ∈ D, we have
the probability
Ph∈F [h(d 1 ) = h(d 2 )] = sim(d 1 , d 2 ) .
We can build an estimator for the similarity by gluing together inde-
pendent LSH functions. Let sim H (x , y ) be the number of equal dimensions
between two vectors x , y —that is, sim H (x , y ) = |{i |xi = y i }|, let h1 , . . . , hm
be m independent LSH functions and let H be the mapping defined by H(d ) =
(h1 (d ), . . . , hm (d )), then
sim H (H(d 1 ), H(d 2 ))
Sim(d 1 , d 2 ) .
m
To avoid confusion we call LSH fingerprint the reduced vector H(d ), and LSH
keys the hash values hi (d ) for 1 ≤ i ≤ m. We present here two standard algo-
rithms for building LSH fingerprints and their application to large scale similar-
ity clustering. Other known similarity estimation algorithms are modsampling
proposed by Manber [1994] and winnowing by Schleimer et al. [2003].

2.2.1 Broder MinHashing. To our knowledge, the oldest references about


minhashing (also called minsampling) are Heintze [1996], Broder [1997], and
Broder et al. [1997].
As described in Section 2.1, each preprocessed document is split up into
parts. Let us call P the set of all possible parts. The main principle of min-
sampling over P is to fix at random a linear ordering on P (call it ≺) and rep-
resent each document D ⊆ P by its lowest elements according to ≺: h≺ (D) =
min≺ (D).
If ≺ is chosen at random, then for any pair of documents D1 , D2 ⊆ P , we
have the LSH property:
P≺ [h≺ (D1 ) = h≺ (D2 )] = Jaccard(D1 , D2 ) .
If we consider m independent linear orderings ≺i , and define H by H(D) =
(min≺1 (D), . . . , min≺m (D)), then
sim H (H(D1 ), H(D2 ))
Jaccard(D1 , D2 ) .
m
2.2.2 Charikar Fingerprints. Charikar [2002] introduces a new approach
to locality sensitive hashing. This method partitions the vector space according
to random hyperplanes and estimates the cosine distance.
Let r be the characteristic vector of a random hyperplane (each coordinate
following a centered normal distribution law). Given a weighted vector d i rep-
resenting a document, we define the hr hashing function by:

1 if r.di > 0
hr (di ) =
0 if r.di ≤ 0.
This implies the following LSH property:
θ(d1 , d2 )
Pr [hr (d1 ) = hr (d2 )] = 1 − ,
π
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:7

1.2
Jaccard vs MinHashing
Cosine vs Charikar
1

estimated similarity 0.8

0.6

0.4

0.2

−0.2

−0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
full text similarity
Fig. 2. A rough comparison of real similarity and its 256 bits LSH estimation for 400 similarity
pairs picked in a set of 1000 HTML files.

where θ(d1 , d2 ) is the angle between d1 and d2 . This value is mapped to the
cosine similarity by:
cosine(d1 , d2 ) = cos(π · (1 − Pr ))) .

2.3 Clustering with LSH Fingerprints


When working on a large volume of data, one would like to group together
the documents which are similar enough according to the chosen similarity:
we want to compute a mapping C associating to each document d its class
representative C(d ), with C(d ) = C(d ) if and only if sim(d , d ) is higher than a
given threshold.
The first benefit of using fingerprints for this task is to reduce the size of
documents representatives, allowing to perform all computation in memory. As
illustrated by Figure 2, this reduction by sampling is at the cost of a little loss
of quality in the similarity estimation.
Another important benefit provided by fingerprints is the low dimensional
representation of documents. It becomes possible to compare only the docu-
ments that match at least on some dimensions. This is a way to build the sparse
similarity matrix with some control over the quadratic explosion induced by the
biggest clusters. Our clustering algorithm, detailed in Section 3.3, exploits this
property.

3. THE HSS FRAMEWORK


We propose to use specific document preprocessors to exclude as much as possi-
ble the textual content and keep the HTML “noise” instead. These preprocessors
are described in Section 3.1. As illustrated by Figure 3, by keeping usually ne-
glected features of HTML like extra-spaces, line feeds, or tags, we are able to
model the “style” of HTML generating tools.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:8 • T. Urvoy et al.

Fig. 3. To detect that two pages were generated by the same script, one must both consider visual
features and hidden features of HTML. For example, some scripts always jump one line after writing
height=“2”><font.

The HTML footprint is split up into wide overlapping sequences of characters


from which LSH fingerprints are computed. In Section 3.3 we propose a clustering
algorithm which allows a flexible trade-off between brute force one-to-one com-
parison and fast approximate clustering. In Section 3.4 we propose different
uses of this clustering to detect Web spam.

3.1 Preprocessing
3.1.1 Splitting Content and Noise. The usual procedure to compare two
documents is to first remove everything that does not reflect their content. In
the case of HTML documents, this preprocessing step may include removing
tags, extra spaces and stop words. It may also include normalization of words
by capitalization or stemming.
In order to reflect similarity based on form (and more specifically tem-
plates) rather than content, we propose to apply the opposite strategy: keep
only the “noisy” parts of HTML documents by removing any usual informa-
tional content, and then compute document similarity based on the filtered
version.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:9

3.1.2 A Collection of Parsers. We propose below several document prepro-


cessors to achieve this goal of removing content from HTML documents.

—html alpha preprocessor (denoted hss as in Urvoy et al. [2006]) filters all
alphanumeric characters from the original document, and keeps anything
else. We expect to be able to model the “style” of HTML documents through
the use of those neglected features of HTML texts like extra-spaces, lines
feedback, or tags.
—html alpha var spaces (denoted hss-varsp) is a variant of the former one,
where repeated blank spaces are squeezed into a single one. This intends
to smooth differences that may arise in output of documents using the same
templates but a varying number of words in content parts. Should the text be
very large, a large part of the html alpha output would otherwise be composed
of the spaces used to separate words, reducing the impact of the true template
part in the preprocessed document.
—html tags (denoted tags) applies a straightforward method to extract the for-
matting skeleton from the page: it filters anything but HTML tags and at-
tributes from the original document. Content between tags is ignored, as well
as comments and javascript.
—html tags and alpha (denoted tags-alpha) mixes the approaches of html
alpha and html tags. HTML tags and attributes are kept in the output,
as well as any nonalphanumeric characters between tags and in javascript
chunks.

Two more preprocessors are used in order to compare those strategies with
more standard filters:
—html to words (denoted words) outputs every alphanumeric character out-
side of tags
—full (denoted full) outputs the initial document unmodified

A comparison of the output of these preprocessing procedures with respect


to the task of categorizing web spam pages is presented in Section 5.
The following example shows the head output of the different parsers for the
http://www.mozilla.org/ URL:

WORDS (words)
Mozilla org Home of the Mozilla Project Skip to main content Mozilla About ...
FULL (full)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">

<head>
<title>Mozilla.org - Home of the Mozilla Project</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="keywords" content="web browser, mozilla, firefox, camino, thunderbird, ...">

<link rel="stylesheet" type="text/css" href="css/print.css" media="print">


<link rel="stylesheet" type="text/css" href="css/base/content.css" media="all">

ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:10 • T. Urvoy et al.

HSS ALPHA REMOVAL (hss)


<! "-//// .//" "://..///.">
< ="">

<>
<>. - </>
< -="-" ="/; =-">
< ="" =" , , , , , , , ">

< ="" ="/" ="/." ="">

HSS ALPHA REMOVAL VAR SPACE (hss-varsp)


<!"-////.//" "://..///.">
<="">

<>
<>.- </>
<-="-" ="/; =-">
<="" =", , , , , , , ">

<="" ="/" ="/." ="">

TAGS (tags)
<DOCTYPE HTML PUBLIC W3C DTD HTML 4 01 EN http www w3 org TR html4 strict dtd ><html lang >...
...<head><title></title><meta http equiv content ><meta name content><link rel type href media
>...

TAGS AND ALPHA REMOVAL (tag-alpha)


<DOCTYPE HTML PUBLIC W3C DTD HTML 4 01 EN http www w3 org TR html4 strict dtd >
<html lang >

<head>
<title>. - </title>
<meta http equiv content >
<meta name content >

<link rel type href media >

3.2 Fingerprinting
As mentioned earlier, using a pairwise similarity measure does not fit large
scale clustering. Using LSH fingerprints on the preprocessing output is required
to address the scalability issue. Some technical details may significantly affect
the quality of the similarity estimation.

3.2.1 Fingerprinting Implementation Details. The parts of text we con-


sider are 32 characters wide overlapping sequences that we hash into 64 bit
integers. Hence, a text of length n + 31 is modelized by a set of n hash values.
This large size avoids false-positives and provides a sentence-level description
of documents, as in Fetterly et al. [2005].
To hash subsequences, we use a variant of Jenkins’s hash function [Jenkins
1997] which achieves a better collision rate than Rabin fingerprints.
For minhashing we precompute m = 64 permutations σi : [264 ] → [264 ] and
compare the permuted values:
x ≺i y ⇔ σi (x) < σi ( y) .
To compute these permutations, we use a sub-family of permutations of the
form σi = σi1 ◦ σi2 where σi1 is a bit shuffle and σi2 (x) is an exclusive OR
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:11

mask. To reduce the fingerprint size to 256 bits, we keep only four bits per
key.
For Charikar’s algorithm, instead of precomputing the random normal dis-
tribution of hyperplane characteristic vector r, we use a linear shuffle and a pre-
computed inverse normal function: let P be a big prime, and let inorm : [n] → Z
be a discretization of inverse normal law, then
ri := inorm(i P mod n) .
From 256 random primes we get a 256 bit fingerprint. This implementation
slightly differs from the one used by Henzinger [2006] where the ri values are
restricted to {−1, +1}.

3.2.2 Combination of Preprocessing and Fingerprinting. As explained


in Section 2.2, Broder minhashing provides an estimation of Jaccard in-
dex [Broder et al. 1997] while Charikar fingerprinting provides an esti-
mation of Cosine similarity [Charikar 2002]. The former similarity mea-
sure is based only on intersection while the latter also includes frequency
information.
In order to evaluate the most relevant strategy for HTML “style” similarity
we combine the six preprocessors defined before with these two algorithms to
get twelve kinds of LSH fingerprints. We write “B-words” for Broder fingerprints
using the word preprocessor, “C-full” for Charikar fingerprints combined with
the full preprocessor, and so on for other preprocessors.
A rough estimation of correlation between these twelve similarities is given
by Figure 4. As expected, minhashing similarities and Charikar similarities
are strongly dependent for a given preprocessing. As expected also, all variants
of HTML preprocessing except words are dependent. A more interesting result
is the correlation between full and style similarities. This can be explained by
the low visible text ratio in HTML (around 18% for WEBSPAM-UK2006 dataset).
We show in Section 5 that style similarities provide a significantly better recall
than full HTML for Web spam detection.

3.3 Multi-Sort Sliding Window Clustering


We use our own algorithm for clustering. It does not build the entire similarity
graph and use some permuted lexical sorts to find potentially similar pairs. As
in Broder et al. [1997], the similarity clusters are the connected components of
the sampled graph.
By thresholding the full similarity matrix, we obtain a nonoriented similarity
graph:
 
G = (d 1 , d 2 ) ∈ D × D | sim(d 1 , d 2 ) > threshold .
This similarity graph is characterized by its quasi-transitivity property: if xG y
and y Gz then there is a high probability that xGz. In other words, the con-
nected components of the graph are almost cliques. This quasi-transitivity prop-
erty is helpful: by sampling the most similar edges, we both accelerate the
clustering process and reduce the probability of a false positive instance in LSH
similarity estimation (cf. Figure 5).
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:12 • T. Urvoy et al.

B-words

C-words

B-full

C-full

B-hss

C-hss

B-tags

C-tags

B-tags-alpha

C-tags-alpha

B-hss-varsp

C-hss-varsp

Fig. 4. Correlogram between each similarity measures on a sample of 105 pairs of HTML docu-
ments. Most of these sampled pairs were dissimilar. Darkest colors implies high correlation.

To sample the edges, we sort fingerprints lexically according to a given bit


permutation and we check locally, in a sliding window, for similarity pairs.
Figure 6 describes the impact of the sliding window size on clustering quality: a
large window retrieves more similarity edges at the cost of a slower algorithm.
This operation may be repeated several times with random permutations to
avoid “unlucky” permutations (similar pairs with short common prefixes). The
same heuristic is used by Bawa et al. [2005] to avoid “unlucky” B-trees to search
the nearest neighbors.
As described in Figure 7, in practice we use two parallel sorts (prefix and suf-
fix). With these two sorts we get a good compromise to cluster 3M fingerprints
with a relatively narrow (4 entries) sliding window.
Depending of the architecture, the sliding window edge detection processes
may also be serialized. For serialized algorithm, redundant similar fingerprints
may be removed from the stream to reduce the cost of the later sorts.

3.4 Tracking Web Spam with Similarity Clusters


The different clusterings obtained are interesting both to detect new spam and
to enhance the quality of a spam detection obtained from other sources (through
editorial review or automatic classification methods). We call features extraction
the former technique, and tag spreading the latter one. The experimentations
described in Section 4 are mostly oriented toward features extraction while the
WEBSPAM-UK2006-based experiments of Section 5 also include tag spreading.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:13

3
2

Fig. 5. The quasi-transitivity of estimated similarity relation is well illustrated by this complete
similarity graph (realized from 3000 HTML files). With a transitive relation, each connected compo-
nent would be a clique. For “bunch of grapes” components like (1) and (2), there is a low probability
for connecting edges to be sampled. An edge density measure may be employed to detect “worms”
components like (3).

3.4.1 Features Extraction. From a given clustering we extract several sim-


ple features. These features are only derived from clusters and straightforward
URL properties:

—number of URLs, hosts and domains by cluster;


—number of clusters by host or by domain;

If we also consider the underlying graph or fingerprints more features can


be extracted:

—edge density by cluster;


—inter-host edge density by cluster;
—mean similarity and standard deviation.

For example, in Section 4, we combine mean similarity and domain counts


in B-hss clusters to detect potential spam pages. All these features are simple
and relevant for Web spam detection but the most efficient use of similarity
clusterings is to smooth external features.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:14 • T. Urvoy et al.

90
Edge recall

edges recall (% retrieved simiarity pairs)


85

80

75

70

65

60

55

50

45

40
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
window aperture (% of keys number)

Fig. 6. Edge recall according to relative window aperture for 105 keys and a similarity threshold
of 40/64: even a small window allows us to retrieve most similarity edges in one pass.

LSH fingerprints

prefix sort suffix sort

sliding window edges detection sliding window edges detection

similarity edges

connected components computation

url clusters

Fig. 7. Our clustering process.

3.4.2 Tag Spreading. As described in Figure 1, we consolidate a set of


tags by spreading them into consistent clusters. To evaluate the consistency
of clusters regarding the tags, we measure the prevalence of a label (spam or
normal) in a given cluster. This consistency measure is defined by:
 2
spam − normal
consistency = .
spam + normal
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:15

Fig. 8. spam/normal labels in a HTML style similarity graph. This graph was built with the B-hss-
varspace similarity on a sample of 105 HTML files from UK-2006. Each node represents a host and
each edge represent a high similarity between pages in different hosts. Black nodes indicate spam
labels and white nodes indicate normal labels, other are unknown. The consistency of clusters is
better than in Figure 9. The highly connected clusters are symptomatic of links farms or mirrors.

We observe on Figure 8 that most host-clusters contain only one kind of label.
In such a case, a high consistency means that the major label is really prevalent
over the other: the label can be propagated to the whole cluster. On the other
hand, a low consistency indicates a balanced amount of each labels and we
cannot rely on this cluster to propagate one or the other label. The big central
cluster of the Figure 9 is a perfect example of such indecision.
A straightforward approach is to propagate a reliable human evaluation. As
discussed later on in the WEBSPAM-UK2006-based experimentation chapter, this
approach is efficient for big clusters (most of the time these are link-farms), but
the majority of URLs being in small or medium clusters, the recall remains too
low.
A more efficient way to use the small and medium clusters is to train a clas-
sifier on features available for all hosts or web pages. Such content-based and
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:16 • T. Urvoy et al.

Fig. 9. spam/normal labels in a words similarity graph. This graph was built with the B-words
similarity on a sample of 105 HTML files from WEBSPAM-UK2006. Each node represents a host and
each edge represents a high similarity between pages in different hosts. Black nodes indicate spam
labels and white nodes indicates normal labels, other are unknown. The small pure clusters indi-
cates “spam-full” and “spam-free” of topics. Most hosts are in the biggest cluster: spam independent
topics.

link-based features relevant to the prediction of the spam status of an URL


or a host were proposed for the Web Spam Challenge 2007. These features
are well described in Castillo et al. [2007]. Even if the classifier precision is
low the resulting prediction is enhanced by the clusters consistency. A sim-
ilar smoothing method is employed for link-based clusters in Castillo et al.
[2007].
In our first experiment, described in next section, we mostly consider cluster
based feature extraction. We experiment with tag spreading in Section 5.

4. EXPERIMENTS ON A FRENCH DATASET


In this section we discuss the result of an experiment done in 2006 on our own
dataset. These results were presented at the AirWeb 2006 Workshop [Urvoy
et al. 2006]. This experiment is based on Broder et al. [1997] minhashing using
the html alpha preprocessor (called B-hss in Section 3.1).
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:17

140
HSS−similarity to franao.com

120 franao.com : template 1 (large)


128 keys HSS−similarity
100 franao.com : template 1 (small)

80 1
other sites
60
2
40
3
20
franao.com : template 2
0
0 20 40 60 80 100 120 140 160 180 200
row x 1000 (by decreasing HSS−similarity)

Fig. 10. By sorting all HTML documents by decreasing similarity with one reference page (here
from franao.com) we get a curve with different gaps. The third gap (around 180, 000) marks the
end of franao web site.

4.1 Dataset
We used a corpus of five million HTML pages crawled from the Web. This corpus
was built by combining three crawl strategies:

—a deep crawl of 3 million documents from 1300 hosts of DMOZ directory;


—a flat crawl of 1 million documents from Orange French search engine black-
list (with a strong proportion of adult content);
—a deep breadth first crawl from 10 nonadult spam URLs (chosen in the black-
list) and 10 trustworthy URLs (mostly from French universities).

At the end of the process, we estimated roughly that 2/5 of the collected docu-
ments were spam.
After fingerprinting, the initial volume of 130 GB of data to analyze was
reduced down to 390 MB, so the next step could be performed in main memory.

4.2 One-to-All Similarity


The comparison between one HTML page and all other pages of our test base
is a way to estimate the quality of B-hss similarity. If the reference page comes
from a known Web site, we are able to judge the quality of B-hss similarity as
Web site detector. In the example considered here: franao.com (a web directory
with many links and many internal pages), a threshold of 20/128 gives a franao
Web site detector which is not based on URLS. On our corpus, this detector is
100% correct according to URL prefixes.
By sorting all HTML documents by decreasing similarity to one randomly
chosen page of franao web site, we get a decreasing curve with different gaps
(Figure 10). These gaps are interesting to consider in detail:
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:18 • T. Urvoy et al.

—the first 20, 000 HTML pages are long lists of sponsored links, all built using
the same template (template 1);
—around 20, 000 there is a smooth gap (1) between long list and short list pages,
but up to 95, 000, the template is the same;
—around 95, 000, there is a strong gap (2) which marks a new template: a list of
80 random internal links spanning four columns (template 2): up to 180, 000,
all pages are internal franao links built according to template 2;
—around 180, 000 there is a strong gap (3) between franao pages and other
web sites pages.

4.3 Global Clustering


All websites do not have a site level footprint as clear as the one of franao
example. The link-farm “4you” (get4you, get4me, goto4me. . . ) from WEBSPAM-
UK2006, for instance, is made of two completely distinct styles of pages spread
across many hosts. To ensure a low level of false positives and to cluster the
whole corpus by templates with a reasonable recall, we progressively raised the
threshold.
The following parameters were chosen at the end:
—size of text parts: 32,
—fingerprint size (in bytes) m = 128,
—similarity threshold t = 35/128.
With a similarity score of at least 35/128, the number of misclassified URLS
seems negligible but some clusters are split into smaller ones.
We obtained 43, 000 clusters with at least 2 elements. Table I shows the 20
first clusters sorted by highest mean similarity × domain count (For the sake
of readability some mirror clusters have been removed from the list).
In order to evaluate the quality of the clustering, the first 50 clusters, as
well as 50 other randomly chosen ones were manually checked, showing no
misclassified URLS.
Most of the resulting clusters belong to one of these classes:
(1) Style similarity
(a) Normal template: Clusters groups HTML pages from dynamic web sites
using the same skeleton. The cluster #2 of Table I is a perfect example, it
groups all forum pages generated using the PhpBB open source project.
The cluster #3 is also interesting: it is populated by Apache default
directory listings;
(b) Link farm and Made for Ads: They contain numerous computer gener-
ated pages, based on the same template and containing several hyper-
links;
(2) Semi-duplicate
(a) Mirrors clusters contain sets of near duplicate pages hosted under dif-
ferent servers. Generally only minor changes were applied to copies like
adding a link back to the server hosting the mirror;
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:19

Table I. Clusters with Highest Mean Similarity and Domain Count


Mean
URLS Dom. sim. Prototypical Member (centroid)
268 231 1 www.9eleven.com/index.html Copy/Paste
93148 313 0.58 www.les7laux.com/hiver/forum/phpBB2/ . . . Template (Forums)
3495 255 0.33 www.orpha.net/static/index.html Template (Apache)
966 174 0.40 www.asliguruney.com/result.php?Keywo . . . Link farm
122 91 0.74 anus.fistingfisting.com/index.htm Copy/Paste
1148 173 0.38 www.basketmag.com/result.php?Keywords . . . Link farm
19834 164 0.40 www.series-tele.fr/index.html?mo=s t . . . Template
122 55 0.91 www.ie.gnu.org/philosophy/index.html Mirror
139 101 0.44 www.reha-care.net/home buying.htm?r=p Link farm
218 195 0.21 chat.porno-star.it/index.html Copy/Paste
177 60 0.67 www.ie.gnu.org/home.html Mirror
2288 44 0.90 www.cash4you.com/insuranceproviders/ . . . Link farm
626900 70 0.52 animalworld.petparty.com/automotive . . . Link farm
168 96 0.32 www.google.ca/intl/en/index.html Mirror
214 61 0.50 shortcuts.00go.com/shorcuts.html Link farm
42314 112 0.26 forums.cosplay.com/index.html Template
121 63 0.41 collection.galerie-yemaya.com/index.html Copy/Paste
555 34 0.68 allmacintosh.digsys.bg/audiomac ra . . . Template
114 77 0.29 www.gfx-revolution.com/search/webarc . . . Link farm
286 60 0.35 gnu.typhon.net/home.sv.html Mirror

(b) Copy/Paste clusters contain pages that are not part of mirrors, but
do share the same content: either a text (e.g. license, porn site legal
warning), a frameset scheme, or the same javascript code (often with
little actual content).
The style similarity cluster classes are the most interesting benefit of the use
of B-hss clustering. They allow an easy classification of web pages by categoriz-
ing a few of them.

4.4 Conclusion of This Experiment


This experiment confirms the usefulness of stylometry for detecting similarity
between host and especially for spam detection. The main problem we had here
was the lack of validation for the recall of the method. The WEBSPAM-UK2006
tagged dataset allows us to go further on this thread and to validate our results,
as explained in next section.

5. UK-2006 EXPERIMENTS

5.1 Dataset
The WEBSPAM-UK2006 reference dataset is a collection based on a crawl of the .uk
domain done by the University of Roma La Sapienza with a large number of
hosts labeled by a team of volunteers. The whole labeling process is described
by Castillo et al. [2006] in their article “A reference collection for Web spam.”
The full dataset contains 77 million Web pages spread over 11,400 hosts.
There are about 5,400 labeled hosts. Because the assessors spent on average
five minutes per host, it makes sense to summarize the dataset by taking only
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:20 • T. Urvoy et al.

on the first 400 reachable pages of each host. We worked on the summarized
dataset, which contains 3.3 million web pages stored in 8 volumes of 1.7GB
each.

5.2 Evaluation of Similarity Clustering


To evaluate the ability of our smoothing framework to enhance the efficiency
of a classifier, we want to measure both its ability to:
(1) spread spam labels (recall improvement);
(2) consolidate spam classification (precision improvement).
The ability to spread information is related to the size distribution of clusters,
while the ability to consolidate is related to the consistency of clusters with
regard to labels. To avoid the side effect of choosing a specific classifier, our
experiments are only based on WEBSPAM-UK2006 human labels.
In a first step we perform some statistics on the twelve clusterings defined in
Section 3.2.2 (size distribution and consistency). In a second step we study by
cross-validation the propagation of spam information through the clusters. We
then evaluate the precision improvement ability of our method and we conclude
our experimentation by a side remark about mirrors.

5.2.1 Clustering Evaluation. To be efficient for information spreading, a


clustering should contains big and consistent clusters. Big clusters diffuse spam
information to more URLs and increase the spam recall. Consistent clusters
avoid false positives.
The distribution of cluster size as well as the total number of built clusters for
the different clusterings are summarized in Figure 11. The first observation is
that standard filters like full or words tend to build smaller clusters on average,
and consequently to produce low improvement of spam recall.
We also note that, for a given preprocessing step, the Charikar fingerprint
produces slightly more clusters than Broder ones.
Clusters size variation for the various clusterings is represented in Figure 11.
In addition, Table II displays consistency measures for different classes of clus-
ters size (per host count). All clusterings build very consistent clusters with
respect to the spamicity of the hosts, with a little loss for the words preprocessor.
These results on clusters consistency suggest that similarity clustering meth-
ods have good precision when tagging new spam.

5.2.2 Evaluation of Tag Spreading. In order to evaluate the diffusion of


labels in the various clusterings, we use cross-validation on the WEBSPAM-UK2006
corpus restricted to tagged hosts. In each experiment a fraction of the corpus is
used to train the system and predict a class for each of the remaining hosts. The
accuracy of these prediction can be computed using the known “true” labels of
the hosts.
In the search engine context, the cost of false positives (genuine sites erro-
neously predicted as spam and removed from the index) is expected to outweigh
the cost of false negatives (undetected spam sites still in the index), therefore
a high level of precision is mandatory.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:21

100 100000

90000

80 80000

70000

Number of clusters
Percent of clusters

60 60000

50000

40 40000

30000

20 20000

10000

0 0
B-

B-

B-

B-

B-

B-

C
-ta

-ta

-h

-fu

-h

-w
ta

ta

hs

hs

fu

ss

ss
or

or
gs

gs

ll
gs

gs

ll
s

s-

ds

-v

ds
-n

va

-n

ar
oi

oi
rs

sp
se

se
p

Algorithm and preprocessing

16+ 4-7 1
8-15 2-3 number of clusters

Fig. 11. Distribution of cluster size (number of hosts) and number of clusters for each pre-
processing.

To estimate the precision gain, we use each clustering to propagate spam


information. If a cluster has more spam URLs than normal ones and has a
consistency greater than or equal to 0.9, we label as spam each untagged host
of this cluster. Once this step is done for each cluster, we label the remaining
untagged hosts as normal.
This evaluation is a bit pessimistic since an operational classifier would com-
pute better results on the hosts that are not covered by clusters.
Table II shows that for the different clusterings, the mean consistency of
the clusters is really high. This high consistency should imply a good precision
for tag spreading. This hypothesis is confirmed by Figure 12, which shows the
precision according to the percentage of tagged labels.
Precision reaches very high levels even if the part of known labels is very
low. However there is a major exception with the words preprocessing. This
preprocessing extracts all of the text structure to leave only content and badly
distinguishes a trustworthy site from another spam one that plagiarizes it.
The highest precision levels are reached by preprocessors dealing with HTML
style.
We also note that, even if their level stands high, Charikar-based variants
are slightly lower than Broder-based corresponding ones.
Figure 13 shows the recall according to the known tags rate. Results
are relatively low. This low score can be explained on the one hand by the
high similarity threshold used that maximizes precision at the detriment of
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:22 • T. Urvoy et al.

Table II. Consistency of the Clusters Depending on the Algorithm, Preprocessing and Cluster
Size
Alg. B Alg. C
Mean Mean
Consistency Stddev Consistency Stddev
2–3 0.9939 0.077001 2–3 0.9764 0.14974
words

words
4–7 0.98411 0.11016 4–7 0.97556 0.13342
8–15 0.96433 0.15218 8–15 0.98735 0.084978
16+ 0.98403 0.10645 16+ 0.98996 0.1316
2–3 0.9956 0.065797 2–3 0.99459 0.073127
Full

Full
4–7 0.99418 0.071929 4–7 0.99455 0.069851
8–15 0.99 0.084263 8–15 0.98612 0.10964
16+ 0.9945 0.06451 16+ 0.99597 0.056474
2–3 0.99819 0.041969 2–3 0.99691 0.055119
4–7 0.99748 0.046309 4–7 0.99669 0.053076
hss

hss
8–15 0.99567 0.059022 8–15 0.99556 0.058094
16+ 0.99591 0.057085 16+ 0.99312 0.071062
hss-varsp

hss-varsp
2–3 0.99762 0.048259 2–3 0.99797 0.044552
4–7 0.9979 0.041595 4–7 0.9973 0.047808
8–15 0.98827 0.093353 8–15 0.99586 0.056173
16+ 0.99501 0.062869 16+ 0.99318 0.070702
2–3 0.99807 0.043069 2–3 0.99838 0.03961
tags

tags

4–7 0.99817 0.039301 4–7 0.99797 0.042088


8–15 0.99792 0.039127 8–15 0.99358 0.065956
16+ 0.99314 0.097974 16+ 0.98936 0.087702
tags-alpha

tags-alpha

2–3 0.9988 0.033991 2–3 0.99749 0.049578


4–7 0.998 0.042234 4–7 0.99695 0.051297
8–15 0.99797 0.038512 8–15 0.99653 0.051339
16+ 0.9954 0.061504 16+ 0.99335 0.070447

recall and on the other hand by the clustering itself that leaves some URLs
(and hosts) unreachable. In other words, some clusters may not contain any
spam nor normal labels. URLs in such clusters are impossible to label by tag
spreading.
Methods based on words (i.e., words and full) have a recall that is signifi-
cantly lower than style-based methods whatever the rate of known labels in
input. Figure 14 combines precision and recall and shows the efficiency gap
between words based methods and other methods. This gap is also clear with
F1 measure.
2 · precision · recall
F1 =
(precision + recall)
The F1 measure is a way to appreciate the global quality of the various
methods. Figure 15 shows F1 measure according to the percentage of tagged
hosts.
For HTML style-based methods, we note that Charikar-based methods are
generally slightly better for highest percentages of known input and corre-
sponding Broder-based methods for smaller known input.
We also note, according to Figure 15, a global performance increase with
the rate of known labels. Hence using an external imperfect spam classifier is
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:23

0.9

0.8

0.7
Precision

0.6

0.5

0.4

0.3

0.2

0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags

Fig. 12. Precision of clustering methods according to pre-labeled hosts rate.

useful to increase the spam/normal base information before the tag spreading
described in Section 3.4.
According to these results, we note that the tag-alpha preprocessor gives
best results on lower rates of tagged hosts combined with Broder fingerprints
and for higher rates with Charikar ones. B-hss-varsp preprocessor is the best
on average.

5.2.3 Precision Improvement Ability. Collections of labels are subject to


errors (human misjudgment, classifier imprecision, . . . ). To use these labels
efficiently, tag spreading should be fault-tolerant.
The high consistency threshold mandatory to label a whole cluster offers a
rather good tolerance to errors. In order to estimate the tolerance level, we intro-
duced a known proportion of errors. Figure 16 shows the evolution of precision
according to the rate of injected errors for the tag alpha preprocessor.
If the rate of injected errors is lesser than five percent, the final precision is
only slightly lowered. This “precision boosting” property is really interesting to
combine different spam-detection methods.

5.2.4 A Side Note About Mirrors. The full preprocessing also allows to
detect mirrors by looking at Web sites with very high similarities.
This is done in two steps. In a first step, we cluster using B-full similarity
with a high similarity level of 60 out of 64. In a second step, we cluster the hosts
using Jaccard similarity on the set of clusters on which they span.
ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:24 • T. Urvoy et al.

0.45

0.4

0.35

0.3
Recall

0.25

0.2

0.15

0.1

0.05
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags

Fig. 13. Recall of clustering methods according to prelabeled hosts rate.

0.45

0.4

0.35

0.3
Recall

0.25

0.2

0.15

0.1

0.05
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags

Fig. 14. Precision versus recall for clustering methods according to prelabeled hosts rate.

ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:25

0.6

0.55

0.5

0.45

0.4
F-Mesure

0.35

0.3

0.25

0.2

0.15

0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-alpha
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-alpha C-tags

Fig. 15. F1 measure of clustering methods according to prelabeled hosts rate.

0.9

0.8

0.7
Precision

0.6

0.5

0.4

0.3
0 5 10 15 20
% of false tags
B-tag-alpha C-tag-alpha

Fig. 16. Evolution of precision when injecting errors for tag-alpha preprocessed clusterings.

ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:26 • T. Urvoy et al.

Applying this procedure on the WEBSPAM-UK2006 corpus, we detect that around


7% of hosts are mirror Web sites that fall in three classes:
—an host name and its alias with the www prefix like: www.cwrightandco.co.uk
and cwrightandco.co.uk
—different variants of the same name like: library.cardiff.ac.uk and
library.cf.ac.uk
—fully different host names: www.yorkshire-evening-post.co.uk and
thisisleeds.co.uk
If detection of the first type of mirrors is trivial and reliable, the handling of
the two other needs semi-duplicate detection techniques.
5.3 Web Spam Challenge 2007
WEBSPAM-UK2006 reference collection is also the reference data for the web spam
challenge 2007: http://webspam.lip6.fr. The goal of this challenge is to eval-
uate different spam detection algorithms based on partial labelling. For this
challenge, we used the framework described in Section 3 combined with a se-
lective Bayesian classifier [Boullé 2006]. A short description of our submission
is given in Filoche et al. [2007].
6. CONCLUSION
We proposed and studied several similarity measures to compare web pages
according to their HTML “style.” We implemented a computationally efficient
algorithm to cluster HTML documents based on these similarities. We pro-
posed a main spam detection framework that we evaluated through a detailed
experiment on WEBSPAM-UK2006 tagged dataset.
This experiment showed the efficiency of our framework to enhance the qual-
ity of spam classifiers by spreading and consolidating their predictions accord-
ing to clusters consistencies. It also showed that style-based similarities are
significantly more efficient than text content and full content based similari-
ties to diffuse spam information. Among the style-based similarities, it showed
a slightly better result for B-hss-varsp and tag-alpha similarities.
The HTML style similarities finds several uses in a search engine back-
office process: a direct application is to enhance the efficiency of search engines’
blacklist management by “spreading” detected spam information, a second one
is to help the detection of web sites boundaries by detecting HTML templates,
a third one is to point out large clusters of similar pages spanning several
domains: this is often a good hint of either sites mirroring or automatic spam
pages generation.
For practical use in a search engine, the most interesting way to fingerprint
HTML documents is probably to combine a word-level content-based Charikar
fingerprint for topic detection, and a sentence-level style-based Broder finger-
print for templates detection.
REFERENCES

BAR-YOSSEF, Z. AND RAJAGOPALAN, S. 2002. Template detection via data mining and its applications.
In Proceedings of the 11th International Conference on World Wide Web (WWW’02). ACM Press,
580–591.

ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
Tracking Web Spam with HTML Style Similarities • 3:27

BAWA, M., CONDIE, T., AND GANESAN, P. 2005. LSH forest: Self-tuning indexes for similarity search.
In Proceedings of the 14th International Conference on World Wide Web (WWW’05). ACM Press,
651–660.
BENCZÚR, A., CSALOGÁNY, K., AND SARLÓS, T. 2006. Link-based similarity search to fight web spam.
In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb’06). Seattle, WA.
BOULLÉ, M. 2006. MODL: A Bayes optimal discretization method for continuous attributes. Ma-
chine Learn. 65, 1, 131–165.
BRODER, A. 1997. On the resemblance and containment of documents. In Proceedings of the
Compression and Complexity of Sequences (SEQUENCES’97). IEEE Computer Society. 21.
BRODER, A. Z., GLASSMAN, S. C., MANASSE, M. S., AND ZWEIG, G. 1997. Syntactic clustering of the web.
In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science
Publishers, 1157–1166.
CASTILLO, C., DONATO, D., BECCHETTI, L., BOLDI, P., LEONARDI, S., SANTINI, M., AND VIGNA, S. 2006. A
reference collection for web spam. SIGIR Forum 40, 2, 11–24.
CASTILLO, C., DONATO, D., GIONIS, A., MURDOCK, V., AND SILVESTRI, F. 2007. Know your neigh-
bors: Web spam detection using the web topology. In Proceedings of SIGIR. ACM Press, 423–
430.
CHAKRABARTI, D., KUMAR, R., AND PUNERA, K. 2007. Page-level template detection via isotonic
smoothing. In Proceedings of the 16th International Conference on World Wide Web (WWW’07).
ACM Press, 61–70.
CHARIKAR, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceed-
ings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02). ACM Press, 380–
388.
CHEN, L., YE, S., AND LI, X. 2006. Template detection for large scale search engines. In Proceedings
of the ACM Symposium on Applied Computing (SAC’06). ACM Press, 1094–1098.
FETTERLY, D., MANASSE, M., AND NAJORK, M. 2004. Spam, damn spam, and statistics: Using statis-
tical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the
Web and Databases (WebDB’04). ACM Press, 1–6.
FETTERLY, D., MANASSE, M., AND NAJORK, M. 2005. Detecting phrase-level duplication on the world
wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR’05). ACM Press, 170–177.
FILOCHE, P., URVOY, T., EMMANUEL, C., AND LAVERGNE, T. 2007. France Telecom R&D entry. Web
Spam Challenge 2007 (Track I).
GRAY, A., SALLIS, P., AND MACDONELL, S. 1997. Software forensics: Extending authorship analysis
techniques to computer programs. In Proceedings of the 3rd Biannual Conference of International
Association of Forensic Linguists (IAFL’97). 1–8.
GYÖNGYI, Z. AND GARCIA-MOLINA, H. 2005. Web spam taxonomy. In Proceedings of the 1st Interna-
tional Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05). Chiba, Japan.
GYÖNGYI, Z., GARCIA-MOLINA, H., AND PEDERSEN, J. 2004. Combating Web spam with TrustRank.
In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan
Kaufmann, 576–587.
HEINTZE, N. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop
on Electronic Commerce.
HENZINGER, M. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms.
In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Devel-
opment in Information Retrieval (SIGIR ’06). ACM Press, 284–291.
INDYK, P. AND MOTWANI, R. 1998. Approximate nearest neighbors: towards removing the curse
of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing
(STOC’98). ACM Press, 604–613.
JENKINS, B. 1997. A hash function for hash table lookup. Dr Dobbs Journal.
LAVERGNE, T. 2006. Unnatural language detection. In Proceedings of Young Scientists’ Conference
on Information Retrieval (RJCRI’06).
MANBER, U. 1994. Finding similar files in a large file system. In USENIX Winter. 1–10.
MCENERY, T. AND OAKES, M. 2000. Authorship identification and computational stylometry. In
Handbook of Natural Language Processing. Marcel Dekker Inc.

ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.
3:28 • T. Urvoy et al.

MEYER ZU EISSEN, S. AND STEIN, B. 2004. Genre classification of web pages. In Proceedings of 27th
German Conference on Artificial Intelligence (KI-04), S. Biundo, T. Frühwirth, and G. Palm, Eds.
Lecture Notes in Computer Science, vol. 3238.
NTOULAS, A., NAJORK, M., MANASSE, M., AND FETTERLY, D. 2006. Detecting spam web pages through
content analysis. In Proceedings of the 15th International Conference on World Wide Web
(WWW’06). ACM Press, 83–92.
SCHLEIMER, S., WILKERSON, D. S., AND AIKEN, A. 2003. Winnowing: Local algorithms for document
fingerprinting. In Proceedings of the SIGMOD Conference, A. Y. Halevy, Z. G. Ives, and A. Doan,
Eds. ACM Press, 76–85.
URVOY, T., LAVERGNE, T., AND FILOCHE, P. 2006. Tracking web spam with hidden style similarity. In
Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb’06).
VAN RIJSBERGEN, C. J. 1979. Information Retrieval 2nd ed. University of Glasgow, Glasgow, Scot-
land, UK.
WESTBROOK, A. AND GREENE, R. 2002. Using semantic analysis to classify search engine spam.
Tech. rep., Stanford University.
ZOBEL, J. AND MOFFAT, A. 1998. Exploring the similarity space. In Proceedings of the SIGIR
Forum 32, 1, 18–34.

Received April 2007; revised September 2007; accepted October 2007

ACM Transactions on the Web, Vol. 2, No. 1, Article 3, Publication date: February 2008.

You might also like