Professional Documents
Culture Documents
or
The Wisdom of Crowds
Ricardo Baeza-Yates
Yahoo! Research
Barcelona, Spain & Santiago, Chile 1
Agenda
• Examples
• Concluding Remarks
4
The IR Problem
SEARCH
ENGINE
Polysemy
Synonimy
Query Results
Refinement Corpus
7
Classic IR Goal
– Classic relevance
• For each query Q and stored document D in a given
corpus assume there exists relevance Score(Q, D)
–Score is average over users U and contexts C
• Optimize Score(Q, D) as opposed to Score(Q, D, U,
C)
• That is, usually:
–Context ignored
Bad assumptions
–Individuals ignored in the web context
–Corpus predetermined
8
Log
13
14
The Structure of the Web
15
Dynamic
Context Interaction
17
Interaction
• Inexperienced users
• Dynamic information needs
• Varying task: querying, browsing,…
• No content overview
• Poor query language, no help
Web Retrieval
• Problems:
– volume
– fast rate of change and growth
– dynamic content
– redundancy
– organization and data quality
– diversity
– …..
• Deal with data overload
Web Retrieval Architecture
• Centralized parallel architecture
Web
Crawlers
Algorithmic Challenges
• Crawling:
– Quantity
Conflict
– Freshness
– Quality
– Politeness vs. Usage of Resources
Adversarial IR
• Ranking
– Words, links, usage logs, … , metadata
– Spamming of all kinds of data
– Good precision, unknown recall
Fight Spam
28
29
Web Mining
– Web characterization
– Particular applications
30
What for?
• Social Mining
• .....
31
The Mining Process
32
Data Recollection
• Usage: Logs
33
Crawling
• Different goals
• Many Restrictions
• No standard benchmark
Crawling Goals
Quality
Focused and
Personal
Crawlers General Research and
General
Search Archive
Search
Engine Crawlers
Engine
Crawlers
Crawlers
Freshness
Mirroring
Systems Quantity
B*
P1 = T* x B1
Bandwidth [bytes/second]
P2 = T* x B2
P3 = T* x B3
T*
P4 = T* x B4
P5 = T* x B5
Time [seconds]
w
B*
Bandwidth [bytes/second]
P1
P2
B3MAX
P3
P4 P5
w T T**
Time [seconds] *
Software Architecture
World Wide
Web
Single
threaded
Scheduler
Multi
threaded
Crawler
Database or Spider
of URLS Collection
of Text
Manager
Long term
Pages scheduling Tasks
Harvester
Seeder Short-term
Resolve sched.
links Network
transfers
Gatherer Documents
URLs Parse pages
and
extract links
Crawling Heuristics
• Breadth-first
• Ranking-ordering
–PageRank
• Largest Site-first
• Use of:
–Partial information
–Historical information
• No Benchmark for Evaluation
1
Fraction of Pagerank
Very
good
collected
m
o
d Very
n
a bad
R
Fraction of 1
pages
downloaded
No Historical Information
Historical Information
Validation in the Greek domain
Data Cleaning
• Problem Dependent
• Usage mining:
– Anonymize if needed
– Define sessions
48
24 languages, 20 countries
55
A Few Examples
• Link Analysis
• Log Analysis
• Web Dynamics
• Social Mining
56
57
58
User Modeling
59
Size Evolution
60
Structure Macro Dynamics
61
62
Mirror of the Society
63
64
The Wisdom of Crowds
66
The Power of Social Media
Anchor Text
www.ibm.com
68
The Wisdom of Crowds
71
Web Search Queries
User Needs
• Need (Broder 2002)
Low hemoglobin
– Gray areas
Car rental Goa
• Find a good hub
• Exploratory search “see what’s there” 73
74
75
76
85
How far do people look for results?
Typical Session
• Two queries of
MP3
• .. two words, looking at… games
• .. two answer pages, doing cars
famous person
• .. two clicks per page
pictures
90
Context
91
Context in Web Queries
Session: ( q, (URL, t)* )+
Home page
Hub page
Page with
resources
Rose & Levinson 2004
94
User
Goals
Liu, Lee & Cho,
WWW 2005
Top 50 CS queries
Manual Query
Classification: 28
people
Informational goal
i(q)
Remove software &
person-names
30 queries left 96
Features
97
Prediction power:
Single features: 80%
Mixed features: 90%
Drawbacks: Small evaluation,
a posteriori feature
98
User Intention
100
101
Results: Topic
• Volume wise the
results are different
102
103
Clustering Queries
Goals
105
Our Approach
Clusters Examples
107
Using the Clusters
108
Query Recommendation
109
Relating Queries (Baeza-Yates, 2007)
common session
q1 q2 q3 q4 queries
common
clicks
words
pages
common
links
clicks
w w
common terms
114
Qualitative Analysis
Click Graph
Formal Definition
• There is an edge between two queries q and q' if:
– en.wikipedia.org/wiki/Complex_network (1 click)
0 0 0 0 1/4 3/4 0 0 0 0
“Complex networks”
Node Degree Distribution
Connected Components
Implicit Folksonomy?
130
132
Experimental Evaluation
Open Issues
Yahoo! Experience
139
Bibliography – General