Professional Documents
Culture Documents
Rajeev Rastogi
Yahoo! Labs Bangalore
The most visited site on the
internet
• 600 million+ users
per month
• Super popular
properties
– News, finance,
sports
– Answers, flickr,
del.icio.us
– Mail, messaging
– Search
Unparalleled scale
• University relations
– Faculty research grants
– Summer internships
– Sharing data/computing
infrastructure
– Conference sponsorships
– PhD co-op program
Web Search
What does search look like
today?
Search results of the future:
Structured abstracts
yelp.com Gawker
epicurious LinkedIn
answers.com webmd
Search results of the future:
Query refinement
Search results of the future:
Rich media
Technologies that are enabling
search transformation
• Information extraction (structured
abstracts)
• Web page classification (query
refinement)
• Multimedia search (rich media)
Information extraction (IE)
• Goal: Extract structured records from Web pages
Name
Category
Address Map
Phone
Price
Reviews
Multiple verticals
• Business, social networking, video, ….
One schema per vertical
Price
Category
Address
Name
Phone Title Price Posted by
Date
Title
Education
Connections
Rating Views
IE on the Web is a hard problem
• Web pages are noisy
• Pages belonging to different Web sites have different
layouts
Noise
Web page types
Template-based
Hand-
crafted
Template-based pages
Extract
Website pages
Extract Records
Example
Generalize
XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span
Filters
• Apply filters to prune from multiple candidates that match
XPath expression
XPath: /html/body//div//span
1 |x|
P (y | x) = ∏ exp ∑ λk f k ( yt , yt −1 , x, t )
Z ( x) t =1 k
– fk: features, λ k: weights
Category
Address
Phone
Similar-structured
records
IE big picture/taxonomy
• Things to extract from
– Template-based, browse, hand-crafted pages, text
• Things to extract
– Records, tables, lists, named entities
• Techniques used
– Structure-based (HTML tags, DOM tree paths) – e.g.
Wrappers
– Content-based (attribute values/models) – e.g. dictionaries
– Structure + Content (sequential/hierarchical relationships
among attribute values) – e.g. hierarchical CRFs
• Level of automation
– Manual, supervised, unsupervised
Web Page Classification:
Requirements
• Quality
– High Precision and Recall
– Leverage structured input (links, co-citations) and
output (taxonomy)
• Scalability
– Large numbers of training Examples, Features and
Classes
– Complex Structured input and output
• Cost
– Small human effort (for labeling of pages)
– Compact classifier model
– Low prediction time
Structured Output Learning
• Structured Output Examples
– Multi-class
– Taxonomy Health Sport
• Naïve approach
– Separate binary classifier per class Fitness Medicine Cricket
Soccer
– Separate classifier for each taxonomy level
• Better approach – single (SVM) classifier One-day Test
Link
Multimedia Search
• Availability & consumption of multimedia
content on the Internet is increasing
– 500 billion images will be captured in 2010
• Leveraging content and metadata are
important for MM search
• Some big technical challenges are:
– Results diversity
– Relevance
– Image Classification, e.g., pornography
Near-Duplicate Detection
• Multiple near-similar versions of an
image exist on the internet
– scaled, cropped, captioned, small
scene change, etc.
• Near-duplicates adversely impact
user experience
• Can we use a compact description
and dedup in constant time?
• Fourier-Mellin Transform (FMT):
translation, rotation, and scale
invariant
• Signature generation using a small
number of low-frequency coefficients
Filtering noisy tags to improve
relevance
• Measures such as IDF may assign high weights to noisy tags
– Treat Tag-Sets as Bag-of-words, random collection of terms
• Boosting weights of tags based on their co-occurrence with other tags can
filter out noise
idf co-occur
10.2765 hinduism 8.8989 child
8.6589 hindu 8.8033 smile
8.6259 finger 8.338 happy
7.8524 kerala 7.982 mother
7.3432 mother 6.0989 women
6.7895 smile 4.8763 family
6.6507 child 4.208 india
6.576 women 2.9307 hinduism
6.5535 point 2.8871 hindu
6.4512 happy 2.8318 orange
6.0312 orange 1.4355 kerala
5.2129 india 0.2292 point
4.312 family 0 finger
Online Advertizing
Sponsored search ads
Search query
Ad
How it works
I want to bid $5 on Ad Index
Advertiser
canon camera Sponsored
I want to bid $2 on search engine
cannon camera
Ads
Contextual ads
∈ ∧ ∈ ∧
Banner ads
Ad
Creates Brand
Awareness
How it works
Ad Index
Advertiser
I want 1M impressions Banner Ad
Engine
“On finance.yahoo.com,
gender = male, age = 20-30
during the month of April 2009”
12
r edne G
Suboptimal
(10M,$20) (10M,$10)
el a M
20 - 30 > 30
12
Age
∑ j∈Ri
xij ≥ d i di i
Edges to Ri
xij j sj pj ∑x
i ij ≤ sj
Objective : maximize ∑ j
p j ( s j − ∑i xij )
Ads taxonomy
Online Ads
Guarantees: NG NG G NG
• Today
Contextual Display
Separate systems
for contextual &
display
CPC CPM
• Tomorrow
Unified Ads marketplace
– Unify contextual & Display
– Increase supply & demand
Y! Ad Exchange – Enable better matching
CPC, CPM – CPC, CPM ads compete
Estimated eCPM
• For CPM ads: eCPM = bid
• For CPC ads: eCPM = bid * Pr(click)
– Select ad with max eCPM to maximize
revenue
• Advertizing
– Construct ML models to predict click probability
• Data mining
– Analyze TBs of Web logs to compute correlations
between (billions of) user profiles and page views
Solution: Cloud computing
• A cloud consists of
– 1000s of commodity machines (e.g., Linux
PCs)
– Software layer for
• Distributing data across machines
• Parallelizing application execution across cluster
• Detecting and recovering from failures
– Yahoo!’s software layer based on Hadoop
Open Source
Cloud computing benefits
Machine4 Machine4
Animals:2,12
Bees: 23
Machine3
Rack 2
Challenges:
Rack i • Optimize distribution to provide
maximum locality
• Optimize replication to provide best
fault tolerance
Rack n
Job Scheduling
Job Queues based on priorities and SLAs
Challenges:
SDS Q1 40% L1 • Schedule jobs to maximize resource
1 2 3
utilization while preserving SLAs
YST Q2 35% L2 • Schedule jobs to maximize data
locality
• Performance modeling
ATG Qm 25% Lm
Summary