Retail Big Data Analytics Solution Blueprint

Getting Started with Big Data
Analytics in Retail
Learn how Intel and Living Naturally* used big data to help a
health store increase sales and reduce inventory carrying costs.
SOLUTION BLUEPRINT
Big Data Analytics in Retail Data
Introduction
The volume, variety, and velocity of data1 being produced in all
areas of the retail industry is growing exponentially, creating both
challenges and opportunities for those diligently analyzing this data
to gain a competitive advantage. Although retailers have been using
data analytics to generate business intelligence for years, the extreme
composition of todays data necessitates new approaches and tools.
This is because the retail industry has entered the big data era, having
access to more information that can be used to create amazing shopping
experiences and forge tighter connections between customers, brands,
and retailers.
A trail of data follows products as they are manufactured, shipped,
stocked, advertised, purchased, consumed, and talked about by
consumers all of which can help forward-thinking retailers increase
sales and operations performance. This requires an end-to-end retail
analytics solution capable of analyzing large datasets populated
by retail systems and sensors, enterprise resource planning (ERP),
inventory control, social media, and other sources.
How does one start a big data project? In an attempt to demystify retail
data analytics, this paper chronicles a real-world implementation that is
producing tangible benefits, such as allowing retailers to:
Increase sales per visit with a deeper understanding of customers
purchase patterns.
Learn about new sales opportunities by identifying unexpected
trends from social media.
Improve inventory management with greater visibility into the
product pipeline.
A set of simple analytics experiments was performed to create
capabilities and a framework for conducting large-scale, distributed
data analytics. These experiments facilitated an understanding of the
edge-to-cloud business analytics value proposition and at the same
time, provided insight into the technical architecture and integration
needed for implementation.
Table of Contents
Introduction1
Competitive Advantage
Big Data Basics2
So what if retailers just ignore big data; after all, is it worth

all the effort? It turns out that it is. According to McKinsey*
Global Institute, big data has the potential to increase net
retailer margins by 60 percent. 3 Likewise, companies in the
top third of their industry in the use of data-driven-decision
making were, on average, five percent more productive and
six percent more profitable than their competitors, wrote
Andrew McAfee and Erik Brynjolfsson in a Harvard Business
Review article.4
Unstructured Data2
Competitive Advantage2
Big Data Technologies 2
Real-World Use Cases2

Product Pipeline Tracking3
Social Media Analysis4
Market Basket Analysis.4
Problem Definition4

Identify the Problem4

Identify All Data Sources5
Conduct Cause and Effect Analysis5
Determine Metrics That Need Improvement6
Verify the Solution Is Workable 6
Consider Business Process Re-engineering6
Calculate an ROI6
Data Preparation6
Algorithm Definition Process 6
Big Data Analytics Solution Implementation8

Capabilities and Software Components 9

Social Media Analysis Details 10
Product Pipeline Tracking 11

Market Basket Analysis Implementation 12
Market Basket Analysis 12
Basket Analysis Using Transposition and Encoding 13
Physical Infrastructure 14
Accelerate Big Data Implementations 14
Big Data Basics

Data has been getting bigger for a while now. The volume of
data generated or processed in 2014 alone is expected to
exceed six zettabytes;2 that is 1,200 times more than all the
data ever generated prior to 2003.
Unstructured Data
One of the reasons data is getting bigger is it is continuously
being generated from more sources and more devices.
Making matters more difficult, much of the data is
unstructured, coming from videos, photos, comments on
social media forums, reviews on web sites, and so on. As a
result, this data is often made up of volumes of text, dates,
numbers, and facts that are typically free form by nature and
cannot be stored in structured, predefined tables.
Certain data sources are arriving so fast, there may not be
enough time to store them before applying analytics to
them. And that is why conventional data analytics and tools
alone do not enable retail IT to store, manage, process, and
analyze all the data they may need to utilize.
2
In order to generate the insights needed to reap substantial

business benefits, new innovative approaches and
technologies are required. This is because big data in retail
is like a mountain, and retailers must uncover those tiny, but
game-changing golden nuggets of insights and knowledge
that can be used to create a competitive advantage.
Big Data Technologies
In order for retailers to realize the full potential of big data,
they must find a new approach for handling large amounts
of data. Traditional tools and infrastructure are struggling
to keep up with larger and more varied data sets coming in
at high velocity. New technologies are emerging to make
big data analytics scalable and cost-effective, such as a
distributed grid of computing resources. Processing is
pushed out to the nodes where the data resides, which is in
contrast to long-established approaches that retrieve data
for processing from a central point.
Hadoop* is a popular, open-source software framework that
enables distributed processing of large data sets on clusters
of computers. The framework easily scales on servers based
on Intel Xeon processors, as demonstrated by Cloudera*,
a provider of Apache* Hadoop-based software, support,
services, and training. Although Hadoop has captured a lot of
attention, other tools and technologies are also available for
working on different types of big data analytics problems.
Real-World Use Cases

Intel teamed up with Living Naturally*, a trusted name in
retail technology, in order to demonstrate big data use
cases for 3,000 natural food stores in the United States.
Living Naturally develops and markets a suite of online and
mobile programs designed to enhance the productivity
and marketing capabilities of retailers and suppliers. Since
1999, Living Naturally has worked with thousands of retail
customers and 20,000 major industry brands.
In a period of three months, a team of four Intel analysts
worked with Living Naturally on a retail analytics project
spanning four key phases: problem definition, data
preparation, algorithm definition, and big data solution
implementation. During the project, the following use cases
were investigated in detail:
30
55
30
25
20
18
120
120
CASHEWS XLG
WHL
0000000003827
NUTRIENT DEPOT
14
ORDER# 1294853
18
21
30
38 units
Mangos Market
Sarasota, FL 99352
Associated Buyers
Houston, TX 77002
SALES
38
120
Recommendation
Papayas are commonly purchased with Cashews
ADD
Figure 1. User Interface for Retail Buyers
Product Pipeline Tracking: When inventory levels are

out of sync with demand, make recommendations to
retail buyers to remedy the situation and maximize
return for the store.
Market Basket Analysis: When an item goes on sale,
let retailers know about adjacent products that benefit
from a sales increase as well.
Social Media Analysis: Prior to products going viral on
social media, suggest retail buyers increase order size to
be more responsive to shifting consumer demand and
avoid out-of-stocks.
A key objective of the team was to ensure the big data
solution complemented how people do their jobs. This
meant the data had to be presented to retail buyers in an
actionable and easy-to-use manner. This is best illustrated
by the solutions intuitive user interface, which is shown in
Figure 1 and described in the following sections:
Product Pipeline Tracking

Figure 2 defines the fields of the retail buyers user interface
(UI) used to order products. In this case, the product is
cashews, and the upper left corner shows the orders (30, 55,
30, 25) the retail buyer made over the last four weeks. Next
to the graph is a shopping cart, indicating the buyer plans to
order 20 more. Moving right along the UI, the next shipment
will contain a partial order of 18 more products arriving from
Houston, Texas, as seen in the lower left corner. Currently,
there are 120 units in inventory.
The retail buyer has a good track record ordering this
product, getting an A grade and a thumbs up from the tool.
The tool makes a suggestion to increase the reorder from
18 currently in the shopping cart to 38 units. The buyer can
override the suggestion by moving the smiley face left or
right to decrease or increase the amount of the next order.
Ordering
Current
Next
Order amount in
shopping cart shipment inventory track record
Previous orders
30
55
30
Inventory
status
Product
25
20
18
120
ORDER# 1294853
Associated Buyers
Houston, TX 77002
Next shipment
Recommended reorder
quantity
CASHEWS XLG
WHL
38 units
Mangos Market
Sarasota, FL 99352
120
Current inventory
38
Order quantity
selector
Figure 2. Product Ordering
Social Media Analysis

In the previous example, why is the tool suggesting a greater
order size when 120 units are already in inventory, and the
prior orders have kept pace of demand and then some?
Because the health benefits of cashews were discussed in a
recent social media post, which can be read by clicking the
Twitter bird in the UI shot in Figure 3. The tool estimated
the demand increase created by this tweet and other related
social media postings (also reflected in the retailers own
increased sales numbers), and recommended a suitable
reorder quantity for cashews.
Recommendation
Papayas are commonly purchased with Cashews

ADD
0000000003827
NUTRIENT DEPOT
14
18
18
21
SALES
INVENTORY
SHRINK
CHANGE
30
SALES
30
ORDERS
21
Significant
Twitter* chatter
0000000003827
NUTRIENT DEPOT
14
Click to see
associated
products
Click for
tweet details
May 10: Dr. Smith

promoted the
health benefits
from cashews...
Figure 3. Social Media Posting
Figure 4. Associated Products
When the team developed the retailer buyer UI, these

featured use cases were pursued separately. But over time,
the team saw the use cases were interconnected and built on
each other, resulting in a solution that was greater than the
sum of its parts. For example, providing retail buyers a single
view for social media postings and associated products (i.e.,
market basket analysis) enables them to take advantage of
multiple profit-maximizing opportunities at the same time.
The rest of this paper describes a process for carrying out a
big data project based on the teams learnings.
Problem Definition
Before starting a big data project, it is important to have
clarity of purpose, which can be accomplished with the
following steps:
Market Basket Analysis
Identify the Problem
The tool also lets the retail buyer know about associated
products for cashews, recommending the buyer increase
orders of papayas, which customers are more likely to
purchase along with cashews due to another social mediadriven microtrend (Figure 4). In other words, if cashew sales
are about to spike due to social media postings, papaya sales
are likely to increase as well.
Retailers need to have a clear idea of the problems they

want to solve and understand the value of solving them. For
instance, a clothing store may find that four of ten customers
looking for blue jeans cannot find their size on the shelf,
which results in lost sales. Using big data, the retailers
goal could be to reduce the frequency of out-of-stocks to
less than one in ten customers, thus increasing the sales
potential for jeans by over 50 percent.
Big data can be used to solve a lot of problems, such as

reduce spoilage, increase margin, increase transaction size
or sales volume, improve new product launch success, or
increase customer dwell time in stores.
Supply Chain: Orders, shipments, invoices, inventory, sales

receipts, and payments
Working with a retail solution provider to 3000 natural food

stores, the Intel team identified an out-of-stocks problem
associated with products going viral thanks to social media
and Internet postings of a celebrity or expert endorsing
them. For example, a popular medical doctor with a healthoriented television show recommended raspberry ketone
pills as a weight reducer to his very large following on
Twitter*. This caused a run on the product at health stores
and ultimately led to empty shelves, which took a long time
to restock since there was little inventory in the supply chain.
Environment: Weather, community events, seasons, and

holidays
Identify All Data sources

Retailers collect a variety of data from point-of-sale (POS)
systems, including the typical sales by product and sales
by customer via loyalty cards. However, there could be less
obvious sources of useful data that retailers can find by
walking the store, a process intended to provide a better
understanding of a products end-to-end life cycle. This
exercise allows retailers to think about problems with a fresh
set of eyes, avoid thinking of their data in silod terms, and
consider new opportunities that present themselves when
big data is used to find correlations between existing and
new data sources, such as:
Video: Surveillance cameras and anonymous video analytics
on digital signs
Social Media: Twitter, Facebook*, blogs, and product reviews
Advertising: Coupons in flyers and newspapers, and

advertisements on TV and in-store signs
Product Movement: RFID tags and GPS

The Intel team focused on the data sources and business
solutions listed in Table 1. Using social media chatter,
the team sent useful product information and desired
promotions to customers based on their interests. In
addition, social media and supply chain data were analyzed
together to minimize out-of-stocks and overstocks, which
will be described in more detail in a following section.
Conduct Cause and Effect Analysis
After listing all the possible data sources, the next step is to
explore the data through queries that help determine which
combinations of data could be used to solve specific retail
problems. For the Intel team, this meant using social media
data to inform health stores up to three weeks before a
product is likely to go viral, providing enough time for retail
buyers to increase their orders. The team queried social
media feeds and compared them to historic store inventory
levels and found it was possible to give retail buyers an early
warning of potentially rising product demand. In this case, it
was important to find a reliable leading indicator that could
be used to predict how much product should be ordered
and when.
Data Type
Social Media
Supply Chain
Data Sources
Twitter*
Facebook*
Orders
Invoices
Shipments
Receipts
Sales
Transactions
Business Solutions
Implement 1:1 digital marketing

Send personalized promotions
Influence customer sentiment
Increase new product sales

Maximize profit by understanding retail buyer behavior
Make product recommendations
Measure marketing campaign effectiveness
Table 1. Data Sources and Business Solutions
Determine Metrics That Need Improvement
Data Preparation
After clearly defining a problem, it is critical to determine

metrics that can accurately measure the effectiveness of
the future big data solution. A recommendation is to make
sure the metrics can be translated into a monetary value
(e.g., income from increased sales, savings from reduced
spoilage) and incorporated into a return on investment (ROI)
calculation.
Retail data used for big data analysis typically comes from
multiple sources and different operational systems, and
may have inconsistencies. Particularly for the Intel team,
transaction data was supplied by several systems, which
truncated UPC codes or stored them differently due to
special characters in the data field. More often than not,
one of the first steps is to cleanse, enhance, and normalize
the data.
The Intel team focused on metrics related to reducing

shrinkage and spoilage, and increasing profitability
by suggesting which products a store should pick for
promotion.
Verify the Solution Is Workable
A workable big data solution hinges on presenting the
findings in a clear and acceptable way to employees. As
described previously, the Intel team made sure the solution
provided retail buyers with timely product ordering
recommendations displayed with an intuitive UI running on
devices used to enter orders.
Consider Business Process Re-engineering
By definition, the outcomes from a big data project can
change the way people do their jobs. Will the data simplify
or complicate things for employees? Will employees need
to be trained on a new device? Will the POS system need to
be modified to deliver a different data set? It is important
to consider the costs and time of the business process reengineering that may be needed to fully implement the big
data solution.
Calculate an ROI
Taking all the inputs previously discussed, retailers should
calculate an ROI to ensure their big data project makes
financial sense. At this point, the ROI is likely to have some
assumptions and unknowns, but hopefully these will have
only a second order effect on the financials. If the return
is deemed too low, it may make sense to repeat the prior
steps and look for other problems and opportunities to
pursue. Moving forward, the following steps will add clarity
to what can be accomplished and raise the confidence level
of the ROI.
The Intel team performed a number of data clean tasks:

Populate missing data elements
Scrub data for discrepancies
Make data categories uniform
Reorganize data fields
Normalize UPC data
Algorithm Definition Process

After identifying the retail data sources, a set of algorithms
can be selected for data exploration. This is a trialand-error process because it is very difficult to assess
the effectiveness of an algorithm before applying it to a
particular data set. Algorithms are like atomic pieces, and
figuring out which ones are needed and work together is like
assembling a puzzle.
Table 2 lists commonly used algorithms and the types of
business problems they help to solve.
Analysis
Technique
Classification
Algorithms
Business Problems
Decision trees, neural network, and Nave Bayesian

Networks.
Churn Analysis: Learn which customers are most likely to

switch to a competitor.
A Bayesian network is a probabilistic graphical

model that represents a set of variables and their
conditional independencies.
Risk Management: Decide whether a loan be approved for

a particular customer.
Business Solutions
Similar to classification, logistic and linear

regression techniques look for patterns that
determine a numerical value.
Predict coupon redemption rate based on face value,

distribution method, distribution volume, and season.
Clustering/
Segmentation
These algorithms assign of a set of observations

into subsets (clusters), which are similar according
to pre-designated criteria.
Maximize profit by creating a profile of perfect buyer

behaviors.
They allow for different assumptions on the

data structure and evaluations based on internal
compactness within a cluster and/or separation
between different clusters.
Association Rules
Forecasting/
Prediction
Sequence Analysis
Customer Segmentation: Determine behavioral and

descriptive profiles of customers.
Association rule learning is a method for

discovering interesting relations between variables
in large databases.
Identify product adjacency for cross-selling purposes.
The method helps identify items that frequently

appear together and determine rules about the
associations.
Perform market basket analysis.
This approach estimates the value of a variable of

interest at some specified future date.
This approach estimates the value of a variable of interest

at some specified future date.
It uses formal statistical methods, employing time

series, cross-sectional or longitudinal data, and
techniques that deal with seasonality, trending,
and noisiness of the data.
It uses formal statistical methods, employing time series,

cross-sectional or longitudinal data, and techniques that
deal with seasonality, trending, and noisiness of the data.
This method finds patterns in a series of events.
Model customer purchases as a sequence, (e.g., a

customer first buys a computer, and then buys speakers,
and finally buys a webcam).
Both sequence and time-series data are similar;

they contain adjacent observations that are orderdependent. The difference is that where a time
series contains numerical data, a sequence series
contains discrete states.
Anomaly Detection
Directed Advertising: Personalize advertisements (instore, online) to better attract customer attention.
These are distance based techniques: k-nearest

neighbor, cluster analysis-based outlier detection,
pointing at records that deviate from learned
association rules.
Determine frequently shopped products.
Detect credit card fraud and network intrusion.

Conduct manufacturing error analysis.
They detect patterns (i.e., anomalies) in a given

data set that do not conform to an established
normal behavior and can be used to validate data
entry and identify outliers from expected norm.
Computer Vision
This type of algorithm translates images/highdimensional data from the real world to provide
numerical or symbolic information for decisionmaking.
Detect events (surveillance), model objects interaction,

reconstruct scenes, recognize objects, etc.
Use models are constructed with the aid of

geometry, physics, statistics, and learning theory
for vision perception.
Automated image analysis is performed on video
sequences to multiple camera views with many
applications.
Table 2. Algorithm Examples
Monitor social
influencers for
health food related
endorsements &
recommendations
Select natural
foods influencer
Monitor
endorsements &
recommendations
for viral
conditions
Monitor Living
Naturally*
products to
message
Review sales prior,

during, and after
message becomes
viral
Figure 5. Data Exploration Approach for Social Media Monitoring
Big Data Analytics Solution Implementation
The goal of this data exploration was to determine if social

media could be used to identify changes to customer
demand for products. The initial premise was the social
media analysis could identify emerging social topics and
viral events, and correlate them with products and sales.
This capability would enable retailers to detect micro trends
and events early enough to make informed ordering, pricing,
and product promotion decisions. The Intel team focused on
known and emerging influencers in the social media world.
The Intel team implemented an analytics solution

architecture based on a Cloudera Distribution of the Apache
Hadoop platform, which could handle the large volume,
velocity, and variety of data that has to be processed. Supply
chain and sales data were analyzed along with social media
data relevant to the natural foods domain; however, the
framework was designed to scale for additional data like
weather, sensor data from the Internet of Things (IoT), or
other sources that are of value to retailers.
Figure 5 shows the data exploration approach, which

basically looked for correlations between social media
events and changes in the sales of related products. The
input data included Google* Trends, tweets from a popular
doctor with a TV show, product attributes, and business-toconsumer (B2C) point-of-sale data.
Figure 7 shows a conceptual solution stack of the big data

analytics solution, containing layers of features required to
analyze large volumes of data and gain actionable insights
from the chosen algorithms. The left side of the figure shows
the data inputs, and the right side lists elements of the
framework spanning the computing infrastructure through
to the retailer buyer UI.
An example of the output is presented in Figure 6. The blue

line shows the trend for social media activity related to
raspberry ketone, a potential weight-reducing compound
found in red raspberries. The trend line peaks twice (weeks
two and four), which is followed by an increase in sales about
a week later, as shown by the green line.
120
100
80
Trend 60
40
20
0
1
400
300
200 Sales,
100 Trend
0 Rate
-100
-120
9
Big Data Retail Analytics Framework

User Interface and Visualization
Twitter* Data
Stream Filtered by
Keywords and
Usernames
Trend Rate
Query, Machine Learning, and

Statistical Algorithms
Analytics
Parallel Computing, Text Processing,
Natural Language Processing,
and Sentiment Analysis
Month
Trend
Visualization
Compute
Sales
d(Trend)/d(Months)
Figure 6. Social Media Impacts on the Sales of Raspberry Ketone
Supply Chain &

Transactional Data
Big Data Distributed File System,

Data Connectors, and Databases
Storage
Server, Storage, and Network
Hardware
Figure 7. Big Data Analytics Conceptual Solution Stack
Twitter* Data
Twitter
Big Data Retail Analytics Framework

Built on Cloudera* Distribution of Apache* Hadoop* (CDH5)
Twitter Firehouse API

Connector (Java
Application)
Mongo Database
Staging Data Store
(Twitter live stream in

JSON format filtered
by keywords, user
names)
Mongo
Export
Analytics
Visualization
in D3.js
Visualization
Store Manager
User Interface
Analytics
Hive
Pig
Mahout
R-Hadoop
Compute
Map Reduce
Programs
NLP Libraries
Tagging &
Tokenization
Sentiment
Scoring
HDFS
(Transaction & Raw
Twitter Data)
HBase
(Processed Data)
Sqoop
(Data Integration)
Living Naturally* Retail Data

in SQL Server
Retailer Buyer Orders

Web Transactions
POS Transactions
Inventory Data
Cleaned Up and
Aggregated Data
Campaigns &
Promotions Data
Storage
Sqoop
Product Catalog
Store & Retailer
Details
Hardware
4 Node Cluster. 2 x Intel Xeon Processor @ 2.7 GHz

24 TB Disk Space, 126 GB Memory per Node,10 Gigabit Ethernet
Loyalty Management
System
Figure 8. Solution Components
Capabilities and Software Components
Various solution components and associated technologies

were used, which are highlighted in Figure 8. At a high
level, Twitter data is streamed live using the fire hose API
exposed by Twitter, and supply-chain data is normalized,
processed, prepared, and stored in relational databases.
These databases are loaded in Hadoop Distributed File
System (HDFS) and HBase, which is a NoSQL database. The
data is processed using MapReduce programs, machine
learning algorithms, and statistical tools, and the output
is recommendations for retail buyers as well as actionable
interactive visualizations.
To enable the social media analysis capabilities, a base

framework was created to access and store live Twitter
streams, depicted by the process flow in Figure 9. The
known social media influencers were identified as previously
discussed. Public chat trends were tracked by extracting
content from tweets, messages were tokenized and
tagged using natural language processing techniques, and
sentiment scores calculated using open source libraries. The
social media analysis was tied to product sales trends to
make recommendations.
Facebook*
Twitter*
Extract
Features
Other
Social
Media
Tokenize &
Tag the Text
Public
chat
Identify known and emerging influencers

Track public chat trend and sentiment on
the relevant topics in influencer tweets
Connect Connect with sales trend of the products

with Sales
Filter by
User
Number of
Followers
Influencer
Keywords
Sentiment
Score
Topics
Recommen Combine the data and make recommendations

-dations
Visualize
Interactive visualization for analysis as well as

integration with store manager UI
Figure 9. Social Media Analysis Process Flow
Product Pipeline Tracking

pipeline that provides stakeholders with information
transparency for data types such as customer demand,
available capacity, and inventory levels. Efficiency should
increase with better information flow, assuming incentive
structures and the necessary technologies are also in place.
A retailers profitability is closely tied to supply chain

management, a critical function that helps ensure the right
amount of inventory is on hand. Without proper product
management, even small variations in customer demand
can lead to major inventory imbalances as business partners
in the supply chain try to make adjustments with imperfect
information. Inventory can swing wildly, creating an unstable
situation known as the bullwhip effect, which results in
stock-outs or excess inventory.
The Intel team developed a glass pipeline analytics

capability by first evaluating the availability and integrity of
product-oriented data structures shown in Figure 10. The
data was bucketed into four categories: unavailable, limited
and suspect, available but dirty, and mostly clean data.
After merging the various product-oriented data structures,
the team was able to generate useful reports such as the
inventory-turns graph in Figure 11. This was the first step
in creating product-based actionable insights to ensure the
right inventory is in the right place at the right time. The
glass pipeline also allows the retail buyer to select individual
product volumes, pricing, and promotions while ensuring
maximum store returns.
It is possible to avoid these inventory issues by providing

effective communication and coordination of the supply
chain after connecting the dots between orders, inventory,
and sales. For example, retailers can improve information
sharing by providing sales insight to the supply chain
so everyones demand predictions are based on the
same information, thereby reducing the bullwhip effect.
Essentially, the supply chain should be viewed as a glass
Inventory
PO & Forecast
Ship and Inv.

Raw
Material
Quantity
PO/Gross
Retailer
Ship and Inv.
Inventory
Inventory
Twitter*
Manufacturer
Ship & Inv.
Raw
Material
PO/Gross
PO & Forecast
Customer
10
LG Promo
POS & Web
Customers
Price
Inventory
Not data available

Limited data / highly suspect
Moderate / larger amount of data requiring scrubbing
Very strong data requiring limited to no scrubbing
Figure 10. Glass Pipeline Data Structures
Environment
Ship and Inv.
Web Search
Demand
Signals
Disty
Raw
Material
MFG Coupon
Price Drop
Shelf
Non-loyalty
Transactions
Demand
Generation
Monthly Dollars
Spent/Sold
Monthly
Inventory Turns %
$120,000
$100,000
$80,000
$60,000
$40.000
$20,000
$0
60%
50%
40%
30%
20%
10%
0%
3/31/11
7/31/11
11/30/11
3/31/12
7/31/12
1/30/11
5/31/11
9/30/11
1/31/12
5/31/12
2/28/11
4/30/11
6/30/11
8/31/11
10/31/11 12/31/11 2/29/12
4/30/12
6/30/12
8/31/12
Monthly Total $ Spent
Monthly Total $ Sold
Monthly Inventory Turn
Figure 11. Inventory Turns
The details with respect to this specific analytics solution can

be found in a companion solution blueprint. This solution
blueprint provides a deep dive discussion on the business
and technical aspects of social media analytics.
Market Basket Analysis Implementation

Several techniques were applied in order to understand
consumer buying patterns and determine product mixes
that best suited consumers and maximized overall return.
The basis of the analytics begins with association rule
learning; and in particular, the machine learning technique
was applied in on-line buying and extended to address
the likelihood of shopping baskets of much larger size and
diversity.
To determine product associations, the Intel team used
association rules to mine data and discover relationships
between products in large-scale transaction data
recorded by point-of-sale systems. Brick and mortar
retailers, especially grocery, have much more complicated
associations between products than on-line retailers due
to the greater average number of items in a consumers
market basket. There are also other types of associations
and relationships that may occur in the store (based on
To test {milk, bread} => {butter}:

Support({milk, bread, butter}) = 1/5 = 0.2
Confidence({milk, bread} => {butter}) =
Support({milk, bread, butter}) / Support({milk, bread})
= 0.2 / 0.4 = 0.5
Lift({milk, bread} => {butter}) =
Support({milk, bread, butter}) /
(Support({milk, bread}) * Support({butter}))
= 0.2 / (0.4 * 0.4) = 1.25
co-location, recipe ingredients, etc.). This called for a fresh

approach to providing prioritized market basket association
signals to the buyer.
Market Basket Association Rules analysis employs a welldefined algorithmic method. The following provides a
specific example of simple grocery association between
milk, bread, and butter to explain how the method works.
In this case, the following rule was tested to determine the
likelihood that customers who buy either milk or bread will
also buy butter:
{milk, bread} => {butter} [support, confidence]
Three measures were used to place constraints on the
significance of rules:
Support: The proportion of transactions in the data set
which contain items: milk or bread.
Confidence: The probability of finding butter under the
condition that these transactions also contain milk or bread.
Lift: A measure of the improvement in the occurrence of the
butter given the presence of milk or bread.
The conclusion is that the purchase of milk or bread is
accompanied by the purchase of butter 25 percent of the
time based on this data set.
Transaction ID
Milk
Bread
Butter
Beer
Source: http://en.wikipedia.org/wiki/Association_rules
Figure 12. Association Rules Learning Example
11
With association rule techniques as the basis for creating

the patterns of consumer buying and products with natural
affinities, the solution took into account the diversity and size
of shopping baskets and provided a multi-tiered analysis
approach. The idea is to identify emergent basket rules
and patterns from transaction datasets to detect product
affinities. These product affinities were then combined with
social media analysis, pricing strategy, promotions, and
customer product recommendations. Figure 13 shows a high
level process flow of the solution.
Basket Analysis Using Transposition and Encoding
For every product set, the solution merged the sets for
every possible pair of products and counted the results. The
merge need not be restricted to a pair, so multiple products
can be considered at a time. The result was then merged
with other products to find basket rules with the cost of
only counting the transactions. The solution calculated the
number of times the products occurred together, divided by
the total number of transactions. The solution processed the
following steps:
Transaction Dataset
Transaction ID
Vegetables
100012346
Canned Goods
100012347
Vegetables
100012348
Bread
100012349
Chips & Pretzels
100012350
Canned Goods
-------
-------
-------
-------
Merge the transaction lists together using AND (only

keep a transaction if in both lists).
If the number of transactions in common is greater than
the support value, put the new combination in the result
at a level equal to the number of keys in the product list,
and keep the list of transactions in common.
When a level is finished, and there are results, repeat the
process, but use the result from that level as the input.
The details with respect to this specific analytics solution
can be found in a companion solution blueprint, whereas
this solution blueprint focuses on the business and technical
aspects of social media analytics.
Association
Rules Mining
Product Category
100012345
For each product, run through the rest of the list.
VEGETABLES
FISH
CHEESE
MEAT
BREAD
CRACKERS
Support =
HERBS
MILK
SEAFOOD
Rule X
Confidence =
POULTRY
Lift =
98% pf people who purchased items
A and B also purchased item C
Figure 13. Market Basket Analysis High-Level Process Flow
12
(X U Y).count
n
(X U Y).count
X.count
Support
Supp(X). Supp(Y)
Association
Rules Library
Basket Rules
items
1 {DRY GROCERY\SHELF STABLE FOODS\CANNED GOODS,
FRESH\PRODUCE\VEGETABLES}
2 {DESERTS,
DRY GROCERY\SHELF STABLE FOODS\COOKIES CRACKERS AND CRISPBREADS}
3 {DRY GROCERY\SHELF STABLE FOODS\COOKIES CRACKERS AND CRISPBREADS,
HERBS BULK}
4 {DRY GROCERY\SHELF STABLE FOODS\CANNED GOODS,
DRY GROCERY\SNACKS\CHIPS AND PRETZELS}
5 {DRY GROCERY\SHELF STABLE FOODS\SOUP AND BROTH,
6 {DRY GROCERY\SNACKS\CHIPS AND PRETZELS,
FRESH BAKERY\BREAD AND BAKED GOODS\BREAD}
7 {DRY GROCERY\SHELF STABLE MEAL SOLUTIONS\MEAT POULTRY SEAFOOD,
FRESH BAKERY\BREAD AND BAKED GOODS\BREAD}
8 {DRY GROCERY\SHELF STABLE MEAL SOLUTIONS\MEAT POULTRY SEAFOOD,
support
0 .005387144
0 .003611720
0.007000243
0 .003363160
0.0024992898
0 .003525485
0.004159565
0 .003276926
Physical Infrastructure
The physical implementation architecture of the analytics
framework is shown in Figure 14, including the hardware
configuration, operating systems, and important software
components used in the solution.
Accelerate Big Data Implementations

Big data presents retailers with game-changing
opportunities to solve age-old business problems as well as
create a competitive advantage through unique insights. This
paper provided an overview of an actual big data analytics
implementation, thereby providing a primer for those just
getting started in this field.
Twitter* Data Streaming Server

2 x Intel Xeon processor @ 2.7 GHz,
24 TB disk space, and 126 GB Memory
OS: Microsoft* Server 2008 R2
Mongo* database
Working to make implementing big data easier, Intel has a

team of experts who create data analytics applications and
analytics toolkits for retailers so they can focus on solving
business problems instead of on system development.
Intel has been in the big data business for over 30 years,
managing state-of-the-art manufacturing sites and complex
supply chain networks. Applying this knowledge to in-store
analytics, Intel data analytics consulting services yield
actionable data that allows retailers and brands to respond
to customers desires in a more responsive and predictive
manner.
Hadoop* Cluster
2 x Intel Xeon processor @ 2.7 GHz, 24 TB disk space,
and 126 GB Memory per node
Cloudera* Distribution of Hadoop (CDH5)
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
10 Gigabit Ethernet
10 Gigabit Ethernet Switch

Transaction Data - Staging Server
7 TB disk space and 24 GB Memory
OS: Microsoft Server 2008 R2
Web Server
2 TB disk space, and 24 GB Memory
Hosting UI and Analytics Visualization
Figure 14. Physical Infrastructure
For more information about Intel solutions for the retail

industry, visit www.intel.com/retail.
Visit http://www.cloudera.com for information on Cloudera
Big Data solutions.
Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity. http://blogs.gartner.com/
doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
Ernst and Young, Ready for takeoff?, 2014, http://www.ey.com/Publication/vwLUAssets/EY_-_Ready_for_takeoff_executive_summary/%24FILE/EY-Ready-for-takeoff-Executive-summary.pdf.
McKinsey & Company, Consumer Marketing Analytics Center, Creating Competitive Advantage from Big Data in Retail, June 2012. http://www.mckinsey.com/client_service/retail/expertise/~/media/
mckinsey/dotcom/client_service/retail/articles/cmac_creating_competitive_advantage_from_big_data.
Andrew McAfee and Erik Brynjolfsson, Big Data: The Management Revolution, Harvard Business Review, October 2012.
Copyright 2014 Intel Corporation. All rights reserved. Intel, the Intel logo and Xeon are trademarks of Intel Corporation in the United States and/or other countries.
*Other names and brands may be claimed as the property of others.
Printed in USA
0614/MB/TM/PDF
Please Recycle
330716-001US
13

Retail Big Data Analytics Solution Blueprint

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Retail Big Data Analytics Solution Blueprint

Uploaded by

Copyright:

Available Formats

Getting Started with Big Data

Big Data Basics2

So what if retailers just ignore big data; after all, is it worth

Real-World Use Cases2

Identify the Problem4

Big Data Analytics Solution Implementation8

Social Media Analysis9

Product Pipeline Tracking 11

Big Data Basics

In order to generate the insights needed to reap substantial

Real-World Use Cases

Papayas are commonly purchased with Cashews

Product Pipeline Tracking: When inventory levels are

Product Pipeline Tracking

Figure 2. Product Ordering

Social Media Analysis

Papayas are commonly purchased with Cashews

May 10: Dr. Smith

Figure 3. Social Media Posting

Figure 4. Associated Products

When the team developed the retailer buyer UI, these

Market Basket Analysis

Identify the Problem

Retailers need to have a clear idea of the problems they

Big data can be used to solve a lot of problems, such as

Supply Chain: Orders, shipments, invoices, inventory, sales

Working with a retail solution provider to 3000 natural food

Environment: Weather, community events, seasons, and

Identify All Data sources

Advertising: Coupons in flyers and newspapers, and

Product Movement: RFID tags and GPS

Implement 1:1 digital marketing

Increase new product sales

Table 1. Data Sources and Business Solutions

Determine Metrics That Need Improvement

After clearly defining a problem, it is critical to determine

The Intel team focused on metrics related to reducing

The Intel team performed a number of data clean tasks:

Algorithm Definition Process

Decision trees, neural network, and Nave Bayesian

Churn Analysis: Learn which customers are most likely to

A Bayesian network is a probabilistic graphical

Risk Management: Decide whether a loan be approved for

Similar to classification, logistic and linear

Predict coupon redemption rate based on face value,

These algorithms assign of a set of observations

Maximize profit by creating a profile of perfect buyer

They allow for different assumptions on the

Customer Segmentation: Determine behavioral and

Association rule learning is a method for

Identify product adjacency for cross-selling purposes.

The method helps identify items that frequently

Perform market basket analysis.

This approach estimates the value of a variable of

This approach estimates the value of a variable of interest

It uses formal statistical methods, employing time

It uses formal statistical methods, employing time series,

This method finds patterns in a series of events.

Model customer purchases as a sequence, (e.g., a

Both sequence and time-series data are similar;

These are distance based techniques: k-nearest

Determine frequently shopped products.

Detect credit card fraud and network intrusion.

They detect patterns (i.e., anomalies) in a given

Big Data Basics2

Real-World Use Cases2

Identify the Problem4

Big Data Analytics Solution Implementation8

Social Media Analysis9

Product Pipeline Tracking 11