You are on page 1of 13

Getting Started with Big Data

Analytics in Retail
Learn how Intel and Living Naturally* used big data to help a
health store increase sales and reduce inventory carrying costs.

SOLUTION BLUEPRINT
Big Data Analytics in Retail Data

Introduction
The volume, variety, and velocity of data1 being produced in all
areas of the retail industry is growing exponentially, creating both
challenges and opportunities for those diligently analyzing this data
to gain a competitive advantage. Although retailers have been using
data analytics to generate business intelligence for years, the extreme
composition of todays data necessitates new approaches and tools.
This is because the retail industry has entered the big data era, having
access to more information that can be used to create amazing shopping
experiences and forge tighter connections between customers, brands,
and retailers.
A trail of data follows products as they are manufactured, shipped,
stocked, advertised, purchased, consumed, and talked about by
consumers all of which can help forward-thinking retailers increase
sales and operations performance. This requires an end-to-end retail
analytics solution capable of analyzing large datasets populated
by retail systems and sensors, enterprise resource planning (ERP),
inventory control, social media, and other sources.
How does one start a big data project? In an attempt to demystify retail
data analytics, this paper chronicles a real-world implementation that is
producing tangible benefits, such as allowing retailers to:
Increase sales per visit with a deeper understanding of customers
purchase patterns.
Learn about new sales opportunities by identifying unexpected
trends from social media.
Improve inventory management with greater visibility into the
product pipeline.
A set of simple analytics experiments was performed to create
capabilities and a framework for conducting large-scale, distributed
data analytics. These experiments facilitated an understanding of the
edge-to-cloud business analytics value proposition and at the same
time, provided insight into the technical architecture and integration
needed for implementation.

Table of Contents
Introduction1

Competitive Advantage

Big Data Basics2

So what if retailers just ignore big data; after all, is it worth


all the effort? It turns out that it is. According to McKinsey*
Global Institute, big data has the potential to increase net
retailer margins by 60 percent. 3 Likewise, companies in the
top third of their industry in the use of data-driven-decision
making were, on average, five percent more productive and
six percent more profitable than their competitors, wrote
Andrew McAfee and Erik Brynjolfsson in a Harvard Business
Review article.4

Unstructured Data2
Competitive Advantage2
Big Data Technologies 2

Real-World Use Cases2


Product Pipeline Tracking3
Social Media Analysis4
Market Basket Analysis.4

Problem Definition4






Identify the Problem4


Identify All Data Sources5
Conduct Cause and Effect Analysis5
Determine Metrics That Need Improvement6
Verify the Solution Is Workable 6
Consider Business Process Re-engineering6
Calculate an ROI6

Data Preparation6
Algorithm Definition Process 6
Social Media Analysis8

Big Data Analytics Solution Implementation8


Capabilities and Software Components 9

Social Media Analysis9


Social Media Analysis Details 10

Product Pipeline Tracking 11


Market Basket Analysis Implementation  12
Market Basket Analysis  12
Basket Analysis Using Transposition and Encoding  13

Physical Infrastructure 14
Accelerate Big Data Implementations  14

Big Data Basics


Data has been getting bigger for a while now. The volume of
data generated or processed in 2014 alone is expected to
exceed six zettabytes;2 that is 1,200 times more than all the
data ever generated prior to 2003.
Unstructured Data
One of the reasons data is getting bigger is it is continuously
being generated from more sources and more devices.
Making matters more difficult, much of the data is
unstructured, coming from videos, photos, comments on
social media forums, reviews on web sites, and so on. As a
result, this data is often made up of volumes of text, dates,
numbers, and facts that are typically free form by nature and
cannot be stored in structured, predefined tables.
Certain data sources are arriving so fast, there may not be
enough time to store them before applying analytics to
them. And that is why conventional data analytics and tools
alone do not enable retail IT to store, manage, process, and
analyze all the data they may need to utilize.
2

In order to generate the insights needed to reap substantial


business benefits, new innovative approaches and
technologies are required. This is because big data in retail
is like a mountain, and retailers must uncover those tiny, but
game-changing golden nuggets of insights and knowledge
that can be used to create a competitive advantage.
Big Data Technologies
In order for retailers to realize the full potential of big data,
they must find a new approach for handling large amounts
of data. Traditional tools and infrastructure are struggling
to keep up with larger and more varied data sets coming in
at high velocity. New technologies are emerging to make
big data analytics scalable and cost-effective, such as a
distributed grid of computing resources. Processing is
pushed out to the nodes where the data resides, which is in
contrast to long-established approaches that retrieve data
for processing from a central point.
Hadoop* is a popular, open-source software framework that
enables distributed processing of large data sets on clusters
of computers. The framework easily scales on servers based
on Intel Xeon processors, as demonstrated by Cloudera*,
a provider of Apache* Hadoop-based software, support,
services, and training. Although Hadoop has captured a lot of
attention, other tools and technologies are also available for
working on different types of big data analytics problems.

Real-World Use Cases


Intel teamed up with Living Naturally*, a trusted name in
retail technology, in order to demonstrate big data use
cases for 3,000 natural food stores in the United States.
Living Naturally develops and markets a suite of online and
mobile programs designed to enhance the productivity
and marketing capabilities of retailers and suppliers. Since
1999, Living Naturally has worked with thousands of retail
customers and 20,000 major industry brands.
In a period of three months, a team of four Intel analysts
worked with Living Naturally on a retail analytics project
spanning four key phases: problem definition, data
preparation, algorithm definition, and big data solution
implementation. During the project, the following use cases
were investigated in detail:

30

55

30

25

20

18

120
120

CASHEWS XLG
WHL

0000000003827
NUTRIENT DEPOT
14

ORDER# 1294853

18

21

30

38 units
Mangos Market
Sarasota, FL 99352

Associated Buyers
Houston, TX 77002

SALES

38

120

Recommendation

Papayas are commonly purchased with Cashews

ADD
Figure 1. User Interface for Retail Buyers

Product Pipeline Tracking: When inventory levels are


out of sync with demand, make recommendations to
retail buyers to remedy the situation and maximize
return for the store.
Market Basket Analysis: When an item goes on sale,
let retailers know about adjacent products that benefit
from a sales increase as well.
Social Media Analysis: Prior to products going viral on
social media, suggest retail buyers increase order size to
be more responsive to shifting consumer demand and
avoid out-of-stocks.
A key objective of the team was to ensure the big data
solution complemented how people do their jobs. This
meant the data had to be presented to retail buyers in an
actionable and easy-to-use manner. This is best illustrated
by the solutions intuitive user interface, which is shown in
Figure 1 and described in the following sections:

Product Pipeline Tracking


Figure 2 defines the fields of the retail buyers user interface
(UI) used to order products. In this case, the product is
cashews, and the upper left corner shows the orders (30, 55,
30, 25) the retail buyer made over the last four weeks. Next
to the graph is a shopping cart, indicating the buyer plans to
order 20 more. Moving right along the UI, the next shipment
will contain a partial order of 18 more products arriving from
Houston, Texas, as seen in the lower left corner. Currently,
there are 120 units in inventory.
The retail buyer has a good track record ordering this
product, getting an A grade and a thumbs up from the tool.
The tool makes a suggestion to increase the reorder from
18 currently in the shopping cart to 38 units. The buyer can
override the suggestion by moving the smiley face left or
right to decrease or increase the amount of the next order.

Ordering
Current
Next
Order amount in
shopping cart shipment inventory track record
Previous orders
30

55

30

Inventory
status
Product

25

20

18

120

ORDER# 1294853
Associated Buyers
Houston, TX 77002

Next shipment

Recommended reorder
quantity

CASHEWS XLG
WHL

38 units
Mangos Market
Sarasota, FL 99352

120
Current inventory

38
Order quantity
selector

Figure 2. Product Ordering

Social Media Analysis


In the previous example, why is the tool suggesting a greater
order size when 120 units are already in inventory, and the
prior orders have kept pace of demand and then some?
Because the health benefits of cashews were discussed in a
recent social media post, which can be read by clicking the
Twitter bird in the UI shot in Figure 3. The tool estimated
the demand increase created by this tweet and other related
social media postings (also reflected in the retailers own
increased sales numbers), and recommended a suitable
reorder quantity for cashews.

Recommendation

Papayas are commonly purchased with Cashews


ADD

0000000003827
NUTRIENT DEPOT
14

18

18

21

SALES

INVENTORY
SHRINK
CHANGE

30

SALES

30

ORDERS

21

Significant
Twitter* chatter

0000000003827
NUTRIENT DEPOT
14

Click to see
associated
products

Click for
tweet details

May 10: Dr. Smith


promoted the
health benefits
from cashews...

Figure 3. Social Media Posting

Figure 4. Associated Products

When the team developed the retailer buyer UI, these


featured use cases were pursued separately. But over time,
the team saw the use cases were interconnected and built on
each other, resulting in a solution that was greater than the
sum of its parts. For example, providing retail buyers a single
view for social media postings and associated products (i.e.,
market basket analysis) enables them to take advantage of
multiple profit-maximizing opportunities at the same time.
The rest of this paper describes a process for carrying out a
big data project based on the teams learnings.

Problem Definition
Before starting a big data project, it is important to have
clarity of purpose, which can be accomplished with the
following steps:

Market Basket Analysis

Identify the Problem

The tool also lets the retail buyer know about associated
products for cashews, recommending the buyer increase
orders of papayas, which customers are more likely to
purchase along with cashews due to another social mediadriven microtrend (Figure 4). In other words, if cashew sales
are about to spike due to social media postings, papaya sales
are likely to increase as well.

Retailers need to have a clear idea of the problems they


want to solve and understand the value of solving them. For
instance, a clothing store may find that four of ten customers
looking for blue jeans cannot find their size on the shelf,
which results in lost sales. Using big data, the retailers
goal could be to reduce the frequency of out-of-stocks to
less than one in ten customers, thus increasing the sales
potential for jeans by over 50 percent.

Big data can be used to solve a lot of problems, such as


reduce spoilage, increase margin, increase transaction size
or sales volume, improve new product launch success, or
increase customer dwell time in stores.

Supply Chain: Orders, shipments, invoices, inventory, sales


receipts, and payments

Working with a retail solution provider to 3000 natural food


stores, the Intel team identified an out-of-stocks problem
associated with products going viral thanks to social media
and Internet postings of a celebrity or expert endorsing
them. For example, a popular medical doctor with a healthoriented television show recommended raspberry ketone
pills as a weight reducer to his very large following on
Twitter*. This caused a run on the product at health stores
and ultimately led to empty shelves, which took a long time
to restock since there was little inventory in the supply chain.

Environment: Weather, community events, seasons, and


holidays

Identify All Data sources


Retailers collect a variety of data from point-of-sale (POS)
systems, including the typical sales by product and sales
by customer via loyalty cards. However, there could be less
obvious sources of useful data that retailers can find by
walking the store, a process intended to provide a better
understanding of a products end-to-end life cycle. This
exercise allows retailers to think about problems with a fresh
set of eyes, avoid thinking of their data in silod terms, and
consider new opportunities that present themselves when
big data is used to find correlations between existing and
new data sources, such as:
Video: Surveillance cameras and anonymous video analytics
on digital signs
Social Media: Twitter, Facebook*, blogs, and product reviews

Advertising: Coupons in flyers and newspapers, and


advertisements on TV and in-store signs

Product Movement: RFID tags and GPS


The Intel team focused on the data sources and business
solutions listed in Table 1. Using social media chatter,
the team sent useful product information and desired
promotions to customers based on their interests. In
addition, social media and supply chain data were analyzed
together to minimize out-of-stocks and overstocks, which
will be described in more detail in a following section.
Conduct Cause and Effect Analysis
After listing all the possible data sources, the next step is to
explore the data through queries that help determine which
combinations of data could be used to solve specific retail
problems. For the Intel team, this meant using social media
data to inform health stores up to three weeks before a
product is likely to go viral, providing enough time for retail
buyers to increase their orders. The team queried social
media feeds and compared them to historic store inventory
levels and found it was possible to give retail buyers an early
warning of potentially rising product demand. In this case, it
was important to find a reliable leading indicator that could
be used to predict how much product should be ordered
and when.

Data Type

Social Media

Supply Chain

Data Sources

Twitter*
Facebook*

Orders
Invoices
Shipments
Receipts
Sales
Transactions

Business Solutions

Implement 1:1 digital marketing


Send personalized promotions
Influence customer sentiment

Increase new product sales


Maximize profit by understanding retail buyer behavior
Make product recommendations
Measure marketing campaign effectiveness

Table 1. Data Sources and Business Solutions

Determine Metrics That Need Improvement

Data Preparation

After clearly defining a problem, it is critical to determine


metrics that can accurately measure the effectiveness of
the future big data solution. A recommendation is to make
sure the metrics can be translated into a monetary value
(e.g., income from increased sales, savings from reduced
spoilage) and incorporated into a return on investment (ROI)
calculation.

Retail data used for big data analysis typically comes from
multiple sources and different operational systems, and
may have inconsistencies. Particularly for the Intel team,
transaction data was supplied by several systems, which
truncated UPC codes or stored them differently due to
special characters in the data field. More often than not,
one of the first steps is to cleanse, enhance, and normalize
the data.

The Intel team focused on metrics related to reducing


shrinkage and spoilage, and increasing profitability
by suggesting which products a store should pick for
promotion.
Verify the Solution Is Workable
A workable big data solution hinges on presenting the
findings in a clear and acceptable way to employees. As
described previously, the Intel team made sure the solution
provided retail buyers with timely product ordering
recommendations displayed with an intuitive UI running on
devices used to enter orders.
Consider Business Process Re-engineering
By definition, the outcomes from a big data project can
change the way people do their jobs. Will the data simplify
or complicate things for employees? Will employees need
to be trained on a new device? Will the POS system need to
be modified to deliver a different data set? It is important
to consider the costs and time of the business process reengineering that may be needed to fully implement the big
data solution.
Calculate an ROI
Taking all the inputs previously discussed, retailers should
calculate an ROI to ensure their big data project makes
financial sense. At this point, the ROI is likely to have some
assumptions and unknowns, but hopefully these will have
only a second order effect on the financials. If the return
is deemed too low, it may make sense to repeat the prior
steps and look for other problems and opportunities to
pursue. Moving forward, the following steps will add clarity
to what can be accomplished and raise the confidence level
of the ROI.

The Intel team performed a number of data clean tasks:


Populate missing data elements
Scrub data for discrepancies
Make data categories uniform
Reorganize data fields
Normalize UPC data

Algorithm Definition Process


After identifying the retail data sources, a set of algorithms
can be selected for data exploration. This is a trialand-error process because it is very difficult to assess
the effectiveness of an algorithm before applying it to a
particular data set. Algorithms are like atomic pieces, and
figuring out which ones are needed and work together is like
assembling a puzzle.
Table 2 lists commonly used algorithms and the types of
business problems they help to solve.

Analysis
Technique
Classification

Algorithms

Business Problems

Decision trees, neural network, and Nave Bayesian


Networks.

Churn Analysis: Learn which customers are most likely to


switch to a competitor.

A Bayesian network is a probabilistic graphical


model that represents a set of variables and their
conditional independencies.

Risk Management: Decide whether a loan be approved for


a particular customer.

Business Solutions

Similar to classification, logistic and linear


regression techniques look for patterns that
determine a numerical value.

Predict coupon redemption rate based on face value,


distribution method, distribution volume, and season.

Clustering/
Segmentation

These algorithms assign of a set of observations


into subsets (clusters), which are similar according
to pre-designated criteria.

Maximize profit by creating a profile of perfect buyer


behaviors.

They allow for different assumptions on the


data structure and evaluations based on internal
compactness within a cluster and/or separation
between different clusters.
Association Rules

Forecasting/
Prediction

Sequence Analysis

Customer Segmentation: Determine behavioral and


descriptive profiles of customers.

Association rule learning is a method for


discovering interesting relations between variables
in large databases.

Identify product adjacency for cross-selling purposes.

The method helps identify items that frequently


appear together and determine rules about the
associations.

Perform market basket analysis.

This approach estimates the value of a variable of


interest at some specified future date.

This approach estimates the value of a variable of interest


at some specified future date.

It uses formal statistical methods, employing time


series, cross-sectional or longitudinal data, and
techniques that deal with seasonality, trending,
and noisiness of the data.

It uses formal statistical methods, employing time series,


cross-sectional or longitudinal data, and techniques that
deal with seasonality, trending, and noisiness of the data.

This method finds patterns in a series of events.

Model customer purchases as a sequence, (e.g., a


customer first buys a computer, and then buys speakers,
and finally buys a webcam).

Both sequence and time-series data are similar;


they contain adjacent observations that are orderdependent. The difference is that where a time
series contains numerical data, a sequence series
contains discrete states.
Anomaly Detection

Directed Advertising: Personalize advertisements (instore, online) to better attract customer attention.

These are distance based techniques: k-nearest


neighbor, cluster analysis-based outlier detection,
pointing at records that deviate from learned
association rules.

Determine frequently shopped products.

Detect credit card fraud and network intrusion.


Conduct manufacturing error analysis.

They detect patterns (i.e., anomalies) in a given


data set that do not conform to an established
normal behavior and can be used to validate data
entry and identify outliers from expected norm.
Computer Vision

This type of algorithm translates images/highdimensional data from the real world to provide
numerical or symbolic information for decisionmaking.

Detect events (surveillance), model objects interaction,


reconstruct scenes, recognize objects, etc.

Use models are constructed with the aid of


geometry, physics, statistics, and learning theory
for vision perception.
Automated image analysis is performed on video
sequences to multiple camera views with many
applications.
Table 2. Algorithm Examples

Monitor social
influencers for
health food related
endorsements &
recommendations

Select natural
foods influencer

Monitor
endorsements &
recommendations
for viral
conditions

Monitor Living
Naturally*
products to
message

Review sales prior,


during, and after
message becomes
viral

Figure 5. Data Exploration Approach for Social Media Monitoring

Social Media Analysis

Big Data Analytics Solution Implementation

The goal of this data exploration was to determine if social


media could be used to identify changes to customer
demand for products. The initial premise was the social
media analysis could identify emerging social topics and
viral events, and correlate them with products and sales.
This capability would enable retailers to detect micro trends
and events early enough to make informed ordering, pricing,
and product promotion decisions. The Intel team focused on
known and emerging influencers in the social media world.

The Intel team implemented an analytics solution


architecture based on a Cloudera Distribution of the Apache
Hadoop platform, which could handle the large volume,
velocity, and variety of data that has to be processed. Supply
chain and sales data were analyzed along with social media
data relevant to the natural foods domain; however, the
framework was designed to scale for additional data like
weather, sensor data from the Internet of Things (IoT), or
other sources that are of value to retailers.

Figure 5 shows the data exploration approach, which


basically looked for correlations between social media
events and changes in the sales of related products. The
input data included Google* Trends, tweets from a popular
doctor with a TV show, product attributes, and business-toconsumer (B2C) point-of-sale data.

Figure 7 shows a conceptual solution stack of the big data


analytics solution, containing layers of features required to
analyze large volumes of data and gain actionable insights
from the chosen algorithms. The left side of the figure shows
the data inputs, and the right side lists elements of the
framework spanning the computing infrastructure through
to the retailer buyer UI.

An example of the output is presented in Figure 6. The blue


line shows the trend for social media activity related to
raspberry ketone, a potential weight-reducing compound
found in red raspberries. The trend line peaks twice (weeks
two and four), which is followed by an increase in sales about
a week later, as shown by the green line.

120
100
80
Trend 60
40
20
0
1

400
300
200 Sales,
100 Trend
0 Rate
-100
-120
9

Big Data Retail Analytics Framework


User Interface and Visualization
Twitter* Data
Stream Filtered by
Keywords and
Usernames

Trend Rate

Query, Machine Learning, and


Statistical Algorithms
Analytics
Parallel Computing, Text Processing,
Natural Language Processing,
and Sentiment Analysis

Month
Trend

Visualization

Compute

Sales

d(Trend)/d(Months)

Figure 6. Social Media Impacts on the Sales of Raspberry Ketone

Supply Chain &


Transactional Data

Big Data Distributed File System,


Data Connectors, and Databases
Storage
Server, Storage, and Network
Hardware

Figure 7. Big Data Analytics Conceptual Solution Stack

Twitter* Data

Twitter

Big Data Retail Analytics Framework


Built on Cloudera* Distribution of Apache* Hadoop* (CDH5)

Twitter Firehouse API


Connector (Java
Application)
Mongo Database
Staging Data Store

(Twitter live stream in


JSON format filtered
by keywords, user
names)

Mongo
Export

Analytics
Visualization
in D3.js

Visualization

Store Manager
User Interface

Analytics

Hive
Pig

Mahout

R-Hadoop

Compute

Map Reduce
Programs

NLP Libraries
Tagging &
Tokenization

Sentiment
Scoring

HDFS
(Transaction & Raw
Twitter Data)

HBase
(Processed Data)

Sqoop
(Data Integration)

Living Naturally* Retail Data


in SQL Server

Retailer Buyer Orders


Web Transactions
POS Transactions
Inventory Data

Cleaned Up and
Aggregated Data

Campaigns &
Promotions Data

Storage
Sqoop

Product Catalog
Store & Retailer
Details

Hardware

4 Node Cluster. 2 x Intel Xeon Processor @ 2.7 GHz


24 TB Disk Space, 126 GB Memory per Node,10 Gigabit Ethernet

Loyalty Management
System

Figure 8. Solution Components

Capabilities and Software Components

Social Media Analysis

Various solution components and associated technologies


were used, which are highlighted in Figure 8. At a high
level, Twitter data is streamed live using the fire hose API
exposed by Twitter, and supply-chain data is normalized,
processed, prepared, and stored in relational databases.
These databases are loaded in Hadoop Distributed File
System (HDFS) and HBase, which is a NoSQL database. The
data is processed using MapReduce programs, machine
learning algorithms, and statistical tools, and the output
is recommendations for retail buyers as well as actionable
interactive visualizations.

To enable the social media analysis capabilities, a base


framework was created to access and store live Twitter
streams, depicted by the process flow in Figure 9. The
known social media influencers were identified as previously
discussed. Public chat trends were tracked by extracting
content from tweets, messages were tokenized and
tagged using natural language processing techniques, and
sentiment scores calculated using open source libraries. The
social media analysis was tied to product sales trends to
make recommendations.

Facebook*

Twitter*

Extract
Features

Other
Social
Media

Tokenize &
Tag the Text

Public
chat

Identify known and emerging influencers


Track public chat trend and sentiment on
the relevant topics in influencer tweets

Connect Connect with sales trend of the products


with Sales

Filter by
User
Number of
Followers

Influencer

Keywords

Sentiment
Score

Topics

Recommen Combine the data and make recommendations


-dations
Visualize

Interactive visualization for analysis as well as


integration with store manager UI

Figure 9. Social Media Analysis Process Flow

Product Pipeline Tracking


pipeline that provides stakeholders with information
transparency for data types such as customer demand,
available capacity, and inventory levels. Efficiency should
increase with better information flow, assuming incentive
structures and the necessary technologies are also in place.

A retailers profitability is closely tied to supply chain


management, a critical function that helps ensure the right
amount of inventory is on hand. Without proper product
management, even small variations in customer demand
can lead to major inventory imbalances as business partners
in the supply chain try to make adjustments with imperfect
information. Inventory can swing wildly, creating an unstable
situation known as the bullwhip effect, which results in
stock-outs or excess inventory.

The Intel team developed a glass pipeline analytics


capability by first evaluating the availability and integrity of
product-oriented data structures shown in Figure 10. The
data was bucketed into four categories: unavailable, limited
and suspect, available but dirty, and mostly clean data.
After merging the various product-oriented data structures,
the team was able to generate useful reports such as the
inventory-turns graph in Figure 11. This was the first step
in creating product-based actionable insights to ensure the
right inventory is in the right place at the right time. The
glass pipeline also allows the retail buyer to select individual
product volumes, pricing, and promotions while ensuring
maximum store returns.

It is possible to avoid these inventory issues by providing


effective communication and coordination of the supply
chain after connecting the dots between orders, inventory,
and sales. For example, retailers can improve information
sharing by providing sales insight to the supply chain
so everyones demand predictions are based on the
same information, thereby reducing the bullwhip effect.
Essentially, the supply chain should be viewed as a glass

Inventory

PO & Forecast

Ship and Inv.


Raw
Material

Quantity

PO/Gross
Retailer
Ship and Inv.

Inventory

Inventory

Twitter*

Manufacturer

Ship & Inv.

Raw
Material

PO/Gross

PO & Forecast

Customer

10

LG Promo
POS & Web
Customers

Price
Inventory

Not data available


Limited data / highly suspect
Moderate / larger amount of data requiring scrubbing
Very strong data requiring limited to no scrubbing
Figure 10. Glass Pipeline Data Structures

Environment

Ship and Inv.

Web Search

Demand
Signals

Disty
Raw
Material

MFG Coupon
Price Drop

Shelf
Non-loyalty
Transactions

Demand
Generation

Monthly Dollars
Spent/Sold

Monthly
Inventory Turns %

$120,000
$100,000
$80,000
$60,000
$40.000
$20,000
$0

60%
50%
40%
30%
20%
10%
0%

3/31/11
7/31/11
11/30/11
3/31/12
7/31/12
1/30/11
5/31/11
9/30/11
1/31/12
5/31/12
2/28/11
4/30/11
6/30/11
8/31/11
10/31/11 12/31/11 2/29/12
4/30/12
6/30/12
8/31/12

Monthly Total $ Spent

Monthly Total $ Sold

Monthly Inventory Turn

Figure 11. Inventory Turns

The details with respect to this specific analytics solution can


be found in a companion solution blueprint. This solution
blueprint provides a deep dive discussion on the business
and technical aspects of social media analytics.

Market Basket Analysis Implementation


Several techniques were applied in order to understand
consumer buying patterns and determine product mixes
that best suited consumers and maximized overall return.
The basis of the analytics begins with association rule
learning; and in particular, the machine learning technique
was applied in on-line buying and extended to address
the likelihood of shopping baskets of much larger size and
diversity.
Market Basket Analysis
To determine product associations, the Intel team used
association rules to mine data and discover relationships
between products in large-scale transaction data
recorded by point-of-sale systems. Brick and mortar
retailers, especially grocery, have much more complicated
associations between products than on-line retailers due
to the greater average number of items in a consumers
market basket. There are also other types of associations
and relationships that may occur in the store (based on

To test {milk, bread} => {butter}:


Support({milk, bread, butter}) = 1/5 = 0.2
Confidence({milk, bread} => {butter}) =
Support({milk, bread, butter}) / Support({milk, bread})
= 0.2 / 0.4 = 0.5
Lift({milk, bread} => {butter}) =
Support({milk, bread, butter}) /
(Support({milk, bread}) * Support({butter}))
= 0.2 / (0.4 * 0.4) = 1.25

co-location, recipe ingredients, etc.). This called for a fresh


approach to providing prioritized market basket association
signals to the buyer.
Market Basket Association Rules analysis employs a welldefined algorithmic method. The following provides a
specific example of simple grocery association between
milk, bread, and butter to explain how the method works.
In this case, the following rule was tested to determine the
likelihood that customers who buy either milk or bread will
also buy butter:
{milk, bread} => {butter} [support, confidence]
Three measures were used to place constraints on the
significance of rules:
Support: The proportion of transactions in the data set
which contain items: milk or bread.
Confidence: The probability of finding butter under the
condition that these transactions also contain milk or bread.
Lift: A measure of the improvement in the occurrence of the
butter given the presence of milk or bread.
The conclusion is that the purchase of milk or bread is
accompanied by the purchase of butter 25 percent of the
time based on this data set.

Transaction ID

Milk

Bread

Butter

Beer

Source: http://en.wikipedia.org/wiki/Association_rules
Figure 12. Association Rules Learning Example

11

With association rule techniques as the basis for creating


the patterns of consumer buying and products with natural
affinities, the solution took into account the diversity and size
of shopping baskets and provided a multi-tiered analysis
approach. The idea is to identify emergent basket rules
and patterns from transaction datasets to detect product
affinities. These product affinities were then combined with
social media analysis, pricing strategy, promotions, and
customer product recommendations. Figure 13 shows a high
level process flow of the solution.
Basket Analysis Using Transposition and Encoding
For every product set, the solution merged the sets for
every possible pair of products and counted the results. The
merge need not be restricted to a pair, so multiple products
can be considered at a time. The result was then merged
with other products to find basket rules with the cost of
only counting the transactions. The solution calculated the
number of times the products occurred together, divided by
the total number of transactions. The solution processed the
following steps:

Transaction Dataset
Transaction ID

Vegetables

100012346

Canned Goods

100012347

Vegetables

100012348

Bread

100012349

Chips & Pretzels

100012350

Canned Goods

-------

-------

-------

-------

Merge the transaction lists together using AND (only


keep a transaction if in both lists).
If the number of transactions in common is greater than
the support value, put the new combination in the result
at a level equal to the number of keys in the product list,
and keep the list of transactions in common.
When a level is finished, and there are results, repeat the
process, but use the result from that level as the input.
The details with respect to this specific analytics solution
can be found in a companion solution blueprint, whereas
this solution blueprint focuses on the business and technical
aspects of social media analytics.

Association
Rules Mining

Market Basket Analysis

Product Category

100012345

For each product, run through the rest of the list.

VEGETABLES
FISH
CHEESE
MEAT
BREAD

CRACKERS

Support =

HERBS
MILK
SEAFOOD

Rule X

Confidence =

POULTRY

Lift =
98% pf people who purchased items
A and B also purchased item C

Figure 13. Market Basket Analysis High-Level Process Flow

12

(X U Y).count
n
(X U Y).count
X.count

Support
Supp(X). Supp(Y)

Association
Rules Library
Basket Rules
items
1 {DRY GROCERY\SHELF STABLE FOODS\CANNED GOODS,
FRESH\PRODUCE\VEGETABLES}
2 {DESERTS,
DRY GROCERY\SHELF STABLE FOODS\COOKIES CRACKERS AND CRISPBREADS}
3 {DRY GROCERY\SHELF STABLE FOODS\COOKIES CRACKERS AND CRISPBREADS,
HERBS BULK}
4 {DRY GROCERY\SHELF STABLE FOODS\CANNED GOODS,
DRY GROCERY\SNACKS\CHIPS AND PRETZELS}
5 {DRY GROCERY\SHELF STABLE FOODS\SOUP AND BROTH,
DRY GROCERY\SNACKS\CHIPS AND PRETZELS}
6 {DRY GROCERY\SNACKS\CHIPS AND PRETZELS,
FRESH BAKERY\BREAD AND BAKED GOODS\BREAD}
7 {DRY GROCERY\SHELF STABLE MEAL SOLUTIONS\MEAT POULTRY SEAFOOD,
FRESH BAKERY\BREAD AND BAKED GOODS\BREAD}
8 {DRY GROCERY\SHELF STABLE MEAL SOLUTIONS\MEAT POULTRY SEAFOOD,
DRY GROCERY\SNACKS\CHIPS AND PRETZELS}

support
0 .005387144
0 .003611720
0.007000243
0 .003363160
0.0024992898
0 .003525485
0.004159565
0 .003276926

Physical Infrastructure
The physical implementation architecture of the analytics
framework is shown in Figure 14, including the hardware
configuration, operating systems, and important software
components used in the solution.

Accelerate Big Data Implementations


Big data presents retailers with game-changing
opportunities to solve age-old business problems as well as
create a competitive advantage through unique insights. This
paper provided an overview of an actual big data analytics
implementation, thereby providing a primer for those just
getting started in this field.

Twitter* Data Streaming Server


2 x Intel Xeon processor @ 2.7 GHz,
24 TB disk space, and 126 GB Memory
OS: Microsoft* Server 2008 R2
Mongo* database

Working to make implementing big data easier, Intel has a


team of experts who create data analytics applications and
analytics toolkits for retailers so they can focus on solving
business problems instead of on system development.
Intel has been in the big data business for over 30 years,
managing state-of-the-art manufacturing sites and complex
supply chain networks. Applying this knowledge to in-store
analytics, Intel data analytics consulting services yield
actionable data that allows retailers and brands to respond
to customers desires in a more responsive and predictive
manner.

Hadoop* Cluster
2 x Intel Xeon processor @ 2.7 GHz, 24 TB disk space,
and 126 GB Memory per node
Cloudera* Distribution of Hadoop (CDH5)
Hadoop Node
Hadoop Node
Hadoop Node

Hadoop Node

10 Gigabit Ethernet

10 Gigabit Ethernet Switch


Transaction Data - Staging Server
2 x Intel Xeon processor @ 2.4 GHz,
7 TB disk space and 24 GB Memory
OS: Microsoft Server 2008 R2

Web Server
2 x Intel Xeon processor @ 2.4 GHz,
2 TB disk space, and 24 GB Memory
Hosting UI and Analytics Visualization

Figure 14. Physical Infrastructure

For more information about Intel solutions for the retail


industry, visit www.intel.com/retail.
Visit http://www.cloudera.com for information on Cloudera
Big Data solutions.

Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity. http://blogs.gartner.com/
doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.

Ernst and Young, Ready for takeoff?, 2014, http://www.ey.com/Publication/vwLUAssets/EY_-_Ready_for_takeoff_executive_summary/%24FILE/EY-Ready-for-takeoff-Executive-summary.pdf.

McKinsey & Company, Consumer Marketing Analytics Center, Creating Competitive Advantage from Big Data in Retail, June 2012. http://www.mckinsey.com/client_service/retail/expertise/~/media/
mckinsey/dotcom/client_service/retail/articles/cmac_creating_competitive_advantage_from_big_data.

Andrew McAfee and Erik Brynjolfsson, Big Data: The Management Revolution, Harvard Business Review, October 2012.

Copyright 2014 Intel Corporation. All rights reserved. Intel, the Intel logo and Xeon are trademarks of Intel Corporation in the United States and/or other countries.
*Other names and brands may be claimed as the property of others.

Printed in USA

0614/MB/TM/PDF

Please Recycle

330716-001US

13

You might also like