You are on page 1of 9

3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

Weusecookiestoimproveourwebsite.Bycontinuingweassumeyourpermissiontousecookies,asdetailedinourprivacyandcookiespolicy(/privacy)(closethismessage).

MENU

Marketbasketanalysis:identifyingproductsandcontent
thatgowelltogether
Anityanalysis(http://en.wikipedia.org/wiki/Anity_analysis)andassociationrulelearning
(http://en.wikipedia.org/wiki/Association_rule_learning)encompassesabroadsetofanalyticstechniquesaimedatuncoveringtheassociations
andconnectionsbetweenspecicobjects:thesemightbevisitorstoyourwebsite(customersoraudience),productsinyourstore,orcontent
itemsonyourmediasite.Ofthese,marketbasketanalysisisperhapsthemostfamousexample.Inamarketbasketanalysis,youlooktoseeif
therearecombinationsofproductsthatfrequentlyco-occurintransactions.Forexample,maybepeoplewhobuyourandcastingsugar,also
tendtobuyeggs(becauseahighproportionofthemareplanningonbakingacake).Aretailercanusethisinformationtoinform:

Storelayout(putproductsthatco-occurtogetherclosetooneanother,toimprovethecustomershoppingexperience)
Marketing(e.g.targetcustomerswhobuyourwithoersoneggs,toencouragethemtospendmoreontheirshoppingbasket)
Onlineretailersandpublisherscanusethistypeofanalysisto:

Informtheplacementofcontentitemsontheirmediasites,orproductsintheircatalogue
Driverecommendationengines(likeAmazonscustomerswhoboughtthisproductalsoboughttheseproducts)
Delivertargetedmarketing(e.g.emailingcustomerswhoboughtproductsspecicproductswithotherproductsandoersonthose
productsthatarelikelytobeinterestingtothem.)
Thereareawiderangeofalgorithms,availableonawidevarietyofplatforms,forperformingmarketbasketanalysis.Inthisintroductoryrecipe,
wewillcover:

1.Marketbasketanalysis:thebasics
2.PerformingmarketingbasketanalysisusingtheapriorialgorithmusingRandthe arules package
3.Managinglargeresultsets:visualizingrulesusingthe arulesViz package
4.Interpretingtheresults:usingtheanalysistodrivebusinessdecision-making
5.Expandingontheanalysis-zoomingoutfromthebaskettolookacustomerbehavioroverlongerperiodsanddierentevents

1.Marketbasketanalysis:thebasics
Terminology
Itemsaretheobjectsthatweareidentifyingassociationsbetween.Foranonlineretailer,eachitemisaproductintheshop.Forapublisher,
eachitemmightbeanarticle,ablogpost,avideoetc.Agroupofitemsisanitemset.

I = {i1 , i2 , ..., in }

Transactionsareinstancesofgroupsofitemsco-occuringtogether.Foranonlineretailer,atransactionis,generally,a,transaction.Fora
publisher,atransactionmightbethegroupofarticlesreadinasinglevisittothewebsite.(Itisuptotheanalysttodeneoverwhatperiodto
measureatransaction.)Foreachtransaction,then,wehaveanitemset.

tn = {ii , ij , ..., ik }

Rulesarestatementsoftheform

{i1 , i2 , ...} {ik }

I.e.ifyouhavetheitemsinitemset(onthelefthandside(LHS)oftherulei.e. {i_1,i_2,...} ),thenitislikelythatavisitorwillbeinterestedin


theitemontherighthandside(RHSi.e. {i_k} .Inourexampleabove,ourrulewouldbe:

{f lour, sugar} {eggs}

Theoutputofamarketbasketanalysisisgenerallyasetofrules,thatwecanthenexploittomakebusinessdecisions(relatedtomarketingor
productplacement,forexample).
http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 1/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

Thesupportofanitemoritemsetisthefractionoftransactionsinourdatasetthatcontainthatitemoritemset.Ingeneral,itisnicetoidentify
rulesthathaveahighsupport,asthesewillbeapplicabletoalargenumberoftransactions.Forsupermarketretailers,thisislikelytoinvolve
basicproductsthatarepopularacrossanentireuserbase(e.g.bread,milk).Aprintercartridgeretailer,forexample,maynothaveproductswith
ahighsupport,becauseeachcustomeronlybuyscartridgesthatarespecictohis/herownprinter.

ThecondenceofaruleisthelikelihoodthatitistrueforanewtransactionthatcontainstheitemsontheLHSoftherule.(I.e.itisthe
probabilitythatthetransactionalsocontainstheitem(s)ontheRHS.)Formally:

condence(i m in ) = support(i m in ) / support(i m)

TheliftofaruleistheratioofthesupportoftheitemsontheLHSoftheruleco-occuringwithitemsontheRHSdividedbyprobabilitythatthe
LHSandRHSco-occurifthetwoareindependent.

lift(i m in ) = support(i m in ) / ( support(i m) support(i n ))

Ifliftisgreaterthan1,itsuggeststhattheprecenseoftheitemsontheLHShasincreasedtheprobabilitythattheitemsontherighthandside
willoccuronthistransaction.Iftheliftisbelow1,itsuggeststhatthepresenceoftheitemsontheLHSmaketheprobabilitythattheitemson
theRHSwillbepartofthetransactionlower.Iftheliftis1,itsuggeststhatthepresenceofitemsontheLHSandRHSreallyareindependent:
knowingthattheitemsontheLHSarepresentmakesnodierencetotheprobabilitythatitemswilloccurontheRHS.

Whenweperformmarketbasketanalysis,then,wearelookingforruleswithaliftofmorethanone.Ruleswithhighercondenceareones
wheretheprobabilityofanitemappearingontheRHSishighgiventhepresenceoftheitemsontheLHS.Itisalsopreferable(highervalue)to
actionrulesthathaveahighsupport-asthesewillbeapplicabletoalargernumberoftransactions.However,inthecaseoflong-tailretailers,
thismaynotbepossible.

Backtotop.

2.PerformingmarketingbasketanalysisusingtheapriorialgorithmusingRandthearules
package
Justtorecap:thepurposeofthisanalysisistogenerateasetofrulesthatlinktwoormoreproductstogether.Eachoftheserulesshouldhavea
liftgreaterthanone.Inaddition,weareinterestedinthesupportandcondenceofthoserules:highercondencerulesareoneswherethereis
ahigherprobabilityofitemsontheRHSbeingpartofthetransactiongiventhepresenceofitemsontheLHS.Wedexpectrecommendations
basedontheserulestodriveahigherresponserate,forexample.Werealsobetteroactioningruleswithhighersupportrst,asthesewillbe
applicabletoawiderrangeofinstances.

Inthisexample,weregoingtoperformtheanalysisforanonlineretailerrunningSnowplow.Weregoingtodotheclassicmarketbasket
analysis:bythatImeanwearegoingtolookforrulesbasedonactualtransactions.(Lateroninthisrecipe,wellconsidertheprosandconsof
deningthescopeorourbasketdierently.)

WeregoingtouseR(http://www.r-project.org/)toperformthemarketbasketanalysis.Risagreatstatisticalandgraphicalanalysistool,well
suitedtomoreadvancedanalysis.WeregoingtousetheArulespackage(http://cran.r-project.org/web/packages/arules/index.html),which
implementstheApriori(http://en.wikipedia.org/wiki/Apriori_algorithm)algorithm,oneofthemostcommonlyusedalgorithmsforidentifying
associationsbetweenitems.

Tostartwith,weneedtofetchtransactiondatafromSnowplowwhichidentiesgroupsofitemsbytransaction.ThefollowingSQLqueryfetches
thesedirectly:itreturnsalineofdataforeverylineitemofeachtransaction,withthetransactionidandtheitemname:

/*PostgreSQL/Redshift*/
SELECT
"ti_orderid"AS"transaction_id",
"ti_name"AS"sku"
FROM
"events"
WHERE
"event"='transaction_item'

WecanpullthisdatadirectlyintoRfromR.(ForassistancesettingupRtousewithSnowplow,seethesetupguide
(https://github.com/snowplow/snowplow/wiki/Setting-up-R-to-perform-more-sophisticated-analysis-on-your-Snowplow-data).)First,weload
upR,andconnectRtoourSnowplowtableinRedshiftbyenteringthefollowingattheRprompt:

http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 2/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

library("RPostgreSQL")
con<dbConnect(drv,host="<<REDSHIFTENDPOINT>>",port="<<PORTNUMBER>>",dbname="<<DBNAME>>",user="<<USERNAME>>",password="<<P
ASSWORD>>")

(Besuretosubstituteappropriatevaluesfor <<REDSHIFTENDPOINT>> <<PORTNUMBER>> , <<DBNAM>> and <<USERNAME>> .

ThenweexecuteourSQLqueryabove,fetchingthedataasadataframeinR:

t<dbGetQuery(con,"
SELECT
\"ti_orderid\"AS\"transaction_id\",
\"ti_name\"AS\"sku\"
FROM
\"events\"
WHERE
\"event\"='transaction_item'
")

Wecantakeapeakattherstverecordsonourdataframebyexecuting

head(t)

Notehoweachlineofdatarepresentsasinglelineitem,sothatthersttransaction(whichincludestwoitems)spanstwolines.

Nowweneedtorecordslinesbytransactionid,sothattheindividualproductsthatbelongtoeachtransactionareaggregatedacrossrecords
intoasinglerecordasanarrayofproducts.ThisisdonebyexecutingthefollowingattheRprompt:

i<split(t$sku,t$transaction_id)

Again,wecanpeakatourdatabyexecuting head(i) attheprompt:

http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 3/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

NowweconvertthedataintoaTransactionobjectoptimizedforrunningthearulesalgorithm:

library("arules")
txn<as(i,"transactions")

Finally,wecanrunouralgorithm:

basket_rules<apriori(txn,parameter=list(sup=0.005,conf=0.01,target="rules"))

Whenrunningtherule,wesetminimumsupportandcondencethresholds,belowwhichRignoresanyrules.Theseareusedtooptimizethe
runningofthealgorithm:guringoutassociationrulescanbecompulationallyexpensive,becauseforacompanywithalargecatalogueof
items,thenumberofcombinationsofitemsisenormous(itincreasesexponentiallywiththenumberofitems).Hence,anythingwegivethe
algorithmtominimizethecomputationalburdeniswelcome.

Inourcase,wevegivenlowguresforsupportandcondence.Thisisbecauseourtestexampleisbasedonalongtailretailer,whooersmore
than10kSKUs,againstwhichc.90kpurchaseshavebeenmade.Themaximumsupportanyoneoftheproductshasisverylow:thiscanbe
conrmedbyplottingtherelativefrequencyofeachitem(i.e.thefractionoftransactions)forthetop25itemsbyitemfrequency(i.e.the
fractionoftransactionsthateachitemappearsin).Thiscanbedonebyrunning:

itemFrequencyPlot(txn,topN=25)

Inwhichcasethefollowingplotwasproduced:

http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 4/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

Notehowthemostfrequentitemappearsinlessthan2%oftransactionsrecorded.

Inyourcasethedistributionofitemsbytransactionmightlookverydierent,andsoverydierentsupportandcondenceparametersmaybe
applicable.Todeterminewhatworksbest,youneedtoexperimentwithdierentparameters:youllseethatasyoureducethem,thenumberof
rulesgeneratedwillincrease,whichwillgiveyoumoretoworkwith.However,youllneedtosiftthroughtherulesmorecarefullytoidentify
thosethatwillbemoreimpactfulforyourbusiness.Wereturntothisthemeinthenextsection.

Lastly,letsinspecttheactualrulesgeneratedbythealgorithm:

inspect(basket_rules)

Inourcase,thealgorithmhasidentied9rules.Therst7arenothelpful:therearenoitemsontheLHS.(Forthesesevenrules,notehow
becausetherearenoitemsontheLHS,thesupport=thecondenceandthelift=1.)

Thelasttworulesareinterestingthough:theysuggestthatpeoplewhobuytheMemoBlockApplearemorelikelytobuytheMemoBlock
Pearandvice-versa.Notjustthat,buttheyaremuchmorelikelytodoso:thecondenceis66-suggestingtheyareverystronglyassociated.

Backtotop.

3.Managinglargeresultsets:visualizingrulesusingthearulesVizpackage
http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 5/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

3.Managinglargeresultsets:visualizingrulesusingthearulesVizpackage
Inthepreviousexamplewesettheparametersforsupportandcondencesothatonlyasmallsetofruleswerereturned.Asmentioned,
however,itisoftenbettertoreturnalargerset,toincreasethechancesthatwegeneratemorerelevantrulesforourbusiness.

Letsrerunthealgorithm,butthistimereduceourparametersforsupportandcondence,andsavetheresultsetintoadierentobject:

basket_rules_broad<apriori(txn,parameter=list(sup=0.001,conf=0.001,target="rules"))

Inourcase,3.2Mruleswerereturned.Thisiswaytomanytovisuallyinspect-howeverwecanlookatthetop20bylift:

Wecanplotourrulesbycondence,supportandlift,usingthe arulesViz package:

library("arulesViz")
plot(basket_rules_broad)

Ourplotlooksasfollows:

http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 6/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

Theplotshowsthatruleswithhighlifttypicallyhavelowsupport.(Thisisnotsurprising,giventhemaths.)Wecanuseaplotliketheoneabove
toidentifyruleswithbothhighsupportandcondence:the arulesViz packageletsusplotthegraphsinaninteractivemode,sothatwecan
clickonindividualpointsandexploretheassociateddata.Formoredetails,see[thefullpackageinstructions](http://cran.r-
project.org/web/packages/arulesViz/vignettes/arulesViz.pdf ).

Howmanyruleswegenerate,andhowweprioritisewhichrulesweaction,dependonwhichbusinessquestionsweplantoanswerwithour
analysis.Thisisdiscussedfurtherinthenextsection.

Backtotop.

4.Usingtheanalysistodrivebusinessdecision-making
Beforeweusethedatatomakeanykindofbusinessdecision,itisimportantthatwetakeastepbackandremembersomethingimportant:

Theoutputoftheanalysisreectshowfrequentlyitemsco-occurintransactions.Thisisafunctionbothofthestrengthofassociation
betweentheitems,andthewaythesiteownerhaspresentedthem.

Tosaythatinadierentway:itemsmightcooccurnotbecausetheyarenaturallyconnected,butbecausewe,thepeopleinchargeofthesite,
havepresentedthemtogether.

Thisisanexampleofamoregeneralprobleminwebanalytics:ourdatareectsthewayusersbehave,andthewaywehaveencouragedthemto
behave,bythewebsitedesigndecisionswehavemade.Weneedtobeconsciousofthis,because,ifassuggestedearlierintherecipe,weusethe
resultstoinformwhereitemsareplacedrelativetooneanother,weneedtocontrolforhowclosetheyaresituatedonthewebsitetoday,sothat
wedontendupconrmingwhatwehaveassumed.So,forexample,ifitemskandlshowastrongassociation,andarepresentednexttoone-
anotheralreadyonoursite,thatisnotthatinteresting.Iftheyarefarapartonoursite,thatisinteresting-maybeweshouldputthemcloser
together.Ifthoseitemsareclosetogether,buttheanalysisshowsthereisnotastrongassociation,weshouldprobablyseparatethem:our
previousassumptionthattheyshouldbeplacedtogethermayhavebeenwrong.

Usingthedatatodrivewebsiteorganization
Thereareanumberofwayswecanusethedatatodrivesiteorganisation:

http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 7/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

1.Largeclustersofco-occuringitemsshouldprobablybeplacedintheirowncategory/theme
2.Itempairsthatcommonlyco-occurshouldbeplacedclosetogetherwithinbroadercategoriesonthewebsite.Thisisespeciallyimportant
whereoneiteminapairisverypopular,andtheotheritemisveryhighmargin.
3.Longlistsofrules(includingoneswithlowsupportandcondence)canbeusedtoputrecommendationsatthebottomofproductpages
andonproductcartpages.Theonlythingthatmattersfortheserulesisthattheliftisgreaterthanone.(Andthatwepickthoserulesthat
areapplicableforeachproductwiththehighliftwheretheproductrecommendedhasahighmargin.)
4.Intheeventthatdoingtheabove(3)drivessignicantupliftinprot,itwouldstrengthenthecasetoinvestinarecommendationsystem,
thatusesasimilaralgorithminanoperationalcontexttopowerautomaticrecommendationengineonyourwebsite.

Usingthedatafortargetedmarketing
Thesameresultscanbeusedtodrivetargetedmarketingcampaigns.Foreachuser,wepickahandfulofproductsbasedonproductstheyhave
boughttodatewhichhavebothahighupliftandahighmargin,andsendthemae.g.personalizedemailordisplayadsetc.

Howweusetheanalysishassignicantimplicationsfortheanalysisitself:ifwearefeedingtheanalysisintoamachine-drivenprocessfor
deliveringrecommendations,wearemuchmoreinterestedingeneratinganexpansivesetofrules.If,however,weareexperimentingwith
targetedmarketingforthersttime,itmakesmuchmoresensetopickahandfulofparticularlyhighvaluerules,andactionjustthem,before
workingoutwhethertoinvestintheeortofbuildingoutthatcapabilitytomanageamuchwiderandmorecomplicatedruleset.

Backtotop.

5.Expandingontheanalysis:zoomingoutfromthebaskettolookacustomerbehaviorover
longerperiodsanddierentevents
Intheaboveexample,weusedactualtransactioneventstoidentifyassociationsbetweenproductsforanonlineretailer.

Stickingwithourretailexample,however,wecouldhaveexpandedthescopeofourdenitionoftransactions.Insteadofjustlookingatthe
basketforsuccessfultransactions,wecouldhavelookedatuserscompletebaskets(whetherornottheywentontobuy).Theanalysissteps
wouldhavebeenalmostexactlythesame,however,insteadofpullingtransactiondataoutofSnowplow,wedhavepulledadd-to-basketdata
out,usingaquerylikethefollowing:

/*PostgreSQL/Redshift*/
SELECT
"domain_userid"+''+"domain_sessionidx"AS"transaction_id",
"ev_property"AS"sku"
FROM
"events"
WHERE
"ev_action"='addtobasket'

Wecouldincreasethescopefurther,soinsteadoflookingatadd-to-basket-events,welookateveryproductthateachvisitorhasviewed,and
associategroupsofproductsthatindividualusershavelookedatwithinasinglesession:

/*PostgreSQL/Redshift*/
SELECT
"domain_userid"+''+"domain_sessionidx"AS"transaction_id",
"page_urlpath"
FROM
"events"
WHERE
"event"='page_view'

NotehowthistimeeachproductisidentiedbyURLratherthanbySKU.ItmaybeappropriatetolteroutURLsthatdonotcorrespondwith
productpages.

Finally,wecouldexpandourwindowfurther,soinsteadofconningourselvestoasinglesession,welookatthesameuserovermultiple
sessions,i.e.:

http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 8/9
3/21/2017 MarketbasketanalysisidentifyingproductsandcontentthatgowelltogetherSnowplow

/*PostgreSQL/Redshift*/
SELECT
"domain_userid"AS"transaction_id",
"page_urlpath"
FROM
"events"
WHERE
"event"='page_view'

Notehowthisisalmostexactlythesamequeryaswhenourscopewasper-session,wevejustremovedthe domain_sessionidx (sothatwhenwe


aggregatebytransaction_id),weaggregatebyuserovertheirentirelifetime,ratherthaneachsimplyovereachsession.

Thesenal,widerscopeexamples,arelikelytobemoreappropriateforpublishersandmediasiteowners,whowanttoidentifyassociations
betweenarticles,writers/authors/producersandcategoriesofcontent,ratherthanproductsinashop.

Backtotop.

Signuptoourmailinglisttobenotiedofnewreleasesandrelatednews.

Emailaddress SUBSCRIBE

COMPANY CONTACTUS
About(/about) contact@snowplowanalytics.com(mailto:co%6Etact@snow%70lowa%6E%61%6Cy%74ics%2Ecom)
Team(/about/team) TheRomaBuilding,
Jobs(/about/jobs) 32-38ScruttonStreet
Blog(/blog) EC2A4RQLondon,UK

(https://twitter.com/snowplowdata) (https://linkedin.com/company/snowplow-analytics-ltd) (/atom.xml)

COPYRIGHT2012-2017SNOWPLOWANALYTICS,LTD.PRIVACYPOLICY(/PRIVACY).

http://snowplowanalytics.com/guides/recipes/cataloganalytics/marketbasketanalysisidentifyingproductsthatsellwelltogether.html 9/9

You might also like