You are on page 1of 195

Data Mining and Soft Computing Francisco Herrera

Research Group on Soft Computing and Information Intelligent g Systems y (SCI2S) ( ) Dept. of Computer Science and A.I. University of Granada, Spain
Email: herrera@decsai.ugr.es http://sci2s.ugr.es http://decsai.ugr.es/~herrera

Data Mining and Soft Computing

Summary
Introduction to Data Mining and Knowledge Discovery Data Preparation Introduction to Prediction, Classification, Clustering and Association Data Mining - From the Top 10 Algorithms to the New Challenges Introduction to Soft Computing. Focusing our attention in Fuzzy Logic and Evolutionary Computation 6. Soft Computing Techniques in Data Mining: Fuzzy Data Mining and Knowledge Extraction based on Evolutionary Learning 7 Genetic 7. G ti Fuzzy F Systems: S t State St t of f the th Art A t and d New N Trends T d 8. Some Advanced Topics I: Classification with Imbalanced Data Sets 9. Some Advanced Topics II: Subgroup Discovery 10.Some advanced Topics III: Data Complexity 11.Final talk: How must I Do my Experimental Study? Design of p in Data Mining/Computational g p Intelligence. g Using g NonExperiments parametric Tests. Some Cases of Study. 1. 2. 3. 4. 5.

Slidesusedforpreparingthis talk:
Top 10 Algorithms in Data Mining Research
prepared for ICDM 2006

10 Challenging g g Problems in Data Mining g Research


prepared for ICDM 2005

CS490D: IntroductiontoDataMining Association Analysis: Basic Concepts Prof.ChrisClifton


and Algorithms Lecture Notes for Chapter 6 Introduction t oduct o to Data ata Mining g
by Tan, Steinbach, Kumar
DATA MINING
Introductory and Advanced Topics

Margaret H H. Dunham
3

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


4

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


5

FromtheTop10AlgorithmstotheNew g inDataMining g Challenges


DiscussionPanelsatICDM2005and2006

FromtheTop10AlgorithmstotheNew g inDataMining g Challenges


DiscussionPanelsatICDM2005and2006

Top 10 Algorithms in Data Mining Research


prepared for ICDM 2006

10 Challenging Problems in Data Mining Research


prepared d for f ICDM 2005

Top p10Algoritms g inDataMining g Research


preparedforICDM2006
http://www.cs.uvm.edu/~icdm/algorithms/index.shtml

Coordinators
XindongWu UniversityofVermont http://www.cs.uvm.edu/~xwu/home.html VipinKumar UniversityofMinessota http://wwwusers.cs.umn.edu/~kumar/
8

ICDM06Panelon Top10AlgorithmsinDataMining

ICDM06Panelon Top10AlgorithmsinDataMining

10

ICDM06Panelon Top10AlgorithmsinDataMining

11

ICDM06Panelon Top10AlgorithmsinDataMining

12

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


13

ICDM06Panelon Top10AlgorithmsinDataMining

C Classification ifi i

14

ClassificationUsin UsingDecisionTrees
Partitioningbased: Dividesearchspaceinto rectangularregions regions. Tupleplacedintoclassbasedontheregion withinwhichitfalls. DTapproachesdifferinhowthetreeisbuilt: DTInduction Internalnodesassociatedwithattributeand arcswithvaluesforthatattribute. Algorithms:ID3,C4.5,CART
15

D i i T Decision Tree
Given: D={t1,,tn}whereti=<ti1,,tih> Databaseschemacontains{A1,A2,,Ah} ClassesC={C { 1,., ,Cm} DecisionorClassificationTree isatreeassociatedwith Dsuchthat Eachinternalnodeislabeledwithattribute,Ai Eacharcislabeledwithpredicatewhichcanbe appliedtoattributeatparent Eachleafnodeislabeledwithaclass,Cj
16

T i i D Training Dataset t t
This Thi follows an example from Quinlans Quinlan s ID3
age <=30 <=30 3140 >40 >40 >40 31 40 3140 <=30 <=30 >40 <=30 31 40 3140 3140 >40 income student credit_rating g no fair high high no excellent high no fair medium no fair low yes fair low yes excellent l low yes excellent ll t medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no
17

Output: p ADecisionTreefor buys_computer


age? g <=30 student? no no yes yes overcast 30 40 t 30..40 yes >40 credit rating? excellent no fair yes
18

Al ith f Algorithm forDecision D i i Tree T Induction I d ti


Basicalgorithm(agreedyalgorithm)
Treeisconstructedinatop pdownrecursivedivideandconquer q manner Atstart,allthetrainingexamplesareattheroot Attributesarecategorical(ifcontinuousvalued,theyarediscretized in advance) Examplesarepartitionedrecursivelybasedonselectedattributes Testattributesareselectedonthebasisofaheuristicorstatistical (e.g., g ,informationg gain) ) measure(

Conditionsforstoppingpartitioning
Allsamples p foragiven g nodebelong gtothesameclass Therearenoremainingattributesforfurtherpartitioning majority voting isemployedforclassifyingtheleaf Therearenosamplesleft
19

DTInduction

20

DTS Splits lit Area A

M Gender F

Height

21

C Comparing i DT DTs

Balanced Deep
22

DTI Issues
ChoosingSplittingAttributes OrderingofSplittingAttributes Splits TreeStructure StoppingCriteria TrainingData P i Pruning
23

DecisionTreeInductionisoftenbasedon InformationTheory

So

24

I f Information ti

25

I f Information/Entropy ti /E t
Gi Givenprobabilitites b bili i p1,p2,..,ps whose h sumi is1, Entropy isdefinedas:

Entropymeasurestheamountofrandomnessor surprise su p seor o uncertainty. u ce ta ty Goalinclassification


nosurprise entropy=0
26

AttributeSelectionMeasure: InformationGain(ID3/C4.5)

Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, , m} i f information i measures info i f required i d to classify l if m any arbitrary tuple si si
I( s1,s 2,...,s m ) =
i =1

log 2

entropy py of attribute A with values { {a1,a2,,av}


E(A)= s1 j + ...+ smj I( s1 j ,...,smj ) s j =1
v

information gained by branching on attribute A


G (A) = I(s Gain(A) ( 1, s 2 ,..., sm) E(A) (A)
27

AttributeSelectionby InformationGainComputation
g g g g

ClassP:buys_computer=yes ClassN:buys_computer=no I(p n)=I(9 I(p, I(9,5)=0.940 =0 940 Computetheentropyforage:

age <=30 3040 >40


age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40

pi 2 4 3

ni I(pi, ni) 3 0.971 0 0 2 0.971


buys_computer no no yes yes yes no yes no yes yes yes yes yes no

5 4 E ( age ) = I ( 2 ,3 ) + I ( 4,0 ) 14 14 5 + I (3, 2 ) = 0 .694 14 5 I ( 2,3)meansage< <=30 30has5outof 14 14samples,with2yesesand3


nos. Hence H

income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair lo low yes es e cellent excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent

Gain ( age ) = I ( p , n ) E ( age ) = 0.246


Similarly,

Gain(income) = 0.029 Gain( student ) = 0.151 Gain(credit _ rating ) = 0.048


28

Oth Attribute Other Att ib t Selection S l ti Measures M


Giniindex(CART,IBMIntelligentMiner)
All llattributes b areassumed dcontinuousvalued l d Assumethereexistseveralpossiblesplitvaluesforeach attribute Mayneedothertools, tools suchasclustering, clustering togetthe possiblesplitvalues Can C be b modified difi dfor f categorical t i lattributes tt ib t

29

Gini Index(IBMIntelligentMiner)
IfadatasetT containsexamplesfromn classes,giniindex, n gini(T)isdefinedas gini (T ) = 1 p 2
j =1 j

wherepj istherelativefrequencyofclassj inT. Ifad data t set tT is i split liti into t t twosubsets b t T1 and dT2 with ithsizes i N1 andN2 respectively,thegini indexofthesplitdatacontains examplesfromn classes, classes thegini indexgini(T)isdefinedas

gini

split

(T ) = N 1 gini (T 1) + N 2 gini (T 2 ) N N

Th Theattribute ib provides id the h smallest ll gini i isplit(T)i ischosen h tosplit li thenode(needtoenumerateallpossiblesplittingpointsfor eachattribute). )
30

Extracting gClassificationRulesfrom Trees


RepresenttheknowledgeintheformofIFTHEN rules Oneruleiscreatedforeachpathfromtheroottoaleaf Eachattributevaluepairalongapathformsaconjunction Theleafnodeholdstheclassprediction Rulesareeasierforhumanstounderstand Example
IFage =<=30ANDstudent =noTHENbuys_computer =no IFage =<=30ANDstudent =yesTHENbuys_computer =yes IFage =3140 31 40 THENbuys_computer b =yes IFage =>40ANDcredit_rating =excellentTHENbuys_computer =yes IFage =< <=30 30ANDcredit_rating credit rating =fairTHENbuys_computer buys computer =no
31

A idO Avoid Overfitting fitti in i Classification Cl ifi ti


Overfitting:Aninducedtreemayoverfit thetrainingdata
Toomany ybranches,somemay yreflectanomaliesduetonoiseor outliers Pooraccuracyforunseensamples

Twoapproaches h toavoid idoverfitting fi i


Prepruning:Halttreeconstructionearlydonotsplitanodeifthis wouldresultinthegoodnessmeasurefallingbelowathreshold Difficulttochooseanappropriatethreshold Postpruning: p g Removebranchesfromafully ygrown g treeget g a sequenceofprogressivelyprunedtrees Useasetofdatadifferentfromthetrainingdatatodecidewhich isthebest bestprunedtree tree

32

Approaches pp toDeterminetheFinal TreeSize


Separatetraining(2/3)andtesting(1/3)sets Usecrossvalidation,e.g.,10foldcrossvalidation Useallthedatafortraining
butapplyastatisticaltest (e.g.,chisquare)toestimate whetherexpandingorpruninganodemayimprovethe entiredistribution

Useminimumdescriptionlength(MDL)principle
haltinggrowthofthetreewhentheencodingisminimized

33

Enhancementstobasicdecision treeinduction
Allowforcontinuousvaluedattributes
D Dynamically i ll define d fi newdi discretevalued l dattributes ib that h partitionthecontinuousattributevalueintoadiscreteset of fintervals i l

Handlemissing gattributevalues
Assignthemostcommonvalueoftheattribute Assign A i probability b bili toeach hof fthe h possible ibl values l

Attributeconstruction
Createnewattributesbasedonexistingonesthatare sparselyrepresented
CS490D Thisreducesfragmentation, repetition,andreplication 34

D i i T Decision Treevs.R Rules l


Treehasimpliedorder inwhichsplittingis performed. Treecreatedbasedon gatallclasses. looking Ruleshavenoordering ofpredicates predicates. Onlyneedtolookat oneclasstogenerateits rules.

35

ScalableDecisionTreeInductionMethods inDataMiningStudies
Classificationaclassicalproblemextensivelystudiedby statisticiansandmachinelearningresearchers Scalability:Classifyingdatasetswithmillionsofexamplesand h d d of hundreds fattributes ib with i hreasonable bl speed d Whydecisiontreeinductionindatamining?
relativelyfasterlearningspeed(thanotherclassificationmethods) convertibletosimpleandeasytounderstandclassificationrules canuseSQLqueriesforaccessingdatabases comparableclassificationaccuracywithothermethods

36

ScalableDecisionTreeInductionMethods inDataMiningStudies
SLIQ (EDBT96 Mehtaetal.)
buildsanindexforeachattributeandonly yclasslistandthecurrent attributelistresideinmemory

SPRINT (VLDB96 J.Shaferetal.)


constructsanattributelistdatastructure

PUBLIC (VLDB98 Rastogi&Shim)


integratestreesplittingandtreepruning:stopgrowingthetreeearlier

RainForest (VLDB98 Gehrke,Ramakrishnan&Ganti)


separatesthe h scalability l bili aspectsf fromthe h criteria i i that h d determine i the h qualityofthetree buildsanAVClist(attribute,value,classlabel)

37

I t Instance Based B d M Methods th d


Instancebasedlearning:
Storetraining gexamples p anddelay ytheprocessing p g(lazy ( yevaluation) ) untilanewinstancemustbeclassified

Typicalapproaches
knearestneighborapproach InstancesrepresentedaspointsinaEuclideanspace. Locallyweightedregression Constructslocalapproximation Casebasedreasoning Usessymbolicrepresentationsandknowledgebasedinference

38

Cl ifi ti U Classification Using i Di Distance t


Placeitemsinclasstowhichtheyare closest. Mustdeterminedistancebetweenanitem and daclass. l Classesrepresented p by y Centroid: Centralvalue. Medoid: Representativepoint. Individualpoints

Algorithm:KNN
39

ThekNearestNeighbor g Algorithm
AllinstancescorrespondtopointsinthenDspace. Thenearestneighbor g aredefinedintermsofEuclidean distance. Thetargetfunctioncouldbediscrete orreal valued. Fordiscretevalued,thekNNreturnsthemostcommonvalue amongthektrainingexamplesnearestto xq. Voronoidiagram:thedecisionsurfaceinducedby1NNfora typicalsetoftrainingexamples.
_ + _ _ _ _ + . xq + _ +

. . .
40

KNearest N tNeighbor N i hb (KNN): (KNN)


Trainingsetincludesclasses. ExamineKitemsnearitemtobeclassified. New Ne itemplacedinclasswith iththemost numberofcloseitems. O(q)foreachtupletobeclassified.(Hereq i th is thesize i of fth thet training i i set.) t)

41

KNN

42

KNNAlgorithm

43

B Bayesian i Classification: Cl ifi ti Why? Wh ?


Probabilisticlearning:Calculateexplicitprobabilitiesfor hypothesis,amongthemostpracticalapproachestocertain t types of flearning l i problems bl Incremental:Eachtrainingexamplecanincrementally increase/decreasetheprobabilitythatahypothesisiscorrect. correct Priorknowledgecanbecombinedwithobserveddata. Probabilisticp prediction:Predictmultiple p hypotheses, yp , weightedbytheirprobabilities Standard:EvenwhenBayesianmethodsarecomputationally i t t bl they intractable, th canprovide id astandard t d dof foptimal ti ldecision d ii makingagainstwhichothermethodscanbemeasured

44

B Bayesian i Theorem: Th Basics B i


LetXbeadatasamplewhoseclasslabelisunknown LetHbeahypothesisthatXbelongstoclassC Forclassificationproblems,determineP(H|X):theprobability yp holdsg giventheobserveddatasample p X thatthehypothesis P(H):priorprobabilityofhypothesisH(i.e.theinitial probability p ybeforeweobserveany ydata,reflectsthe backgroundknowledge) P(X):probabilitythatsampledataisobserved P(X|H):probabilityofobservingthesampleX,giventhatthe hypothesisholds

45

B Bayes Theorem Th
Giventrainingdata X,posterioriprobabilityofahypothesisH, P(H|X)followstheBayestheorem

P ( X | H ) P ( H ) P(H | X ) = P( X )
Informally,thiscanbewrittenas
posterior=likelihoodxprior p p /evidence

MAP(maximumposteriori)hypothesis

h arg max P(h | D) = arg max P(D | h)P(h). MAP hH hH


Practicaldifficulty:requireinitialknowledgeofmany probabilities,significantcomputationalcost
46

N B Nave BayesCl Classifier ifi


Asimplifiedassumption:attributesareconditionally independent: n P( X | Ci) = P( xk | Ci) k =1 Theproductofoccurrenceofsay2elementsx1 andx2,given thecurrentclassisC,istheproductoftheprobabilitiesof eachelementtakenseparately,giventhesameclass P([y1,y2],C)=P(y1,C)*P(y2,C) Nodependencerelationbetweenattributes Greatlyreducesthecomputationcost,onlycounttheclass distribution. Once O the h probability b bili P(X|Ci)is i known, k assign i Xtothe h class l withmaximumP(X|Ci)*P(Ci)
47

T i i d Training dataset t t
age 30 <=30 Class: C1:buys_computer= <=30 yes 3040 C2:buys computer= >40 C2:buys_computer= 40 no >40 >40 Data sample l 3140 X =(age<=30, <=30 Income=medium, < 30 <=30 Student=yes >40 Credit_rating= <=30 Fair) 3140 3140 >40 income student credit_rating high no fair high no excellent high no fair medium di no fair f i low yes fair yes excellent y low low yes excellent medium no fair lo low yes es fair medium yes fair yes excellent y medium medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes es yes yes y yes yes no
48

N B Nave Bayesian i Classifier: Cl ifi E Example l


ComputeP(X/Ci)foreachclass
P(age=<30|buys_computer=yes)=2/9=0.222 P(age=<30 P(age= <30 |buys buys_computer= computer=no) no )=3/5=0.6 =0 6 P(income=medium|buys_computer=yes)=4/9=0.444 P(income=medium|buys_computer=no)=2/5=0.4 P(student=yes ( y |buys y _computer=yes)= p y ) 6/9 / =0.667 P(student=yes|buys_computer=no)=1/5=0.2 P(credit_rating=fair|buys_computer=yes)=6/9=0.667 P(credit_rating=fair|buys_computer=no)=2/5=0.4 X=(age<=30 ( ,income=medium, d student=yes,credit_rating=fair) d d f ) P(X|Ci):P(X|buys_computer=yes)=0.222x0.444x0.667x0.0.667=0.044 P(X|buys_computer=no)=0.6x0.4x0.2x0.4=0.019 P(X|Ci)*P(Ci): P(X|buys_computer=yes)*P(buys_computer=yes)=0.028 P(X|buys_computer=yes)*P(buys_computer=yes)=0.007 Xbelongstoclassbuys buys_computer computer=yes yes

49

NaveBayesian y Classifier: Comments


Advantages:
Easy ytoimplement p Goodresultsobtainedinmostofthecases

Disadvantages
Assumption:classconditionalindependence,thereforelossof accuracy Practically, Practically dependenciesexistamongvariables E.g.,hospitals:patients:Profile:age,familyhistoryetc Symptoms:fever,coughetc.,Disease:lungcancer,diabetesetc DependenciesamongthesecannotbemodeledbyNaveBayesian Classifier

Howtodealwiththesedependencies?
BayesianBeliefNetworks
50

ICDM06Panelon Top10AlgorithmsinDataMining

S i i Learning Statistical i

51

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


52

S Support tvector t machine hi (SVM)


Classificationisessentiallyfindingthebest b boundary d between b t classes. l Supportvectormachinefindsthebest boundarypointscalledsupportvectorsand b ldclassifier build l f ontopof fthem. h LinearandNonlinearsupportvector machine.

53

E Example l of fgeneral lSVM

Thedotswithshadowaround themaresupportvectors. Clearlytheyarethebestdata pointstorepresentthe boundary.Thecurveisthe separatingboundary boundary.

54

SVM SupportVectorMachines

Small Margin

Large Margin Support Vectors

O ti lHyper Optimal H plane, l separable bl case.


Inthiscase,class1and class2areseparable. Therepresentingpoints areselectedsuchthat themarginbetween twoclassesare maximized. Crossed C dpoints i are supportvectors.

xT + 0 = 0
X X X X

56

SVM Cont. C t
LinearSupportVectorMachine
y i {1,1} Givenasetofpoints xi n withlabel TheSVMfindsahyperplanedefinedbythepair(w,b) (wherew isthenormaltotheplaneandb isthe distancefromtheorigin)

s.t. yi ( xi w + b) +1 i = 1,..., N
x feature f vector, bb bias, bi y- class l label, l b l ||w|| || || - margin i

57

A l i of Analysis fSeparable S bl case.


1.Throughoutourpresentation,thetrainingdata consistsofNpairs:(x1,y1),(x2,y2),,(Xn,Yn). 2.Defineahyperplane:

{x : f ( x) = x + 0 = 0}
T

where isaunitvector.Theclassificationruleis:
G ( x) = sign[ x + 0 ]
T

58

A l i Cont. Analysis C t
3.Sotheproblemoffindingoptimalhyperplaneturns to: , 0 ,|| ||=1 MaximizingC on Subjecttoconstrain:

yi ( x + 0 ) > C , i = 1,..., N .
T i

4.Itsthesameas: || || subjectto Minimizing

yi ( x + 0 ) > 1, i = 1,..., N .
T i
59

G General lSVM
Thisclassificationproblem clearly ydonothaveagood g optimallinearclassifier. Canwedobetter? Anonlinearboundaryas shownwilldofine.
60

N separable Non bl case


Whenthedatasetis nonseparable bl as g shownintheright figure,wewillassign weight i h toeach h support pp vectorwhich willbeshowninthe constraint.

xT + 0 = 0
X X X
C
*

61

N Linear Non Li SVM


Classification using SVM (w,b)

xi w + b > 0
In non linear case we can see this as

K ( xi , w) + b > 0
Kernel Can be thought of as doing dot product in some high dimensional space
62

G General lSVMCont. C t
Similartolinearcase,thesolutioncanbe writtenas:
f ( x ) = h ( x )T + 0 =

i =1

y i h ( x i ), h ( x i ' ) + 0

Butfunctionhisofveryhighdimension sometimesinfinity,doesitmeanSVMis i impractical? ti l?


63

R lti S Resulting Surfaces f

64

R Reproducing d i Kernel. K l
Lookatthedualproblem,thesolution onlydependson . h( xi ' ) h( xi ), Traditionalfunctionalanalysistellsuswe ylookattheirkernel needtoonly representation:K(X,X)= h( xi ), h( xi ' ) Whichliesinamuchsmallerdimension S Space th thanh.
65

R t i ti and Restrictions dtypical t i lkernels. k l


Kernelrepresentationdoesnotexistallthe ti time, M Mercers condition diti (C (Courant tand d , )tellsustheconditionforthis Hilbert,1953) kindofexistence. There h areasetof fk kernels l proventob be ,suchaspolynomial p y kernelsand effective, radialbasiskernels.

66

E Example l of fpolynomial l i lkernel. k l


rdegreepolynomial: K( ) (1 K(x,x)=(1+<x,x>) )d. Forafeaturespacewithtwoinputs:x1,x2and apolynomialkernelofdegree2. K(x,x)=(1+<x,x>)2 Let h ( x) = 1, h ( x) = 2 x , h ( x) = 2 x , h ( x) = x 2 , h ( x) = x 2
1 2 1 3 2 4 1 5 2

h6 ( x) = 2 x1 x2,thenK(x,x)=<h(x),h(x)>. and
67

SVMvs.N Neural lNetwork N t k


SVM
Relativelynewconcept NiceGeneralization properties Hardtolearn learnedin batchmodeusingquadratic programming i techniques h i Usingkernelscanlearnvery complexfunctions

NeuralNetwork
QuietOld Generalizeswellbut doesnt doesn thavestrong mathematical foundation Caneasilybelearnedin incrementalfashion Tolearncomplex functions use multilayer l l perceptron (notthattrivial)
68

O Open problems bl of fSVM SVM.


HowdowechooseKernelfunctionfora specificsetofproblems. problems DifferentKernelwill havedifferentresults,althoughgenerallythe results l arebetter b than h using i hyper h planes. l ComparisonswithBayesianriskfor classificationproblem.MinimumBayesianrisk isproventobethebest best.WhencanSVM achievetherisk.

69

O Open problems bl of fSVM


Forverylargetrainingset,supportvectors might i htb beof fl largesize. i S Speed dth thusb becomesa bottleneck. AoptimaldesignformulticlassSVMclassifier.

70

SVMR Related l t dLi Links k


http://svm.dcs.rhbnc.ac.uk/ http://www kernelmachines.org/ http://www.kernel machines org/ C. J. C.Burges.ATutorialonSupportVectorMachinesforPattern Recognition KnowledgeDiscoveryandDataMining,2(2), Recognition. 2(2) 1998 1998. SVMlight Software(inC)http://ais.gmd.de/~thorsten/svm_light BOOK:An Introduction d i to S Support Vector Machines hi N. Cristianini and J. Shawe-Taylor C Cambridge b id U University i it P Press, 2000

71

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


72

ICDM06Panelon Top10AlgorithmsinDataMining

Bagging i and Boosting i

C bi i classifiers Combining l ifi


Examples:classificationtreesandneuralnetworks, severalneuralnetworks,severalclassificationtrees, etc. Averageresultsfromdifferentmodels Why?
Betterclassificationperformancethanindividualclassifiers Moreresiliencetonoise

Whynot?
Timeconsuming Overfitting

C bi i classifiers Combining l ifi


Ensemblemethodsclassification Manipulation withmodel (Model=M( ( ()) Manipulation with ithd data t set t
75

B i and Bagging dBoosting B ti


Bagging=Manipulationwithdataset

Boosting=Manipulationwithmodel

76

B i and Bagging dBoosting B ti


Generalidea T i i data Training d t
Classification method (CM)

Classifier C
CM

AlteredTrainingdata
CM

Classifier C1 Classifier C2

AlteredTrainingdata .. Aggregation.

Classifier C*
77

B i Bagging
Breiman,1996 Derived D i df fromb bootstrap t t (Ef (Efron,1993) Createclassifiersusingtrainingsetsthatare bootstrapped(drawnwithreplacement) Averageresultsforeachcase

B i Bagging
GivenasetSofssamples GenerateabootstrapsampleTfromS.CasesinSmaynot appearinTormayappearmorethanonce. p thissampling p gp procedure, ,g getting gasequence q ofk Repeat independenttrainingsets p gsequence q ofclassifiersC1,C2,,Ckis Acorresponding constructedforeachofthesetrainingsets,byusingthesame classificationalgorithm ToclassifyanunknownsampleX,leteachclassifierpredictor vote TheBaggedClassifierC*countsthevotesandassignsXtothe classwiththemostvotes
79

BaggingExample(Opitz,1999)
Original Training set 1 T i i set Training t2 Training set 3 Training set 4 1 2 7 3 4 2 7 8 6 5 3 8 5 2 1 4 3 6 7 4 5 7 4 5 6 6 7 6 3 2 7 6 4 2 3 8 1 1 2 8

B ti Boosting
Afamilyofmethods Sequential S ti lproduction d ti of fclassifiers l ifi p onthep previousone, ,and Eachclassifierisdependent focusesonthepreviousoneserrors Examplesthatareincorrectlypredictedinprevious classifiersarechosenmoreoftenorweightedmore h il heavily

B ti Technique Boosting T h i Algorithm Al ith


Assigneveryexampleanequalweight1/N 1 2, 2 ,TDo Fort=1,
Obtainahypothesis(classifier)h(t) underw(t) Calculatetheerrorof h(t) andreweighttheexamples basedontheerror.Eachclassifierisdependentonthe previousones.Samplesthatareincorrectlypredictedare weighted eightedmoreheavily hea il Normalizew(t+1) tosumto1(weightsassignedtodifferent classifierssumto1)

Outputaweightedsumofallthehypothesis,with eachhypothesisweightedaccordingtoitsaccuracy onthetrainingset


82

B ti Boosting
Theidea

83

Ad Boosting Ada B ti
FreundandSchapire,1996 Twoapproaches
Selectexamplesaccordingtoerrorinprevious classifier(morerepresentativesofmisclassified casesareselected) l t d) morecommon Weigherrorsofthemisclassifiedcaseshigher(all casesareincorporated,butweightsaredifferent) doesnotworkforsomealgorithms

Ad Boosting Ada B ti
Define k asthesumoftheprobabilitiesforthe misclassifiedinstancesforcurrentclassifierCk Multiplyprobabilityofselectingmisclassifiedcases by k=(1 k)/k Renormalizeprobabilities(i.e.,rescalesothatit sumsto1) CombineclassifiersC1Ckusingweightedvoting whereCkhasweightlog(k)

BoostingExample(Opitz,1999)
Original Training set 1 T i i set Training t2 Training set 3 Training set 4 1 2 1 7 1 2 7 4 1 1 3 8 5 5 6 4 3 4 8 1 5 7 1 1 1 6 7 6 3 5 6 8 3 1 1 8 1 4 4 5

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


87

ICDM06Panelon Top10AlgorithmsinDataMining

C Clustering i

88

Cl t i Problem Clustering P bl
GivenadatabaseD={t1,t2,,tn}oftuplesand anintegervaluek, k theClusteringProblem is todefineamappingf:D{1,..,k}whereeachti i assigned is i dtoonecluster l Kj,1<=j<=k. 1 j k ACluster,Kj,containspreciselythosetuples mappedtoit. Unlike U lik classification l ifi ti problem, bl clusters l t arenot t knownapriori.

89

Cl t i Examples Clustering E l
Segment customerdatabasebasedonsimilar b i patterns. buying tt Grouphousesinatownintoneighborhoods basedonsimilarfeatures. Identifynewplantspecies IdentifysimilarWebusagepatterns

90

Cl t i Example Clustering E l

91

Cl t i Levels Clustering L l

Size Based

92

Cl t i vs.Classification Clustering Cl ifi ti


Nopriorknowledge
Numberofclusters Meaningofclusters

Unsupervisedlearning

93

Cl t i Issues Clustering I
Outlierhandling Dynamicdata Interpretingresults Evaluating gresults Numberofclusters Datatobeused S l bilit Scalability
94

I Impact of fOutliers O li onClustering Cl i

95

T Types of fClustering Cl t i
Hierarchical Nestedsetofclusterscreated. Partitional P titi l O Oneset tof fclusters l t created. t d Incremental Eachelementhandledoneata time. Simultaneous Si lt Allelements l t h handled dl d together. Overlapping/Nonoverlapping

96

Cl t i Approaches Clustering A h
Clustering

Hierarchical

Partitional

Categorical

Large DB

Agglomerative

Divisive

Sampling

Compression

97

Cl t P Cluster Parameters t

98

Di t Distance B Between t Clusters Cl t


SingleLink:smallestdistancebetweenpoints CompleteLink: largestdistancebetweenpoints AverageLink: averagedistancebetweenpoints Centroid: distancebetweencentroids

99

Hi Hierarchical hi lClustering Cl t i
Clustersarecreatedinlevelsactuallycreatingsetsof clustersateachlevel. Agglomerative
I Initially iti ll each hit itemin i it itsowncluster l t Iterativelyclustersaremergedtogether BottomUp

Divisive
Initiallyallitemsinonecluster Large g clustersaresuccessively ydivided TopDown
100

Hi Hierarchical hi lAlgorithms Al ith


SingleLink MSTSingleLink CompleteLink Average g Link

101

D d Dendrogram
Dendrogram: atreedata structurewhichillustrates hierarchicalclustering techniques. Eachlevelshowsclustersfor thatlevel. level
Leaf individualclusters Root onecluster

Aclusteratleveliistheunion ofitschildrenclustersatlevel i+1.


102

L l of Levels fClustering Cl t i

103

P titi Partitional lCl Clustering t i


Nonhierarchical Createsclustersinonestepasopposedto severalsteps steps. Sinceonlyonesetofclustersisoutput,the usernormallyhastoinputthedesirednumber ofclusters clusters,k k. Usuallydealswithstaticsets.

104

P titi Partitional lAl Algorithms ith


MST SquaredError KMeans NearestNeighbor g PAM BEA GA
105

KMeans M
Initialsetofclustersrandomlychosen chosen. y,itemsaremovedamong gsetsof Iteratively, clustersuntilthedesiredsetisreached. Highdegreeofsimilarityamongelements inaclusterisobtained. GivenaclusterKi={ti1,ti2,,tim},thecluster mean ismi =(1/m)(ti1 ++tim)

106

KMeansExample
Given:{2,4,10,12,3,20,30,11,25},k k=2 2 Randomlyassignmeans:m1=3,m2=4 K1={2,3}, {2 3} K2={4,10,12,20,30,11,25}, {4 10 12 20 30 11 25} m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, . 5, 2=19.6 9.6 m1=4.75,m K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m =7 m2=25 Stopastheclusterswiththesemeansare th same. the
107

KMeansAlgorithm

108

Cl t i Large Clustering L Databases D t b


Mostclusteringalgorithmsassumealargedata structurewhichismemoryresident. resident Clusteringmaybeperformedfirstonasampleof the h d database b then h applied li dtothe h entire i database. d b Algorithms g
BIRCH DBSCAN CURE

109

DesiredFeaturesforLarge g Databases
Onescan(orless)ofDB O li Online Suspendable,stoppable,resumable Incremental Workwithlimitedmainmemory Differenttechniquestoscan(e (e.g. g sampling) Processeachtupleonce

110

BIRCH
BalancedIterativeReducingandClustering usingHierarchies Incremental,hierarchical,onescan Saveclusteringinformationinatree Eachentryinthetreecontainsinformation aboutonecluster Newnodesinsertedinclosestentryintree

111

Cl t i Feature Clustering F t
CTTriple:(N,LS,SS) N:Numberofpointsincluster LS:Sumofpointsinthecluster SS:Sumofsquaresofpointsinthecluster CFTree Balancedsearchtree NodehasCFtripleforeachchild LeafnoderepresentsclusterandhasCFvalueforeach subclusterinit. Subclusterhasmaximumdiameter

112

BIRCHAl Algorithm ith

113

ComparisonofClustering Techniques

114

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


115

ICDM06Panelon Top10AlgorithmsinDataMining

A Association i i Analysis A i

ICDM06Panelon Top10AlgorithmsinDataMining

Sequential Patterns

Graph p Mining g

117

A Association i ti Rule R l Problem P bl


GivenasetofitemsI={I1,I2,,Im}anda databaseoftransactionsD={t { 1, ,t2,, ,tn} whereti={Ii1,Ii2,,Iik}andIij I,the AssociationRuleProblem istoidentifyall associationrules X Y withaminimum support tand dconfidence. fid LinkAnalysis y NOTE: SupportofX Y issameassupport of fX Y. Y
118

A Association i ti Rule R l Definitions D fi iti


Setofitems: I={I1,I2,,Im} Transactions: D={t1,t2,,tn},tj I Itemset: {Ii1,I Ii2,,Iik} I Support pp of fanitemset: Percentage g of transactionswhichcontainthatitemset. Large(Frequent)itemset: Itemsetwhose numberofoccurrencesisaboveathreshold.

119

A Association i ti Rule R l Definitions D fi iti


AssociationRule(AR):implicationX Y where h X,Y Iand dX Y=; SupportofAR(s)X Y:Percentageof transactionsthatcontainXY ConfidenceofAR()X Y: Ratioof numberoftransactionsthatcontainX YtothenumberthatcontainX

120

E Example: l M Market k t B Basket k tD Data t


Itemsfrequently f l purchased h dtogether: h
BreadPeanutButter

Uses:
Placement l Advertising g Sales Coupons

Objective:increasesalesandreducecosts

121

A Association i ti Rules R l Example E l

I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60%


122

A Association i ti Rules R l Ex E (contd) ( td)

123

A Association i ti Rule R l Mining Mi i Task T k


GivenasetoftransactionsT,thegoalofassociation ruleminingistofindallruleshaving
supportminsupthreshold confidenceminconfthreshold

Bruteforceapproach:
Listallpossibleassociationrules Computethesupportandconfidenceforeachrule Prunerulesthatfailtheminsup andminconf thresholds Computationallyprohibitive!

Mi i A Mining Association i ti R Rules l


TID Items

Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk Beer} {Diaper} (s=0 {Milk,Beer} (s 0.4, 4 c c=1 1.0) 0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

1 2 3 4 5

Bread, Milk Bread, Diaper, i Beer, Eggs Milk, Diaper, Beer, Coke Bread Milk Bread, Milk, Diaper Diaper, Beer Bread, Milk, Diaper, Coke

Observations:
All the th above b rules l are binary bi partitions titi of f the th same itemset: it t {Milk, Diaper, Beer} Rules R l originating i i ti from f the th same itemset it t have h identical id ti l support t but b t can have different confidence Thus, Th we may decouple d l the th support t and d confidence fid requirements i t

A Association i ti Rule R l Techniques T h i


1. FindLargeItemsets. 2. Generaterulesfromfrequentitemsets.

126

Mi i A Mining Association i ti R Rules l


Twostepapproach:
1. FrequentItemsetGeneration
Generateallitemsetswhosesupport pp minsup p

2 RuleGeneration 2.
Generatehighconfidencerulesfromeachfrequent i itemset, where h each hrule l i isabi binarypartitioning i i i of fa frequentitemset

Frequentitemsetgenerationisstill computationallyexpensive

Frequent eque tItemset te setGe Generation e at o


null A B C D E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets

Frequent eque tItemset te setGe Generation e at o


Bruteforceapproach:
Eachitemsetinthelatticeisacandidate frequent itemset Countthesupportofeachcandidatebyscanningthe database Transactions
TID 1 2 3 4 5 Items It Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Matcheachtransactionagainsteverycandidate Complexity~O(NMw)=>ExpensivesinceM=2d !!!

Computational Co putat o a Co Complexity p e ty


Givenduniqueitems:
Totalnumberofitemsets=2d Totalnumberofpossibleassociationrules:

d d k R = k j = 3 2 +1
d 1 k =1 d k j =1 d d +1

If d=6, R = 602 rules

Frequent q ItemsetGeneration Strategies


Reducethenumberofcandidates (M)
Completesearch:M=2d UsepruningtechniquestoreduceM

Reducethenumberoftransactions(N)
ReducesizeofNasthesizeofitemsetincreases UsedbyDHPandverticalbasedminingalgorithms

Reducethenumberofcomparisons (NM)
Useefficientdatastructurestostorethecandidatesor transactions Noneedtomatcheverycandidateagainstevery transaction

Reducing educ gNumber u be o ofCa Candidates d dates


Apriori p o principle: p c p e:
Ifanitemsetisfrequent,thenallofitssubsetsmustalsobe frequent

Aprioriprincipleholdsduetothefollowingpropertyof thesupportmeasure:

X , Y : ( X Y ) s( X ) s(Y )
Supportofanitemsetneverexceedsthesupportofits subsets Thisisknownastheantimonotone propertyofsupport

Ill t ti Apriori Illustrating A i i P Principle i i l


null A B C D E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

Found to be I f Infrequent t
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD

ABCE

ABDE

ACDE

BCDE

Pruned supersets

ABCDE

A i i Apriori
LargeItemsetProperty: Anysubsetofalargeitemsetislarge. Contrapositive: p Ifanitemsetisnotlarge, noneofitssupersetsarelarge.

134

A i iEx Apriori E (contd) ( td)

s=30%

= 50%

135

A i iAlgorithm Apriori Al ith


T Tables: bl Lk=Setofkitemsetswhicharefrequent Ck=Setofkitemsetswhichcouldbefrequent Method:
Init.Letk=1 GenerateL1(frequentitemsetsoflength1) Repeatuntilnonewfrequentitemsetsareidentified a)GenerateC(k+1)candidateitemsetsfromLkfrequent itemsets b)Countthesupportofeachcandidatebyscanningthe DB c)Eliminatecandidatesthatareinfrequent,leavingonly thosethatarefrequent

Illustrating ust at gApriori p o Principle cpe


L1
Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1

a) No need to generate
candidates involving Coke (or Eggs)

C2
Itemset {Bread,Milk} {Bread,Beer} {Bread Diaper} {Bread,Diaper} {Milk, Beer} {Milk,Diaper} {Beer,Diaper} { , p }

b) Counting

Minimum Support = 3

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} { , p } {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

Count 3 2 3 2 3 3

c) Filter non-frequent L3
Ite m se t {B re a d ,M ilk ,D ia p e r} Count 3

a) No need to generate C3
candidates involving {Bread, Beer}

L2
Itemset {Bread,Milk} {Bread,Diaper} {Milk,Diaper} {Beer,Diaper} Count 3 3 3 3

b) Counting c) Filter

Itemset {B {Bread,Milk,Diaper} d Milk Di }

AprioriGenExample

138

AprioriGenExample(cont (contd) d)

139

A i iAdv/Disadv Apriori Ad /Di d


Advantages:
Useslargeitemsetproperty. Easilyparallelized Easytoimplement.

Disadvantages:
Assumestransactiondatabaseismemory resident. Requiresuptomdatabasescans.
140

S Sampling li
Largedatabases p thedatabaseandapply pp yApriori p tothe Sample sample. PotentiallyLargeItemsets(PL): Largeitemsets fromsample NegativeBorder(BD ): GeneralizationofAprioriGenappliedto itemsetsofvaryingsizes. MinimalsetofitemsetswhicharenotinPL, ,but
whosesubsetsareallinPL.

141

N ti Border Negative B d Example E l

PL
142

PL BD (PL)

S Sampling li Algorithm Al ith


1. 2 2. 3. 4. 5. 6 6. 7. 8. Ds =sampleofDatabaseD; PL=Large L itemsets it t in i Ds using i smalls; ll C=PL BD (PL); CountCinDatabaseusings; ML=largeitemsetsinBD(PL); IfML= thendone elseC=repeatedapplicationofBD; CountCinDatabase;
143

S Sampling li Example E l
FindARassumings=20% ,t2} Ds ={t1, Smalls=10% PL={{Bread} {{Bread},{Jelly} {Jelly},{PeanutButter} {PeanutButter},{Bread {Bread,Jelly}, Jelly} {Bread,PeanutButter},{Jelly,PeanutButter}, {Bread Jelly PeanutButter}} {Bread,Jelly,PeanutButter}} BD(PL)={{Beer},{Milk}} ML={{Beer},{Milk}} RepeatedapplicationofBD generatesall remainingitemsets
144

S Sampling li Adv/Disadv Ad /Di d


Advantages:
Reducesnumberofdatabasescanstooneinthe bestcaseandtwoinworst. Scalesbetter.

Disadvantages: Di d
Potentiallylargenumberofcandidatesinsecond pass

145

P titi i Partitioning
DividedatabaseintopartitionsD1,D2,,Dp ApplyAprioritoeachpartition Anylargeitemsetmustbelargeinatleastone partition.

146

P titi i Algorithm Partitioning Al ith


1. 2. 3 3. 4. 5. DivideDintopartitionsD1,D2,,Dp; ForI=1topdo Li =Apriori(Di); C = L 1 Lp ; CountConDtogenerateL;

147

P titi i Example Partitioning E l


L1 ={{Bread}, {Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} D1

D2

L2 ={{Bread}, {{B {B d} {Milk} {Bread}, {Milk}, {PeanutButter}, {P tB tt } {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}}

S=10%

148

P titi i Adv/Disadv Partitioning Ad /Di d


Advantages:
Adaptstoavailablemainmemory Easilyparallelized Maximumnumberofdatabasescansistwo.

Disadvantages:
Mayhavemanycandidatesduringsecondscan. scan

149

P ll li i ARAl Parallelizing Algorithms ith


BasedonApriori Techniquesdiffer:
Whatiscountedateachsite How H d data t (t (transactions) ti )aredi distributed t ib t d

DataParallelism
Datapartitioned CountDistributionAlgorithm

TaskParallelism
Dataandcandidatespartitioned DataDistributionAlgorithm
150

CountDistributionAlgorithm(CDA)
1. Placedatapartitionateachsite. 2. InParallelateachsitedo 3. C1 =ItemsetsofsizeoneinI; 4 4. C Count C1; 5. Broadcastcountstoallsites; 6 6. D t Determine i global l b llarge l itemsets it t of fsize i 1, 1 L1; 7. i=1; 8 8. Repeat 9. i=i+1; 10 10. Ci=AprioriGen(Li1); 11. CountCi; 12 12. Broadcastcountstoallsites; 13. Determinegloballargeitemsetsofsizei,Li; 14 14. untilnomorelargeitemsetsfound;
151

CDAE Example l

152

DataDistributionAlgorithm(DDA)
1. 2 2. 3. 4. 5. 6. 7. 8 8. 9. 10. 11. 12. 13 13. 14. 15 15. Placedatapartitionateachsite. InParallelateachsitedo Determinelocalcandidatesofsize1tocount; Broadcastlocaltransactionstoothersites; Countlocalcandidatesofsize1onalldata; Determinelargeitemsetsofsize1forlocal candidates; did t Broadcastlargeitemsetstoallsites; DetermineL1; i=1; p Repeat i=i+1; Ci=AprioriGen(Li1); D Determine i local l lcandidates did of fsize i itocount; Count,broadcast,andfindLi; untilnomorelargeitemsetsfound;
153

DDAE Example l

154

C Comparing i ART Techniques h i


Target T t Type DataType DataSource Technique ItemsetStrategyandDataStructure gyandDataStructure TransactionStrategy Optimization Architecture ParallelismStrategy
155

ComparisonofARTechniques Techniq es

156

I Incremental t lAssociation A i ti R Rules l


G Generate t ARs AR in i adynamic d i database. d t b Problem:algorithmsassumestatic database Objective:
KnowlargeitemsetsforD FindlargeitemsetsforD { D}

MustbelargeineitherDor D Save S Liand dcounts


157

N t onAR Note ARs


Manyapplicationsoutsidemarketbasket dataanalysis
Prediction(telecomswitchfailure) Webusagemining

Manydifferenttypesofassociationrules
Temporal Spatial Causal C l
158

A Association i ti rules: l E Evaluation l ti


Associationrulealgorithmstendtoproducetoo manyrules
manyofthemareuninterestingorredundant Redundantif{A,B,C} {D}and{A,B} {D} havesamesupport&confidence

Intheoriginalformulationofassociationrules rules, support&confidencearetheonlymeasuresused Interestingness g measurescanbeusedtoprune/rank p / thederivedpatterns

M Measuring i Quality Q lit of fRules R l


Support Confidence Interest Conviction ChiSquaredTest

160

Ad Advanced dARTechniques T h i
GeneralizedAssociationRules Multipleminimumsupports MultipleLevelAssociationRules QuantitativeAssociationRules Usingmultipleminimumsupports C Correlation l ti R Rules l Sequentialpatternsmining Graphmining Mining gassociationrulesinstreamdata Fuzzyassociationrules Anomalousassociationrules
161

E t i Extensions: H Handling dli C Continuous ti Att Attributes ib t Differentkindsofrules:


Age[21,35) Salary[70k,120k) Buy Salary[70k,120k) Buy Age:=28, 28,=4 4

Differentmethods:
Discretization Di ti ti based b d Statisticsbased Nondiscretizationbased
minApriori

Extensions:Sequential q p patternmining g
Sequence Database
Customer Web Data

Sequence
Purchase history of a given customer Browsing activity of a particular Web visitor History of events generated yag given sensor by DNA sequence of a particular species

Element (Transaction)
A set of items bought by a customer at time t A collection of files viewed by a Web visitor after a single mouse click Events triggered by a sensor at time t An element of the DNA sequence

Event (Item)
Books, diary products, CDs etc CDs, Home page, index page, contact info, etc Types of alarms generated by g y sensors Bases A,T,G,C

Event data Genome sequences

Element (T (Transaction) ti ) Sequence

E1 E2

E1 E3

E2

E2

E3 E4

Event e (Item)

Extensions:Sequential q p patternmining g

Sequence Database:
Object A A A B B B B C Timestamp 10 20 23 11 17 21 28 14 Events , 3, ,5 2, 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1 7, 8 1,

Extensions:Sequential q p patternmining g
Asequence q <a1a2an>iscontainedinanothersequence q <b1 b2bm>(m n)ifthereexistintegers i1<i2<<in suchthata1 bi1,a2 bi1,,an bin
Data sequence < {2,4} {3,5,6} {8} > < {1,2} {1 2} {3 {3,4} 4} > < {2,4} {2,4} {2,5} > Subsequence < {2} {3,5} > < {1} {2} > < {2} {4} > Contain? Yes No Yes

Thesupportofasubsequencewisdefinedasthefractionof q thatcontainw datasequences Asequentialpattern isafrequentsubsequence(i.e.,a q whosesupport pp isminsup p) subsequence

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


166

Extensions:Graph p mining g
Extendassociationrulemining gtofinding g frequentsubgraphs UsefulforWebMining,computational chemistry,bioinformatics,spatialdatasets,etc
Homepage Research

Artificial Intelligence

Databases

D t Mi Data Mining i

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


168

ICDM06Panelon Top10AlgorithmsinDataMining

Link i Mining i i

169

Li kMi Link Mining i


Traditionalmachinelearninganddatamining approachesassume:
Arandom d sample l of fh homogeneousobjects bj t f from singlerelation

Real R lworld ldd data t sets: t


Multirelational,heterogeneousandsemistructured

LinkMining
newlyemergingresearchareaattheintersectionof researchinsocialnetworkandlinkanalysis, hypertextandwebmining,relationallearningand i d ti l inductive logic i programming i and dgraph hmining. i i

Li k dD Linked Data t
Heterogeneous,multirelationaldata representedasagraphornetwork
Nodesareobjects
Mayhavedifferentkindsofobjects Objectshaveattributes Objectsmayhavelabelsorclasses

Edgesarelinks
Mayhavedifferentkindsoflinks Linksmayhaveattributes Linksmaybedirected,arenotrequiredtobebinary

S Sample l Domains D i
webdata(web) bibliographicdata(cite) epidimiological data(epi)

Example:LinkedBibliographicData
P1 P3 I1 Objects: Papers Authors Institutions Attributes: Categories A1 P4 Links: Citation Co-Citation Author-of Author-affiliation P2

Li kMi Link Mining i Tasks T k


LinkbasedObjectClassification LinkTypePrediction P di ti Li Predicting Link kE Existence it LinkCardinalityEstimation ObjectIdentification SubgraphDiscovery

WebMiningOutline

175

W bData Web D t
Webpages Intrapagestructures Interpagestructures Usage g data Supplementaldata
Profiles Registrationinformation Cookies
176

W bContent Web C t tMi Mining i


Extendsworkofbasicsearchengines Search S hEngines E i
IRapplication pp Keywordbased Similaritybetweenqueryanddocument Crawlers Indexing Profiles Linkanalysis
177

W bStructure Web St t Mining Mi i


Minestructure(links,graph)oftheWeb Techniques
PageRank CLEVER

CreateamodeloftheWeborganization. May M be b combined bi dwith ithcontent t tmining i i to t yretrieveimportant p p pages. g moreeffectively

178

P R k (LarryPage andSergeyBrin) PageRank


U Used db byG Google l Prioritizepagesreturnedfromsearchbylooking atWebstructure. Importance p ofpage p g iscalculatedbasedon numberofpageswhichpointtoit Backlinks. Weightingisusedtoprovidemoreimportanceto backlinks comingformimportantpages.

179

P R k (contd) PageRank ( td)


PR(p)=c(PR(1)/N1 ++PR(n)/Nn)
PR(i):PageRankforapageiwhichpointstotarget pagep. Ni:numberoflinkscomingoutofpagei

180

P R k( PageRank (contd) td)


G General l Principle P i i l
A B C D

Every page has some number of Outbounds links (forward links) and Inbounds links (backlinks). A page X has a high rank if: - It I h has many Inbounds I b d li links k - It has highly ranked Inbounds links - Page P li linking ki t to h has few f O tb Outbounds d links. li k
181

HITS
HyperlinkInducesTopicSearch Basedonasetofkeywords,findsetofrelevant pages R. Identifyhubandauthoritypagesforthese these.
ExpandRtoabaseset,B,ofpageslinkedtoorfromR. Calculateweightsfor f authoritiesandhubs.

Pages g withhighest g ranksinRarereturned.


AuthoritativePages :
Highlyimportantpages. Bestsourceforrequestedinformation.

HubPages :
Containlinkstohighlyimportantpages pages.
182

HITSAlgorithm

183

W bUsage Web U Mi Mining i


Extendsworkofbasicsearchengines Search S hEngines E i
IRapplication pp Keywordbased Similaritybetweenqueryanddocument Crawlers Indexing Profiles Linkanalysis
Prentice Hall 184

WebUsage Usa eMinin MiningApplications


Personalization ImprovestructureofasitesWebpages Aidincachingandpredictionoffuturepage references Improvedesignofindividualpages Improveeffectivenessofecommerce(sales andadvertising)

Prentice Hall

185

WebUsageMiningActivities
PreprocessingWeblog
Cleanse Removeextraneousinformation Sessionize
Session: Sequenceofpagesreferencedbyoneuseratasitting sitting.

PatternDiscovery
C Count tpatterns tt th that toccuri insessions i Patternissequenceofpagesreferencesinsession. Similar l toassociationrules l
Transaction:session Itemset: It t pattern tt (or ( subset) b t) Orderisimportant

PatternAnalysis
Prentice Hall 186

ICDM06Panelon Top10AlgorithmsinDataMining

Integrated Mining Rougth Sets

I t Integrated t dMi Mining i


Ontheuseofassociationrulealgorithmsfor classification: l ifi ti CBA SubgroupDiscovery:Caracterizationofclasses
Given a population Gi l i of f individuals i di id l and d a property of f those individuals we are interested in, find population subgroups b th t are statistically that t ti ti ll most t interesting, i t ti e.g., are as large as possible and have the most unusual statistical charasteristics with respect to the property of interest.
188

R Rough h S Set t A Approach h


Roughsetsareusedtoapproximatelyorroughlydefine equivalentclasses AroughsetforagivenclassCisapproximatedbytwosets:a lowerapproximation (certaintobeinC)andanupper approximation (cannotbedescribedasnotbelongingtoC) Findingtheminimalsubsets(reducts)ofattributes(for featurereduction) )isNPhardbutadiscernibilitymatrixis usedtoreducethecomputationintensity

189

ICDM06Panelon Top10AlgorithmsinDataMining

190

ICDM06Panelon Top10AlgorithmsinDataMining

191

ICDM06Panelon Top10AlgorithmsinDataMining

192

ICDM06Panelon Top10AlgorithmsinDataMining Asurveypaperhasbeengenerated


Xindong g Wu, , Vipin p Kumar, , J. Ross Quinlan, , Joydeep y p Ghosh, , Qiang g Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg. Steinberg Top 10 algorithms in data mining. Knowledge Information Systems (2008) 14:1 14:137. 37

193

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


194

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

O tli Outline

Top 10 Algorithms in Data Mining Research


Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks


195

You might also like