DMSC4 PartI From The Top 10 Algorithms To The New Challenges in Data Mining

Data Mining and Soft Computing Francisco Herrera
Research Group on Soft Computing and Information Intelligent g Systems y (SCI2S) ( ) Dept. of Computer Science and A.I. University of Granada, Spain
Email: herrera@decsai.ugr.es http://sci2s.ugr.es http://decsai.ugr.es/~herrera
Data Mining and Soft Computing
Summary
Introduction to Data Mining and Knowledge Discovery Data Preparation Introduction to Prediction, Classification, Clustering and Association Data Mining - From the Top 10 Algorithms to the New Challenges Introduction to Soft Computing. Focusing our attention in Fuzzy Logic and Evolutionary Computation 6. Soft Computing Techniques in Data Mining: Fuzzy Data Mining and Knowledge Extraction based on Evolutionary Learning 7 Genetic 7. G ti Fuzzy F Systems: S t State St t of f the th Art A t and d New N Trends T d 8. Some Advanced Topics I: Classification with Imbalanced Data Sets 9. Some Advanced Topics II: Subgroup Discovery 10.Some advanced Topics III: Data Complexity 11.Final talk: How must I Do my Experimental Study? Design of p in Data Mining/Computational g p Intelligence. g Using g NonExperiments parametric Tests. Some Cases of Study. 1. 2. 3. 4. 5.
Slidesusedforpreparingthis talk:
Top 10 Algorithms in Data Mining Research
prepared for ICDM 2006
10 Challenging g g Problems in Data Mining g Research

CS490D: IntroductiontoDataMining Association Analysis: Basic Concepts Prof.ChrisClifton

and Algorithms Lecture Notes for Chapter 6 Introduction t oduct o to Data ata Mining g
by Tan, Steinbach, Kumar
DATA MINING
Introductory and Advanced Topics
Margaret H H. Dunham
3
Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll
O tli Outline

Introduction Classification Statistical Learning Bagging and Boosting Clustering Association Analysis Link Mining Text Mining Top 10 Algorithms

10 Challenging g g Problems in Data Mining g Research Concluding Remarks

4
O tli Outline



5
FromtheTop10AlgorithmstotheNew g inDataMining g Challenges

DiscussionPanelsatICDM2005and2006
FromtheTop10AlgorithmstotheNew g inDataMining g Challenges

DiscussionPanelsatICDM2005and2006

10 Challenging Problems in Data Mining Research

prepared d for f ICDM 2005
Top p10Algoritms g inDataMining g Research

preparedforICDM2006
http://www.cs.uvm.edu/~icdm/algorithms/index.shtml
Coordinators
XindongWu UniversityofVermont http://www.cs.uvm.edu/~xwu/home.html VipinKumar UniversityofMinessota http://wwwusers.cs.umn.edu/~kumar/
8
ICDM06Panelon Top10AlgorithmsinDataMining
10
11
12
O tli Outline



13
C Classification ifi i
14
ClassificationUsin UsingDecisionTrees
Partitioningbased: Dividesearchspaceinto rectangularregions regions. Tupleplacedintoclassbasedontheregion withinwhichitfalls. DTapproachesdifferinhowthetreeisbuilt: DTInduction Internalnodesassociatedwithattributeand arcswithvaluesforthatattribute. Algorithms:ID3,C4.5,CART
15
D i i T Decision Tree
Given: D={t1,,tn}whereti=<ti1,,tih> Databaseschemacontains{A1,A2,,Ah} ClassesC={C { 1,., ,Cm} DecisionorClassificationTree isatreeassociatedwith Dsuchthat Eachinternalnodeislabeledwithattribute,Ai Eacharcislabeledwithpredicatewhichcanbe appliedtoattributeatparent Eachleafnodeislabeledwithaclass,Cj
16
T i i D Training Dataset t t
This Thi follows an example from Quinlans Quinlan s ID3
age <=30 <=30 3140 >40 >40 >40 31 40 3140 <=30 <=30 >40 <=30 31 40 3140 3140 >40 income student credit_rating g no fair high high no excellent high no fair medium no fair low yes fair low yes excellent l low yes excellent ll t medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no
17
Output: p ADecisionTreefor buys_computer

age? g <=30 student? no no yes yes overcast 30 40 t 30..40 yes >40 credit rating? excellent no fair yes
18
Al ith f Algorithm forDecision D i i Tree T Induction I d ti

Basicalgorithm(agreedyalgorithm)
Treeisconstructedinatop pdownrecursivedivideandconquer q manner Atstart,allthetrainingexamplesareattheroot Attributesarecategorical(ifcontinuousvalued,theyarediscretized in advance) Examplesarepartitionedrecursivelybasedonselectedattributes Testattributesareselectedonthebasisofaheuristicorstatistical (e.g., g ,informationg gain) ) measure(
Conditionsforstoppingpartitioning
Allsamples p foragiven g nodebelong gtothesameclass Therearenoremainingattributesforfurtherpartitioning majority voting isemployedforclassifyingtheleaf Therearenosamplesleft
19
DTInduction
20
DTS Splits lit Area A
M Gender F
Height
21
C Comparing i DT DTs
Balanced Deep
22
DTI Issues
ChoosingSplittingAttributes OrderingofSplittingAttributes Splits TreeStructure StoppingCriteria TrainingData P i Pruning
23
DecisionTreeInductionisoftenbasedon InformationTheory
So
24
I f Information ti
25
I f Information/Entropy ti /E t
Gi Givenprobabilitites b bili i p1,p2,..,ps whose h sumi is1, Entropy isdefinedas:
Entropymeasurestheamountofrandomnessor surprise su p seor o uncertainty. u ce ta ty Goalinclassification

nosurprise entropy=0
26
AttributeSelectionMeasure: InformationGain(ID3/C4.5)

Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, , m} i f information i measures info i f required i d to classify l if m any arbitrary tuple si si
I( s1,s 2,...,s m ) =
i =1
log 2
entropy py of attribute A with values { {a1,a2,,av}

E(A)= s1 j + ...+ smj I( s1 j ,...,smj ) s j =1
v
information gained by branching on attribute A

G (A) = I(s Gain(A) ( 1, s 2 ,..., sm) E(A) (A)
27
AttributeSelectionby InformationGainComputation
g g g g
ClassP:buys_computer=yes ClassN:buys_computer=no I(p n)=I(9 I(p, I(9,5)=0.940 =0 940 Computetheentropyforage:
age <=30 3040 >40

age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40
pi 2 4 3
ni I(pi, ni) 3 0.971 0 0 2 0.971

buys_computer no no yes yes yes no yes no yes yes yes yes yes no
5 4 E ( age ) = I ( 2 ,3 ) + I ( 4,0 ) 14 14 5 + I (3, 2 ) = 0 .694 14 5 I ( 2,3)meansage< <=30 30has5outof 14 14samples,with2yesesand3

nos. Hence H
income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair lo low yes es e cellent excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent
Gain ( age ) = I ( p , n ) E ( age ) = 0.246

Similarly,
Gain(income) = 0.029 Gain( student ) = 0.151 Gain(credit _ rating ) = 0.048

28
Oth Attribute Other Att ib t Selection S l ti Measures M

Giniindex(CART,IBMIntelligentMiner)
All llattributes b areassumed dcontinuousvalued l d Assumethereexistseveralpossiblesplitvaluesforeach attribute Mayneedothertools, tools suchasclustering, clustering togetthe possiblesplitvalues Can C be b modified difi dfor f categorical t i lattributes tt ib t
29
Gini Index(IBMIntelligentMiner)
IfadatasetT containsexamplesfromn classes,giniindex, n gini(T)isdefinedas gini (T ) = 1 p 2
j =1 j
wherepj istherelativefrequencyofclassj inT. Ifad data t set tT is i split liti into t t twosubsets b t T1 and dT2 with ithsizes i N1 andN2 respectively,thegini indexofthesplitdatacontains examplesfromn classes, classes thegini indexgini(T)isdefinedas
gini
split
(T ) = N 1 gini (T 1) + N 2 gini (T 2 ) N N
Th Theattribute ib provides id the h smallest ll gini i isplit(T)i ischosen h tosplit li thenode(needtoenumerateallpossiblesplittingpointsfor eachattribute). )
30
Extracting gClassificationRulesfrom Trees

RepresenttheknowledgeintheformofIFTHEN rules Oneruleiscreatedforeachpathfromtheroottoaleaf Eachattributevaluepairalongapathformsaconjunction Theleafnodeholdstheclassprediction Rulesareeasierforhumanstounderstand Example
IFage =<=30ANDstudent =noTHENbuys_computer =no IFage =<=30ANDstudent =yesTHENbuys_computer =yes IFage =3140 31 40 THENbuys_computer b =yes IFage =>40ANDcredit_rating =excellentTHENbuys_computer =yes IFage =< <=30 30ANDcredit_rating credit rating =fairTHENbuys_computer buys computer =no
31
A idO Avoid Overfitting fitti in i Classification Cl ifi ti

Overfitting:Aninducedtreemayoverfit thetrainingdata
Toomany ybranches,somemay yreflectanomaliesduetonoiseor outliers Pooraccuracyforunseensamples
Twoapproaches h toavoid idoverfitting fi i

Prepruning:Halttreeconstructionearlydonotsplitanodeifthis wouldresultinthegoodnessmeasurefallingbelowathreshold Difficulttochooseanappropriatethreshold Postpruning: p g Removebranchesfromafully ygrown g treeget g a sequenceofprogressivelyprunedtrees Useasetofdatadifferentfromthetrainingdatatodecidewhich isthebest bestprunedtree tree
32
Approaches pp toDeterminetheFinal TreeSize

Separatetraining(2/3)andtesting(1/3)sets Usecrossvalidation,e.g.,10foldcrossvalidation Useallthedatafortraining
butapplyastatisticaltest (e.g.,chisquare)toestimate whetherexpandingorpruninganodemayimprovethe entiredistribution
Useminimumdescriptionlength(MDL)principle
haltinggrowthofthetreewhentheencodingisminimized
33
Enhancementstobasicdecision treeinduction
Allowforcontinuousvaluedattributes
D Dynamically i ll define d fi newdi discretevalued l dattributes ib that h partitionthecontinuousattributevalueintoadiscreteset of fintervals i l
Handlemissing gattributevalues
Assignthemostcommonvalueoftheattribute Assign A i probability b bili toeach hof fthe h possible ibl values l
Attributeconstruction
Createnewattributesbasedonexistingonesthatare sparselyrepresented
CS490D Thisreducesfragmentation, repetition,andreplication 34
D i i T Decision Treevs.R Rules l

Treehasimpliedorder inwhichsplittingis performed. Treecreatedbasedon gatallclasses. looking Ruleshavenoordering ofpredicates predicates. Onlyneedtolookat oneclasstogenerateits rules.
35
ScalableDecisionTreeInductionMethods inDataMiningStudies
Classificationaclassicalproblemextensivelystudiedby statisticiansandmachinelearningresearchers Scalability:Classifyingdatasetswithmillionsofexamplesand h d d of hundreds fattributes ib with i hreasonable bl speed d Whydecisiontreeinductionindatamining?
relativelyfasterlearningspeed(thanotherclassificationmethods) convertibletosimpleandeasytounderstandclassificationrules canuseSQLqueriesforaccessingdatabases comparableclassificationaccuracywithothermethods
36
ScalableDecisionTreeInductionMethods inDataMiningStudies
SLIQ (EDBT96 Mehtaetal.)
buildsanindexforeachattributeandonly yclasslistandthecurrent attributelistresideinmemory
SPRINT (VLDB96 J.Shaferetal.)

constructsanattributelistdatastructure
PUBLIC (VLDB98 Rastogi&Shim)

integratestreesplittingandtreepruning:stopgrowingthetreeearlier
RainForest (VLDB98 Gehrke,Ramakrishnan&Ganti)

separatesthe h scalability l bili aspectsf fromthe h criteria i i that h d determine i the h qualityofthetree buildsanAVClist(attribute,value,classlabel)
37
I t Instance Based B d M Methods th d

Instancebasedlearning:
Storetraining gexamples p anddelay ytheprocessing p g(lazy ( yevaluation) ) untilanewinstancemustbeclassified
Typicalapproaches
knearestneighborapproach InstancesrepresentedaspointsinaEuclideanspace. Locallyweightedregression Constructslocalapproximation Casebasedreasoning Usessymbolicrepresentationsandknowledgebasedinference
38
Cl ifi ti U Classification Using i Di Distance t

Placeitemsinclasstowhichtheyare closest. Mustdeterminedistancebetweenanitem and daclass. l Classesrepresented p by y Centroid: Centralvalue. Medoid: Representativepoint. Individualpoints
Algorithm:KNN
39
ThekNearestNeighbor g Algorithm
AllinstancescorrespondtopointsinthenDspace. Thenearestneighbor g aredefinedintermsofEuclidean distance. Thetargetfunctioncouldbediscrete orreal valued. Fordiscretevalued,thekNNreturnsthemostcommonvalue amongthektrainingexamplesnearestto xq. Voronoidiagram:thedecisionsurfaceinducedby1NNfora typicalsetoftrainingexamples.
_ + _ _ _ _ + . xq + _ +
. . .
40
KNearest N tNeighbor N i hb (KNN): (KNN)

Trainingsetincludesclasses. ExamineKitemsnearitemtobeclassified. New Ne itemplacedinclasswith iththemost numberofcloseitems. O(q)foreachtupletobeclassified.(Hereq i th is thesize i of fth thet training i i set.) t)
41
KNN
42
KNNAlgorithm
43
B Bayesian i Classification: Cl ifi ti Why? Wh ?

Probabilisticlearning:Calculateexplicitprobabilitiesfor hypothesis,amongthemostpracticalapproachestocertain t types of flearning l i problems bl Incremental:Eachtrainingexamplecanincrementally increase/decreasetheprobabilitythatahypothesisiscorrect. correct Priorknowledgecanbecombinedwithobserveddata. Probabilisticp prediction:Predictmultiple p hypotheses, yp , weightedbytheirprobabilities Standard:EvenwhenBayesianmethodsarecomputationally i t t bl they intractable, th canprovide id astandard t d dof foptimal ti ldecision d ii makingagainstwhichothermethodscanbemeasured
44
B Bayesian i Theorem: Th Basics B i

LetXbeadatasamplewhoseclasslabelisunknown LetHbeahypothesisthatXbelongstoclassC Forclassificationproblems,determineP(H|X):theprobability yp holdsg giventheobserveddatasample p X thatthehypothesis P(H):priorprobabilityofhypothesisH(i.e.theinitial probability p ybeforeweobserveany ydata,reflectsthe backgroundknowledge) P(X):probabilitythatsampledataisobserved P(X|H):probabilityofobservingthesampleX,giventhatthe hypothesisholds
45
B Bayes Theorem Th
Giventrainingdata X,posterioriprobabilityofahypothesisH, P(H|X)followstheBayestheorem
P ( X | H ) P ( H ) P(H | X ) = P( X )
Informally,thiscanbewrittenas
posterior=likelihoodxprior p p /evidence
MAP(maximumposteriori)hypothesis
h arg max P(h | D) = arg max P(D | h)P(h). MAP hH hH

Practicaldifficulty:requireinitialknowledgeofmany probabilities,significantcomputationalcost
46
N B Nave BayesCl Classifier ifi

Asimplifiedassumption:attributesareconditionally independent: n P( X | Ci) = P( xk | Ci) k =1 Theproductofoccurrenceofsay2elementsx1 andx2,given thecurrentclassisC,istheproductoftheprobabilitiesof eachelementtakenseparately,giventhesameclass P([y1,y2],C)=P(y1,C)*P(y2,C) Nodependencerelationbetweenattributes Greatlyreducesthecomputationcost,onlycounttheclass distribution. Once O the h probability b bili P(X|Ci)is i known, k assign i Xtothe h class l withmaximumP(X|Ci)*P(Ci)
47
T i i d Training dataset t t
age 30 <=30 Class: C1:buys_computer= <=30 yes 3040 C2:buys computer= >40 C2:buys_computer= 40 no >40 >40 Data sample l 3140 X =(age<=30, <=30 Income=medium, < 30 <=30 Student=yes >40 Credit_rating= <=30 Fair) 3140 3140 >40 income student credit_rating high no fair high no excellent high no fair medium di no fair f i low yes fair yes excellent y low low yes excellent medium no fair lo low yes es fair medium yes fair yes excellent y medium medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes es yes yes y yes yes no
48
N B Nave Bayesian i Classifier: Cl ifi E Example l

ComputeP(X/Ci)foreachclass
P(age=<30|buys_computer=yes)=2/9=0.222 P(age=<30 P(age= <30 |buys buys_computer= computer=no) no )=3/5=0.6 =0 6 P(income=medium|buys_computer=yes)=4/9=0.444 P(income=medium|buys_computer=no)=2/5=0.4 P(student=yes ( y |buys y _computer=yes)= p y ) 6/9 / =0.667 P(student=yes|buys_computer=no)=1/5=0.2 P(credit_rating=fair|buys_computer=yes)=6/9=0.667 P(credit_rating=fair|buys_computer=no)=2/5=0.4 X=(age<=30 ( ,income=medium, d student=yes,credit_rating=fair) d d f ) P(X|Ci):P(X|buys_computer=yes)=0.222x0.444x0.667x0.0.667=0.044 P(X|buys_computer=no)=0.6x0.4x0.2x0.4=0.019 P(X|Ci)*P(Ci): P(X|buys_computer=yes)*P(buys_computer=yes)=0.028 P(X|buys_computer=yes)*P(buys_computer=yes)=0.007 Xbelongstoclassbuys buys_computer computer=yes yes
49
NaveBayesian y Classifier: Comments

Advantages:
Easy ytoimplement p Goodresultsobtainedinmostofthecases
Disadvantages
Assumption:classconditionalindependence,thereforelossof accuracy Practically, Practically dependenciesexistamongvariables E.g.,hospitals:patients:Profile:age,familyhistoryetc Symptoms:fever,coughetc.,Disease:lungcancer,diabetesetc DependenciesamongthesecannotbemodeledbyNaveBayesian Classifier
Howtodealwiththesedependencies?
BayesianBeliefNetworks
50
S i i Learning Statistical i
51
O tli Outline



52
S Support tvector t machine hi (SVM)

Classificationisessentiallyfindingthebest b boundary d between b t classes. l Supportvectormachinefindsthebest boundarypointscalledsupportvectorsand b ldclassifier build l f ontopof fthem. h LinearandNonlinearsupportvector machine.
53
E Example l of fgeneral lSVM
Thedotswithshadowaround themaresupportvectors. Clearlytheyarethebestdata pointstorepresentthe boundary.Thecurveisthe separatingboundary boundary.
54
SVM SupportVectorMachines
Small Margin
Large Margin Support Vectors
O ti lHyper Optimal H plane, l separable bl case.

Inthiscase,class1and class2areseparable. Therepresentingpoints areselectedsuchthat themarginbetween twoclassesare maximized. Crossed C dpoints i are supportvectors.
xT + 0 = 0
X X X X
56
SVM Cont. C t
LinearSupportVectorMachine
y i {1,1} Givenasetofpoints xi n withlabel TheSVMfindsahyperplanedefinedbythepair(w,b) (wherew isthenormaltotheplaneandb isthe distancefromtheorigin)
s.t. yi ( xi w + b) +1 i = 1,..., N
x feature f vector, bb bias, bi y- class l label, l b l ||w|| || || - margin i
57
A l i of Analysis fSeparable S bl case.

1.Throughoutourpresentation,thetrainingdata consistsofNpairs:(x1,y1),(x2,y2),,(Xn,Yn). 2.Defineahyperplane:
{x : f ( x) = x + 0 = 0}
T
where isaunitvector.Theclassificationruleis:
G ( x) = sign[ x + 0 ]
T
58
A l i Cont. Analysis C t
3.Sotheproblemoffindingoptimalhyperplaneturns to: , 0 ,|| ||=1 MaximizingC on Subjecttoconstrain:
yi ( x + 0 ) > C , i = 1,..., N .
T i
4.Itsthesameas: || || subjectto Minimizing
yi ( x + 0 ) > 1, i = 1,..., N .
T i
59
G General lSVM
Thisclassificationproblem clearly ydonothaveagood g optimallinearclassifier. Canwedobetter? Anonlinearboundaryas shownwilldofine.
60
N separable Non bl case

Whenthedatasetis nonseparable bl as g shownintheright figure,wewillassign weight i h toeach h support pp vectorwhich willbeshowninthe constraint.
xT + 0 = 0
X X X
C
*
61
N Linear Non Li SVM

Classification using SVM (w,b)
xi w + b > 0
In non linear case we can see this as
K ( xi , w) + b > 0
Kernel Can be thought of as doing dot product in some high dimensional space
62
G General lSVMCont. C t
Similartolinearcase,thesolutioncanbe writtenas:
f ( x ) = h ( x )T + 0 =
i =1
y i h ( x i ), h ( x i ' ) + 0
Butfunctionhisofveryhighdimension sometimesinfinity,doesitmeanSVMis i impractical? ti l?

63
R lti S Resulting Surfaces f
64
R Reproducing d i Kernel. K l
Lookatthedualproblem,thesolution onlydependson . h( xi ' ) h( xi ), Traditionalfunctionalanalysistellsuswe ylookattheirkernel needtoonly representation:K(X,X)= h( xi ), h( xi ' ) Whichliesinamuchsmallerdimension S Space th thanh.
65
R t i ti and Restrictions dtypical t i lkernels. k l

Kernelrepresentationdoesnotexistallthe ti time, M Mercers condition diti (C (Courant tand d , )tellsustheconditionforthis Hilbert,1953) kindofexistence. There h areasetof fk kernels l proventob be ,suchaspolynomial p y kernelsand effective, radialbasiskernels.
66
E Example l of fpolynomial l i lkernel. k l

rdegreepolynomial: K( ) (1 K(x,x)=(1+<x,x>) )d. Forafeaturespacewithtwoinputs:x1,x2and apolynomialkernelofdegree2. K(x,x)=(1+<x,x>)2 Let h ( x) = 1, h ( x) = 2 x , h ( x) = 2 x , h ( x) = x 2 , h ( x) = x 2
1 2 1 3 2 4 1 5 2
h6 ( x) = 2 x1 x2,thenK(x,x)=<h(x),h(x)>. and
67
SVMvs.N Neural lNetwork N t k

SVM
Relativelynewconcept NiceGeneralization properties Hardtolearn learnedin batchmodeusingquadratic programming i techniques h i Usingkernelscanlearnvery complexfunctions
NeuralNetwork
QuietOld Generalizeswellbut doesnt doesn thavestrong mathematical foundation Caneasilybelearnedin incrementalfashion Tolearncomplex functions use multilayer l l perceptron (notthattrivial)
68
O Open problems bl of fSVM SVM.

HowdowechooseKernelfunctionfora specificsetofproblems. problems DifferentKernelwill havedifferentresults,althoughgenerallythe results l arebetter b than h using i hyper h planes. l ComparisonswithBayesianriskfor classificationproblem.MinimumBayesianrisk isproventobethebest best.WhencanSVM achievetherisk.
69
O Open problems bl of fSVM

Forverylargetrainingset,supportvectors might i htb beof fl largesize. i S Speed dth thusb becomesa bottleneck. AoptimaldesignformulticlassSVMclassifier.
70
SVMR Related l t dLi Links k

http://svm.dcs.rhbnc.ac.uk/ http://www kernelmachines.org/ http://www.kernel machines org/ C. J. C.Burges.ATutorialonSupportVectorMachinesforPattern Recognition KnowledgeDiscoveryandDataMining,2(2), Recognition. 2(2) 1998 1998. SVMlight Software(inC)http://ais.gmd.de/~thorsten/svm_light BOOK:An Introduction d i to S Support Vector Machines hi N. Cristianini and J. Shawe-Taylor C Cambridge b id U University i it P Press, 2000
71
O tli Outline



72
Bagging i and Boosting i
C bi i classifiers Combining l ifi

Examples:classificationtreesandneuralnetworks, severalneuralnetworks,severalclassificationtrees, etc. Averageresultsfromdifferentmodels Why?
Betterclassificationperformancethanindividualclassifiers Moreresiliencetonoise
Whynot?
Timeconsuming Overfitting
C bi i classifiers Combining l ifi

Ensemblemethodsclassification Manipulation withmodel (Model=M( ( ()) Manipulation with ithd data t set t
75
B i and Bagging dBoosting B ti

Bagging=Manipulationwithdataset
Boosting=Manipulationwithmodel
76
B i and Bagging dBoosting B ti

Generalidea T i i data Training d t
Classification method (CM)
Classifier C
CM
AlteredTrainingdata
CM
Classifier C1 Classifier C2
AlteredTrainingdata .. Aggregation.
Classifier C*
77
B i Bagging
Breiman,1996 Derived D i df fromb bootstrap t t (Ef (Efron,1993) Createclassifiersusingtrainingsetsthatare bootstrapped(drawnwithreplacement) Averageresultsforeachcase
B i Bagging
GivenasetSofssamples GenerateabootstrapsampleTfromS.CasesinSmaynot appearinTormayappearmorethanonce. p thissampling p gp procedure, ,g getting gasequence q ofk Repeat independenttrainingsets p gsequence q ofclassifiersC1,C2,,Ckis Acorresponding constructedforeachofthesetrainingsets,byusingthesame classificationalgorithm ToclassifyanunknownsampleX,leteachclassifierpredictor vote TheBaggedClassifierC*countsthevotesandassignsXtothe classwiththemostvotes
79
BaggingExample(Opitz,1999)
Original Training set 1 T i i set Training t2 Training set 3 Training set 4 1 2 7 3 4 2 7 8 6 5 3 8 5 2 1 4 3 6 7 4 5 7 4 5 6 6 7 6 3 2 7 6 4 2 3 8 1 1 2 8
B ti Boosting
Afamilyofmethods Sequential S ti lproduction d ti of fclassifiers l ifi p onthep previousone, ,and Eachclassifierisdependent focusesonthepreviousoneserrors Examplesthatareincorrectlypredictedinprevious classifiersarechosenmoreoftenorweightedmore h il heavily
B ti Technique Boosting T h i Algorithm Al ith

Assigneveryexampleanequalweight1/N 1 2, 2 ,TDo Fort=1,
Obtainahypothesis(classifier)h(t) underw(t) Calculatetheerrorof h(t) andreweighttheexamples basedontheerror.Eachclassifierisdependentonthe previousones.Samplesthatareincorrectlypredictedare weighted eightedmoreheavily hea il Normalizew(t+1) tosumto1(weightsassignedtodifferent classifierssumto1)
Outputaweightedsumofallthehypothesis,with eachhypothesisweightedaccordingtoitsaccuracy onthetrainingset

82
B ti Boosting
Theidea
83
Ad Boosting Ada B ti
FreundandSchapire,1996 Twoapproaches
Selectexamplesaccordingtoerrorinprevious classifier(morerepresentativesofmisclassified casesareselected) l t d) morecommon Weigherrorsofthemisclassifiedcaseshigher(all casesareincorporated,butweightsaredifferent) doesnotworkforsomealgorithms
Ad Boosting Ada B ti
Define k asthesumoftheprobabilitiesforthe misclassifiedinstancesforcurrentclassifierCk Multiplyprobabilityofselectingmisclassifiedcases by k=(1 k)/k Renormalizeprobabilities(i.e.,rescalesothatit sumsto1) CombineclassifiersC1Ckusingweightedvoting whereCkhasweightlog(k)
BoostingExample(Opitz,1999)
Original Training set 1 T i i set Training t2 Training set 3 Training set 4 1 2 1 7 1 2 7 4 1 1 3 8 5 5 6 4 3 4 8 1 5 7 1 1 1 6 7 6 3 5 6 8 3 1 1 8 1 4 4 5
O tli Outline



87
C Clustering i
88
Cl t i Problem Clustering P bl
GivenadatabaseD={t1,t2,,tn}oftuplesand anintegervaluek, k theClusteringProblem is todefineamappingf:D{1,..,k}whereeachti i assigned is i dtoonecluster l Kj,1<=j<=k. 1 j k ACluster,Kj,containspreciselythosetuples mappedtoit. Unlike U lik classification l ifi ti problem, bl clusters l t arenot t knownapriori.
89
Cl t i Examples Clustering E l
Segment customerdatabasebasedonsimilar b i patterns. buying tt Grouphousesinatownintoneighborhoods basedonsimilarfeatures. Identifynewplantspecies IdentifysimilarWebusagepatterns
90
Cl t i Example Clustering E l
91
Cl t i Levels Clustering L l
Size Based
92
Cl t i vs.Classification Clustering Cl ifi ti

Nopriorknowledge
Numberofclusters Meaningofclusters
Unsupervisedlearning
93
Cl t i Issues Clustering I
Outlierhandling Dynamicdata Interpretingresults Evaluating gresults Numberofclusters Datatobeused S l bilit Scalability
94
I Impact of fOutliers O li onClustering Cl i
95
T Types of fClustering Cl t i
Hierarchical Nestedsetofclusterscreated. Partitional P titi l O Oneset tof fclusters l t created. t d Incremental Eachelementhandledoneata time. Simultaneous Si lt Allelements l t h handled dl d together. Overlapping/Nonoverlapping
96
Cl t i Approaches Clustering A h
Clustering
Hierarchical
Partitional
Categorical
Large DB
Agglomerative
Divisive
Sampling
Compression
97
Cl t P Cluster Parameters t
98
Di t Distance B Between t Clusters Cl t

SingleLink:smallestdistancebetweenpoints CompleteLink: largestdistancebetweenpoints AverageLink: averagedistancebetweenpoints Centroid: distancebetweencentroids
99
Hi Hierarchical hi lClustering Cl t i
Clustersarecreatedinlevelsactuallycreatingsetsof clustersateachlevel. Agglomerative
I Initially iti ll each hit itemin i it itsowncluster l t Iterativelyclustersaremergedtogether BottomUp
Divisive
Initiallyallitemsinonecluster Large g clustersaresuccessively ydivided TopDown
100
Hi Hierarchical hi lAlgorithms Al ith

SingleLink MSTSingleLink CompleteLink Average g Link
101
D d Dendrogram
Dendrogram: atreedata structurewhichillustrates hierarchicalclustering techniques. Eachlevelshowsclustersfor thatlevel. level
Leaf individualclusters Root onecluster
Aclusteratleveliistheunion ofitschildrenclustersatlevel i+1.

102
L l of Levels fClustering Cl t i
103
P titi Partitional lCl Clustering t i

Nonhierarchical Createsclustersinonestepasopposedto severalsteps steps. Sinceonlyonesetofclustersisoutput,the usernormallyhastoinputthedesirednumber ofclusters clusters,k k. Usuallydealswithstaticsets.
104
P titi Partitional lAl Algorithms ith

MST SquaredError KMeans NearestNeighbor g PAM BEA GA
105
KMeans M
Initialsetofclustersrandomlychosen chosen. y,itemsaremovedamong gsetsof Iteratively, clustersuntilthedesiredsetisreached. Highdegreeofsimilarityamongelements inaclusterisobtained. GivenaclusterKi={ti1,ti2,,tim},thecluster mean ismi =(1/m)(ti1 ++tim)
106
KMeansExample
Given:{2,4,10,12,3,20,30,11,25},k k=2 2 Randomlyassignmeans:m1=3,m2=4 K1={2,3}, {2 3} K2={4,10,12,20,30,11,25}, {4 10 12 20 30 11 25} m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, . 5, 2=19.6 9.6 m1=4.75,m K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m =7 m2=25 Stopastheclusterswiththesemeansare th same. the
107
KMeansAlgorithm
108
Cl t i Large Clustering L Databases D t b

Mostclusteringalgorithmsassumealargedata structurewhichismemoryresident. resident Clusteringmaybeperformedfirstonasampleof the h d database b then h applied li dtothe h entire i database. d b Algorithms g
BIRCH DBSCAN CURE
109
DesiredFeaturesforLarge g Databases
Onescan(orless)ofDB O li Online Suspendable,stoppable,resumable Incremental Workwithlimitedmainmemory Differenttechniquestoscan(e (e.g. g sampling) Processeachtupleonce
110
BIRCH
BalancedIterativeReducingandClustering usingHierarchies Incremental,hierarchical,onescan Saveclusteringinformationinatree Eachentryinthetreecontainsinformation aboutonecluster Newnodesinsertedinclosestentryintree
111
Cl t i Feature Clustering F t
CTTriple:(N,LS,SS) N:Numberofpointsincluster LS:Sumofpointsinthecluster SS:Sumofsquaresofpointsinthecluster CFTree Balancedsearchtree NodehasCFtripleforeachchild LeafnoderepresentsclusterandhasCFvalueforeach subclusterinit. Subclusterhasmaximumdiameter
112
BIRCHAl Algorithm ith
113
ComparisonofClustering Techniques
114
O tli Outline



115
A Association i i Analysis A i
Sequential Patterns
Graph p Mining g
117
A Association i ti Rule R l Problem P bl

GivenasetofitemsI={I1,I2,,Im}anda databaseoftransactionsD={t { 1, ,t2,, ,tn} whereti={Ii1,Ii2,,Iik}andIij I,the AssociationRuleProblem istoidentifyall associationrules X Y withaminimum support tand dconfidence. fid LinkAnalysis y NOTE: SupportofX Y issameassupport of fX Y. Y
118
A Association i ti Rule R l Definitions D fi iti

Setofitems: I={I1,I2,,Im} Transactions: D={t1,t2,,tn},tj I Itemset: {Ii1,I Ii2,,Iik} I Support pp of fanitemset: Percentage g of transactionswhichcontainthatitemset. Large(Frequent)itemset: Itemsetwhose numberofoccurrencesisaboveathreshold.
119
A Association i ti Rule R l Definitions D fi iti

AssociationRule(AR):implicationX Y where h X,Y Iand dX Y=; SupportofAR(s)X Y:Percentageof transactionsthatcontainXY ConfidenceofAR()X Y: Ratioof numberoftransactionsthatcontainX YtothenumberthatcontainX
120
E Example: l M Market k t B Basket k tD Data t

Itemsfrequently f l purchased h dtogether: h
BreadPeanutButter
Uses:
Placement l Advertising g Sales Coupons
Objective:increasesalesandreducecosts
121
A Association i ti Rules R l Example E l
I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60%

122
A Association i ti Rules R l Ex E (contd) ( td)
123
A Association i ti Rule R l Mining Mi i Task T k

GivenasetoftransactionsT,thegoalofassociation ruleminingistofindallruleshaving
supportminsupthreshold confidenceminconfthreshold
Bruteforceapproach:
Listallpossibleassociationrules Computethesupportandconfidenceforeachrule Prunerulesthatfailtheminsup andminconf thresholds Computationallyprohibitive!
Mi i A Mining Association i ti R Rules l

TID Items
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk Beer} {Diaper} (s=0 {Milk,Beer} (s 0.4, 4 c c=1 1.0) 0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
1 2 3 4 5
Bread, Milk Bread, Diaper, i Beer, Eggs Milk, Diaper, Beer, Coke Bread Milk Bread, Milk, Diaper Diaper, Beer Bread, Milk, Diaper, Coke
Observations:
All the th above b rules l are binary bi partitions titi of f the th same itemset: it t {Milk, Diaper, Beer} Rules R l originating i i ti from f the th same itemset it t have h identical id ti l support t but b t can have different confidence Thus, Th we may decouple d l the th support t and d confidence fid requirements i t
A Association i ti Rule R l Techniques T h i

1. FindLargeItemsets. 2. Generaterulesfromfrequentitemsets.
126
Mi i A Mining Association i ti R Rules l

Twostepapproach:
1. FrequentItemsetGeneration
Generateallitemsetswhosesupport pp minsup p
2 RuleGeneration 2.
Generatehighconfidencerulesfromeachfrequent i itemset, where h each hrule l i isabi binarypartitioning i i i of fa frequentitemset
Frequentitemsetgenerationisstill computationallyexpensive
Frequent eque tItemset te setGe Generation e at o

null A B C D E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets
Frequent eque tItemset te setGe Generation e at o

Bruteforceapproach:
Eachitemsetinthelatticeisacandidate frequent itemset Countthesupportofeachcandidatebyscanningthe database Transactions
TID 1 2 3 4 5 Items It Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Matcheachtransactionagainsteverycandidate Complexity~O(NMw)=>ExpensivesinceM=2d !!!
Computational Co putat o a Co Complexity p e ty

Givenduniqueitems:
Totalnumberofitemsets=2d Totalnumberofpossibleassociationrules:
d d k R = k j = 3 2 +1
d 1 k =1 d k j =1 d d +1
If d=6, R = 602 rules
Frequent q ItemsetGeneration Strategies

Reducethenumberofcandidates (M)
Completesearch:M=2d UsepruningtechniquestoreduceM
Reducethenumberoftransactions(N)
ReducesizeofNasthesizeofitemsetincreases UsedbyDHPandverticalbasedminingalgorithms
Reducethenumberofcomparisons (NM)
Useefficientdatastructurestostorethecandidatesor transactions Noneedtomatcheverycandidateagainstevery transaction
Reducing educ gNumber u be o ofCa Candidates d dates

Apriori p o principle: p c p e:
Ifanitemsetisfrequent,thenallofitssubsetsmustalsobe frequent
Aprioriprincipleholdsduetothefollowingpropertyof thesupportmeasure:
X , Y : ( X Y ) s( X ) s(Y )
Supportofanitemsetneverexceedsthesupportofits subsets Thisisknownastheantimonotone propertyofsupport
Ill t ti Apriori Illustrating A i i P Principle i i l

null A B C D E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
Found to be I f Infrequent t
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Pruned supersets
ABCDE
A i i Apriori
LargeItemsetProperty: Anysubsetofalargeitemsetislarge. Contrapositive: p Ifanitemsetisnotlarge, noneofitssupersetsarelarge.
134
A i iEx Apriori E (contd) ( td)
s=30%
= 50%
135
A i iAlgorithm Apriori Al ith

T Tables: bl Lk=Setofkitemsetswhicharefrequent Ck=Setofkitemsetswhichcouldbefrequent Method:
Init.Letk=1 GenerateL1(frequentitemsetsoflength1) Repeatuntilnonewfrequentitemsetsareidentified a)GenerateC(k+1)candidateitemsetsfromLkfrequent itemsets b)Countthesupportofeachcandidatebyscanningthe DB c)Eliminatecandidatesthatareinfrequent,leavingonly thosethatarefrequent
Illustrating ust at gApriori p o Principle cpe

L1
Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1
a) No need to generate
candidates involving Coke (or Eggs)
C2
Itemset {Bread,Milk} {Bread,Beer} {Bread Diaper} {Bread,Diaper} {Milk, Beer} {Milk,Diaper} {Beer,Diaper} { , p }
b) Counting
Minimum Support = 3
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} { , p } {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
Count 3 2 3 2 3 3
c) Filter non-frequent L3
Ite m se t {B re a d ,M ilk ,D ia p e r} Count 3
a) No need to generate C3
candidates involving {Bread, Beer}
L2
Itemset {Bread,Milk} {Bread,Diaper} {Milk,Diaper} {Beer,Diaper} Count 3 3 3 3
b) Counting c) Filter
Itemset {B {Bread,Milk,Diaper} d Milk Di }
AprioriGenExample
138
AprioriGenExample(cont (contd) d)
139
A i iAdv/Disadv Apriori Ad /Di d

Advantages:
Useslargeitemsetproperty. Easilyparallelized Easytoimplement.
Disadvantages:
Assumestransactiondatabaseismemory resident. Requiresuptomdatabasescans.
140
S Sampling li
Largedatabases p thedatabaseandapply pp yApriori p tothe Sample sample. PotentiallyLargeItemsets(PL): Largeitemsets fromsample NegativeBorder(BD ): GeneralizationofAprioriGenappliedto itemsetsofvaryingsizes. MinimalsetofitemsetswhicharenotinPL, ,but
whosesubsetsareallinPL.
141
N ti Border Negative B d Example E l
PL
142
PL BD (PL)
S Sampling li Algorithm Al ith

1. 2 2. 3. 4. 5. 6 6. 7. 8. Ds =sampleofDatabaseD; PL=Large L itemsets it t in i Ds using i smalls; ll C=PL BD (PL); CountCinDatabaseusings; ML=largeitemsetsinBD(PL); IfML= thendone elseC=repeatedapplicationofBD; CountCinDatabase;
143
S Sampling li Example E l
FindARassumings=20% ,t2} Ds ={t1, Smalls=10% PL={{Bread} {{Bread},{Jelly} {Jelly},{PeanutButter} {PeanutButter},{Bread {Bread,Jelly}, Jelly} {Bread,PeanutButter},{Jelly,PeanutButter}, {Bread Jelly PeanutButter}} {Bread,Jelly,PeanutButter}} BD(PL)={{Beer},{Milk}} ML={{Beer},{Milk}} RepeatedapplicationofBD generatesall remainingitemsets
144
S Sampling li Adv/Disadv Ad /Di d

Advantages:
Reducesnumberofdatabasescanstooneinthe bestcaseandtwoinworst. Scalesbetter.
Disadvantages: Di d
Potentiallylargenumberofcandidatesinsecond pass
145
P titi i Partitioning
DividedatabaseintopartitionsD1,D2,,Dp ApplyAprioritoeachpartition Anylargeitemsetmustbelargeinatleastone partition.
146
P titi i Algorithm Partitioning Al ith

1. 2. 3 3. 4. 5. DivideDintopartitionsD1,D2,,Dp; ForI=1topdo Li =Apriori(Di); C = L 1 Lp ; CountConDtogenerateL;
147
P titi i Example Partitioning E l

L1 ={{Bread}, {Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} D1
D2
L2 ={{Bread}, {{B {B d} {Milk} {Bread}, {Milk}, {PeanutButter}, {P tB tt } {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}}
S=10%
148
P titi i Adv/Disadv Partitioning Ad /Di d

Advantages:
Adaptstoavailablemainmemory Easilyparallelized Maximumnumberofdatabasescansistwo.
Disadvantages:
Mayhavemanycandidatesduringsecondscan. scan
149
P ll li i ARAl Parallelizing Algorithms ith

BasedonApriori Techniquesdiffer:
Whatiscountedateachsite How H d data t (t (transactions) ti )aredi distributed t ib t d
DataParallelism
Datapartitioned CountDistributionAlgorithm
TaskParallelism
Dataandcandidatespartitioned DataDistributionAlgorithm
150
CountDistributionAlgorithm(CDA)
1. Placedatapartitionateachsite. 2. InParallelateachsitedo 3. C1 =ItemsetsofsizeoneinI; 4 4. C Count C1; 5. Broadcastcountstoallsites; 6 6. D t Determine i global l b llarge l itemsets it t of fsize i 1, 1 L1; 7. i=1; 8 8. Repeat 9. i=i+1; 10 10. Ci=AprioriGen(Li1); 11. CountCi; 12 12. Broadcastcountstoallsites; 13. Determinegloballargeitemsetsofsizei,Li; 14 14. untilnomorelargeitemsetsfound;
151
CDAE Example l
152
DataDistributionAlgorithm(DDA)
1. 2 2. 3. 4. 5. 6. 7. 8 8. 9. 10. 11. 12. 13 13. 14. 15 15. Placedatapartitionateachsite. InParallelateachsitedo Determinelocalcandidatesofsize1tocount; Broadcastlocaltransactionstoothersites; Countlocalcandidatesofsize1onalldata; Determinelargeitemsetsofsize1forlocal candidates; did t Broadcastlargeitemsetstoallsites; DetermineL1; i=1; p Repeat i=i+1; Ci=AprioriGen(Li1); D Determine i local l lcandidates did of fsize i itocount; Count,broadcast,andfindLi; untilnomorelargeitemsetsfound;
153
DDAE Example l
154
C Comparing i ART Techniques h i

Target T t Type DataType DataSource Technique ItemsetStrategyandDataStructure gyandDataStructure TransactionStrategy Optimization Architecture ParallelismStrategy
155
ComparisonofARTechniques Techniq es
156
I Incremental t lAssociation A i ti R Rules l

G Generate t ARs AR in i adynamic d i database. d t b Problem:algorithmsassumestatic database Objective:
KnowlargeitemsetsforD FindlargeitemsetsforD { D}
MustbelargeineitherDor D Save S Liand dcounts

157
N t onAR Note ARs

Manyapplicationsoutsidemarketbasket dataanalysis
Prediction(telecomswitchfailure) Webusagemining
Manydifferenttypesofassociationrules
Temporal Spatial Causal C l
158
A Association i ti rules: l E Evaluation l ti

Associationrulealgorithmstendtoproducetoo manyrules
manyofthemareuninterestingorredundant Redundantif{A,B,C} {D}and{A,B} {D} havesamesupport&confidence
Intheoriginalformulationofassociationrules rules, support&confidencearetheonlymeasuresused Interestingness g measurescanbeusedtoprune/rank p / thederivedpatterns
M Measuring i Quality Q lit of fRules R l

Support Confidence Interest Conviction ChiSquaredTest
160
Ad Advanced dARTechniques T h i
GeneralizedAssociationRules Multipleminimumsupports MultipleLevelAssociationRules QuantitativeAssociationRules Usingmultipleminimumsupports C Correlation l ti R Rules l Sequentialpatternsmining Graphmining Mining gassociationrulesinstreamdata Fuzzyassociationrules Anomalousassociationrules
161
E t i Extensions: H Handling dli C Continuous ti Att Attributes ib t Differentkindsofrules:

Age[21,35) Salary[70k,120k) Buy Salary[70k,120k) Buy Age:=28, 28,=4 4
Differentmethods:
Discretization Di ti ti based b d Statisticsbased Nondiscretizationbased
minApriori
Extensions:Sequential q p patternmining g
Sequence Database
Customer Web Data
Sequence
Purchase history of a given customer Browsing activity of a particular Web visitor History of events generated yag given sensor by DNA sequence of a particular species
Element (Transaction)
A set of items bought by a customer at time t A collection of files viewed by a Web visitor after a single mouse click Events triggered by a sensor at time t An element of the DNA sequence
Event (Item)
Books, diary products, CDs etc CDs, Home page, index page, contact info, etc Types of alarms generated by g y sensors Bases A,T,G,C
Event data Genome sequences
Element (T (Transaction) ti ) Sequence
E1 E2
E1 E3
E2
E2
E3 E4
Event e (Item)
Sequence Database:
Object A A A B B B B C Timestamp 10 20 23 11 17 21 28 14 Events , 3, ,5 2, 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1 7, 8 1,
Asequence q <a1a2an>iscontainedinanothersequence q <b1 b2bm>(m n)ifthereexistintegers i1<i2<<in suchthata1 bi1,a2 bi1,,an bin
Data sequence < {2,4} {3,5,6} {8} > < {1,2} {1 2} {3 {3,4} 4} > < {2,4} {2,4} {2,5} > Subsequence < {2} {3,5} > < {1} {2} > < {2} {4} > Contain? Yes No Yes
Thesupportofasubsequencewisdefinedasthefractionof q thatcontainw datasequences Asequentialpattern isafrequentsubsequence(i.e.,a q whosesupport pp isminsup p) subsequence
O tli Outline



166
Extensions:Graph p mining g
Extendassociationrulemining gtofinding g frequentsubgraphs UsefulforWebMining,computational chemistry,bioinformatics,spatialdatasets,etc
Homepage Research
Artificial Intelligence
Databases
D t Mi Data Mining i
O tli Outline



168
Link i Mining i i
169
Li kMi Link Mining i

Traditionalmachinelearninganddatamining approachesassume:
Arandom d sample l of fh homogeneousobjects bj t f from singlerelation
Real R lworld ldd data t sets: t

Multirelational,heterogeneousandsemistructured
LinkMining
newlyemergingresearchareaattheintersectionof researchinsocialnetworkandlinkanalysis, hypertextandwebmining,relationallearningand i d ti l inductive logic i programming i and dgraph hmining. i i
Li k dD Linked Data t
Heterogeneous,multirelationaldata representedasagraphornetwork
Nodesareobjects
Mayhavedifferentkindsofobjects Objectshaveattributes Objectsmayhavelabelsorclasses
Edgesarelinks
Mayhavedifferentkindsoflinks Linksmayhaveattributes Linksmaybedirected,arenotrequiredtobebinary
S Sample l Domains D i
webdata(web) bibliographicdata(cite) epidimiological data(epi)
Example:LinkedBibliographicData
P1 P3 I1 Objects: Papers Authors Institutions Attributes: Categories A1 P4 Links: Citation Co-Citation Author-of Author-affiliation P2
Li kMi Link Mining i Tasks T k

LinkbasedObjectClassification LinkTypePrediction P di ti Li Predicting Link kE Existence it LinkCardinalityEstimation ObjectIdentification SubgraphDiscovery
WebMiningOutline
175
W bData Web D t
Webpages Intrapagestructures Interpagestructures Usage g data Supplementaldata
Profiles Registrationinformation Cookies
176
W bContent Web C t tMi Mining i

Extendsworkofbasicsearchengines Search S hEngines E i
IRapplication pp Keywordbased Similaritybetweenqueryanddocument Crawlers Indexing Profiles Linkanalysis
177
W bStructure Web St t Mining Mi i

Minestructure(links,graph)oftheWeb Techniques
PageRank CLEVER
CreateamodeloftheWeborganization. May M be b combined bi dwith ithcontent t tmining i i to t yretrieveimportant p p pages. g moreeffectively
178
P R k (LarryPage andSergeyBrin) PageRank

U Used db byG Google l Prioritizepagesreturnedfromsearchbylooking atWebstructure. Importance p ofpage p g iscalculatedbasedon numberofpageswhichpointtoit Backlinks. Weightingisusedtoprovidemoreimportanceto backlinks comingformimportantpages.
179
P R k (contd) PageRank ( td)

PR(p)=c(PR(1)/N1 ++PR(n)/Nn)
PR(i):PageRankforapageiwhichpointstotarget pagep. Ni:numberoflinkscomingoutofpagei
180
P R k( PageRank (contd) td)

G General l Principle P i i l
A B C D
Every page has some number of Outbounds links (forward links) and Inbounds links (backlinks). A page X has a high rank if: - It I h has many Inbounds I b d li links k - It has highly ranked Inbounds links - Page P li linking ki t to h has few f O tb Outbounds d links. li k
181
HITS
HyperlinkInducesTopicSearch Basedonasetofkeywords,findsetofrelevant pages R. Identifyhubandauthoritypagesforthese these.
ExpandRtoabaseset,B,ofpageslinkedtoorfromR. Calculateweightsfor f authoritiesandhubs.
Pages g withhighest g ranksinRarereturned.

AuthoritativePages :
Highlyimportantpages. Bestsourceforrequestedinformation.
HubPages :
Containlinkstohighlyimportantpages pages.
182
HITSAlgorithm
183
W bUsage Web U Mi Mining i

Extendsworkofbasicsearchengines Search S hEngines E i
IRapplication pp Keywordbased Similaritybetweenqueryanddocument Crawlers Indexing Profiles Linkanalysis
Prentice Hall 184
WebUsage Usa eMinin MiningApplications

Personalization ImprovestructureofasitesWebpages Aidincachingandpredictionoffuturepage references Improvedesignofindividualpages Improveeffectivenessofecommerce(sales andadvertising)
Prentice Hall
185
WebUsageMiningActivities
PreprocessingWeblog
Cleanse Removeextraneousinformation Sessionize
Session: Sequenceofpagesreferencedbyoneuseratasitting sitting.
PatternDiscovery
C Count tpatterns tt th that toccuri insessions i Patternissequenceofpagesreferencesinsession. Similar l toassociationrules l
Transaction:session Itemset: It t pattern tt (or ( subset) b t) Orderisimportant
PatternAnalysis
Prentice Hall 186
Integrated Mining Rougth Sets
I t Integrated t dMi Mining i

Ontheuseofassociationrulealgorithmsfor classification: l ifi ti CBA SubgroupDiscovery:Caracterizationofclasses
Given a population Gi l i of f individuals i di id l and d a property of f those individuals we are interested in, find population subgroups b th t are statistically that t ti ti ll most t interesting, i t ti e.g., are as large as possible and have the most unusual statistical charasteristics with respect to the property of interest.
188
R Rough h S Set t A Approach h

Roughsetsareusedtoapproximatelyorroughlydefine equivalentclasses AroughsetforagivenclassCisapproximatedbytwosets:a lowerapproximation (certaintobeinC)andanupper approximation (cannotbedescribedasnotbelongingtoC) Findingtheminimalsubsets(reducts)ofattributes(for featurereduction) )isNPhardbutadiscernibilitymatrixis usedtoreducethecomputationintensity
189
190
191
192
ICDM06Panelon Top10AlgorithmsinDataMining Asurveypaperhasbeengenerated

Xindong g Wu, , Vipin p Kumar, , J. Ross Quinlan, , Joydeep y p Ghosh, , Qiang g Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg. Steinberg Top 10 algorithms in data mining. Knowledge Information Systems (2008) 14:1 14:137. 37
193
O tli Outline



194
O tli Outline



195

DMSC4 PartI From The Top 10 Algorithms To The New Challenges in Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMSC4 PartI From The Top 10 Algorithms To The New Challenges in Data Mining

Uploaded by

Copyright:

Available Formats

Data Mining and Soft Computing Francisco Herrera

Data Mining and Soft Computing

10 Challenging g g Problems in Data Mining g Research

CS490D: IntroductiontoDataMining Association Analysis: Basic Concepts Prof.ChrisClifton

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

Top 10 Algorithms in Data Mining Research

10 Challenging g g Problems in Data Mining g Research Concluding Remarks

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

Top 10 Algorithms in Data Mining Research

10 Challenging g g Problems in Data Mining g Research Concluding Remarks

FromtheTop10AlgorithmstotheNew g inDataMining g Challenges

FromtheTop10AlgorithmstotheNew g inDataMining g Challenges

Top 10 Algorithms in Data Mining Research

10 Challenging Problems in Data Mining Research

Top p10Algoritms g inDataMining g Research

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

Top 10 Algorithms in Data Mining Research

10 Challenging g g Problems in Data Mining g Research Concluding Remarks

Output: p ADecisionTreefor buys_computer

Al ith f Algorithm forDecision D i i Tree T Induction I d ti

DTS Splits lit Area A

Entropymeasurestheamountofrandomnessor surprise su p seor o uncertainty. u ce ta ty Goalinclassification

entropy py of attribute A with values { {a1,a2,,av}

information gained by branching on attribute A

ClassP:buys_computer=yes ClassN:buys_computer=no I(p n)=I(9 I(p, I(9,5)=0.940 =0 940 Computetheentropyforage:

age <=30 3040 >40

ni I(pi, ni) 3 0.971 0 0 2 0.971

5 4 E ( age ) = I ( 2 ,3 ) + I ( 4,0 ) 14 14 5 + I (3, 2 ) = 0 .694 14 5 I ( 2,3)meansage< <=30 30has5outof 14 14samples,with2yesesand3

Gain ( age ) = I ( p , n ) E ( age ) = 0.246

Gain(income) = 0.029 Gain( student ) = 0.151 Gain(credit _ rating ) = 0.048

Oth Attribute Other Att ib t Selection S l ti Measures M

Extracting gClassificationRulesfrom Trees

A idO Avoid Overfitting fitti in i Classification Cl ifi ti

Twoapproaches h toavoid idoverfitting fi i

Approaches pp toDeterminetheFinal TreeSize

D i i T Decision Treevs.R Rules l

SPRINT (VLDB96 J.Shaferetal.)

PUBLIC (VLDB98 Rastogi&Shim)

RainForest (VLDB98 Gehrke,Ramakrishnan&Ganti)

I t Instance Based B d M Methods th d

Cl ifi ti U Classification Using i Di Distance t

KNearest N tNeighbor N i hb (KNN): (KNN)

B Bayesian i Classification: Cl ifi ti Why? Wh ?

B Bayesian i Theorem: Th Basics B i

h arg max P(h | D) = arg max P(D | h)P(h). MAP hH hH

N B Nave BayesCl Classifier ifi

N B Nave Bayesian i Classifier: Cl ifi E Example l

NaveBayesian y Classifier: Comments

Data Mining - From the Top 10 Algorithms to the N New Ch Challenges ll

Top 10 Algorithms in Data Mining Research

10 Challenging g g Problems in Data Mining g Research Concluding Remarks

S Support tvector t machine hi (SVM)

E Example l of fgeneral lSVM

Thedotswithshadowaround themaresupportvectors. Clearlytheyarethebestdata pointstorepresentthe boundary.Thecurveisthe separatingboundary boundary.

Large Margin Support Vectors

O ti lHyper Optimal H plane, l separable bl case.

A l i of Analysis fSeparable S bl case.

4.Itsthesameas: || || subjectto Minimizing

N separable Non bl case

N Linear Non Li SVM

Butfunctionhisofveryhighdimension sometimesinfinity,doesitmeanSVMis i impractical? ti l?

R lti S Resulting Surfaces f

R t i ti and Restrictions dtypical t i lkernels. k l