Donalek Classif

SupervisedandUnsupervised
Learning
CiroDonalek
Ay/Bi199April2011
Summary
KDDandDataMiningTasks
Findingtheop?malapproach
SupervisedModels
NeuralNetworks
Mul?LayerPerceptron
DecisionTrees
UnsupervisedModels
DierentTypesofClustering
DistancesandNormaliza?on
Kmeans
SelfOrganizingMaps
Combiningdierentmodels
CommiOeeMachines
IntroducingaPrioriKnowledge
SleepingExpertFramework
KnowledgeDiscoveryinDatabases
KDDmaybedenedas:"Thenontrivialprocessof
iden2fyingvalid,novel,poten2allyuseful,and
ul2matelyunderstandablepa9ernsindata".
KDDisaninterac?veanditera?veprocessinvolving
severalsteps.
Yougotyourdata:whatsnext?
Whatkindofanalysisdoyouneed?Whichmodelismoreappropriateforit?
Cleanyourdata!
Datapreprocessingtransformstherawdata
intoaformatthatwillbemoreeasilyand
eec?velyprocessedforthepurposeofthe
user.
Sometasks
sampling:selectsarepresenta?vesubset
fromalargepopula?onofdata;
Noisetreatment
strategiestohandlemissingdata:some?mes
yourrowswillbeincomplete,notall
parametersaremeasuredforallsamples.
normaliza2on
featureextrac2on:pullsoutspecieddata
thatissignicantinsomepar?cularcontext.
Usestandard
formats!
MissingData
Missingdataareapartofalmostallresearch,andweallhaveto
decidehowtodealwithit.
CompleteCaseAnalysis:useonlyrowswithallthevalues
AvailableCaseAnalysis
Subs?tu?on
MeanValue:replacethemissingvaluewiththe
meanvalueforthatpar?cularaOribute
RegressionSubs?tu?on:wecanreplacethe
missingvaluewithhistoricalvaluefromsimilarcases
MatchingImputa?on:foreachunitwithamissingy,
ndaunitwithsimilarvaluesofxintheobserved
dataandtakeitsyvalue
MaximumLikelihood,EM,etc
SomeDMmodelscandealwithmissingdatabeOerthanothers.
Whichtechniquetoadoptreallydependsonyourdata
DataMining
CrucialtaskwithintheKDD
DataMiningisaboutautoma?ngtheprocessof
searchingforpaOernsinthedata.
Moreindetails,themostrelevantDMtasksare:
associa?on
sequenceorpathanalysis
clustering
classicaDon
regression
visualiza?on
FindingSoluDonviaPurposes
Youhaveyourdata,whatkindofanalysisdoyouneed?
Regression
predictnewvaluesbasedonthepast,inference
computethenewvaluesforadependentvariablebasedonthe
valuesofoneormoremeasuredaOributes
Classica?on:
dividesamplesinclasses
useatrainedsetofpreviouslylabeleddata
Clustering
par??oningofadatasetintosubsets(clusters)sothatdatain
eachsubsetideallysharesomecommoncharacteris?cs
Classica?onisinasomewaysimilartotheclustering,butrequires
thattheanalystknowaheadof?mehowclassesaredened.
ClusterAnalysis
Howmanyclustersdoyouexpect?
SearchforOutliers
ClassicaDon
Dataminingtechniqueusedtopredictgroupmembershipfor
datainstances.Therearetwowaystoassignanewvaluetoa
givenclass.
CrispyclassicaDon
givenaninput,theclassierreturnsitslabel
ProbabilisDcclassicaDon
givenaninput,theclassierreturnsitsprobabili?estobelongto
eachclass
usefulwhensomemistakescanbemore
costlythanothers(givemeonlydata>90%)
winnertakeallandotherrules
assigntheobjecttotheclasswiththe
highestprobability(WTA)
butonlyifitsprobabilityisgreaterthan40%
(WTAwiththresholds)
Regression/ForecasDng
Datatablesta?s?calcorrela?on
mappingwithoutanypriorassump?ononthefunc?onal
formofthedatadistribu?on;
machinelearningalgorithmswellsuitedforthis.
Curvegng
ndawelldenedandknown
func?onunderlyingyourdata;
theory/exper?secanhelp.
MachineLearning
Tolearn:togetknowledgeofbystudy,experience,
orbeingtaught.
TypesofLearning
Supervised
Unsupervised
UnsupervisedLearning
Themodelisnotprovidedwiththecorrectresults
duringthetraining.
Canbeusedtoclustertheinputdatainclasseson
thebasisoftheirsta?s?calproper?esonly.
Clustersignicanceandlabeling.
Thelabelingcanbecarriedoutevenifthelabelsare
onlyavailableforasmallnumberofobjects
representa?veofthedesiredclasses.
SupervisedLearning
Trainingdataincludesboththeinputandthe
desiredresults.
Forsomeexamplesthecorrectresults(targets)are
knownandaregivenininputtothemodelduring
thelearningprocess.
Theconstruc?onofapropertraining,valida?onand
testset(Bok)iscrucial.
Thesemethodsareusuallyfastandaccurate.
Havetobeabletogeneralize:givethecorrect
resultswhennewdataaregivenininputwithout
knowingapriorithetarget.
GeneralizaDon
Referstotheabilitytoproducereasonableoutputs
forinputsnotencounteredduringthetraining.
Inotherwords:NOPANICwhen
"neverseenbefore"dataaregiven
ininput!
Acommonproblem:OVERFITTING
Learnthedataandnottheunderlyingfunc?on
Performswellonthedatausedduringthetraining
andpoorlywithnewdata.
Howtoavoidit:usepropersubsets,earlystopping.
Datasets
Trainingset:asetofexamplesusedforlearning,
wherethetargetvalueisknown.
ValidaDonset:asetofexamplesusedtotunethe
architectureofaclassierandes?matetheerror.
Testset:usedonlytoassesstheperformancesofa
classier.Itisneverused
duringthetrainingprocess
sothattheerroronthetest
setprovidesanunbiased
es?mateofthegeneraliza?on
error.
IRISdataset
IRIS
consistsof3classes,50instanceseach
4numericalaOributes(sepalandpetallengthandwidth
incm)
eachclassreferstoatypeofIrisplant(Setosa,Versicolor,
Verginica)
therstclassislinearlyseparable
fromtheothertwowhilethe2nd
andthe3rdarenotlinearly
separable
ArDfactsDataset
PQAr?facts
2mainclassesand4numericalaOributes
classesare:trueobjects,ar?facts
DataSelecDon
Garbagein,garbageout:training,valida?onand
testdatamustberepresenta?veoftheunderlying
model
Alleventuali?esmustbecovered
Unbalanceddatasets
sincethenetworkminimizestheoverallerror,thepropor?on
oftypesofdatainthesetiscri?cal;
inclusionofalossmatrix(Bishop,1995);
onen,thebestapproachistoensureevenrepresenta?onof
dierentcases,thentointerpretthenetwork'sdecisions
accordingly.
ArDcialNeuralNetwork
AnAr?cialNeuralNetworkisan
informa?onprocessingparadigm
thatisinspiredbytheway
biologicalnervoussystemsprocess
informa?on:
alargenumberofhighly
interconnectedsimpleprocessing
elements(neurons)working
togethertosolvespecic
problems
AsimplearDcialneuron
Thebasiccomputa?onalelementisonencalledanodeorunit.It
receivesinputfromsomeotherunits,orfromanexternalsource.
Eachinputhasanassociatedweightw,whichcanbemodiedso
astomodelsynap?clearning.
Theunitcomputessomefunc?onoftheweightedsumofits
inputs:
NeuralNetworks
ANeuralNetworkisusuallystructuredintoaninputlayerofneurons,oneor
morehiddenlayersandoneoutputlayer.
Neuronsbelongingtoadjacentlayersareusuallyfullyconnectedandthe
varioustypesandarchitecturesareiden?edbothbythedierenttopologies
adoptedfortheconnec?onsaswellbythechoiceoftheac?va?onfunc?on.
Thevaluesofthefunc?onsassociatedwiththeconnec?onsarecalled
weights.
ThewholegameofusingNNsisinthefact
that,inorderforthenetworktoyield
appropriateoutputsforgiveninputs,the
weightmustbesettosuitablevalues.
Thewaythisisobtainedallowsafurther
dis?nc?onamongmodesofopera?ons.
NeuralNetworks:types
Feedforward:SingleLayerPerceptron,MLP,ADALINE(Adap?veLinear
Neuron),RBF
SelfOrganized:SOM(KohonenMaps)
Recurrent:SimpleRecurrentNetwork,
HopeldNetwork.
Stochas?c:Boltzmannmachines,RBM.
Modular:CommiOeeofMachines,ASNN
(Associa?veNeuralNetworks),
Ensembles.
Others:InstantaneouslyTrained,Spiking
(SNN),Dynamic,Cascades,NeuroFuzzy,
PPS,GTM.
MulDLayerPerceptron
TheMLPisoneofthemostusedsupervisedmodel:
itconsistsofmul?plelayersofcomputa?onalunits,
usuallyinterconnectedinafeedforwardway.
Eachneuroninonelayerhasdirectconnec?onsto
alltheneuronsofthesubsequentlayer.
LearningProcess
BackPropaga?on
theoutputvaluesarecomparedwiththetargettocomputethevalue
ofsomepredenederrorfunc?on
theerroristhenfedbackthroughthenetwork
usingthisinforma?on,thealgorithmadjuststheweightsofeach
connec?oninordertoreducethevalueoftheerrorfunc?on
Anerrepea?ngthisprocessforasucientlylargenumberoftrainingcycles,
thenetworkwillusuallyconverge.
HiddenUnits
Thebestnumberofhiddenunitsdependon:
numberofinputsandoutputs
numberoftrainingcase
theamountofnoiseinthetargets
thecomplexityofthefunc?ontobelearned
theac?va?onfunc?on
Toofewhiddenunits=>hightrainingandgeneraliza?onerror,dueto
undergngandhighsta?s?calbias.
Toomanyhiddenunits=>lowtrainingerrorbuthighgeneraliza?on
error,duetoovergngandhighvariance.
Rulesofthumbdon'tusuallywork.
AcDvaDonandErrorFuncDons
AcDvaDonFuncDons
Results:confusionmatrix
Results:completenessandcontaminaDon
Exercise:computecompletenessandcontamina?onforthepreviousconfusionmatrix(testset)
DecisionTrees
Isanotherclassica?onmethod.
Adecisiontreeisasetofsimplerules,suchas"ifthe
sepallengthislessthan5.45,classifythespecimenas
setosa."
Decisiontreesarealsononparametricbecausetheydo
notrequireanyassump?onsaboutthedistribu?onof
thevariablesineachclass.
Summary
KDDandDataMiningTasks
Findingtheop?malapproach
SupervisedModels
NeuralNetworks
Mul?LayerPerceptron
DecisionTrees
UnsupervisedModels
DierentTypesofClustering
DistancesandNormaliza?on
Kmeans
SelfOrganizingMaps
Combiningdierentmodels
CommiOeeMachines
IntroducingaPrioriKnowledge
SleepingExpertFramework
UnsupervisedLearning
Themodelisnotprovidedwiththecorrectresults
duringthetraining.
Canbeusedtoclustertheinputdatainclasseson
thebasisoftheirsta?s?calproper?esonly.
Clustersignicanceandlabeling.
Thelabelingcanbecarriedoutevenifthelabelsare
onlyavailableforasmallnumberofobjects
representa?veofthedesiredclasses.
TypesofClustering
Typesofclustering:
HIERARCHICAL:ndssuccessiveclustersusingpreviously
establishedclusters
agglomera?ve(boOomup):startwitheachelementinaseparatecluster
andmergethemaccordinglytoagivenproperty
divisive(topdown)
PARTITIONAL:usuallydeterminesallclustersatonce
Distances
Determinethesimilaritybetweentwoclustersand
theshapeoftheclusters.
Incaseofstrings
TheHammingdistancebetweentwostringsofequallengthis
thenumberofposi?onsatwhichthecorrespondingsymbols
aredierent.
measurestheminimumnumberofsubs2tu2onsrequiredto
changeonestringintotheother
TheLevenshtein(edit)distanceisametricformeasuringthe
amountofdierencebetweentwosequences.
isdenedastheminimumnumberofeditsneededtotransform
onestringintotheother.
1001001
1000100
HD=3
LD(BIOLOGY,BIOLOGIA)=2
BIOLOGY>BIOLOGI(subsDtuDon)
BIOLOGI>BIOLOGIA(inserDon)
NormalizaDon
VAR:themeanofeachaOribute
ofthetransformedsetofdata
pointsisreducedtozeroby
subtrac?ngthemeanofeach
aOributefromthevaluesofthe
aOributesanddividingtheresult
bythestandarddevia?onofthe
aOribute.
RANGE(MinMaxNormalizaDon):subtractstheminimumvalueofanaOributefromeachvalue
oftheaOributeandthendividesthedierencebytherangeoftheaOribute.Ithasthe
advantageofpreservingexactlyallrela?onshipinthedata,withoutaddinganybias.
SOFTMAX:isawayofreducingtheinuenceofextremevaluesoroutliersinthedatawithout
removingthemfromthedataset.Itisusefulwhenyouhaveoutlierdatathatyouwishto
includeinthedatasetwhiles?llpreservingthesignicanceofdatawithinastandarddevia?on
ofthemean.
KMeans
KMeans:howitworks
Kmeans:ProandCons
LearningK
Findabalancebetweentwovariables:thenumberof
clusters(K)andtheaveragevarianceoftheclusters.
Minimizebothvalues
Asthenumberofclustersincreases,theaverage
variancedecreases(uptothetrivialcaseofk=nand
variance=0).
Somecriteria:
BIC(BayesianInforma?onCriteria)
AIC(AkaikeInforma?onCriteria)
DavisBouldinIndex
ConfusionMatrix
SelfOrganizingMaps
SOMtopology
SOMPrototypes
SOMTraining
CompeDDveandCooperaDveLearning
SOMUpdateRule
Parameters
DMwithSOM
SOMLabeling
LocalizingData
ClusterStructure
ClusterStructure2
ComponentPlanes
RelaDveImportance
Howaccurateisyourclustering
Trajectories
CombiningModels
CommideeMachines
Aprioriknowledge
SleepingExperts

Donalek Classif

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Donalek Classif

Uploaded by

Copyright:

Available Formats

SupervisedandUnsupervised

You might also like