- 1.69 MB
- 2022-04-29 14:31:53 发布
- 1、本文档共5页,可阅读全部内容。
- 2、本文档内容版权归属内容提供方,所产生的收益全部归内容提供方所有。如果您对本文有版权争议,可选择认领,认领后既往收益都归您。
- 3、本文档由用户上传,本站不保证质量和数量令人满意,可能有诸多瑕疵,付费之前,请仔细先通过免费阅读内容等途径辨别内容交易风险。如存在严重挂羊头卖狗肉之情形,可联系本站下载客服投诉处理。
- 文档侵权举报电话:19940600175。
'Chapter20:DataAnalysis
Chapter20:DataAnalysisDecisionSupportSystemsDataWarehousingDataMiningClassificationAssociationRulesClustering
DecisionSupportSystemsDecision-supportsystemsareusedtomakebusinessdecisions,oftenbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusinessdecisions:Whatitemstostock?Whatinsurancepremiumtochange?Towhomtosendadvertisements?ExamplesofdatausedformakingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age,gender,etc.)
Decision-SupportSystems:OverviewDataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExampletasksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsabove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:S++)canbeinterfacedwithdatabasesStatisticalanalysisisalargefield,butnotcoveredhereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatisticalrulesandpatternsfromlargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,andstoresitunderaunifiedschema,atasinglesite.Importantforlargebusinessesthatgeneratedatafrommultipledivisions,possiblyatmultiplesitesDatamayalsobepurchasedexternally
DataWarehousingDatasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,includinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderaunifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransactionprocessingsystems
DataWarehousing
DesignIssuesWhenandhowtogatherdataSourcedrivenarchitecture:datasourcestransmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.,atnight)Destinationdrivenarchitecture:warehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.,usingtwo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.WhatschematouseSchemaintegration
MoreWarehouseDesignIssuesDatacleansingE.g.,correctmistakesinaddresses(misspellings,zipcodeerrors)MergeaddresslistsfromdifferentsourcesandpurgeduplicatesHowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasourcesWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQueriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevalues
WarehouseSchemasDimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastarschemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttables
DataWarehouseSchema
DataMiningDataminingistheprocessofsemi-automaticallyanalyzinglargedatabasestofindusefulpatternsPredictionbasedonpasthistoryPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,..)andpasthistoryPredictifapatternofphonecallingcardusageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenanewitemwhoseclassisunknown,predicttowhichclassitbelongsRegressionformulaeGivenasetofmappingsforanunknownfunction,predictthefunctionresultforanewparametervalue
DataMining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtby“similar”customers.Ifanewsuchcustomerbuysonesuchbook,suggesttheotherstoo.AssociationsmaybeusedasafirststepindetectingcausationE.g.,associationbetweenexposuretochemicalXandcancer,ClustersE.g.,typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemics
ClassificationRulesClassificationruleshelpassignnewobjectstoclasses.E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?Classificationrulesforaboveexamplecoulduseavarietyofdata,suchaseducationallevel,salary,age,etc.personP,P.degree=mastersandP.income>75,000P.credit=excellentpersonP,P.degree=bachelorsand(P.income25,000andP.income75,000)P.credit=goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassificationrulescanbeshowncompactlyasadecisiontree.
DecisionTree
ConstructionofDecisionTreesTrainingset:adatasampleinwhichtheclassificationisalreadyknown.Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,andapartitioningconditionforthenodeLeafnode:all(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenconsidered,andnofurtherpartitioningispossible.
BestSplitsPickbestattributesandconditionsonwhichtopartitionThepurityofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,fractionofinstancesinclassi=pi.TheGinimeasureofpurityisdefinedas[Gini(S)=1-Whenallinstancesareinasingleclass,theGinivalueis0Itreachesitsmaximum(of1–1/k)ifeachclassthesamenumberofinstances.ki-1p2i
BestSplits(Cont.)Anothermeasureofpurityistheentropymeasure,whichisdefinedasentropy(S)=–WhenasetSissplitintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityoftheresultantsetofsetsas:purity(S1,S2,…..,Sr)=TheinformationgainduetoparticularsplitofSintoSi,i=1,2,….,rInformation-gain(S,{S1,S2,….,Sr)=purity(S)–purity(S1,S2,…Sr)ri=1|Si||S|purity(Si)ki-1pilog2pi
BestSplits(Cont.)Measureof“cost”ofasplit:Information-content(S,{S1,S2,…..,Sr}))=–Information-gainratio=Information-gain(S,{S1,S2,……,Sr})Information-content(S,{S1,S2,…..,Sr})Thebestsplitistheonethatgivesthemaximuminformationgainratiolog2ri-1|Si||S||Si||S|
FindingBestSplitsCategoricalattributes(withnomeaningfulorder):Multi-waysplit,onechildforeachvalueBinarysplit:tryallpossiblebreakupofvaluesintotwosets,andpickthebestContinuous-valuedattributes(canbesortedinameaningfulorder)Binarysplit:Sortvalues,tryeachasasplitpointE.g.,ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:Aseriesofbinarysplitsonthesameattributehasroughlyequivalenteffect
Decision-TreeConstructionAlgorithmProcedureGrowTree(S)Partition(S);ProcedurePartition(S)if(purity(S)>por|S|<s)thenreturn;foreachattributeAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,….,Sr,fori=1,2,…..,rPartition(Si);
OtherTypesofClassifiersNeuralnetclassifiersarestudiedinartificialintelligenceandarenotcoveredhereBayesianclassifiersuseBayestheorem,whichsaysp(cj|d)=p(d|cj)p(cj)p(d)wherep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabilityofgeneratinginstancedgivenclasscj,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuring
NaïveBayesianClassifiersBayesianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naïveBayesianclassifiersassumeattributeshaveindependentdistributions,andtherebyestimatep(d|cj)=p(d1|cj)*p(d2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandstore
RegressionRegressiondealswiththepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariableY.Onewayistoinfercoefficientsa0,a1,a1,…,ansuchthatY=a0+a1*X1+a2*X2+…+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingeneral,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherelationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.
AssociationRulesRetailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.SomeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuythebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.,whenacustomerbuysaparticularbook,anonlineshopmaysuggestassociatedbooks.Associationrules:breadmilkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassociationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation
AssociationRules(Cont.)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.,supposeonly0.001percentofallpurchasesincludemilkandscrewdrivers.Thesupportfortheruleismilkscrewdriversislow.Confidenceisameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.,therulebreadmilkhasaconfidenceof80percentif80percentofthepurchasesthatincludebreadalsoincludemilk.
FindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater)NaïvealgorithmConsiderallpossiblesetsofrelevantitems.Foreachsetfinditssupport(i.e.,counthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupportUselargeitemsetstogenerateassociationrules.FromitemsetAgeneratetheruleA-{b}bforeachbA.Supportofrule=support(A).Confidenceofrule=support(A)/support(A-{b})
FindingSupportDeterminesupportofitemsetsviaasinglepassonsetoftransactionsLargeitemsets:setswithahighcountattheendofthepassIfmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsineachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.Theaprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everysetofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidates
OtherTypesofAssociationsBasicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilityaremoreinterestingE.g.,ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchasebothWeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigherthanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.,wheneverbondsgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.,deviationfromasteadygrowthE.g.,salesofwinterweargodowninsummerNotsurprising,partofaknownpattern.Lookfordeviationfromvaluepredictedusingpastpatterns
ClusteringClustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdistancemetricsinseveralwaysGrouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpointsinaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargedatasetsE.g.,theBirchclusteringalgorithm(moreshortly)
HierarchicalClusteringExamplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredictionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.,Yahoo,moreonthislater)AgglomerativeclusteringalgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,repeatedlyrefine(break)clustersintosmallerones
ClusteringAlgorithmsClusteringalgorithmshavebeendesignedtohandleverylargedatasetsE.g.,theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneatatimeintotheR-tree,merginganewpointwithanexistingclusterifislessthansomedistanceawayIftherearemoreleafnodesthanfitinmemory,mergeexistingclustersthatareclosetoeachotherAttheendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenumberofclusters
CollaborativeFilteringGoal:predictwhatmovies/books/…apersonmaybeinterestedin,onthebasisofPastpreferencesofthepersonOtherpeoplewithsimilarpastpreferencesThepreferencesofsuchpeopleforanewmovie/book/…OneapproachbasedonrepeatedclusteringClusterpeopleonthebasisofpreferencesformoviesThenclustermoviesonthebasisofbeinglikedbythesameclustersofpeopleAgainclusterpeoplebasedontheirpreferencesfor(thenewlycreatedclustersof)moviesRepeatabovetillequilibriumAboveproblemisaninstanceofcollaborativefiltering,whereuserscollaborateinthetaskoffilteringinformationtofindinformationofinterest
OtherTypesofMiningTextmining:applicationofdataminingtotextualdocumentsclusterWebpagestofindrelatedpagesclusterpagesauserhasvisitedtoorganizetheirvisithistoryclassifyWebpagesautomaticallyintoaWebdirectoryDatavisualizationsystemshelpusersexaminelargevolumesofdataanddetectpatternsvisuallyCanvisuallyencodelargeamountsofinformationonasinglescreenHumansareverygoodadetectingvisualpatterns
EndofChapter
Figure20.01
Figure20.02
Figure20.03
Figure20.05'
您可能关注的文档
- 数据结构课件PPT110章全 第二章.ppt
- 数据结构课件PPT110章全 第十章 内部排序old.ppt
- 数据库系统概念全套配套课件PPT ch13.ppt
- 数据库系统概念全套配套课件PPT ch5.ppt
- 数据库系统概念全套配套课件PPT ch3.ppt
- 数据库系统概念全套配套课件PPT ch1.ppt
- 数据库系统概念全套配套课件PPT ch25.ppt
- 数据库系统概念全套配套课件PPT ch24.ppt
- 数据库系统概念全套配套课件PPT ch17.ppt
- 数据库系统概念全套配套课件PPT ch19.ppt
- 数据库系统概念全套配套课件PPT ch11.ppt
- 数据库系统概念全套配套课件PPT ch14.ppt
- 数据库系统概念全套配套课件PPT ch10.ppt
- 数据库系统概念全套配套课件PPT ch8.ppt
- 数据库系统概念全套配套课件PPT appC.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 03土木工程材料.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 02土木工程材料.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 15土木工程材料.ppt