• 543.50 KB
  • 2022-04-29 14:33:39 发布

数据库系统概念全套配套课件PPT ch21.ppt

  • 25页
  • 当前文档由用户上传发布,收益归属用户
  1. 1、本文档共5页,可阅读全部内容。
  2. 2、本文档内容版权归属内容提供方,所产生的收益全部归内容提供方所有。如果您对本文有版权争议,可选择认领,认领后既往收益都归您。
  3. 3、本文档由用户上传,本站不保证质量和数量令人满意,可能有诸多瑕疵,付费之前,请仔细先通过免费阅读内容等途径辨别内容交易风险。如存在严重挂羊头卖狗肉之情形,可联系本站下载客服投诉处理。
  4. 文档侵权举报电话:19940600175。
'Chapter21:InformationRetrieval Chapter21:InformationRetrievalRelevanceRankingUsingTermsRelevanceUsingHyperlinksSynonyms.,Homonyms,andOntologiesIndexingofDocumentsMeasuringRetrievalEffectivenessWebSearchEnginesInformationRetrievalandStructuredDataDirectories InformationRetrievalSystemsInformationretrieval(IR)systemsuseasimplerdatamodelthandatabasesystemsInformationorganizedasacollectionofdocumentsDocumentsareunstructured,noschemaInformationretrievallocatesrelevantdocuments,onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocumentscontainingthewords“databasesystems”Canbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuchasimagesWebsearchenginesarethemostfamiliarexampleofIRsystems InformationRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsystemsdon’tdealwithtransactionalupdates(includingconcurrencycontrolandrecovery)Databasesystemsdealwithstructureddata,withschemasthatdefinethedataorganizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddressedbydatabasesystemsApproximatesearchingbykeywordsRankingofretrievedanswersbyestimateddegreeofrelevance KeywordSearchInfulltextretrieval,allthewordsineachdocumentareconsideredtobekeywords.WeusethewordtermtorefertothewordsinadocumentInformation-retrievalsystemstypicallyallowqueryexpressionsformedusingkeywordsandthelogicalconnectivesand,or,andnotAndsareimplicit,evenifnotexplicitlyspecifiedRankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscriticalRelevancerankingisbasedonfactorssuchasTermfrequencyFrequencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowmanydocumentsthequerykeywordoccursinFewergivemoreimportancetokeywordHyperlinkstodocumentsMorelinkstoadocumentdocumentismoreimportant RelevanceRankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequency)ranking:Letn(d)=numberoftermsinthedocumentdn(d,t)=numberofoccurrencesoftermtinthedocumentd.RelevanceofadocumentdtoatermtThelogfactoristoavoidexcessiveweighttofrequenttermsRelevanceofdocumenttoqueryQn(d)n(d,t)1+TF(d,t)=logr(d,Q)=TF(d,t)n(t)tQ RelevanceRankingUsingTerms(Cont.)MostsystemsaddtotheabovemodelWordsthatoccurintitle,authorlist,sectionheadings,etc.aregivengreaterimportanceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimportanceVerycommonwordssuchas“a”,“an”,“the”,“it”etc.areeliminatedCalledstopwordsProximity:ifkeywordsinqueryoccurclosetogetherinthedocument,thedocumenthashigherimportancethaniftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescoreUsuallyonlytopfewdocumentsarereturned,notall SimilarityBasedRetrievalSimilaritybasedretrieval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedonthebasisofcommonwordsE.g.,findktermsinAwithhighestTF(d,t)/n(t)andusethesetermstofindrelevanceofotherdocuments.Relevancefeedback:SimilaritycanbeusedtorefineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrievedbykeywordquery,andsystemfindsotherdocumentssimilartotheseVectorspacemodel:defineann-dimensionalspace,wherenisthenumberofwordsinthedocumentset.VectorfordocumentdgoesfromorigintoapointwhoseithcoordinateisTF(d,t)/n(t)Thecosineoftheanglebetweenthevectorsoftwodocumentsisusedasameasureoftheirsimilarity. RelevanceUsingHyperlinksNumberofdocumentsrelevanttoaquerycanbeenormousifonlytermfrequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.,atravelagencycanaddmanyoccurrencesofthewords“travel”toitspagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularityofWebsite(e.g.,howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolution:nextslide RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteasameasureofthepopularityorprestigeofthesiteCountonlyonehyperlinkfromeachsite(why?-seepreviousslide)Popularitymeasureisforsite,notforindividualpageBut,mosthyperlinksaretorootofsiteAlso,conceptof“site”difficulttodefinesinceaURLprefixlikecs.yale.educontainsmanyunrelatedpagesofvaryingpopularityRefinementsWhencomputingprestigebasedonlinkstoasite,givemoreweighttolinksfromsitesthatthemselveshavehigherprestigeDefinitioniscircularSetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankingmechanism RelevanceUsingHyperlinks(Cont.)ConnectionstosocialnetworkingtheoriesthatrankedprestigeofpeopleE.g.,thepresidentoftheU.S.AhasahighprestigesincemanypeopleknowhimSomeoneknownbymultipleprestigiouspeoplehashighprestigeHubandauthoritybasedrankingAhubisapagethatstoreslinkstomanypages(onatopic)AnauthorityisapagethatcontainsactualinformationonatopicEachpagegetsahubprestigebasedonprestigeofauthoritiesthatitpointstoEachpagegetsanauthorityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefinitionsarecyclic,andcanbegotby solvinglinearequationsUseauthorityprestigewhenrankinganswerstoaquery SynonymsandHomonymsSynonymsE.g.,document:“motorcyclerepair”,query:“motorcyclemaintenance”Needtorealizethat“maintenance”and“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.,“object”hasdifferentmeaningsasnoun/verbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesautomaticallyusingsynonymscanbeproblematicNeedtounderstandintendedmeaninginordertoinfersynonymsOrverifysynonymswithuserSynonymsmayhaveothermeaningsaswell Concept-BasedQueryingApproachForeachword,determinetheconceptitrepresentsfromcontextUseoneormoreontologies:HierarchicalstructureshowingrelationshipbetweenconceptsE.g.,theISArelationshipthatwesawintheE-RmodelThisapproachcanbeusedtostandardizeterminologyinaspecificfieldOntologiescanlinkmultiplelanguagesFoundationoftheSemanticWeb(notcoveredhere) IndexingofDocumentsAninvertedindexmapseachkeywordKitoasetofdocumentsSithatcontainthekeywordDocumentsidentifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowproximitybasedrankingCountsofnumberofoccurrencesofkeywordtocomputeTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.IntersectionS1S2.....Snoroperation:documentsthatcontainatleastoneofK1,K2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficientintersection/unionbymerging“not”canalsobeefficientlyimplementedbymergingofsortedlists MeasuringRetrievalEffectivenessInformation-retrievalsystemssavespacebyusingindexstructuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsedrop)-somerelevantdocumentsmaynotberetrieved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpermitanyfalsedrops,butmaypermitafewfalsepositives.Relevantperformancemetrics:precision-whatpercentageoftheretrieveddocumentsarerelevanttothequery.recall-whatpercentageofthedocumentsrelevanttothequerywereretrieved. MeasuringRetrievalEffectiveness(Cont.)Recallvs.precisiontradeoff:Canincreaserecallbyretrievingmanydocuments(downtoalowlevelofrelevanceranking),butmanyirrelevantdocumentswouldbefetched,reducingprecisionMeasuresofretrievaleffectiveness:Recallasafunctionofnumberofdocumentsfetched,orPrecisionasafunctionofrecallEquivalently,asafunctionofnumberofdocumentsfetchedE.g.,“precisionof75%atrecallof50%,and60%atarecallof75%”Problem:whichdocumentsareactuallyrelevant,andwhicharenot WebSearchEnginesWebcrawlersareprogramsthatlocateandgatherinformationontheWebRecursivelyfollowhyperlinkspresentinknowndocuments,tofindotherdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHandedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawlingtheentireWebwouldtakeaverylargeamountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasinglecrawl WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplemachines,runninginparallelSetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawledlaterIndexingprocessalsorunsonmultiplemachinesCreatesanewcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAfteracrawlis“completed”newindexbecomes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferentmachinesforloadbalancing InformationRetrievalandStructuredDataInformationretrievalsystemsoriginallytreateddocumentsasacollectionofwordsInformationextractionsystemsinferstructurefromdocuments,e.g.:Extractionofhouseattributes(size,address,numberofbedrooms,etc.)fromatextadvertisementExtractionoftopicandpeoplenamedfromanewarticleRelationsorXMLstructuresusedtostoreextracteddataSystemseeksconnectionsamongdatatoanswerqueriesQuestionansweringsystems DirectoriesStoringrelateddocumentstogetherinalibraryfacilitatesbrowsingUserscanseenotonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedbyclassificationsystemthatorganizeslogicallyrelateddocumentstogether.Organizationishierarchical:classificationhierarchy AClassificationHierarchyForALibrarySystem ClassificationDAGDocumentscanresideinmultipleplacesinahierarchyinaninformationretrievalsystem,sincephysicallocationisnotimportant.ClassificationhierarchyisthusDirectedAcyclicGraph(DAG) AClassificationDAGForALibraryInformationRetrievalSystem WebDirectoriesAWebdirectoryisjustaclassificationdirectoryonWebpagesE.g.,Yahoo!Directory,OpenDirectoryprojectIssues:Whatshouldthedirectoryhierarchybe?Givenadocument,whichnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClassificationofdocumentsintoahierarchymaybedonebasedontermsimilarity EndofChapter'