- 543.50 KB
- 2022-04-29 14:33:39 发布
- 1、本文档共5页,可阅读全部内容。
- 2、本文档内容版权归属内容提供方,所产生的收益全部归内容提供方所有。如果您对本文有版权争议,可选择认领,认领后既往收益都归您。
- 3、本文档由用户上传,本站不保证质量和数量令人满意,可能有诸多瑕疵,付费之前,请仔细先通过免费阅读内容等途径辨别内容交易风险。如存在严重挂羊头卖狗肉之情形,可联系本站下载客服投诉处理。
- 文档侵权举报电话:19940600175。
'Chapter21:InformationRetrieval
Chapter21:InformationRetrievalRelevanceRankingUsingTermsRelevanceUsingHyperlinksSynonyms.,Homonyms,andOntologiesIndexingofDocumentsMeasuringRetrievalEffectivenessWebSearchEnginesInformationRetrievalandStructuredDataDirectories
InformationRetrievalSystemsInformationretrieval(IR)systemsuseasimplerdatamodelthandatabasesystemsInformationorganizedasacollectionofdocumentsDocumentsareunstructured,noschemaInformationretrievallocatesrelevantdocuments,onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocumentscontainingthewords“databasesystems”Canbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuchasimagesWebsearchenginesarethemostfamiliarexampleofIRsystems
InformationRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsystemsdon’tdealwithtransactionalupdates(includingconcurrencycontrolandrecovery)Databasesystemsdealwithstructureddata,withschemasthatdefinethedataorganizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddressedbydatabasesystemsApproximatesearchingbykeywordsRankingofretrievedanswersbyestimateddegreeofrelevance
KeywordSearchInfulltextretrieval,allthewordsineachdocumentareconsideredtobekeywords.WeusethewordtermtorefertothewordsinadocumentInformation-retrievalsystemstypicallyallowqueryexpressionsformedusingkeywordsandthelogicalconnectivesand,or,andnotAndsareimplicit,evenifnotexplicitlyspecifiedRankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscriticalRelevancerankingisbasedonfactorssuchasTermfrequencyFrequencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowmanydocumentsthequerykeywordoccursinFewergivemoreimportancetokeywordHyperlinkstodocumentsMorelinkstoadocumentdocumentismoreimportant
RelevanceRankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequency)ranking:Letn(d)=numberoftermsinthedocumentdn(d,t)=numberofoccurrencesoftermtinthedocumentd.RelevanceofadocumentdtoatermtThelogfactoristoavoidexcessiveweighttofrequenttermsRelevanceofdocumenttoqueryQn(d)n(d,t)1+TF(d,t)=logr(d,Q)=TF(d,t)n(t)tQ
RelevanceRankingUsingTerms(Cont.)MostsystemsaddtotheabovemodelWordsthatoccurintitle,authorlist,sectionheadings,etc.aregivengreaterimportanceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimportanceVerycommonwordssuchas“a”,“an”,“the”,“it”etc.areeliminatedCalledstopwordsProximity:ifkeywordsinqueryoccurclosetogetherinthedocument,thedocumenthashigherimportancethaniftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescoreUsuallyonlytopfewdocumentsarereturned,notall
SimilarityBasedRetrievalSimilaritybasedretrieval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedonthebasisofcommonwordsE.g.,findktermsinAwithhighestTF(d,t)/n(t)andusethesetermstofindrelevanceofotherdocuments.Relevancefeedback:SimilaritycanbeusedtorefineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrievedbykeywordquery,andsystemfindsotherdocumentssimilartotheseVectorspacemodel:defineann-dimensionalspace,wherenisthenumberofwordsinthedocumentset.VectorfordocumentdgoesfromorigintoapointwhoseithcoordinateisTF(d,t)/n(t)Thecosineoftheanglebetweenthevectorsoftwodocumentsisusedasameasureoftheirsimilarity.
RelevanceUsingHyperlinksNumberofdocumentsrelevanttoaquerycanbeenormousifonlytermfrequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.,atravelagencycanaddmanyoccurrencesofthewords“travel”toitspagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularityofWebsite(e.g.,howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolution:nextslide
RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteasameasureofthepopularityorprestigeofthesiteCountonlyonehyperlinkfromeachsite(why?-seepreviousslide)Popularitymeasureisforsite,notforindividualpageBut,mosthyperlinksaretorootofsiteAlso,conceptof“site”difficulttodefinesinceaURLprefixlikecs.yale.educontainsmanyunrelatedpagesofvaryingpopularityRefinementsWhencomputingprestigebasedonlinkstoasite,givemoreweighttolinksfromsitesthatthemselveshavehigherprestigeDefinitioniscircularSetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankingmechanism
RelevanceUsingHyperlinks(Cont.)ConnectionstosocialnetworkingtheoriesthatrankedprestigeofpeopleE.g.,thepresidentoftheU.S.AhasahighprestigesincemanypeopleknowhimSomeoneknownbymultipleprestigiouspeoplehashighprestigeHubandauthoritybasedrankingAhubisapagethatstoreslinkstomanypages(onatopic)AnauthorityisapagethatcontainsactualinformationonatopicEachpagegetsahubprestigebasedonprestigeofauthoritiesthatitpointstoEachpagegetsanauthorityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefinitionsarecyclic,andcanbegotbysolvinglinearequationsUseauthorityprestigewhenrankinganswerstoaquery
SynonymsandHomonymsSynonymsE.g.,document:“motorcyclerepair”,query:“motorcyclemaintenance”Needtorealizethat“maintenance”and“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.,“object”hasdifferentmeaningsasnoun/verbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesautomaticallyusingsynonymscanbeproblematicNeedtounderstandintendedmeaninginordertoinfersynonymsOrverifysynonymswithuserSynonymsmayhaveothermeaningsaswell
Concept-BasedQueryingApproachForeachword,determinetheconceptitrepresentsfromcontextUseoneormoreontologies:HierarchicalstructureshowingrelationshipbetweenconceptsE.g.,theISArelationshipthatwesawintheE-RmodelThisapproachcanbeusedtostandardizeterminologyinaspecificfieldOntologiescanlinkmultiplelanguagesFoundationoftheSemanticWeb(notcoveredhere)
IndexingofDocumentsAninvertedindexmapseachkeywordKitoasetofdocumentsSithatcontainthekeywordDocumentsidentifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowproximitybasedrankingCountsofnumberofoccurrencesofkeywordtocomputeTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.IntersectionS1S2.....Snoroperation:documentsthatcontainatleastoneofK1,K2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficientintersection/unionbymerging“not”canalsobeefficientlyimplementedbymergingofsortedlists
MeasuringRetrievalEffectivenessInformation-retrievalsystemssavespacebyusingindexstructuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsedrop)-somerelevantdocumentsmaynotberetrieved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpermitanyfalsedrops,butmaypermitafewfalsepositives.Relevantperformancemetrics:precision-whatpercentageoftheretrieveddocumentsarerelevanttothequery.recall-whatpercentageofthedocumentsrelevanttothequerywereretrieved.
MeasuringRetrievalEffectiveness(Cont.)Recallvs.precisiontradeoff:Canincreaserecallbyretrievingmanydocuments(downtoalowlevelofrelevanceranking),butmanyirrelevantdocumentswouldbefetched,reducingprecisionMeasuresofretrievaleffectiveness:Recallasafunctionofnumberofdocumentsfetched,orPrecisionasafunctionofrecallEquivalently,asafunctionofnumberofdocumentsfetchedE.g.,“precisionof75%atrecallof50%,and60%atarecallof75%”Problem:whichdocumentsareactuallyrelevant,andwhicharenot
WebSearchEnginesWebcrawlersareprogramsthatlocateandgatherinformationontheWebRecursivelyfollowhyperlinkspresentinknowndocuments,tofindotherdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHandedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawlingtheentireWebwouldtakeaverylargeamountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasinglecrawl
WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplemachines,runninginparallelSetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawledlaterIndexingprocessalsorunsonmultiplemachinesCreatesanewcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAfteracrawlis“completed”newindexbecomes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferentmachinesforloadbalancing
InformationRetrievalandStructuredDataInformationretrievalsystemsoriginallytreateddocumentsasacollectionofwordsInformationextractionsystemsinferstructurefromdocuments,e.g.:Extractionofhouseattributes(size,address,numberofbedrooms,etc.)fromatextadvertisementExtractionoftopicandpeoplenamedfromanewarticleRelationsorXMLstructuresusedtostoreextracteddataSystemseeksconnectionsamongdatatoanswerqueriesQuestionansweringsystems
DirectoriesStoringrelateddocumentstogetherinalibraryfacilitatesbrowsingUserscanseenotonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedbyclassificationsystemthatorganizeslogicallyrelateddocumentstogether.Organizationishierarchical:classificationhierarchy
AClassificationHierarchyForALibrarySystem
ClassificationDAGDocumentscanresideinmultipleplacesinahierarchyinaninformationretrievalsystem,sincephysicallocationisnotimportant.ClassificationhierarchyisthusDirectedAcyclicGraph(DAG)
AClassificationDAGForALibraryInformationRetrievalSystem
WebDirectoriesAWebdirectoryisjustaclassificationdirectoryonWebpagesE.g.,Yahoo!Directory,OpenDirectoryprojectIssues:Whatshouldthedirectoryhierarchybe?Givenadocument,whichnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClassificationofdocumentsintoahierarchymaybedonebasedontermsimilarity
EndofChapter'
您可能关注的文档
- 数据库系统概念全套配套课件PPT ch12.ppt
- 数据库系统概念全套配套课件PPT ch6.ppt
- 数据库系统概念全套配套课件PPT ch2.ppt
- 数据库系统概念全套配套课件PPT appB.ppt
- 数据库系统概念全套配套课件PPT ch16.ppt
- 数据库系统概念全套配套课件PPT ch7.ppt
- 数据库系统概念全套配套课件PPT ch26.ppt
- 数据库系统概念全套配套课件PPT ch22.ppt
- 数据库系统概念全套配套课件PPT ch23.ppt
- 数据库系统概念全套配套课件PPT ch15.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 06土木工程材料.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 01土木工程材料.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 07土木工程材料.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 14土木工程材料.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 10土木工程材料.ppt
- 土木工程材料第2版柯国军配套教学课件PPT 09土木工程材料.ppt
- 新能源汽车汽车认知与应用 吴荣辉 全套配套课件PPT 模块五 新能源汽车使用与充电 单元一 新能源汽车的使用.ppt
- 幼儿文学创作与欣赏全套配套课件PPT 6章.ppt