课件跨语言资讯检索导论.ppt
《课件跨语言资讯检索导论.ppt》由会员分享,可在线阅读,更多相关《课件跨语言资讯检索导论.ppt(66页珍藏版)》请在三一文库上搜索。
1、Hsin-Hsi Chen,1,跨語言資訊檢索導論,Hsin-Hsi Chen (陳信希) Department of Computer Science and Information Engineering National Taiwan University,Hsin-Hsi Chen,2,Outline,Multilingual Environments What is Cross-Language Information Retrieval? Major Problems in CLIR Major Approaches in CLIR Case Study: CLIR in NPDM
2、 Summary,Hsin-Hsi Chen,3,Multilingual Collections,There are 6,703 languages listed in the Ethnologue Digital libraries OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership atta
3、ched in more than 370 languages World Wide Web Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English,Hsin-Hsi Chen,4,真實世界語言使用人口,( http:/ Chen,5,(Statistics from Euro-Marketing Associates, 1998),西班牙語,德語,日語,法語,中文,荷蘭語,葡萄牙語,義大利語,瑞典語,韓文,Hsin-Hsi Chen,6,http:/ f
4、rom Euro-Marketing Associates, 1999),中文人口 比例(6.1%) 法文人口 比例(8.8%) (1998年),Hsin-Hsi Chen,7,網路世界語言使用人口,Hsin-Hsi Chen,8,網際網路內容,(Network Wizards Jan 99 Internet Domain Survey),英語,日語,德語,法語,荷蘭語,芬蘭語,西班牙語,中文,瑞典語,33,878,1,687,1,684,654,546,473,458,432,546,40%的Internet使用者 不懂英文,但是80% 的Internet內容是英文,Hsin-Hsi Che
5、n,9,(Source: http:/),Hsin-Hsi Chen,10,What is Cross-Language Information Retrieval?,Definition: Select information in one language based on queries in another. Terminologies Cross-Language Information Retrieval (ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval) Translingual Informatio
6、n Retrieval (Defense Advanced Research Project Agency - DARPA),Hsin-Hsi Chen,11,Generalization: Multi- & Cross- Lingual Information Access,Hsin-Hsi Chen,12,MLIR Applications,Multilingual information access in multilingual country, organization, enterprise, etc. Cross- language information retrieval
7、for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary). Monolingual users may retrieve images by taking advantage of multilingual captions. Monolingual users may retrieve documents and have them translated (automatically o
8、r manually) in their language.,Hsin-Hsi Chen,13,Why is Cross- Language Information Retrieval Important?,More information workers with less time require fast access to global resources global B2B interactions (virtual enterprises) global B2C interactions (online trading, travelling) time critical inf
9、ormation (translation comes too late),Hsin-Hsi Chen,14,History,1970 Salton runs retrieval experiments with a small English/ German dictionary 1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation 1978 ISO Standard 5964 for developin
10、g multilingual thesauri (revised in 1985) 1990 Latent Semantic Indexing (LSI) applied to CLIR,Hsin-Hsi Chen,15,History (Continued),1994 1st PhD thesis on CLIR by Khaled Radwan 1996 Similarity thesaurus applied to CLIR (ETH Zurich) 1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenob
11、le) 1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU),Hsin-Hsi Chen,16,History (Continued),1997 CLIR (Cross- Language Information Retrieval) track starts within TREC 1998 NTCIR starts in Japan 1999 TIDES (Translingual Information Detection, Extraction, and Summarization) starts in U.
12、S. 2000 CLEF starts in Europe,Hsin-Hsi Chen,17,An Architecture of Multilingual Information Access,Hsin-Hsi Chen,18,Major Problems of CLIR,Queries and documents are in different languages. translation Words in a query may be ambiguous. disambiguation Queries are usually short. expansion,Hsin-Hsi Chen
13、,19,Major Problems of CLIR (Continued),Queries may have to be segmented. segmentation A document may be in terms of various languages. language identification,Hsin-Hsi Chen,20,Enhancing Traditional Information Retrieval Systems,Which part(s) should be modified for CLIR?,Documents,Queries,Document Re
14、presentation,Query Representation,Comparison,(3),(1),(2),(4),Hsin-Hsi Chen,21,Enhancing Traditional Information Retrieval Systems (Continued),(1): text translation (2): vector translation (3): query translation (4): term vector translation (1) and (2), (3) and (4): interlingual form,Hsin-Hsi Chen,22
15、,What are the Problems?,Ambiguous terms (e.g., performance) Multiword phrases may correspond to single-word phrases (e. g. South Africa = 南非, Sdafrika) Coverage of the vocabulary There is not a one-to-one mapping between two languages Translating queries automatically (lack of syntax) Translating do
16、cuments automatically (performance, ) Computing mixed result lists,Hsin-Hsi Chen,23,Cross-Language Information Retrieval,Hsin-Hsi Chen,24,Query Translation Based CLIR,English Query,Translation Device,Chinese Query,Monolingual Chinese Retrieval System,Retrieved Chinese Documents,Hsin-Hsi Chen,25,Tran
17、slating the 400 Million non-English Pages of the WWW,. would take 100000 days (300 years) on one fast PC. Or, 1 month on 3600 PCs.,Hsin-Hsi Chen,26,Knowledge-Based,Examples Subject Thesaurus Hierarchical and associative relations. Unique term assigned to each node. Concept List Term space partitione
18、d into concept spaces. Term List List of cross-language synonyms. Lexicon Machine readable syntax and/or semantics.,Hsin-Hsi Chen,27,Ontology-Based Approaches,Exploit complex knowledge representations e.g., EuroWordNet A Proposal for Conceptual Indexing using EuroWordNet,Hsin-Hsi Chen,28,Dictionary-
19、Based Approaches,Exploit machine-readable dictionaries. Problems translation ambiguity + target polysemy coverage (unknown words, abbreviations, .),Hsin-Hsi Chen,29,Dictionary-Based Approaches (Continued),Issue 1: selection strategy Select all. Select N randomly. Select best N. Issue 2: which level
20、word phrase,Hsin-Hsi Chen,30,Selection Strategy: Select All,Hull and Grefenstette 1996 Take concatenation of all term translation. E: politically motivated civil disturbances F: troubles civils a caractere politique trouble - turmoil, discord, trouble, unrest, disturbance, disorder civil - civil, ci
21、vilian, courteous caractere - character, nature politique - political, diplomatic, politician, policy Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%. errors: multi-word expressions and ambiguity,Hsin-Hsi Chen,31,Selection Strategy: Select All (Continued),Davis 1
22、997 (TREC5) Replace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary. Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%,Hsin-Hsi Chen,32,Evaluation Method,Average Precision (5-, 9-, 11-points) Model,Spanish Query,Mono IR Engin
23、e,English Query,Bilingual Dictionary,Mono IR Engine,TREC Spanish Corpus,Spanish Equivalents,English Query,Mono IR Engine,TREC Spanish Corpus,Spanish Equivalents by POS,POS Bilingual Dictionary,TREC Spanish Corpus,Hsin-Hsi Chen,33,Selection Strategy: Select N,Simple word-by-word translation Each quer
24、y term is replaced by the word or group of words given for the first sense of the terms definition. 50-60% drop in performance (average precision),Hsin-Hsi Chen,34,Selection Strategy: Select N (Continued),word/phrase translation Take at most three translations of each word, one from each of the firs
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 课件 语言 资讯 检索 导论
链接地址:https://www.31doc.com/p-2601310.html