This document discusses cross-language information retrieval (CLIR). It presents the goals of allowing users to query for domain-specific information in their native language and presenting relevant search results in the target language. It describes the key components of CLIR including bilingual corpus extraction from multiple sources, corpus indexing, querying and string matching. Preliminary evaluation results of sample queries are provided, along with conclusions that machine translation based CLIR is often more useful than the proposed method and that future work could focus on automated evaluation and fuzzy matching.
3. Background
•
Corpus - a collection of written text; a single word or multiple words, or even
phrases and sentences
•
Comparable corpus - a collection of text from pairs of languages referring to
the same domain[1]; (source text, target text) pair
•
N-gram - n-character or n-word slice of a longer string[2]. We refer n-character
slices by the term n-gram. We use 4-gram (four-gram or quad-gram)
•
Source language - the language of the original phrases
•
Target language - the language into which CLIR translates the original phrases
[1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264.
[2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print.
3
4. Motivation
•
Desire to acquire information even if the information is not
sufficiently available in their native language
•
Survey has shown people have a higher foreign language
proficiency level in reading than in writing
•
CLIR may bridge the gap between their desire to obtain
information and unavailability or under-availability of such
information in their native language
4
5. Goals
•
Allow users to query for domain-specific (i.e., computer science and software
engineering) information in their native language
•
Present relevant search results in the target language; the language in which
the largest amount of information is available
5
10. Multiple Candidates
global&variable&
•
•
Longest match first
Confidence: how many times does this comparable
corpus pair appear in a set of documents?
3:&bal_&(14870)&
8:&aria&(14269)&
global&
•
Outcome of matching depends on the domain of the
documents stored in the database
전역 변수&
세계적인&
0:&loba&(25848)&
variable&
변수&
1:&aria&(14269)&
variable&
가변적인&
1:&aria&(14269)&
10
11. Indexing and Querying Recap
자바 전역 변수 예제!
자바 :!Java!
전역 :!transfer!
전역 :!all!parts!(of)!
전역 변수 :!global!variable!
변수 :!variable!
예제 :!example!
Java!global!variable!
example!!
11
12. Relationship with Content Addressability
자바 전역 변수 예제&
자바&
Java&
전역 변수&
예제&
global&variable&
example&
Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.&
Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor&
ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis&
suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae&
consectetur&risus.&global&variable&Aenean&justo&risus,&mollis&
sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,&
commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin&
pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non&
elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum&
lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,&
interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,&
ornare&quis&ligula.&
12
13. Evaluation
•
Matching
•
•
•
Did it translate all the search terms to the target language properly?
Did it preserve domain-specific information?
Searching
•
Hit ratio: # of relevant web pages / # of results on the first page
•
Total number of search results
13
14. Evaluation
•
재귀 열거 집합 - recursively enumerable sets
•
•
배낭 문제 시간 복잡도 - 배낭 issue the time complexity
•
•
(3/3, 1/1)
(3/4, 1/2)
가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter
energy efficiency maximization
•
(7/7, 4/4)
14
15. Evaluation
•
Query in source language “재귀 열거 집합”
•
•
Query in target language “recursively enumerable sets”
•
•
(6/10, 15,300)
(10/10, 105,000)
Google Translate result “Set of recursive enumeration”
•
(10/10, 1,990,000)
15
16. Evaluation
•
Query in source language “배낭 문제 시간 복잡도”
•
•
Query in target language “배낭 issue time complexity”
•
•
(10/10, 31,200)
(2/6, 2,270)
Google Translate result “Knapsack problem, the time complexity”
•
(10/10, 206,000)
16
17. Evaluation
•
Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화”
•
•
Query in target language “through virtualization datacenter energy efficiency
maximization”
•
•
(5/10, 36,100)
(8/10, 264,000)
Google Translate result “Maximize energy efficiency through data center
virtualization”
•
(10/10, 284,000)
17
18. Conclusion & Future Work
•
Preliminary results look satisfactory
•
Machine translation based CLIR appears to be more useful in many cases
•
Evaluation factors may not reflect the actual quality of the system
•
Labor-intensive evaluation process - need for an automated evaluation
•
Fuzzy matching based on lexical information (e.g., call, calls)
•
Fuzzy matching based on semantic information (e.g., maximize, maximizing,
maximization, maximum)
18