Cross-Language Info Retrieval Models

Digital Enterprise Research Institute www.deri.ie

Explicit vs. Latent Concept Models for Cross-
Language Information Retrieval

Nitish Aggarwal
DERI, NUI Galway
firstname.lastname@deri.org

Tuesday,Digitalth June, 2012 All rights reserved.
Copyright 2011
26 Enterprise Research Institute.
DERI, Reading Group
Enabling Networked Knowledge

Based On:

 Title:
 “Explicit vs. Latent Concept Models for Cross-Language
Information Retrieval”

 Authors:
 Philipp Cimiano, Antje Schultz, SergejSizov, Philipp Sorg,
Steffen Staab

 Published:
 International Joint Conference on Artificial Intelligence, 2009


Overview

 Introduction
 Cross lingual information retrieval (CLIR)
 Concept Model
 Explicit Semantics
 Latent Semantics
 Evaluation
 Conclusion


Introduction: CLIR

 Cross Lingual Information Retrieval
 Many documents, web sites
are written in different languages

 Retrieve all information without
a language barrier

 Query and documents are in different
languages


Introduction: CLIR

 CLIR based on Machine Translation
 Translation of queries or documents
 Reduced problem to monolingual retrieval
– Issues:
– MT is not available for all language pairs
– Increase vocabulary mismatch


Introduction: CLIR

 Interlingua or Concepts based
 Use language independent representation
– Mapping all queries and documents in different language to concepts space
– Define a concept space and relevance function

Language independent
representation


Concept Model

 Document in conceptspace
 Di = {t1, t2,t3…tn}
 ti in space
C1
– Associationwitheveryconcept
 Composite semanticsofalltokens
– Σti , Πti

 Typesofconceptmodel ti

 Explicit
C2
 Latent/implicit

C3


ConceptModel: Explicit

 Intuition: define concepts from external resources
 Definition of concepts
– Wikipedia articles, tagged web pages
 Cover a broad range of vocabulary and language
 Example
 Wikipedia based Explicit semantic analysis (ESA)


Concept Model: ESA

 ExplicitConceptSpace
 Di = {t1, t2,t3…tn}
 ti = {w1a1 + w2a2… + wnan} query University
docs
 Composite semanticsofalltoken
– Σti

Student

Education


Cross lingual - ESA

 Extension of ESA
 Use Wikipedia cross language links
 Linked articles define same concepts in different languages

EN Word1 W1*URI1+w2*URI2…. wn*URIn

Wordn W1*URI1+w2*URI2…. wn*URIn

DE Word1 W1*URI1+w2*URI2…. wn*URIn


ES Word1 W1*URI1+w2*URI2…. wn*URIn


Inverted Index

Term@en W11*URI1+w12*URI2…. w1n*URIn
Vector Semantic
Term@de W11*URI1+w12*URI2…. w1n*URIn
Cosine Relatedness


Concept Model: Latent

 Intuition: semantic space of latent concepts
 Definition of latent concepts
– Cluster of similar things define a latent concept

Latent Concept1 Latent Concept2
30% broccoli 20% chinchillas
15% bananas 20% kittens
10% breakfast 20% cute
10% munching
(Food) 15% hamster
(animals)

Look at this cute hamster munching on a piece of brocoli
(40% Latent Concept1 and 60%Latent Concept2)


Concept Model: Latent

docs
query
LC1

Training
Corpus

Derived Latent LC2
Concepts
LC1

LC2

LC3
LC3


Latent Semantic Analysis (LSA)

 Definition
 Dimensionality reductions to find latent concepts
 Approach
 Build term-documents matrix M
 Perform single value decomposition (SVD) on M

 Approximate M by taking top N singular values
– N singular values reflect N different latent concepts
– U defines term-concept-correlation
– V defines document-concept-correlation
 Cross Lingual-LSA
 Use parallel corpus


Latent Dirichlet Allocation (LDA)

 Definition
 Generative model
– Words generate latent concepts (Topics)
– Topics generate document to learn the parameter

 Approach
 Topic distribution is assumed to be Dirichlet prior
 Fit corpus and document level properties using variational
Expectation Maximization (EM) procedure

 Cross-lingual-LDA
 Use parallel corpus


Evaluation

 Parallel corpora
 All documents are translated into many languages

 Relevance assessment
 Use documents in one language as query to retrieve documents
of other language
 Translated document = relevant document
– No manual relevant assessment is needed

 Measures used
 Mean reciprocal rank (MRR)
 Average score over all language pairs


Evaluation: Datasets

 Multilingual corpora
 MultextCorpus
– 3066 Q/A pairs from the Official Journal of European Community
 JRC-AQUIS Corpus
– 21,000 legislative documents of the European Union
– We randomly selected 3,000 documents as queries

 Set up
 English, German and French documents were used
 Split dataset for latent topic extraction
– 60% learning, 40% testing


Evaluation: Datasets

 Wikipedia
 Snapshot
– 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German)
– Collection of 166,484 articles

 CL-ESA: Use cross-language links for concepts in different
language

 LSA/LDA: Wikipedia as parallel corpus
– Use it as training corpus for latent concepts extraction


Evaluation: Parameter

 Cross-lingual ESA
 Problem
– Too many concepts
 Solution
– Only use highest m values

 LSI/LDA
 Problem
– Computational costs increase with number of topics
 Solution
– Use fixed number of latent topics


Evaluation: Results

 Multext Dataset


Evaluation: Results

 JRC-Aquis Dataset


Conclusion

 Parameter tuning
 ESA performs good for m=10,000
 Maximum of 500 topics for LSI tested
– Not maximal performance, but seems to converge

 Results
 LSA performs better than LDA
 Comparable results of CL-ESA and LSA
– Explicit Vs Implicit
 Explicit model Perform better than latent model


Cross-Language Info Retrieval Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cross-Language Info Retrieval Models

Similar to Cross-Language Info Retrieval Models (20)

Recently uploaded

Recently uploaded (20)

Cross-Language Info Retrieval Models