Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, And Future Trendsav

A Survey of Arabic
Question Answering
Challenges, Tasks, Approaches,
Tools, and Future Trends

Ahmed Magdy & Dr. Mohamed Shaheen
ACIT 2012

Outline
● Motivation
● Question Answering Tasks
- Question Analysis, Passage Retrieval, and Answer
Extraction
● Arabic Language Challenges
● Approaches
- Stemming, Named Entity Recognition, Language
Resources
● Tools
● Future Trends And Open Issues

Motivation
● Arabic is the 6th most important language
● More than 300 million speakers
● Increasing amounts of Arabic content on the
Internet
● Increasing demand for Information
● There is no survey that covers Arabic
Question Answering

Question Answering Tasks

● Question Analysis
● Passage Retrieval
● Answer Extraction

Question Analysis
● Tokenization & Normalization
● Remove stop words
● Named Entity Recognition (gazetteer, maxent model)
● Stemming all words except Named Entities
● Question Focus determination by extracting the main NE
● Keywords Extraction & Expansion
● Answer type extraction by question words (Name, Place,
Date, Quantity)
● Query generation of keywords into a Boolean formula
● Experiments with cross-language Arabic/English QA
● Not Promising because of Translation Ambiguity

Passage Retrieval
● Systems used:
– Salton’s vector space model based systems
– JIRS passage retrieval system
● Ranking retrieved passages according to:
– Answer and Question words Count
– Answer and Question words Association
– Query words weight
– Cosine similarity between documents words and
question words
– Distance Density N-gram Model

Answer Extraction
● Ranking candidate answers according to:
– Manual lexical patterns
– Answer Snippet position
– Question Word frequencies in Answer
– Matching using N-grams
– Select answers with NEs of the same expected
answer type
– Semantic similarity between the question’s focus and
the answer

Challenges
● Arabic Morphology is highly inflectional
– Many affixes (articles, prepositions, pronouns etc.)

● Arabic Morphology is highly derivational
– 10,000 root and 120 pattern for derivation

● No Capital Letters in Named Entities
– Unlike Latin based languages

● Scarceness of Arabic Language Resources
– corpora, lexicons, and machine-readable dictionaries

Approaches
● Stemming
– Removing prefixes, suffixes and infixes from words
– Match root with patterns
– Language dependent rules
– defining the most used affixed statistically
● Named Entity Recognition
– Maxent model or CRF
– ANERcorp and ANERgazet
● Language Resources
– Arabic WordNet
– Arabic Penn Tree Bank

Tools
●
NOOJ for Arabic NLP
– C# .NET Freeware linguistic engineering development environment
– Supports Regular Expressions and Context Free Grammars
– Has Arabic Language resources (Sample Text and Dictionary)
●
Amine Platform
– Java platform for intelligent systems and multi-agents
– Used for semantic analysis of questions and answers
– Uses Conceptual Graphs, Knowledge bases, and Ontologies
●
JIRS a Java Passage Retrieval
– Search based on question n-grams
– Based on the Space Vectorial Model
– Simple N-gram Model (SNM)
– Term-weight N-gram Model (TNM)
– Distance N-gram Model

Tools [continued]
● Arabic Stemmers
– Khoja Arabic stemmer (With roots dictionary)
– AraMorph (uses Transliteration to English Letters)
– Information Science Research Institute’s (ISRI) stemmer
(without a root dictionary)
● GATE (General Architecture for Text Engineering)
– Java based platform that composes of a tokenizer, a
gazetteer, a sentence splitter, a part of speech tagger, a
named entities transducer and a coreference tagger.
– Plugins for machine learning with Weka, RASP,
MAXENT, SVM Light
– Managing ontologies like WordNet

Tools [continued]
● OpenNLP
– NLP tasks like tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction, chunking, parsing,
maximum entropy, perceptron based machine learning, and
coreference resolution
● Stanford NLP
– Java Framework with many NLP modules for:
– Dependency parsers, and a lexicalized PCFG parser
– Part-of-speech (POS) tagger
– CRF-based Named Entity Recognizer
– CRF-based Word Segmenter
– Maxent Text Classifier
– Tokens Regex: regular expressions over tokens

Future Trends and Open Issues
● More research on Arabic restricted domain QA
– Makes semantic tasks like word sense disambiguation easier
– Domain rules affects how the question is posed and how the answer
is formulated
– A Restricted domain should be circumscribed, practical, and complex
– E.g. Agriculture, Architectural Engineering or any field of science
– But not news and current events as they have no constraints

● Use of deep application dependent approaches
– use application dependent constraints and rules to guide the question
analysis and answer extraction and validation
– Depending on the available resources

Future Trends and Open Issues [continued]
● Intensive usage of semantics
– Arabic QA focused on morpho-syntactic approaches
– Very little used the Arabic Wordnet
– Still a lot to be done in the field of word sense
disambiguation, coreference resolution and ontology
based reasoning
● Use of theorem proving & deep
reasoning
● Use of logic-based and inference-

based approaches

Summary
● Motivation
● Question Answering Tasks
- Question Analysis, Passage Retrieval, and Answer
Extraction
● Arabic Language Challenges
● Approaches
- Stemming, Named Entity Recognition, Language
Resources
● Tools
● Future Trends And Open Issues

Thank You

You can view the Full Paper on ACIT 2012 Proceedings

Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, And Future Trendsav

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, And Future Trendsav

Similar to Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, And Future Trendsav (20)

More from Ahmed Magdy Ezzeldin, MSc.

More from Ahmed Magdy Ezzeldin, MSc. (11)

Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, And Future Trendsav