A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
1. A Survey of Arabic
Question Answering
Challenges, Tasks, Approaches,
Tools, and Future Trends
Ahmed Magdy & Dr. Mohamed Shaheen
ACIT 2012
2. Outline
● Motivation
● Question Answering Tasks
- Question Analysis, Passage Retrieval, and Answer
Extraction
● Arabic Language Challenges
● Approaches
- Stemming, Named Entity Recognition, Language
Resources
● Tools
● Future Trends And Open Issues
3. Motivation
● Arabic is the 6th most important language
● More than 300 million speakers
● Increasing amounts of Arabic content on the
Internet
● Increasing demand for Information
● There is no survey that covers Arabic
Question Answering
5. Question Analysis
● Tokenization & Normalization
● Remove stop words
● Named Entity Recognition (gazetteer, maxent model)
● Stemming all words except Named Entities
● Question Focus determination by extracting the main NE
● Keywords Extraction & Expansion
● Answer type extraction by question words (Name, Place,
Date, Quantity)
● Query generation of keywords into a Boolean formula
● Experiments with cross-language Arabic/English QA
● Not Promising because of Translation Ambiguity
6. Passage Retrieval
● Systems used:
– Salton’s vector space model based systems
– JIRS passage retrieval system
● Ranking retrieved passages according to:
– Answer and Question words Count
– Answer and Question words Association
– Query words weight
– Cosine similarity between documents words and
question words
– Distance Density N-gram Model
7. Answer Extraction
● Ranking candidate answers according to:
– Manual lexical patterns
– Answer Snippet position
– Question Word frequencies in Answer
– Matching using N-grams
– Select answers with NEs of the same expected
answer type
– Semantic similarity between the question’s focus and
the answer
8. Challenges
● Arabic Morphology is highly inflectional
– Many affixes (articles, prepositions, pronouns etc.)
● Arabic Morphology is highly derivational
– 10,000 root and 120 pattern for derivation
● No Capital Letters in Named Entities
– Unlike Latin based languages
● Scarceness of Arabic Language Resources
– corpora, lexicons, and machine-readable dictionaries
9. Approaches
● Stemming
– Removing prefixes, suffixes and infixes from words
– Match root with patterns
– Language dependent rules
– defining the most used affixed statistically
● Named Entity Recognition
– Maxent model or CRF
– ANERcorp and ANERgazet
● Language Resources
– Arabic WordNet
– Arabic Penn Tree Bank
10. Tools
●
NOOJ for Arabic NLP
– C# .NET Freeware linguistic engineering development environment
– Supports Regular Expressions and Context Free Grammars
– Has Arabic Language resources (Sample Text and Dictionary)
●
Amine Platform
– Java platform for intelligent systems and multi-agents
– Used for semantic analysis of questions and answers
– Uses Conceptual Graphs, Knowledge bases, and Ontologies
●
JIRS a Java Passage Retrieval
– Search based on question n-grams
– Based on the Space Vectorial Model
– Simple N-gram Model (SNM)
– Term-weight N-gram Model (TNM)
– Distance N-gram Model
11. Tools [continued]
● Arabic Stemmers
– Khoja Arabic stemmer (With roots dictionary)
– AraMorph (uses Transliteration to English Letters)
– Information Science Research Institute’s (ISRI) stemmer
(without a root dictionary)
● GATE (General Architecture for Text Engineering)
– Java based platform that composes of a tokenizer, a
gazetteer, a sentence splitter, a part of speech tagger, a
named entities transducer and a coreference tagger.
– Plugins for machine learning with Weka, RASP,
MAXENT, SVM Light
– Managing ontologies like WordNet
12. Tools [continued]
● OpenNLP
– NLP tasks like tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction, chunking, parsing,
maximum entropy, perceptron based machine learning, and
coreference resolution
● Stanford NLP
– Java Framework with many NLP modules for:
– Dependency parsers, and a lexicalized PCFG parser
– Part-of-speech (POS) tagger
– CRF-based Named Entity Recognizer
– CRF-based Word Segmenter
– Maxent Text Classifier
– Tokens Regex: regular expressions over tokens
13. Future Trends and Open Issues
● More research on Arabic restricted domain QA
– Makes semantic tasks like word sense disambiguation easier
– Domain rules affects how the question is posed and how the answer
is formulated
– A Restricted domain should be circumscribed, practical, and complex
– E.g. Agriculture, Architectural Engineering or any field of science
– But not news and current events as they have no constraints
● Use of deep application dependent approaches
– use application dependent constraints and rules to guide the question
analysis and answer extraction and validation
– Depending on the available resources
14. Future Trends and Open Issues [continued]
● Intensive usage of semantics
– Arabic QA focused on morpho-syntactic approaches
– Very little used the Arabic Wordnet
– Still a lot to be done in the field of word sense
disambiguation, coreference resolution and ontology
based reasoning
● Use of theorem proving & deep
reasoning
● Use of logic-based and inference-
based approaches
15. Summary
● Motivation
● Question Answering Tasks
- Question Analysis, Passage Retrieval, and Answer
Extraction
● Arabic Language Challenges
● Approaches
- Stemming, Named Entity Recognition, Language
Resources
● Tools
● Future Trends And Open Issues