Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
Human Factors of XR: Using Human Factors to Design XR Systems
Natural language processing with python and amharic syntax parse tree by daniel adenew msc
1. Amharic Language
Syntax Parsing and
Parse Tree
By: Daniel Adenew MSC (AAU)
source code:
http://www.sourcepod.com/gzvjuw15-20791
2. Abstract
Natural Language processing (NLP) the major field of study in computer science .Computers now a days
believed to be for different reason is having a greater improvement over the capability of NLP processing if they
are equipped with a processing logic that can make increase their ability to understand , interpret and
communicate using human language. There is has been a lot work done and being done to incorporate these
features of communication to computers. As a result, there are certain techniques, tools and scientific approaches
to train and follow generally referred to as NLP ability for computers. For example , computers must understand
,characters, words ,sentence, paragraphs , sounds , and speeches more or less similar to human being does .In
this report , I m going to see that how to enable the ability of computers to understand human constructed
sentence. This is well known in NLP as syntax parsing. Syntax parsing is referred as the way of identifying
words that are related to each other in a given sentence. And, this report only focuses in Amharic language
sentence syntax parsing. example can be mentioned as አበበ በሶ በላ፡፡ (omitted some due to space)
Keywords: NLP, Python, Syntax Parser, CFG, PCFG, Grammar, Amharic Language Sentence, NLP
Tools.
3. Background
Amharic language which is the official language of Ethiopia. Nature of Amharic is being a morphologically rich language having a
similar characteristic in the Semitic language family like that of Arabic, Hebrew, etc. Amharic is the second largest Semitic
language. The Speakers of Arabic count in hundreds of millions, of Amharic in tens of millions, and of Hebrew and Tigrinya in
millions. [5] Since, The Amharic language is quite different both when spoken and written. The reason to say this is because
Amharic language has a complex morphology, where nouns (and adjectives) are inflected for gender, number, definiteness, and
case. Definite markers and conjunctions are suffixed to the nouns, while prepositions are prefixed. Like other Semitic languages,
the verbal morphology is rich and based on triconsonantal roots. There are a quite number of reason , that are required for the
Amharic language to be effectively incorporated for an NLP processing .One of the blockage to progress of developing NLP tools
was lack of standardization: like an international standard for Ethiopic script was agreed on only in 1998 and 2000 into Unicode
repetitions.[5] Another major blockage to progress in Amharic language processing has been the lack of large-scale resources such
as corpora and tools that can effectively understand the language alphabets or symbols called 'Fidel' due to ASCII And Unicode
Representation difference as I have seen this in handy when I was developing this syntax parser .
4. Introduction
Human are naturally given with the gift of communication whether its using sound, signed and written kind.
Communication in human’s life plays a vital role in our day to day activities. Computers in another hand a have
a limited capability of communicating with humans. Since, computer in our age becoming the central point when
we come to simplifying our day to day life. The need for increasing the capability of computers to communicate
with humans effectively and efficiently is increasing. Natural Language Processing, as a field of scientific
inquiry, plays an important role in increasing computers capability to understand natural languages, the language
by which most human knowledge is recorded. NLP operates in designing and implementation of tools,
techniques, frameworks to enable computers communicate effectively as and with humans.
5. ..continued
As matter of fact the above mentioned tools, and many NLP tools has been developed to English language to
more degree of acceptance, efficiency and correctness than that of Amharic language. Regarding Amharic
language there is numerous numbers of researches being undergoing and done to improve the gap and alleviate
the problem in different area of NLP for Amharic. Syntax parsing ,one of the steps to design a functional NLP
application and which can work in cooperation and as input to other many NLP application like grammar and
spell checker , spell correction , and etc. In syntax parsing the central point involves in manipulation,
understanding, and parsing (breaking down to manageable components), understand their context, relation with
each other to successfully identify their correctness. Sentences are the starting point when we come to analyzing
a written material or documents. Syntax refers to the way words are related to each other in a sentence.
6. ..continued
Today, parsers of different kinds (e.g. probabilistic, rule based) have been developed for languages, which have
relatively wider use nationally and/or internationally (e .g. English, German, Chinese, etc. [1]
Example 1: For a sentence አበበ የሰዉ አጥር አፈረሰ ::
Can be parsed as
'(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር)) (V አፈረሰ)))
Syntax Parser Tree’s from this Developed
Syntax Parser Application.
8. Statement of the problem
The problem statement is some we really need a syntax parser that can automatically
parse a given sentence regardless of sentence length, with ability to resolve ambiguities
like by using probabilistic approaches and that can be trained and learn from sentence
on how to parse features. One of the draw back in NLP tools for Amharic can be
mentioned as for Google Online Translation tool which support translation to and from
too many languages even the most morphologically complex language like Hebrew and
Arabic but not Amharic.
9. Statement of the problem
The major concern of this report is to contribute a little to the research in NLP of Amharic, by developing a
syntactic analyzer (i.e. sentence parser) using rule based and probabilistic grammar parsing.
The approach I have followed in this study is to explore current and previous progress of syntax parsers using set
of mechanisms ,techniques, tools , theories and scientific algorithms because syntax parsing which is the second
level analysis in NLP which is very important component to many NLP application done and to be done for
Amharic language.
The approach followed in the design and development of the parser is one that combines rule based and
statistical techniques. This sort of statistical NLP applications require a large volume of data such as hand tagged
and hand parsed corpus.Such corpus is currently made available for many natural languages (for instance, for
English). But there is no such corpus available for the Amharic language and studies of this kind are believed to
contribute to the initiation of compiling and producing the corpus mentioned above.
10. Purpose of the Study
The purpose of study or this report is, to make a researcher like me pretty familiar with the challenges of NLP for
Amharic languages, the tools, techniques for developing and filling the gap for lack of a syntax parser for Amharic
language. So far, as far as my exploration in this matter with the given time to write the report, there are possibly no other
syntax parser to date and to current technologies with a capability to be used as component in another NLP application.
This report is beloved to be providing current information, experimental outputs, challenges for future researcher and
clearing the road a little to syntax parsing in Amharic language. This report can provide a general awareness about the
available grammar parsing (Syntax) methods , algorithms and tools that can possibly achieve the desired output (Syntax
Parse Tree for a given Amharic sentence) and provide a sample that can strengthen the Amharic syntax parsing which is
really becoming more closer to be resolved in near future, in my opinion. If God allows me I will like to be extending it
to my master’s fulfillment thesis and to be even show my continued progress for a PHD program.
11. Limitation of the study
●
This study uses a very small sample prepared for the purpose of the work
due to lack of time and
finding well organized corpus, machine editable dictionary, POS tagged words and unable to find
specially a POS tagger application for Amharic, but simply used a manual dictionary to POS tagging a
sentence or words to construct a parse
●
The
sentence and parse tree later using the my application.
prototype developed in the report/study parses is assumed to be supporting a 10 and more composed
-word Amharic sentences but, the to gain the real outcome of the prototype developed, again due
mainly to time constraint, lack of linguistic ability to possibility determine grammar rules and probabilistic
rules which I believe to use them as hybrid and unavailability of processed data needed. But, the
prototype developed here can support more complex and complex sentence if proper care for above
limitation is considered
12. Limitation of the study
●
This report does not incorporate more advanced topic like ambiguity resolution, but showed sample
parsing using probabilistic approaches.
●
This study has shown a statistical way of parsing a sentence but, the
to words or sentence components
initial probabilistic value assigned
are assigned by the syntax parser developer (me), in the future word
with their probabilistic value formalization must be provided from
grammar read from file (corpus) or similar dynamic input mechanism.
an
automatically
feed
13. Literature Review
Sentences and Parsing
A natural language system must have a considerable knowledge about the structure of the
language itself, including what the words are, how words are combined to form sentences, what the words mean,
how word meanings contribute to sentence meanings and so on (Allen, 95).The major purpose of parsing in
general and sentence parsing in particular is extracting structural and semantic information from the input text
(Abiyot, 2000).
Example
'I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'.
A grammar permits the sentence to be analyzed in two ways, depending on whether the prepositional phrase in my pajamas describes the
elephant or the shooting event.
14. Literature Review
Parser Structure for the above sentence having multiple structures
S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | 'I'
... VP -> V NP | VP PP
... Det -> 'an' | 'my'
... N -> 'elephant' | 'pajamas'
... V -> 'shot'
... P -> 'in'
15. Literature Review
Parsed Structure is continued on next page.
(S
(NP I)
(VP
(V shot)
(NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
16. Literature Review
Syntax Parse Tree as Follow:
A sentence can have multiple parse trees built from a single sentence , referred as
ambiguities
17. Literature Review
Context Free Grammar
A context-free grammar (CFG) is a formal system that describes a language by specifying how any legal text can
be derived from a distinguished symbol called the axiom, or sentence symbol. [5]
An example of a CFG is given below.
For a Sentence Like “አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ" can be represented using the following grammar.
S -> NP VP
VP -> V NP | V NP PP | NP V
PP -> P NP | P P
V -> “አየ” | “በላ” | "ተራመዳ"
NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | Det N N PP
Det -> "የ" | "ለ"
N -> "ሰዉ" | "ውሻ" |"አጥር"| "ድመት" | "ቲልሳኦፕ" | "መናፈሻ"
P -> "በ" | "ላይ" | "በኩል"|"ሆኖ"| "ከ"
18. Literature Review
The Syntax Parse Structure for the above example and its Parse Tree Using the developed application
looks like the following respectively:
(S (NP አበበ) (VP (NP (Det የ)
(N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ)))
(V አየ)))
19. Literature Review
Recursive Descent Parsing
The simplest kind of parser interprets a grammar as a specification of how to break a
high-level goal into several lower-level sub goals. The top-level goal is to find an S.
The S → NP VP production permits the parser to replace this goal with two subgoals:
find an NP, then find a VP. Each of these sub goals can be replaced in turn by sub-subgoals, using productions that have NP and VP on their left-hand side.
20. Literature Review
Sample code taken form Python Language Processing
grammarx = nltk.parse_cfg("""
S -> NP VP
VP -> V NP | V NP PP | NP V
PP -> P NP
V -> "አየ" | "በላ" | "ተራመዳ"
NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | N
Det -> "የ" | "ለ"
N -> "ሰዉ" | "ውሻ" | "ድመት" | "ቲልሳኦፕ" | "መናፈሻ"
P -> "በ" | "ላይ" | "በኩል" | "ከ"
""")
>>sent = "አበበ የ ሰዉ ውሻ አየ".split()
>>print (sent)
>>rd_parser = nltk.RecursiveDescentParser(grammarx)
>>for tree in rd_parser.nbest_parse(sent):
print (tree)
>>parseTree = nltk.Tree.parse('(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N ውሻ)) (Vአየ)))',remove_empty_top_bracketing=True)
>>parseTree .draw()
21. ..continued
Parsed Structure Output: (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N ውሻ)) (Vአየ))).
Syntax Parse Tree for the above sentence parsed using Reduced Shift Parser (Top Down) .
22. ..continued
Shift-Reduce Parsing
A simple kind of bottom-up parser is the shift-reduce parser. In common with all
bottom-up parsers, a shift-reduce parser tries to find sequences of words and phrases
that correspond to the right hand side of a grammar production, and replace them with
the left-hand side, until the whole sentence is reduced to an S.[5]
23. ..continued
For a sentence: አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ .Its Parse Structure parse tree representation is given.
Using the following CFG grammar.
S -> NP VP
VP -> V NP | V NP PP | NP V | NP Adj V
PP -> P NP | P P
V -> "አየ" | "በላ" | "ተራመዳ"
NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | Det N N PP
Det -> "የ" | "ለ"
N -> "ሰዉ" | "ውሻ" |"አጥር"| "ድመት" | "ቲልሳኦፕ" | "መናፈሻ"
P -> "በ" | "ላይ" | "በኩል"|"ሆኖ"| "ከ"
Adj ->"ትንሽ"
24. ..continued
Parser Structure, parsed using the above grammar.
(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (Adj ትንሽ) (V አየ)))
Figure 1.8 Parser Tree
Similar manner by keeping the source
code on code example 1.0 above
we can use a shift reduce parser.
25. Dependency Grammar
Phrase structure grammar is concerned with how words and sequences of words combine to form constituents. A
distinct and complementary approach, dependency grammar, focuses instead on how words relate to other words.
Dependency is a binary asymmetric relation that holds between a head and its dependents. The head of a sentence
is usually taken to be the tensed verb, and every other word is either dependent on the sentence head, or connects
to it through a path of dependencies.
Sample code taken from Python Syntax parser Application
>>dep_grammar = nltk.parse_dependency_grammar("""
...'አየ' -> 'አበበ' | 'አጥር' | 'ላይ'|'ሰዉ'
...'አጥር' -> 'ላይ'|'ሰዉ'|'ሆኖ'
...'ሰዉ' -> 'ኧሱ'|'የ'
…""")
>>print (dep_grammar)
26. ..continued
The Generated Output showing dependency of each word :
Dependency grammar with 9 productions
'አየ' -> 'አበበ'
'አየ' -> 'አጥር'
'አየ' -> 'ላይ'
'አየ' -> 'ሰዉ'
'አጥር' -> 'ላይ'
'አጥር' -> 'ሰዉ'
'አጥር' -> 'ሆኖ'
'ሰዉ' -> 'ኧሱ'
27. Statistical Approaches
In statistical parsing, grammar rules specify the structures allowable in the language,
while probabilities specify the distributional regularities of sentence structures in the
language. That is, probabilistic reasoning by way of statistical probabilities is
introduced to assist reasoning.
It means that linguistic specifications and statistical regularities of syntax are combined
to be used for better syntax analysis. The probabilistic reasoning has become much
more popular in recent years (Yao and Lua, 1998).[1]
28. Probabilistic CFG parsing
Probabilistic Context-Free Grammar (or PCFG) is a context free grammar that associates a probability with each of
its productions. It generates the same set of parses for a text that the corresponding context free grammar does, and
assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the product of the
probabilities of the productions used to generate it.[1]
PCFGs tend to be robust (Manning and Schütze, 1999). [1] They produce a model of a language based on real data,
and therefore do not have to worry about things like grammatical mistakes, which occur in real-life situations.
Although PCFGs have many advantages, a critical disadvantage is that context is not taken into account at all (Cahill,
2000).[8]
In fact a tri-gram (sequence of three words in this case) model of a language would probably achieve better results
(Charniak, 1993), even though it takes no account of internal structures in the language ,more applicable to language
like Amharic.
29. Probabilistic CFG parsing
Example of PCFG grammar is shown below and, the approach is explained in a topic below the figure.
S -> NP VP [1.0]
VP -> V NP
PP -> P NP
V -> "አየ"
[0.2] VP -> V NP PP [0.3] VP -> NP V
[0.2] PP -> P P
[0.8] V -> "በላ"
[0.1] VP -> NP Adj V [0.4]
[0.8]
[0.1] V -> "ተራመዳ" [0.1]
NP -> "አበበ" [0.2] NP -> "ከበደ"
NP -> Det N PP [0.1] NP -> N N
[0.1] NP ->"ጫላ"
[0.1]
[0.1]
NP -> Det N
[0.1]
NP -> Det N N [0.1]
NP -> Det N N PP [0.2]
Det -> "የ" [0.9] Det -> "ለ" [0.1] N -> "ሰዉ" [0.4]
N -> "ውሻ" [0.1] N -> "አጥር" [0.2] N -> "ድመት" [0.1] N ->"ቲልሳኦፕ" [0.1] N -> "መናፈሻ" [0.1]
P -> "በ"
[0.1] P ->"ላይ" [0.4] P -> "በኩል" [0.1] P ->"ሆኖ"
Adj ->"ትንሽ" [1.0]
[0.3] P ->"ከ"
[0.1]
30. Probabilistic CFG parsing
The Syntax Parsed Structural Output using Viterbi algorithm using the above grammar
is shown below, with a final summed up probabilistic value.
Code Example Using Python
viterbi_parser = nltk.ViterbiParser(grammer)
sent = "አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ".split()
print (viterbi_parser.parse(sent))
Output of the above grammar and Viterberi_Parser in My application using Python
(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (Adj ትንሽ) (V አየ)))
(p=8.84736e-05)
31. Probabilistic CFG prasing
Form the example of a PCFG with associated sentence probabilities taken from the developed syntax parser
application : Note that ,the probabilities for each Crammer symbol categories say ,NP must sum up to 1.0.So that
using the viterbri algorithm (selects the best route using a probability sum up ,this algorithm is also used in POS
taggers as case Mesifin 2001.[2] )grammar can be parsed .In this case we can see that two productions of the
grammar having a similar probability within same category like .
V -> "አየ"
[0.8] V -> "በላ"
[0.1] V -> "ተራመዳ" [0.1]
Assume we have the following sentence:
አበበ የ ሰዉ አጥር ላይ ሆኖ አየ ::
How is then it resolved whether the end of the production end in “Bela” , this the advantage of PCFG based on
the previous path of probability we can have exact match. This case is demonstrated in my application and can
see the source code the end of this document.
32. Meth0d0l0gy
The methodology I used to develop this sample application is, takes a set of sample grammars 4
from simple to complex grammar production rules, and assigned those probabilities for
probabilistic approach parsing and draws their parse tree and specifies their parsing structure based
on the grammar.
To develop the application, talking source code wise: I have used a collection tools working and
supporting the main application for different purposes. Below I have listed out the names.
●
Python 3.2
●
NLTK 3.0 Python Based Natural
Language Processing Toolkit .(www.nltk.org)
●
KeyMan Keyboard for Unicode
Keyboard Writer (Amharic)
●
PyScripter 3.2 for an
interactive IDE for python.
33. Meth0d0l0gy
In order to Setup my application, on a local environment, first python 3.2 must be
installed and then download NLTK 3.0 and install it under the python directory,
because this used as library inside a python code. Then you need to download NLTK
data using python itself.
Example using command line in windows. [Go to CMD]
Type Python on windows `CMD`
type nltk.download() to download data
but , you need to install nltk first using how to install on www.nltk.org
35. Significance of study
The significance of the study can be considered very important matter of fact, in Amharic
language we don't really have this kind of parser developed so far, this study seems to
provide a lot of possibilities to ease the parsing of Amharic sentences and transform one step
ahead to our Amharic syntax parsing approaches. This study has also showed that there is a
very easy and more accurate way of parsing syntax for Amharic language. As ,compared to
previous trials of researchers , am not saying this study is above all but, think it has
alleviated some of the approaches and problems they mentions on their study [Alebachew,
Abitou,Mesfin], like probabilistic approaches ,automatic parsing ,the need to write a
grammar parser and more from programming outcomes .
36. Significance of study
By taking this study into a very advanced and researcher study with more time and effort I
believe the must be the being that a real syntax parser for Amharic language to be developed.
This study , tried so much that how to handle Amharic sentences using rule based and
probabilistic approach and the outcomes of the study also has code or application output
available on the end of this document. This also can motivate researcher's ,student and
stockholder to move forward from the study I did in this limited amount of time that have
left off and by seeing the source code and method I have suggested they can benefit a lot and
lot more I believe. But, above all one thing I have to remind is the growth to Amharic NLP
capabilities and that is my dedication for in this study.
37. Significance of study
By taking this study into a very advanced and researcher study with more time and effort I
believe the must be the being that a real syntax parser for Amharic language to be developed.
This study , tried so much that how to handle Amharic sentences using rule based and
probabilistic approach and the outcomes of the study also has code or application output
available on the end of this document. This also can motivate researcher's ,student and
stockholder to move forward from the study I did in this limited amount of time that have
left off and by seeing the source code and method I have suggested they can benefit a lot and
lot more I believe. But, above all one thing I have to remind is the growth to Amharic NLP
capabilities and that is my dedication for in this study.
39. Thankyou!
comment and contact me
@ mr.prog60@gmail.com
linkedin: daniel adenew
accademia: daniel adenew
google : daniel adenew
slideshare : daniel adenew ,dannymanone