SlideShare a Scribd company logo
1 of 7
Download to read offline
International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022
DOI: 10.5121/ijcses.2022.13201 1
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
Onyenwe Ikechukwu1
, Onyedikachukwu Ikechukwu-Onyenwe1
,
Onyedinma Ebele1
, Chidinma A. Nwafor2
1
Nnamdi Azikiwe University Awka, Anambra State, Nigeria
2
Nigerian Army College of Environmental Science and Technology, Makurdi,
BenueState, Nigeria
ABSTRACT
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
KEYWORDS
Languages, Parts-of-Speech technology, Corpus, Natural Language Processing.
1. INTRODUCTION
Parts-of-speech (henceforth POS) tagging is the process of assigning a POS or other lexical class
marker to each word in a language texts according to their respective POS label according to its
definition and context [4] [14]. It is an important enabling task for NLP applications such as a
pre-processing step to syntactic parsing, information extraction and retrieval, statistical machine
translation, corpus linguistics, etc.
Automating POS tagging involves three main tasks: text segmentation into tokens, assignment of
most probable tags to tokens (this create room for ambiguity), and determination of potential
appropriateness of each probable tag to resolve ambiguity. There is valid contextual information
used by all POS tagging technologies in their analysis for disambiguation, namely preceding
(following) words, and their POS tags. This information is highly utilized in trying to resolve
ambiguity. There is valid contextual information used by all POS tagging technologies in their
analysis for disambiguation, namely preceding (following) words, and their POS tags. This
information is highly utilized in trying to resolve ambiguity. [2] justified the effectiveness of this
approach in the following comment:
“Corpus-based methods are often able to succeed while ignoring the true complexities of
language, banking on the fact that complex linguistic phenomena can often be indirectly observed
through simple epiphenomena” [2]
Brill illustrated the above statement using “race” in examples 1–3, that one could assign a POS
tag to words in the sentence considering the contextual information.
International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022
2
1. He will race/verb the car.
2. He will not race/verb the car.
3. When will the race/noun end?
The POS tags attached to the word “race” were derived considering the preceding (following)
words and their associated tags based on the fact that words seen in the right side of a modal, in
this case “will” is always a verb. This linguistic generalization is only exempted when there is a
determiner to the right of the modal. And there is an assumption that determiner “a/the” precedes
either a noun or adjective in English, but in this case, it is a noun. Therefore, restricting POS
tagging to the use of this minimal information while ignoring the complexities found in most
languages can be highly effective.
2. PARTS-OF-SPEECH TECHNOLOGY
Various POS technologies developed are either corpus-based or non-corpus-based. The above
illustrations are examples of corpus-based method. [1] and [14] identified types of technologies
developed and classified the m on the following basis:
1. Source of knowledge: here, linguistic generalization used in the model are derived from
either the grammatical knowledge of a linguist or the text corpus.
2. Express as: the linguistic generalization derived are encoded as rules or probabilities
(linguistic information are represented in numerical parameters).
Figure 1. POS tagging Techniques
Logically, there are four possible POS tagging techniques in Figure 1, namely rule-based
method(A), probabilistic-based method (B), corpus-based method (C), and unknown mode of
operation(D). The illustration in the Figure 1 is closely in relation to four possible disambiguation
methodologies in [1].
2.1. Rule-Based Method (A)
Rule-based (A) is where linguistic information is expressed as rules. This the earliest POS tagging
technique developed, between 1960s and 1980s. It is a combination of lexicon and hand- crafted
rules. Model is EngCG (English Constraint Grammar). In this approach, Linguistic generalization
forms the information called constraints. There are set of instructions in each rule that determine
how operations are to be performed and contextual information that describe where to apply the
rules. The operations tend to alter tags associated with ambiguous tagged terms so that one or
International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022
3
more potential tags are eliminated from the list of consideration, thereby reducing ambiguity.
Rule-based is more than using grammar as traditionally formulated, it involves syntactic analysis.
The work of [15], “string analysis of sentence structure” is an example. It is described by Harris
as “a process of syntactic analysis which is intermediate between the traditional constituent
analysis and transformational analysis.” Another example is constraint grammar (CG), a method
implemented in the work of [10]. Karlsson defined CG as “a language-independent formalism for
surface-oriented, morphology- based parsing of un restricted text.”
Figure 2. EngCG Architecture
CG has been implemented in English, Swedish, and Finnish. It is based on disambiguation of
morphological readings and syntactic labels. The assignment of tags is directly through lexicon to
syntax. Suggested alternative tags are discarded by the constraints as many as possible, the
outcome aimed at getting fully disambiguated sentence having one tag for each word. Figure 2
shows the architecture of CG. CG rule-based approach has been successfully applied to most of
European languages like English and French. The EngCG tagger using 1, 100 constraints
achieved a recall of at least 99.7% with 93-97% of words disambiguated [1]. Also, Greene and
Rubin used a rule-based approach in the TAGGIT system [8], which aided in POS tagging the
Brown corpus [7]. 77% of the corpus was disambiguated through the use of TAGGIT; the
remaining ambiguated words in the corpus were manually corrected over a period of several
years.
2.2. Probabilistic-Based Method (B)
This approach enables linguistic generalization derived from a corpus-data to be represented in
numerical parameters. The frequencies at which different sequences of tags occur are derived
from the corpus-data. The derived information is then used to verify “which of this possible tags
is the most likely for w (observed word) in the vicinity X (X could be a sentence under
observation) given a sequence of tags (ti...tn)”. This statistical information gathered help in
choosing the right tag of the ambiguously tagged word using any machine learning applicable
design. Illustrating the Above statement, a tagger after training on a gold corpus might learn
through statistics that a tag for a subject pronoun is followed by the verb tag 70% of the time,
adverb tag 29% of the time, and noun tag 1% of the time. In the future, if the tagger encounters a
new instance, say, a word following a subject pronoun that was ambiguously tagged as either a
noun or verb, it can use its statistical knowledge to decide that the verb tag is most likely to be
correct [1]. Though in practice, this process would be in capable of handling long sequences of
International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022
4
ambiguous words. Thus, the modern probabilistic process called Markov model. This model
works by estimating the probability of a sequence of tags, given the transition probability of the
different tags in the sequence P (ti|ti−1). Markov model uses transition probabilities and
observation likelihood P (wi|ti) in calculating the possible tag for the ambiguously tagged words.
Both probabilities correspond to the prior and likelihood respectively.
Given annotated corpus, to calculate the maximum likelihood estimate of P(ti|ti−1), we compute
the counts of ti given ti−1 and divide the outcome by count ti−1 , that is:
P (ti|ti−1) = C(ti−1, ti) ÷ C(ti−1) (1)
For example, determiner is very likely to follow adjectives and nouns in English (e.g. that/DT
flight/NN and the/DT yellow/JJ hat/NN). Thus, we would expect P (NN/DT) and P (JJ/DT) to
have higher scores compare to P (DT/JJ) and P(DT/NN) using equantion 1 [60]. Also, in the same
way, we can compute maximum likelihood estimate of a word w under observation. So, we count
how many times w occur in the annotated corpus given a tag t The outcome is divided by number
of t in the corpus, that is:
P (wi|ti ) = C(ti, wi ) ÷ C(ti ) (2)
[9] stated “... in POS tagging, while we recognize the words in the input, we do not recognize
the POS tags. Thus, we cannot condition any probabilities on, say, a previous POS tag, because
we can not be completely sure exactly which tag applied to the previous word.” Hence, the term
hidden in Markov model because the transitions are hidden from view. Hidden Markov Model
(HMM) handles both seen and hidden events. Observed events are the words that serve as an
input while hidden events are the POS tags. The Markov model parameters useful in POS tagging
are the states are the tags, the symbols that they output (their associated words), and the
transitions from one state to another (makeup the sequence of tags allocated to a sentence). Tag
transition probabilities answer the question “how likely are we to expect a tag given the previous
tag?” and lexical likelihood finds how likely “the observed word” given a POS tag. Final
estimation is computed by calculating the maximum likelihood for all the probabilities which can
be obtained from the corpus counts. The most likely tag in Markov model can be found through
computation or selecting the most appropriate path without calculating the probability of each tag.
Using the example in [9], the former works by focusing on the observed word and the tag
sequence. For instance, the word “race” can be a noun or a verb in the followings:
1. Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow/NR.
2. People/NNS continue/VB to/TO inquire/VB the/AT reason/NN for/IN the/AT race/NN
for/IN outer/JJ space/NN.
International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022
5
A graph view of 1 and 2 above is given in Figure 3
Figure 3. Graph showing three probabilities (arcs 1, 2 and 3) where sentences 1 and 2 above differ
(Source: [9])
Arcs 1, 2 and 3 in Figure 3 are the probabilities that HMM will use to resolve ambiguously POS
tagged words. Like in the above case (race/NN, race/VB). Arcs 1 and 2 are the transition
probability P(ti|ti−1) and lexical likelihood P(wi|ti), while 3 is the probability of the tag sequence
incase of NR. So, the transition probabilities of P(VB|TO) and P(NN|TO), assume that verbs are
about 500 times as likely as nouns to occur after TO, using equation 1 is as follows, P(VB|TO) =
0.83 and P(NN|TO) = 0.00047. The lexical likelihood of “race” given a tag (i.e. the probability of
tags VB and NN producing the word race) using equation 2, P (race|NN) = 0.00057 and P
(race|VB) = 0.00012. And finally, the probability of the tag sequence (arc 3 in figure 3.4),
P(NR|VB) =0.0027 and P(NR|NN) = 0.0012. To get the final probability, multiply all the
probabilities to get
 P(VB|TO)P(NR|VB) P (race|VB) = 0.00000027
 P(NN|TO)P(NR|NN) P (race|NN) = 0.00000000032
With this statistic, HMM tagger will correctly tag race as a VB. Markov model has been widely
used in assuming that a token depends probabilistically on its parts-of-speech category and this in
turn depends solely on the categories of the following and preceding one or two words [5].
Markov model are also called n − gram taggers because of the transition probabilities, which
might occur over n − size of words, taggers based on. There is unigram tagger where n = 1,
bigram taggers where n = 2 and trigram taggers where n = 3. These taggers undergo a training
process to learn statistical behaviour of the tags associated with words in the corpus data through
an examination of the training data (manually tagged corpus). These statistical learned behaviour
of the tagger are used in disambiguation techniques, in Markov model. Viterbi algorithm allows
an appropriate tag to be selected through choosing the most probable path without calculating the
probability of each path through the ambiguously tagged sequences of words. It is the most
decoding algorithm used in HMMs, mostly relevant in speech and language processing.
According to [12], there are three reasons of choosing Viterbi over maximum likelihood. Firstly,
it is easier to implement. Secondly, it cannot produce in its output sequences of tags which are not
possible in the model, which is theoretically possible in maximum likelihood tagging. Thirdly, it
gives the best analysis of the sentence as a whole.
International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022
6
Stochastic process of POS tagging advantage over rule-based is that the linguist does not write an
effective set of rules to produce an effective system. According to [2], the stochastic techniques
advantage over the traditional rule-based design comes from the ease that very little hand- crafted
knowledge need be built into the system. Also, stochastic process is generally applicable.
Probabilistic or stochastic process became popular in 1980s. Statistical methods have been used
in the works of [6] (VOLSUNGA), [12], [5], [3] (TnT), etc.
2.3. Corpus-Based Rule Method (C)
In this approach, linguistic generalizations (rules) are generated from the corpus, hence the name
corpus-based rule. Corpus-based rule ignores the true complexities of language in its process with
the fact that challenging linguistic phenomena can be indirectly recognized through simple
epiphenomena [2]. But this does not mean that the rule-based approach discussed earlier did not
use corpus data to derive some grammatical rules. [11] pointed out that some linguistic
generalization used in CGC system were based on empirical observation of the corpus data.
However, both approaches are distinguished in their implementations. Corpus-based rule method
allow computer to formulate and evaluate the rules. That is, the rules created are automatic and
linguistic readable. The linguist information about grammar is excluded from this process. An
example of this approach is a widely used Brill POS tagger based on transformation- based error-
driven learning.
2.4. Unknown Mode of Operation (D)
Dashed line in Figure 1 above indicates unknown mode of operation. According to [1], humans
are unskilled in estimating frequencies of words and other linguistic phenomena. Therefore,
modeling linguist knowledge of the corpus data in probabilistic form will create a perverse POS
tagging techniques. Since humans will usually get words frequencies wrong compared to rules,
which are more likely to get it right.
3. CONCLUSION
There are different approaches to the problem of assigning a parts-of-speech (POS) tag to each
word of a natural language sentence. These different approaches fall into two distinctive groups:
corpus-based and non-corpus-based. Here, we present an elaborate review of the existing
technologies in the area of POS tagging with a focus on these distinctive groups.
ACKNOWLEDGEMENTS
We thank all that contributed to the success of this paper. Your feedbacks are very insightful and
helpful towards the development of this paper.
REFERENCES
[1] Andrew Hardie. The Computational Analysis of Morpho syntactic Categories in Urdu. PhD thesis,
University of Lancaster, 2003.
[2] Brill E. Transformation-based error-driven learning and natural language processing: A case study In
part-of-speech tagging. In Association for Computational Linguistics., 1995.
[3] Brants Thorsten. Tnt: a statistical part-of-speech tagger. In ANLC ’00 Proceedings of the sixth
conference on Applied natural language processing, pages 224–231. Association for Computational
Linguistics Stroudsburg, PA, USA, 2000.
[4] Chungku C., Rabgay J., Faaß G., “Building NLP resources for Dzongkha: a tag set and a tagged
corpus”, Proceedings of the Eighth Workshop on Asian Language Resouces, p. 103-110, 2010.
International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022
7
[5] Cutting D., Kupiec J., Pedersen J., and Sibun P. A practical part-of-speech tagger. In ANLC
’92Proceedings of the third conference on Applied natural language processing, pages 133–140.
Association for Computational Linguistics Stroudsburg, PA, USA, 1992.
[6] DeRose S. Grammatical category disambiguation by statistical optimization. Computational
Linguistics, 14:31–39, 1998.
[7] Francis W. N. and Kučera H. Frequency Analysis of English Usage. Houghton Mifflin, Boston, 1982.
[8] Greene B. B. and Rubin G. M. Automatic grammatical tagging of english. Technical report,
Department of Linguistics, Brown University, Providence, Rhode Island, 1971.
[9] Jurafsky D. and Martin J. H. Speech and language processing: An introduction to natural language
processing, computational linguistics, and speech recognition. Prentice Hall, 2000.
[10] Karlsson F. Designing a parser for unrestricted text, 1995.
[11] Klein S. and Simmons R. F. A computational approach to grammatical coding of English words.
Journal of the Association for Computing Machinery, 10:334–347, 1963.
[12] Merialdo B. Tagging english text with a probabilistic model. Computational Linguistics, 20:155171,
1994.
[13] Francis W. N. and Kučera H. Frequency Analysis of English Usage. Houghton Mifflin, Boston, 1982.
[14] Onyenwe, Ikechukwu Ekene. Developing methods and resources for automated processing of the
African language igbo. Diss. University of Sheffield, 2017.
[15] Zellig S. Harris. String analysis of sentence structure. In Linguistic Society of America, 1962.

More Related Content

Similar to A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES

EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfNohaGhoweil
 
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCompression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...mathsjournal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...mathsjournal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...mathsjournal
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Editor IJARCET
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002IJARTES
 
Semantic extraction of arabic
Semantic extraction of arabicSemantic extraction of arabic
Semantic extraction of arabiccsandit
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIcscpconf
 

Similar to A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES (20)

L017158389
L017158389L017158389
L017158389
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCompression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic Language
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
Semantic extraction of arabic
Semantic extraction of arabicSemantic extraction of arabic
Semantic extraction of arabic
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDI
 

Recently uploaded

VIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service Gorakhpur
VIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service GorakhpurVIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service Gorakhpur
VIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service GorakhpurSuhani Kapoor
 
BOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts Services
BOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts ServicesBOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts Services
BOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts Servicesdollysharma2066
 
webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...
webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...
webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...Cluster TWEED
 
NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...
NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...
NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...Amil baba
 
Call On 6297143586 Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...
Call On 6297143586  Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...Call On 6297143586  Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...
Call On 6297143586 Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...tanu pandey
 
Proposed Amendments to Chapter 15, Article X: Wetland Conservation Areas
Proposed Amendments to Chapter 15, Article X: Wetland Conservation AreasProposed Amendments to Chapter 15, Article X: Wetland Conservation Areas
Proposed Amendments to Chapter 15, Article X: Wetland Conservation Areas💥Victoria K. Colangelo
 
VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...Call Girls in Nagpur High Profile
 
Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999
Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999
Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999Tina Ji
 
Hot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night Standkumarajju5765
 
Booking open Available Pune Call Girls Parvati Darshan 6297143586 Call Hot I...
Booking open Available Pune Call Girls Parvati Darshan  6297143586 Call Hot I...Booking open Available Pune Call Girls Parvati Darshan  6297143586 Call Hot I...
Booking open Available Pune Call Girls Parvati Darshan 6297143586 Call Hot I...Call Girls in Nagpur High Profile
 
(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
VIP Kolkata Call Girl Kalighat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kalighat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kalighat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kalighat 👉 8250192130 Available With Roomdivyansh0kumar0
 
(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Service
(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Service(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Service
(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
DENR EPR Law Compliance Updates April 2024
DENR EPR Law Compliance Updates April 2024DENR EPR Law Compliance Updates April 2024
DENR EPR Law Compliance Updates April 2024itadmin50
 
VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...Suhani Kapoor
 
(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Service(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Contact Number Call Girls Service In Goa 9316020077 Goa Call Girls Service
Contact Number Call Girls Service In Goa  9316020077 Goa  Call Girls ServiceContact Number Call Girls Service In Goa  9316020077 Goa  Call Girls Service
Contact Number Call Girls Service In Goa 9316020077 Goa Call Girls Servicesexy call girls service in goa
 
Booking open Available Pune Call Girls Budhwar Peth 6297143586 Call Hot Indi...
Booking open Available Pune Call Girls Budhwar Peth  6297143586 Call Hot Indi...Booking open Available Pune Call Girls Budhwar Peth  6297143586 Call Hot Indi...
Booking open Available Pune Call Girls Budhwar Peth 6297143586 Call Hot Indi...Call Girls in Nagpur High Profile
 
(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...
(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...
(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...ranjana rawat
 

Recently uploaded (20)

VIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service Gorakhpur
VIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service GorakhpurVIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service Gorakhpur
VIP Call Girl Gorakhpur Aashi 8250192130 Independent Escort Service Gorakhpur
 
BOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts Services
BOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts ServicesBOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts Services
BOOK Call Girls in (Dwarka) CALL | 8377087607 Delhi Escorts Services
 
webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...
webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...
webinaire-green-mirror-episode-2-Smart contracts and virtual purchase agreeme...
 
NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...
NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...
NO1 Verified kala jadu karne wale ka contact number kala jadu karne wale baba...
 
Call On 6297143586 Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...
Call On 6297143586  Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...Call On 6297143586  Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...
Call On 6297143586 Pimpri Chinchwad Call Girls In All Pune 24/7 Provide Call...
 
Proposed Amendments to Chapter 15, Article X: Wetland Conservation Areas
Proposed Amendments to Chapter 15, Article X: Wetland Conservation AreasProposed Amendments to Chapter 15, Article X: Wetland Conservation Areas
Proposed Amendments to Chapter 15, Article X: Wetland Conservation Areas
 
VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Koregaon Park (7001035870) Pune Escorts Nearby with Comp...
 
Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999
Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999
Call Girls In Faridabad(Ballabgarh) Book ☎ 8168257667, @4999
 
Hot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Preet Vihar ☎ 9711199171 Book Your One night Stand
 
Booking open Available Pune Call Girls Parvati Darshan 6297143586 Call Hot I...
Booking open Available Pune Call Girls Parvati Darshan  6297143586 Call Hot I...Booking open Available Pune Call Girls Parvati Darshan  6297143586 Call Hot I...
Booking open Available Pune Call Girls Parvati Darshan 6297143586 Call Hot I...
 
young Whatsapp Call Girls in Delhi Cantt🔝 9953056974 🔝 escort service
young Whatsapp Call Girls in Delhi Cantt🔝 9953056974 🔝 escort serviceyoung Whatsapp Call Girls in Delhi Cantt🔝 9953056974 🔝 escort service
young Whatsapp Call Girls in Delhi Cantt🔝 9953056974 🔝 escort service
 
(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(AISHA) Wagholi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
VIP Kolkata Call Girl Kalighat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kalighat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kalighat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kalighat 👉 8250192130 Available With Room
 
(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Service
(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Service(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Service
(DIYA) Call Girls Sinhagad Road ( 7001035870 ) HI-Fi Pune Escorts Service
 
DENR EPR Law Compliance Updates April 2024
DENR EPR Law Compliance Updates April 2024DENR EPR Law Compliance Updates April 2024
DENR EPR Law Compliance Updates April 2024
 
VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Moti Ganpur ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
 
(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Service(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wagholi ( 7001035870 ) HI-Fi Pune Escorts Service
 
Contact Number Call Girls Service In Goa 9316020077 Goa Call Girls Service
Contact Number Call Girls Service In Goa  9316020077 Goa  Call Girls ServiceContact Number Call Girls Service In Goa  9316020077 Goa  Call Girls Service
Contact Number Call Girls Service In Goa 9316020077 Goa Call Girls Service
 
Booking open Available Pune Call Girls Budhwar Peth 6297143586 Call Hot Indi...
Booking open Available Pune Call Girls Budhwar Peth  6297143586 Call Hot Indi...Booking open Available Pune Call Girls Budhwar Peth  6297143586 Call Hot Indi...
Booking open Available Pune Call Girls Budhwar Peth 6297143586 Call Hot Indi...
 
(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...
(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...
(PARI) Viman Nagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune ...
 

A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES

  • 1. International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022 DOI: 10.5121/ijcses.2022.13201 1 A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES Onyenwe Ikechukwu1 , Onyedikachukwu Ikechukwu-Onyenwe1 , Onyedinma Ebele1 , Chidinma A. Nwafor2 1 Nnamdi Azikiwe University Awka, Anambra State, Nigeria 2 Nigerian Army College of Environmental Science and Technology, Makurdi, BenueState, Nigeria ABSTRACT Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be fully applied to the language. Many POS disambiguation technologies have been developed for this type of research and there are factors that influence the choice of choosing one. This could be either corpus-based or non-corpus-based. In this paper, we present a review of POS tagging technologies. KEYWORDS Languages, Parts-of-Speech technology, Corpus, Natural Language Processing. 1. INTRODUCTION Parts-of-speech (henceforth POS) tagging is the process of assigning a POS or other lexical class marker to each word in a language texts according to their respective POS label according to its definition and context [4] [14]. It is an important enabling task for NLP applications such as a pre-processing step to syntactic parsing, information extraction and retrieval, statistical machine translation, corpus linguistics, etc. Automating POS tagging involves three main tasks: text segmentation into tokens, assignment of most probable tags to tokens (this create room for ambiguity), and determination of potential appropriateness of each probable tag to resolve ambiguity. There is valid contextual information used by all POS tagging technologies in their analysis for disambiguation, namely preceding (following) words, and their POS tags. This information is highly utilized in trying to resolve ambiguity. There is valid contextual information used by all POS tagging technologies in their analysis for disambiguation, namely preceding (following) words, and their POS tags. This information is highly utilized in trying to resolve ambiguity. [2] justified the effectiveness of this approach in the following comment: “Corpus-based methods are often able to succeed while ignoring the true complexities of language, banking on the fact that complex linguistic phenomena can often be indirectly observed through simple epiphenomena” [2] Brill illustrated the above statement using “race” in examples 1–3, that one could assign a POS tag to words in the sentence considering the contextual information.
  • 2. International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022 2 1. He will race/verb the car. 2. He will not race/verb the car. 3. When will the race/noun end? The POS tags attached to the word “race” were derived considering the preceding (following) words and their associated tags based on the fact that words seen in the right side of a modal, in this case “will” is always a verb. This linguistic generalization is only exempted when there is a determiner to the right of the modal. And there is an assumption that determiner “a/the” precedes either a noun or adjective in English, but in this case, it is a noun. Therefore, restricting POS tagging to the use of this minimal information while ignoring the complexities found in most languages can be highly effective. 2. PARTS-OF-SPEECH TECHNOLOGY Various POS technologies developed are either corpus-based or non-corpus-based. The above illustrations are examples of corpus-based method. [1] and [14] identified types of technologies developed and classified the m on the following basis: 1. Source of knowledge: here, linguistic generalization used in the model are derived from either the grammatical knowledge of a linguist or the text corpus. 2. Express as: the linguistic generalization derived are encoded as rules or probabilities (linguistic information are represented in numerical parameters). Figure 1. POS tagging Techniques Logically, there are four possible POS tagging techniques in Figure 1, namely rule-based method(A), probabilistic-based method (B), corpus-based method (C), and unknown mode of operation(D). The illustration in the Figure 1 is closely in relation to four possible disambiguation methodologies in [1]. 2.1. Rule-Based Method (A) Rule-based (A) is where linguistic information is expressed as rules. This the earliest POS tagging technique developed, between 1960s and 1980s. It is a combination of lexicon and hand- crafted rules. Model is EngCG (English Constraint Grammar). In this approach, Linguistic generalization forms the information called constraints. There are set of instructions in each rule that determine how operations are to be performed and contextual information that describe where to apply the rules. The operations tend to alter tags associated with ambiguous tagged terms so that one or
  • 3. International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022 3 more potential tags are eliminated from the list of consideration, thereby reducing ambiguity. Rule-based is more than using grammar as traditionally formulated, it involves syntactic analysis. The work of [15], “string analysis of sentence structure” is an example. It is described by Harris as “a process of syntactic analysis which is intermediate between the traditional constituent analysis and transformational analysis.” Another example is constraint grammar (CG), a method implemented in the work of [10]. Karlsson defined CG as “a language-independent formalism for surface-oriented, morphology- based parsing of un restricted text.” Figure 2. EngCG Architecture CG has been implemented in English, Swedish, and Finnish. It is based on disambiguation of morphological readings and syntactic labels. The assignment of tags is directly through lexicon to syntax. Suggested alternative tags are discarded by the constraints as many as possible, the outcome aimed at getting fully disambiguated sentence having one tag for each word. Figure 2 shows the architecture of CG. CG rule-based approach has been successfully applied to most of European languages like English and French. The EngCG tagger using 1, 100 constraints achieved a recall of at least 99.7% with 93-97% of words disambiguated [1]. Also, Greene and Rubin used a rule-based approach in the TAGGIT system [8], which aided in POS tagging the Brown corpus [7]. 77% of the corpus was disambiguated through the use of TAGGIT; the remaining ambiguated words in the corpus were manually corrected over a period of several years. 2.2. Probabilistic-Based Method (B) This approach enables linguistic generalization derived from a corpus-data to be represented in numerical parameters. The frequencies at which different sequences of tags occur are derived from the corpus-data. The derived information is then used to verify “which of this possible tags is the most likely for w (observed word) in the vicinity X (X could be a sentence under observation) given a sequence of tags (ti...tn)”. This statistical information gathered help in choosing the right tag of the ambiguously tagged word using any machine learning applicable design. Illustrating the Above statement, a tagger after training on a gold corpus might learn through statistics that a tag for a subject pronoun is followed by the verb tag 70% of the time, adverb tag 29% of the time, and noun tag 1% of the time. In the future, if the tagger encounters a new instance, say, a word following a subject pronoun that was ambiguously tagged as either a noun or verb, it can use its statistical knowledge to decide that the verb tag is most likely to be correct [1]. Though in practice, this process would be in capable of handling long sequences of
  • 4. International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022 4 ambiguous words. Thus, the modern probabilistic process called Markov model. This model works by estimating the probability of a sequence of tags, given the transition probability of the different tags in the sequence P (ti|ti−1). Markov model uses transition probabilities and observation likelihood P (wi|ti) in calculating the possible tag for the ambiguously tagged words. Both probabilities correspond to the prior and likelihood respectively. Given annotated corpus, to calculate the maximum likelihood estimate of P(ti|ti−1), we compute the counts of ti given ti−1 and divide the outcome by count ti−1 , that is: P (ti|ti−1) = C(ti−1, ti) ÷ C(ti−1) (1) For example, determiner is very likely to follow adjectives and nouns in English (e.g. that/DT flight/NN and the/DT yellow/JJ hat/NN). Thus, we would expect P (NN/DT) and P (JJ/DT) to have higher scores compare to P (DT/JJ) and P(DT/NN) using equantion 1 [60]. Also, in the same way, we can compute maximum likelihood estimate of a word w under observation. So, we count how many times w occur in the annotated corpus given a tag t The outcome is divided by number of t in the corpus, that is: P (wi|ti ) = C(ti, wi ) ÷ C(ti ) (2) [9] stated “... in POS tagging, while we recognize the words in the input, we do not recognize the POS tags. Thus, we cannot condition any probabilities on, say, a previous POS tag, because we can not be completely sure exactly which tag applied to the previous word.” Hence, the term hidden in Markov model because the transitions are hidden from view. Hidden Markov Model (HMM) handles both seen and hidden events. Observed events are the words that serve as an input while hidden events are the POS tags. The Markov model parameters useful in POS tagging are the states are the tags, the symbols that they output (their associated words), and the transitions from one state to another (makeup the sequence of tags allocated to a sentence). Tag transition probabilities answer the question “how likely are we to expect a tag given the previous tag?” and lexical likelihood finds how likely “the observed word” given a POS tag. Final estimation is computed by calculating the maximum likelihood for all the probabilities which can be obtained from the corpus counts. The most likely tag in Markov model can be found through computation or selecting the most appropriate path without calculating the probability of each tag. Using the example in [9], the former works by focusing on the observed word and the tag sequence. For instance, the word “race” can be a noun or a verb in the followings: 1. Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow/NR. 2. People/NNS continue/VB to/TO inquire/VB the/AT reason/NN for/IN the/AT race/NN for/IN outer/JJ space/NN.
  • 5. International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022 5 A graph view of 1 and 2 above is given in Figure 3 Figure 3. Graph showing three probabilities (arcs 1, 2 and 3) where sentences 1 and 2 above differ (Source: [9]) Arcs 1, 2 and 3 in Figure 3 are the probabilities that HMM will use to resolve ambiguously POS tagged words. Like in the above case (race/NN, race/VB). Arcs 1 and 2 are the transition probability P(ti|ti−1) and lexical likelihood P(wi|ti), while 3 is the probability of the tag sequence incase of NR. So, the transition probabilities of P(VB|TO) and P(NN|TO), assume that verbs are about 500 times as likely as nouns to occur after TO, using equation 1 is as follows, P(VB|TO) = 0.83 and P(NN|TO) = 0.00047. The lexical likelihood of “race” given a tag (i.e. the probability of tags VB and NN producing the word race) using equation 2, P (race|NN) = 0.00057 and P (race|VB) = 0.00012. And finally, the probability of the tag sequence (arc 3 in figure 3.4), P(NR|VB) =0.0027 and P(NR|NN) = 0.0012. To get the final probability, multiply all the probabilities to get  P(VB|TO)P(NR|VB) P (race|VB) = 0.00000027  P(NN|TO)P(NR|NN) P (race|NN) = 0.00000000032 With this statistic, HMM tagger will correctly tag race as a VB. Markov model has been widely used in assuming that a token depends probabilistically on its parts-of-speech category and this in turn depends solely on the categories of the following and preceding one or two words [5]. Markov model are also called n − gram taggers because of the transition probabilities, which might occur over n − size of words, taggers based on. There is unigram tagger where n = 1, bigram taggers where n = 2 and trigram taggers where n = 3. These taggers undergo a training process to learn statistical behaviour of the tags associated with words in the corpus data through an examination of the training data (manually tagged corpus). These statistical learned behaviour of the tagger are used in disambiguation techniques, in Markov model. Viterbi algorithm allows an appropriate tag to be selected through choosing the most probable path without calculating the probability of each path through the ambiguously tagged sequences of words. It is the most decoding algorithm used in HMMs, mostly relevant in speech and language processing. According to [12], there are three reasons of choosing Viterbi over maximum likelihood. Firstly, it is easier to implement. Secondly, it cannot produce in its output sequences of tags which are not possible in the model, which is theoretically possible in maximum likelihood tagging. Thirdly, it gives the best analysis of the sentence as a whole.
  • 6. International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022 6 Stochastic process of POS tagging advantage over rule-based is that the linguist does not write an effective set of rules to produce an effective system. According to [2], the stochastic techniques advantage over the traditional rule-based design comes from the ease that very little hand- crafted knowledge need be built into the system. Also, stochastic process is generally applicable. Probabilistic or stochastic process became popular in 1980s. Statistical methods have been used in the works of [6] (VOLSUNGA), [12], [5], [3] (TnT), etc. 2.3. Corpus-Based Rule Method (C) In this approach, linguistic generalizations (rules) are generated from the corpus, hence the name corpus-based rule. Corpus-based rule ignores the true complexities of language in its process with the fact that challenging linguistic phenomena can be indirectly recognized through simple epiphenomena [2]. But this does not mean that the rule-based approach discussed earlier did not use corpus data to derive some grammatical rules. [11] pointed out that some linguistic generalization used in CGC system were based on empirical observation of the corpus data. However, both approaches are distinguished in their implementations. Corpus-based rule method allow computer to formulate and evaluate the rules. That is, the rules created are automatic and linguistic readable. The linguist information about grammar is excluded from this process. An example of this approach is a widely used Brill POS tagger based on transformation- based error- driven learning. 2.4. Unknown Mode of Operation (D) Dashed line in Figure 1 above indicates unknown mode of operation. According to [1], humans are unskilled in estimating frequencies of words and other linguistic phenomena. Therefore, modeling linguist knowledge of the corpus data in probabilistic form will create a perverse POS tagging techniques. Since humans will usually get words frequencies wrong compared to rules, which are more likely to get it right. 3. CONCLUSION There are different approaches to the problem of assigning a parts-of-speech (POS) tag to each word of a natural language sentence. These different approaches fall into two distinctive groups: corpus-based and non-corpus-based. Here, we present an elaborate review of the existing technologies in the area of POS tagging with a focus on these distinctive groups. ACKNOWLEDGEMENTS We thank all that contributed to the success of this paper. Your feedbacks are very insightful and helpful towards the development of this paper. REFERENCES [1] Andrew Hardie. The Computational Analysis of Morpho syntactic Categories in Urdu. PhD thesis, University of Lancaster, 2003. [2] Brill E. Transformation-based error-driven learning and natural language processing: A case study In part-of-speech tagging. In Association for Computational Linguistics., 1995. [3] Brants Thorsten. Tnt: a statistical part-of-speech tagger. In ANLC ’00 Proceedings of the sixth conference on Applied natural language processing, pages 224–231. Association for Computational Linguistics Stroudsburg, PA, USA, 2000. [4] Chungku C., Rabgay J., Faaß G., “Building NLP resources for Dzongkha: a tag set and a tagged corpus”, Proceedings of the Eighth Workshop on Asian Language Resouces, p. 103-110, 2010.
  • 7. International Journal of Computer Science and Engineering Survey (IJCSES), Vol.13, No.2, April 2022 7 [5] Cutting D., Kupiec J., Pedersen J., and Sibun P. A practical part-of-speech tagger. In ANLC ’92Proceedings of the third conference on Applied natural language processing, pages 133–140. Association for Computational Linguistics Stroudsburg, PA, USA, 1992. [6] DeRose S. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14:31–39, 1998. [7] Francis W. N. and Kučera H. Frequency Analysis of English Usage. Houghton Mifflin, Boston, 1982. [8] Greene B. B. and Rubin G. M. Automatic grammatical tagging of english. Technical report, Department of Linguistics, Brown University, Providence, Rhode Island, 1971. [9] Jurafsky D. and Martin J. H. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall, 2000. [10] Karlsson F. Designing a parser for unrestricted text, 1995. [11] Klein S. and Simmons R. F. A computational approach to grammatical coding of English words. Journal of the Association for Computing Machinery, 10:334–347, 1963. [12] Merialdo B. Tagging english text with a probabilistic model. Computational Linguistics, 20:155171, 1994. [13] Francis W. N. and Kučera H. Frequency Analysis of English Usage. Houghton Mifflin, Boston, 1982. [14] Onyenwe, Ikechukwu Ekene. Developing methods and resources for automated processing of the African language igbo. Diss. University of Sheffield, 2017. [15] Zellig S. Harris. String analysis of sentence structure. In Linguistic Society of America, 1962.