1. Umm-e-Rooman Yaqoob
Corpus Analysis
Corpus linguistics
Corpus linguistics is the study of language as expressed in corpora (samples) of "real world"
text. The text-corpus method is a digestive approach for deriving a set of abstract rules, from a
text, for governing a natural language, and how that language relates to and with another
language; originally derived manually, corpora now are automatically derived from the source
texts. Corpus linguistics proposes that reliable language analysis is more feasible with corpora
collected in the field, in their natural contexts, and with minimal experimental-interference.
Corpus
John Sinclair defined it as' Corpus is a collection of naturally-occurring text, chosen to
characterize a state or variety of a language' .
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts
(nowadays usually electronically stored and processed). They are used to do statistical analysis
and hypothesis testing, checking occurrences or validating linguistic rules within a specific
language territory. A corpus may contain texts in a single language (monolingual corpus) or text
data in multiple languages (multilingual corpus).
Corpora are the main knowledge base in corpus linguistics. Corpora can be considered as a
type of foreign language writing aid as the contextualized grammatical knowledge acquired by
non-native language users through exposure to authentic texts in corpora allows learners to
grasp the manner of sentence formation in the target language, enabling effective writing.
The great advantage of the corpus-linguistic method is that language researchers do not have
to rely on their own or other native speakers’ intuition or even on made-up examples. Rather,
they can draw on a large amount of authentic, naturally occurring language data produced by a
variety of speakers or writers in order to confirm or refute their own hypotheses about specific
language features on the basis of an empirical foundation.
Corpus analysis and linguistic theory
When the first computer corpus, the Brown Corpus, was being created in the early 1960s,
generative grammar dominated linguistics, and there was little tolerance for approaches to
linguistic study that did not adhere to what generative grammarians deemed acceptable
linguistic practice. As a consequence, even though the creators of the Brown Corpus, W.
Nelson Francis and Henry Kuˇcera, are now regarded as pioneers and visionaries in the corpus
linguistics community, in the 1960s their efforts to create a machine-readable corpus of English
were not warmly accepted by many members of the linguistic community
Linguistic theory and description
Chomsky has stated in a number of sources that there are three levels of “adequacy” upon
which grammatical descriptions and linguistic theories can be evaluated: observational
adequacy, descriptive adequacy, and explanatory adequacy. If a theory or description achieves
observational adequacy, it is able to describe which sentences in a language are grammatically
well formed.
2. The highest level of adequacy is explanatory adequacy, which is achieved when the description
or theory not only reaches descriptive adequacy but does so using abstract principles which can
be applied beyond the language being considered and become a part of “Universal Grammar”.
Within Chomsky’s theory of principles and parameters, pro-drop is a consequence of the “null-
subject parameter”.
Because generative grammar has placed so much emphasis on universal grammar, explanatory
adequacy has always been a high priority in generative grammar, often at the expense of
descriptive adequacy.
Unlike generative grammarians, corpus linguists see complexity and variation as inherent in
language, and in their discussions of language, they place a very high priority on descriptive
adequacy, not explanatory adequacy. Consequently, corpus linguists are very skeptical of the
highly abstract and decontextualized discussions of language promoted by generative
grammarians, largely because such discussions are too far removed from actual language
usage.
Preparation and Analysis of Linguistic Corpora
The corpus is a fundamental tool for any type of research on language. The availability of
computers in the 1950’s immediately led to the creation of corpora in electronic form that could
be searched automatically for a variety of language features and compute frequency,
distributional characteristics, and other descriptive statistics. Corpora of literary works were
compiled to enable stylistic analyses and authorship studies, and corpora representing general
language use became widely used in the field of lexicography. Two notable exceptions are the
Brown Corpus of American English (Francis and Kucera, 1967) and the London/Oslo/Bergen
(LOB) corpus of British English (Johanssen et al., 1978); both of these corpora, each containing
one millions words of data tagged for part of speech, were compiled in the 1960’s using a
representative sample of texts produced in the year 1961. In the 1980’s, the speed and capacity
of computers increased dramatically, and, with more and more texts being produced in
computerized form, it became possible to create corpora much larger than the Brown and LOB,
containing millions of words. Parallel corpora, which contain the same text in two or more
languages, also began to appear; the best known of these is the Canadian Hansard corpus of
Parliamentary debates in English and French.
The “golden era” of linguistic corpora began in 1990 and continues to this day. Enormous
corpora of both text and speech have been and continue to be compiled, many by government-
funded projects in Europe, the U.S., and Japan. In addition to mono-lingual corpora, several
multi-lingual parallel corpora covering multiple languages have also been created.
Methods
Corpus linguistics has generated a number of research methods, attempting to trace a path
from data to theory. Wallis and Nelson (2001) first introduced what they called the 3A
perspective: Annotation, Abstraction and Analysis.
a. Annotation consists of the application of a scheme to texts. Annotations may include
structural markup, part-of-speech tagging, parsing, and numerous other representations.
b. Abstraction consists of the translation (mapping) of terms in the scheme to terms in a
theoretically motivated model or dataset. Abstraction typically includes linguist-directed
search but may include e.g., rule-learning for parsers.
3. c. Analysis consists of statistically probing, manipulating and generalising from the dataset.
Analysis might include statistical evaluations, optimisation of rule-bases or knowledge
discovery methods.
Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus
linguists who work with 'unannotated plain text' inevitably apply some method to isolate salient
terms. In such situations annotation and abstraction are combined in a lexical search.
The advantage of publishing an annotated corpus is that other users can then perform
experiments on the corpus (through corpus managers). Linguists with other interests and
differing perspectives than the originators' can exploit this work. By sharing data, corpus
linguists are able to treat the corpus as a locus of linguistic debate, rather than as an exhaustive
fount of knowledge.
Corpus annotation
For computational linguistics research, which has driven the bulk of corpus creation efforts over
the past decade, corpora are typically annotated with various kinds of linguistic information. The
following sections outline the major annotation types.
o Morpho-syntactic annotation
By far the most common corpus annotation is morpho-syntactic annotation (part-ofspeech
tagging), primarily because several highly accurate automatic taggers have been developed
over the past 15 years. Part of speech tagging is a disambiguation task: for words that have
more than one possible part of speech, it is necessary to determine which one, given the
context, is correct.
o Syntactic annotation
There are two main types of syntactic annotation in linguistic corpora: noun phrase (NP)
bracketing or “chunking”, and the creation of “treebanks” that include fuller syntactic analysis.
Syntactically annotated corpora serve various statistics-based applications, most notably, by
providing probabilities to drive syntactic parsers, and have been also used to derive context-free
and unification-based grammars (Charniak, 1996; van Genabith et al., 1999). Syntactically
annotated corpora also provide theoretical linguists with data to support studies of language use
o Semantic annotation
Semantic annotation can be taken to mean any kind of annotation that adds information about
the meaning of elements in a text. At present, the most common type of semantic annotation is
”sense tagging”: the association of lexical items in a text with a particular sense or definition,
usually drawn from an existing sense inventory provided in a dictionary or on-line lexicon such
as WordNet (Miller, et al., 1990).
o Discourse-level annotation
There are three main types of annotation at the level of discourse: topic identification,
coreference annotation, and discourse structure. Topic identification (also called “topic
detection”) annotates texts with information about the events or activities described in the text.
Co-reference annotation links referring objects (e.g., pronouns, definite noun phrases) to prior
elements in a discourse to which they refer. This type of annotation is invariably performed
manually, since reliable software to identify co-referents is not available. Discourse structure
4. annotation identifies multi-level hierarchies of discourse segments and the relations between
them, building upon low-level analysis of clauses, phrases, or sentences. To date, annotation of
discourse structure is almost always accomplished by hand, although some software to perform
discourse segmentation has been developed (e.g., Marcu, 1996).
Abstraction
Abstraction consists of the definition of an experiment (or series of experiments), and the
construction of an abstract model or sample which may then be analysed.
The simplest type of model is just a table of totals called a contingency table. To construct a
contingency table, one must carry out a set of queries over the corpus, where these queries are
organised into dependent (predicted) and independent (predictive) variables. The scope of the
model is defined by two aspects:
a) the (sub-) corpus and how it is sampled, and
b) what precisely is under investigation - the case definition.
A contingency table is not the only, or necessarily the optimum, type of abstract model. If we
step back a bit, a contingency table is perhaps best thought of as one kind of summary of the
experimental dataset.
An experimental dataset consists of a set of cases rather than simply a total number of each
type of case. This notion is implicit in our discussion of experimentation on our website.
Contingency tables are very simple and effective for small two-variable experiments. However, if
one want to consider more than one hypothesis at a time, or look for other predictive variables,
it is more useful to explicitly abstract (define) and then explore this dataset.
Analysis
Corpus analysis is done on three grounds i.e. concordances, collocation and key word in
context.
o Concordance
A concordance is an alphabetical list of the principal words used in a book or body of work,
listing every instance of each word with its immediate context. Only works of special importance
have had concordances prepared for them, such as the Vedas, Bible or the works
of Shakespeare or classical Latin and Greek authors, because of the time, difficulty, and
expense involved in creating a concordance in the pre-computer era.
A bilingual concordance is a concordance based on aligned parallel text.
A topical concordance is a list of subjects that a book covers (usually The Bible), with the
immediate context of the coverage of those subjects. Unlike a traditional concordance, the
indexed word does not have to appear in the verse.
Use of concordances in linguistics
Concordances are frequently used in linguistics, when studying a text. For example:
a. comparing different usages of the same word
5. b. analyzing keywords
c. analyzing word frequencies
d. finding and analyzing phrases and idioms
e. finding translations of substantial elements, e.g. terminology, in translation memories
f. creating indexes and word lists (also useful for publishing)
Concordancing techniques are widely used in national text corpora such as American National
Corpus, British National Corpus, and Corpus of Contemporary American English available on-
line. Stand-alone applications that employ concordancing techniques are known
as concordancers or more advanced corpus managers. Some of them have integrated part-of-
speech taggers and enable the user to create his/her own pos-annotated corpora to conduct
various type of searches adopted in corpus linguistics
o Collocation
In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often
than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An
example of a phraseological collocation, as propounded by Michael Halliday, is the
expression strong tea. While the same meaning could be conveyed by the roughly
equivalent powerful tea, this expression is considered excessive and awkward by English
speakers. Conversely, the corresponding expression in technology, powerful computer is
preferred over strong computer. Phraseological collocations should not be confused with idioms,
where an idiom's meaning is derived from its convention as a stand-in for something else while
collocation is a mere popular composition.
Types of Collocation
There are about six main types of collocations:
a. adjective+noun
b. noun+noun
c. verb+noun
d. adverb+adjective
e. verbs+prepositional phrase
f. verb+adverb
Collocation extraction is a computational technique that finds collocations in a document or
corpus, using various computational linguistics elements resembling data mining.
o Key Word In Context
KWIC is an acronym for Key Word In Context, the most common format for concordance lines.
The term KWIC was first coined by Hans Peter Luhn. A KWIC index is formed by sorting and
aligning the words within an article title to allow each word (except the stop words) in titles to be
searchable alphabetically in the index. It was a useful indexing method for technical manuals
before computerized full text search became common.
Parts in KWIC
There are three parts in KWIC Index:
6. 1. Keywords: Subject denoting words which serve as approach terms.
2. Context: Keywords selected also specify the particular context of the document. (Rest of the
words in the title)
3. Identification or location code: To provide full bibliographic description of the document
address code is used.
Conclusion
Corpus Linguistics is now seen as the study of linguistic phenomena through large collections of
machine-readable texts: corpora. These are used within a number of research areas going from
the Descriptive Study of the Syntax of a Language to Prosody or Language Learning, to
mention but a few. An over-view of some of the areas where corpora have been used can be
found on the Research areas page.
The use of real examples of texts in the study of language is not a new issue in the history of
linguistics. However, Corpus Linguistics has developed considerably in the last decades due to
the great possibilities offered by the processing of natural language with computers. The
availability of computers and machine-readable text has made it possible to get data quickly and
easily and also to have this data presented in a format suitable for analysis.
Corpus linguistics is, however, not the same as mainly obtaining language data through the use
of computers. Corpus linguistics is the study and analysis of data obtained from a corpus. The
main task of the corpus linguist is not to find the data but to analyse it. Computers are useful,
and sometimes indispensable, tools used in this process.