SlideShare a Scribd company logo
1 of 6
Umm-e-Rooman Yaqoob
Corpus Analysis
Corpus linguistics
Corpus linguistics is the study of language as expressed in corpora (samples) of "real world"
text. The text-corpus method is a digestive approach for deriving a set of abstract rules, from a
text, for governing a natural language, and how that language relates to and with another
language; originally derived manually, corpora now are automatically derived from the source
texts. Corpus linguistics proposes that reliable language analysis is more feasible with corpora
collected in the field, in their natural contexts, and with minimal experimental-interference.
Corpus
John Sinclair defined it as' Corpus is a collection of naturally-occurring text, chosen to
characterize a state or variety of a language' .
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts
(nowadays usually electronically stored and processed). They are used to do statistical analysis
and hypothesis testing, checking occurrences or validating linguistic rules within a specific
language territory. A corpus may contain texts in a single language (monolingual corpus) or text
data in multiple languages (multilingual corpus).
Corpora are the main knowledge base in corpus linguistics. Corpora can be considered as a
type of foreign language writing aid as the contextualized grammatical knowledge acquired by
non-native language users through exposure to authentic texts in corpora allows learners to
grasp the manner of sentence formation in the target language, enabling effective writing.
The great advantage of the corpus-linguistic method is that language researchers do not have
to rely on their own or other native speakers’ intuition or even on made-up examples. Rather,
they can draw on a large amount of authentic, naturally occurring language data produced by a
variety of speakers or writers in order to confirm or refute their own hypotheses about specific
language features on the basis of an empirical foundation.
Corpus analysis and linguistic theory
When the first computer corpus, the Brown Corpus, was being created in the early 1960s,
generative grammar dominated linguistics, and there was little tolerance for approaches to
linguistic study that did not adhere to what generative grammarians deemed acceptable
linguistic practice. As a consequence, even though the creators of the Brown Corpus, W.
Nelson Francis and Henry Kuˇcera, are now regarded as pioneers and visionaries in the corpus
linguistics community, in the 1960s their efforts to create a machine-readable corpus of English
were not warmly accepted by many members of the linguistic community
Linguistic theory and description
Chomsky has stated in a number of sources that there are three levels of “adequacy” upon
which grammatical descriptions and linguistic theories can be evaluated: observational
adequacy, descriptive adequacy, and explanatory adequacy. If a theory or description achieves
observational adequacy, it is able to describe which sentences in a language are grammatically
well formed.
The highest level of adequacy is explanatory adequacy, which is achieved when the description
or theory not only reaches descriptive adequacy but does so using abstract principles which can
be applied beyond the language being considered and become a part of “Universal Grammar”.
Within Chomsky’s theory of principles and parameters, pro-drop is a consequence of the “null-
subject parameter”.
Because generative grammar has placed so much emphasis on universal grammar, explanatory
adequacy has always been a high priority in generative grammar, often at the expense of
descriptive adequacy.
Unlike generative grammarians, corpus linguists see complexity and variation as inherent in
language, and in their discussions of language, they place a very high priority on descriptive
adequacy, not explanatory adequacy. Consequently, corpus linguists are very skeptical of the
highly abstract and decontextualized discussions of language promoted by generative
grammarians, largely because such discussions are too far removed from actual language
usage.
Preparation and Analysis of Linguistic Corpora
The corpus is a fundamental tool for any type of research on language. The availability of
computers in the 1950’s immediately led to the creation of corpora in electronic form that could
be searched automatically for a variety of language features and compute frequency,
distributional characteristics, and other descriptive statistics. Corpora of literary works were
compiled to enable stylistic analyses and authorship studies, and corpora representing general
language use became widely used in the field of lexicography. Two notable exceptions are the
Brown Corpus of American English (Francis and Kucera, 1967) and the London/Oslo/Bergen
(LOB) corpus of British English (Johanssen et al., 1978); both of these corpora, each containing
one millions words of data tagged for part of speech, were compiled in the 1960’s using a
representative sample of texts produced in the year 1961. In the 1980’s, the speed and capacity
of computers increased dramatically, and, with more and more texts being produced in
computerized form, it became possible to create corpora much larger than the Brown and LOB,
containing millions of words. Parallel corpora, which contain the same text in two or more
languages, also began to appear; the best known of these is the Canadian Hansard corpus of
Parliamentary debates in English and French.
The “golden era” of linguistic corpora began in 1990 and continues to this day. Enormous
corpora of both text and speech have been and continue to be compiled, many by government-
funded projects in Europe, the U.S., and Japan. In addition to mono-lingual corpora, several
multi-lingual parallel corpora covering multiple languages have also been created.
Methods
Corpus linguistics has generated a number of research methods, attempting to trace a path
from data to theory. Wallis and Nelson (2001) first introduced what they called the 3A
perspective: Annotation, Abstraction and Analysis.
a. Annotation consists of the application of a scheme to texts. Annotations may include
structural markup, part-of-speech tagging, parsing, and numerous other representations.
b. Abstraction consists of the translation (mapping) of terms in the scheme to terms in a
theoretically motivated model or dataset. Abstraction typically includes linguist-directed
search but may include e.g., rule-learning for parsers.
c. Analysis consists of statistically probing, manipulating and generalising from the dataset.
Analysis might include statistical evaluations, optimisation of rule-bases or knowledge
discovery methods.
Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus
linguists who work with 'unannotated plain text' inevitably apply some method to isolate salient
terms. In such situations annotation and abstraction are combined in a lexical search.
The advantage of publishing an annotated corpus is that other users can then perform
experiments on the corpus (through corpus managers). Linguists with other interests and
differing perspectives than the originators' can exploit this work. By sharing data, corpus
linguists are able to treat the corpus as a locus of linguistic debate, rather than as an exhaustive
fount of knowledge.
 Corpus annotation
For computational linguistics research, which has driven the bulk of corpus creation efforts over
the past decade, corpora are typically annotated with various kinds of linguistic information. The
following sections outline the major annotation types.
o Morpho-syntactic annotation
By far the most common corpus annotation is morpho-syntactic annotation (part-ofspeech
tagging), primarily because several highly accurate automatic taggers have been developed
over the past 15 years. Part of speech tagging is a disambiguation task: for words that have
more than one possible part of speech, it is necessary to determine which one, given the
context, is correct.
o Syntactic annotation
There are two main types of syntactic annotation in linguistic corpora: noun phrase (NP)
bracketing or “chunking”, and the creation of “treebanks” that include fuller syntactic analysis.
Syntactically annotated corpora serve various statistics-based applications, most notably, by
providing probabilities to drive syntactic parsers, and have been also used to derive context-free
and unification-based grammars (Charniak, 1996; van Genabith et al., 1999). Syntactically
annotated corpora also provide theoretical linguists with data to support studies of language use
o Semantic annotation
Semantic annotation can be taken to mean any kind of annotation that adds information about
the meaning of elements in a text. At present, the most common type of semantic annotation is
”sense tagging”: the association of lexical items in a text with a particular sense or definition,
usually drawn from an existing sense inventory provided in a dictionary or on-line lexicon such
as WordNet (Miller, et al., 1990).
o Discourse-level annotation
There are three main types of annotation at the level of discourse: topic identification,
coreference annotation, and discourse structure. Topic identification (also called “topic
detection”) annotates texts with information about the events or activities described in the text.
Co-reference annotation links referring objects (e.g., pronouns, definite noun phrases) to prior
elements in a discourse to which they refer. This type of annotation is invariably performed
manually, since reliable software to identify co-referents is not available. Discourse structure
annotation identifies multi-level hierarchies of discourse segments and the relations between
them, building upon low-level analysis of clauses, phrases, or sentences. To date, annotation of
discourse structure is almost always accomplished by hand, although some software to perform
discourse segmentation has been developed (e.g., Marcu, 1996).
 Abstraction
Abstraction consists of the definition of an experiment (or series of experiments), and the
construction of an abstract model or sample which may then be analysed.
The simplest type of model is just a table of totals called a contingency table. To construct a
contingency table, one must carry out a set of queries over the corpus, where these queries are
organised into dependent (predicted) and independent (predictive) variables. The scope of the
model is defined by two aspects:
a) the (sub-) corpus and how it is sampled, and
b) what precisely is under investigation - the case definition.
A contingency table is not the only, or necessarily the optimum, type of abstract model. If we
step back a bit, a contingency table is perhaps best thought of as one kind of summary of the
experimental dataset.
An experimental dataset consists of a set of cases rather than simply a total number of each
type of case. This notion is implicit in our discussion of experimentation on our website.
Contingency tables are very simple and effective for small two-variable experiments. However, if
one want to consider more than one hypothesis at a time, or look for other predictive variables,
it is more useful to explicitly abstract (define) and then explore this dataset.
 Analysis
Corpus analysis is done on three grounds i.e. concordances, collocation and key word in
context.
o Concordance
A concordance is an alphabetical list of the principal words used in a book or body of work,
listing every instance of each word with its immediate context. Only works of special importance
have had concordances prepared for them, such as the Vedas, Bible or the works
of Shakespeare or classical Latin and Greek authors, because of the time, difficulty, and
expense involved in creating a concordance in the pre-computer era.
A bilingual concordance is a concordance based on aligned parallel text.
A topical concordance is a list of subjects that a book covers (usually The Bible), with the
immediate context of the coverage of those subjects. Unlike a traditional concordance, the
indexed word does not have to appear in the verse.
 Use of concordances in linguistics
Concordances are frequently used in linguistics, when studying a text. For example:
a. comparing different usages of the same word
b. analyzing keywords
c. analyzing word frequencies
d. finding and analyzing phrases and idioms
e. finding translations of substantial elements, e.g. terminology, in translation memories
f. creating indexes and word lists (also useful for publishing)
Concordancing techniques are widely used in national text corpora such as American National
Corpus, British National Corpus, and Corpus of Contemporary American English available on-
line. Stand-alone applications that employ concordancing techniques are known
as concordancers or more advanced corpus managers. Some of them have integrated part-of-
speech taggers and enable the user to create his/her own pos-annotated corpora to conduct
various type of searches adopted in corpus linguistics
o Collocation
In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often
than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An
example of a phraseological collocation, as propounded by Michael Halliday, is the
expression strong tea. While the same meaning could be conveyed by the roughly
equivalent powerful tea, this expression is considered excessive and awkward by English
speakers. Conversely, the corresponding expression in technology, powerful computer is
preferred over strong computer. Phraseological collocations should not be confused with idioms,
where an idiom's meaning is derived from its convention as a stand-in for something else while
collocation is a mere popular composition.
 Types of Collocation
There are about six main types of collocations:
a. adjective+noun
b. noun+noun
c. verb+noun
d. adverb+adjective
e. verbs+prepositional phrase
f. verb+adverb
Collocation extraction is a computational technique that finds collocations in a document or
corpus, using various computational linguistics elements resembling data mining.
o Key Word In Context
KWIC is an acronym for Key Word In Context, the most common format for concordance lines.
The term KWIC was first coined by Hans Peter Luhn. A KWIC index is formed by sorting and
aligning the words within an article title to allow each word (except the stop words) in titles to be
searchable alphabetically in the index. It was a useful indexing method for technical manuals
before computerized full text search became common.
 Parts in KWIC
There are three parts in KWIC Index:
1. Keywords: Subject denoting words which serve as approach terms.
2. Context: Keywords selected also specify the particular context of the document. (Rest of the
words in the title)
3. Identification or location code: To provide full bibliographic description of the document
address code is used.
Conclusion
Corpus Linguistics is now seen as the study of linguistic phenomena through large collections of
machine-readable texts: corpora. These are used within a number of research areas going from
the Descriptive Study of the Syntax of a Language to Prosody or Language Learning, to
mention but a few. An over-view of some of the areas where corpora have been used can be
found on the Research areas page.
The use of real examples of texts in the study of language is not a new issue in the history of
linguistics. However, Corpus Linguistics has developed considerably in the last decades due to
the great possibilities offered by the processing of natural language with computers. The
availability of computers and machine-readable text has made it possible to get data quickly and
easily and also to have this data presented in a format suitable for analysis.
Corpus linguistics is, however, not the same as mainly obtaining language data through the use
of computers. Corpus linguistics is the study and analysis of data obtained from a corpus. The
main task of the corpus linguist is not to find the data but to analyse it. Computers are useful,
and sometimes indispensable, tools used in this process.

More Related Content

What's hot

Language standardization: How and why
Language standardization: How and whyLanguage standardization: How and why
Language standardization: How and why
adm-2012
 

What's hot (20)

Types of syllabus design
Types of syllabus designTypes of syllabus design
Types of syllabus design
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
Product oriented syllabus1
Product oriented syllabus1Product oriented syllabus1
Product oriented syllabus1
 
Syllabus Designing
Syllabus DesigningSyllabus Designing
Syllabus Designing
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Introducing Critical Discourse Analysis
Introducing Critical Discourse AnalysisIntroducing Critical Discourse Analysis
Introducing Critical Discourse Analysis
 
Applied linguistics
Applied linguisticsApplied linguistics
Applied linguistics
 
6) discourse grammar
6) discourse grammar6) discourse grammar
6) discourse grammar
 
Michael halliday
Michael hallidayMichael halliday
Michael halliday
 
Language standardization: How and why
Language standardization: How and whyLanguage standardization: How and why
Language standardization: How and why
 
Introduction to corpus linguistics 1
Introduction to corpus linguistics 1Introduction to corpus linguistics 1
Introduction to corpus linguistics 1
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
Applied linguistics: overview
Applied linguistics: overviewApplied linguistics: overview
Applied linguistics: overview
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
 
Definition and Scopo of Psycholinguistics
Definition and Scopo of PsycholinguisticsDefinition and Scopo of Psycholinguistics
Definition and Scopo of Psycholinguistics
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
04. Mentalism.pptx
04. Mentalism.pptx04. Mentalism.pptx
04. Mentalism.pptx
 
Language and Gender (Sociolinguistic)
Language and Gender (Sociolinguistic)Language and Gender (Sociolinguistic)
Language and Gender (Sociolinguistic)
 

Viewers also liked

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Raul Vargas
 
Approach to a child with monoarthritis
Approach to a child with monoarthritisApproach to a child with monoarthritis
Approach to a child with monoarthritis
Raghavendra Babu
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
Alicia Ruiz
 
Dr. Faustus as Christian Tragedy - seven deadly sins
Dr. Faustus as Christian Tragedy - seven deadly sinsDr. Faustus as Christian Tragedy - seven deadly sins
Dr. Faustus as Christian Tragedy - seven deadly sins
zalakrutika
 
Corpus Tools for Language Teaching
Corpus Tools for Language TeachingCorpus Tools for Language Teaching
Corpus Tools for Language Teaching
CALPER
 
Overview -productivity management
Overview -productivity managementOverview -productivity management
Overview -productivity management
Bharat Parmar
 
Approach To A Patient With Polyarthritis
Approach To A Patient With PolyarthritisApproach To A Patient With Polyarthritis
Approach To A Patient With Polyarthritis
Pramod Mahender
 

Viewers also liked (20)

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
A passage to India Chapter 7 to 12
A passage to India  Chapter 7 to 12A passage to India  Chapter 7 to 12
A passage to India Chapter 7 to 12
 
Representation of female in advertisements
Representation of female in advertisements Representation of female in advertisements
Representation of female in advertisements
 
Approach to a child with monoarthritis
Approach to a child with monoarthritisApproach to a child with monoarthritis
Approach to a child with monoarthritis
 
critical evaluation seven deadly sins.
critical evaluation seven deadly sins.critical evaluation seven deadly sins.
critical evaluation seven deadly sins.
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
 
discourse analysis
discourse analysis discourse analysis
discourse analysis
 
Dr. Faustus as Christian Tragedy - seven deadly sins
Dr. Faustus as Christian Tragedy - seven deadly sinsDr. Faustus as Christian Tragedy - seven deadly sins
Dr. Faustus as Christian Tragedy - seven deadly sins
 
Corpus linguistics the basics
Corpus linguistics the basicsCorpus linguistics the basics
Corpus linguistics the basics
 
Corpus Tools for Language Teaching
Corpus Tools for Language TeachingCorpus Tools for Language Teaching
Corpus Tools for Language Teaching
 
Overview -productivity management
Overview -productivity managementOverview -productivity management
Overview -productivity management
 
Corpus approaches to discourse analysis
Corpus approaches to discourse analysisCorpus approaches to discourse analysis
Corpus approaches to discourse analysis
 
Doctor faustus
Doctor faustusDoctor faustus
Doctor faustus
 
Seven deadly sin in "Dr Faustus"
Seven deadly sin in "Dr Faustus"Seven deadly sin in "Dr Faustus"
Seven deadly sin in "Dr Faustus"
 
Doctor Faustus by Christopher Marlowe
Doctor Faustus by Christopher MarloweDoctor Faustus by Christopher Marlowe
Doctor Faustus by Christopher Marlowe
 
Approach To A Patient With Polyarthritis
Approach To A Patient With PolyarthritisApproach To A Patient With Polyarthritis
Approach To A Patient With Polyarthritis
 
Clinical approach to Arthritis
Clinical approach to ArthritisClinical approach to Arthritis
Clinical approach to Arthritis
 
Approach to case of arthritis
Approach to case of arthritisApproach to case of arthritis
Approach to case of arthritis
 
what is Productivity
what is  Productivitywhat is  Productivity
what is Productivity
 
Improving Productivity
Improving ProductivityImproving Productivity
Improving Productivity
 

Similar to Corpus Analysis in Corpus linguistics

corpus linguistics and lexicography
corpus linguistics and lexicographycorpus linguistics and lexicography
corpus linguistics and lexicography
ayfa
 
Corpus and semantics final
Corpus and semantics finalCorpus and semantics final
Corpus and semantics final
Filipe Santos
 
Linguistic approach by sheena bernal
Linguistic approach by sheena bernalLinguistic approach by sheena bernal
Linguistic approach by sheena bernal
Edi sa puso mo :">
 
lexicography
lexicographylexicography
lexicography
ayfa
 
Corpora in cognitive linguistics
Corpora in cognitive linguisticsCorpora in cognitive linguistics
Corpora in cognitive linguistics
白兰 钦
 
Sinopsis
SinopsisSinopsis
Sinopsis
ayfa
 

Similar to Corpus Analysis in Corpus linguistics (20)

2001052491
20010524912001052491
2001052491
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysis
 
What can a corpus tell us about discourse
What can a corpus tell us about discourseWhat can a corpus tell us about discourse
What can a corpus tell us about discourse
 
corpus linguistics and lexicography
corpus linguistics and lexicographycorpus linguistics and lexicography
corpus linguistics and lexicography
 
Corpus and semantics final
Corpus and semantics finalCorpus and semantics final
Corpus and semantics final
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
Corpus-Based Studies of Legal Language for Translation Purposes:
Corpus-Based Studies of Legal Language for Translation Purposes:Corpus-Based Studies of Legal Language for Translation Purposes:
Corpus-Based Studies of Legal Language for Translation Purposes:
 
Linguistic approach by sheena bernal
Linguistic approach by sheena bernalLinguistic approach by sheena bernal
Linguistic approach by sheena bernal
 
lexicography
lexicographylexicography
lexicography
 
Corpora in cognitive linguistics
Corpora in cognitive linguisticsCorpora in cognitive linguistics
Corpora in cognitive linguistics
 
History of applied linguistic
History of applied linguisticHistory of applied linguistic
History of applied linguistic
 
Transformational grammar
Transformational grammarTransformational grammar
Transformational grammar
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
 
Macrolinguistics & Contrastive Analysis
Macrolinguistics & Contrastive AnalysisMacrolinguistics & Contrastive Analysis
Macrolinguistics & Contrastive Analysis
 
Corpus linguistics and multi-word units
Corpus linguistics and multi-word unitsCorpus linguistics and multi-word units
Corpus linguistics and multi-word units
 
IMPORTANCE OF LINGUISTICS IN ENGLISH LANGUAGE
IMPORTANCE OF LINGUISTICS IN ENGLISH LANGUAGEIMPORTANCE OF LINGUISTICS IN ENGLISH LANGUAGE
IMPORTANCE OF LINGUISTICS IN ENGLISH LANGUAGE
 
A Phrase-Frame List For Social Science Research Article Introductions
A Phrase-Frame List For Social Science Research Article IntroductionsA Phrase-Frame List For Social Science Research Article Introductions
A Phrase-Frame List For Social Science Research Article Introductions
 
An Outline Of Type-Theoretical Approaches To Lexical Semantics
An Outline Of Type-Theoretical Approaches To Lexical SemanticsAn Outline Of Type-Theoretical Approaches To Lexical Semantics
An Outline Of Type-Theoretical Approaches To Lexical Semantics
 
Sinopsis
SinopsisSinopsis
Sinopsis
 
I01066062
I01066062I01066062
I01066062
 

More from Umm-e-Rooman Yaqoob

More from Umm-e-Rooman Yaqoob (20)

English Phonetics and Phonology By Peter Roach
English Phonetics and Phonology By Peter RoachEnglish Phonetics and Phonology By Peter Roach
English Phonetics and Phonology By Peter Roach
 
Airstream Mechanism Phonetics And Phonology
Airstream Mechanism Phonetics And PhonologyAirstream Mechanism Phonetics And Phonology
Airstream Mechanism Phonetics And Phonology
 
Airstream mechanism Phonetics and Phonology
Airstream mechanism Phonetics and PhonologyAirstream mechanism Phonetics and Phonology
Airstream mechanism Phonetics and Phonology
 
Beloved By Toni Morrison
Beloved By Toni MorrisonBeloved By Toni Morrison
Beloved By Toni Morrison
 
East coker
East cokerEast coker
East coker
 
Daddy
DaddyDaddy
Daddy
 
Moth Smoke
Moth SmokeMoth Smoke
Moth Smoke
 
Point of view
Point of viewPoint of view
Point of view
 
Foregrounding
Foregrounding Foregrounding
Foregrounding
 
Themes of moth smoke
Themes of moth smokeThemes of moth smoke
Themes of moth smoke
 
Systemic functional linguistics
Systemic functional linguisticsSystemic functional linguistics
Systemic functional linguistics
 
Modern peotry
Modern peotryModern peotry
Modern peotry
 
Michael Alexander Kirkwood Halliday
Michael Alexander Kirkwood HallidayMichael Alexander Kirkwood Halliday
Michael Alexander Kirkwood Halliday
 
Drama
DramaDrama
Drama
 
I know why the caged bird sings
I know why the caged bird singsI know why the caged bird sings
I know why the caged bird sings
 
Stylistics
Stylistics Stylistics
Stylistics
 
To the light house
To the light houseTo the light house
To the light house
 
Dr. Faustus
Dr. FaustusDr. Faustus
Dr. Faustus
 
The Return of The Native
The Return of The NativeThe Return of The Native
The Return of The Native
 
The Return Of The Native
The Return Of The NativeThe Return Of The Native
The Return Of The Native
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

Corpus Analysis in Corpus linguistics

  • 1. Umm-e-Rooman Yaqoob Corpus Analysis Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. The text-corpus method is a digestive approach for deriving a set of abstract rules, from a text, for governing a natural language, and how that language relates to and with another language; originally derived manually, corpora now are automatically derived from the source texts. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field, in their natural contexts, and with minimal experimental-interference. Corpus John Sinclair defined it as' Corpus is a collection of naturally-occurring text, chosen to characterize a state or variety of a language' . In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Corpora are the main knowledge base in corpus linguistics. Corpora can be considered as a type of foreign language writing aid as the contextualized grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing. The great advantage of the corpus-linguistic method is that language researchers do not have to rely on their own or other native speakers’ intuition or even on made-up examples. Rather, they can draw on a large amount of authentic, naturally occurring language data produced by a variety of speakers or writers in order to confirm or refute their own hypotheses about specific language features on the basis of an empirical foundation. Corpus analysis and linguistic theory When the first computer corpus, the Brown Corpus, was being created in the early 1960s, generative grammar dominated linguistics, and there was little tolerance for approaches to linguistic study that did not adhere to what generative grammarians deemed acceptable linguistic practice. As a consequence, even though the creators of the Brown Corpus, W. Nelson Francis and Henry Kuˇcera, are now regarded as pioneers and visionaries in the corpus linguistics community, in the 1960s their efforts to create a machine-readable corpus of English were not warmly accepted by many members of the linguistic community Linguistic theory and description Chomsky has stated in a number of sources that there are three levels of “adequacy” upon which grammatical descriptions and linguistic theories can be evaluated: observational adequacy, descriptive adequacy, and explanatory adequacy. If a theory or description achieves observational adequacy, it is able to describe which sentences in a language are grammatically well formed.
  • 2. The highest level of adequacy is explanatory adequacy, which is achieved when the description or theory not only reaches descriptive adequacy but does so using abstract principles which can be applied beyond the language being considered and become a part of “Universal Grammar”. Within Chomsky’s theory of principles and parameters, pro-drop is a consequence of the “null- subject parameter”. Because generative grammar has placed so much emphasis on universal grammar, explanatory adequacy has always been a high priority in generative grammar, often at the expense of descriptive adequacy. Unlike generative grammarians, corpus linguists see complexity and variation as inherent in language, and in their discussions of language, they place a very high priority on descriptive adequacy, not explanatory adequacy. Consequently, corpus linguists are very skeptical of the highly abstract and decontextualized discussions of language promoted by generative grammarians, largely because such discussions are too far removed from actual language usage. Preparation and Analysis of Linguistic Corpora The corpus is a fundamental tool for any type of research on language. The availability of computers in the 1950’s immediately led to the creation of corpora in electronic form that could be searched automatically for a variety of language features and compute frequency, distributional characteristics, and other descriptive statistics. Corpora of literary works were compiled to enable stylistic analyses and authorship studies, and corpora representing general language use became widely used in the field of lexicography. Two notable exceptions are the Brown Corpus of American English (Francis and Kucera, 1967) and the London/Oslo/Bergen (LOB) corpus of British English (Johanssen et al., 1978); both of these corpora, each containing one millions words of data tagged for part of speech, were compiled in the 1960’s using a representative sample of texts produced in the year 1961. In the 1980’s, the speed and capacity of computers increased dramatically, and, with more and more texts being produced in computerized form, it became possible to create corpora much larger than the Brown and LOB, containing millions of words. Parallel corpora, which contain the same text in two or more languages, also began to appear; the best known of these is the Canadian Hansard corpus of Parliamentary debates in English and French. The “golden era” of linguistic corpora began in 1990 and continues to this day. Enormous corpora of both text and speech have been and continue to be compiled, many by government- funded projects in Europe, the U.S., and Japan. In addition to mono-lingual corpora, several multi-lingual parallel corpora covering multiple languages have also been created. Methods Corpus linguistics has generated a number of research methods, attempting to trace a path from data to theory. Wallis and Nelson (2001) first introduced what they called the 3A perspective: Annotation, Abstraction and Analysis. a. Annotation consists of the application of a scheme to texts. Annotations may include structural markup, part-of-speech tagging, parsing, and numerous other representations. b. Abstraction consists of the translation (mapping) of terms in the scheme to terms in a theoretically motivated model or dataset. Abstraction typically includes linguist-directed search but may include e.g., rule-learning for parsers.
  • 3. c. Analysis consists of statistically probing, manipulating and generalising from the dataset. Analysis might include statistical evaluations, optimisation of rule-bases or knowledge discovery methods. Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus linguists who work with 'unannotated plain text' inevitably apply some method to isolate salient terms. In such situations annotation and abstraction are combined in a lexical search. The advantage of publishing an annotated corpus is that other users can then perform experiments on the corpus (through corpus managers). Linguists with other interests and differing perspectives than the originators' can exploit this work. By sharing data, corpus linguists are able to treat the corpus as a locus of linguistic debate, rather than as an exhaustive fount of knowledge.  Corpus annotation For computational linguistics research, which has driven the bulk of corpus creation efforts over the past decade, corpora are typically annotated with various kinds of linguistic information. The following sections outline the major annotation types. o Morpho-syntactic annotation By far the most common corpus annotation is morpho-syntactic annotation (part-ofspeech tagging), primarily because several highly accurate automatic taggers have been developed over the past 15 years. Part of speech tagging is a disambiguation task: for words that have more than one possible part of speech, it is necessary to determine which one, given the context, is correct. o Syntactic annotation There are two main types of syntactic annotation in linguistic corpora: noun phrase (NP) bracketing or “chunking”, and the creation of “treebanks” that include fuller syntactic analysis. Syntactically annotated corpora serve various statistics-based applications, most notably, by providing probabilities to drive syntactic parsers, and have been also used to derive context-free and unification-based grammars (Charniak, 1996; van Genabith et al., 1999). Syntactically annotated corpora also provide theoretical linguists with data to support studies of language use o Semantic annotation Semantic annotation can be taken to mean any kind of annotation that adds information about the meaning of elements in a text. At present, the most common type of semantic annotation is ”sense tagging”: the association of lexical items in a text with a particular sense or definition, usually drawn from an existing sense inventory provided in a dictionary or on-line lexicon such as WordNet (Miller, et al., 1990). o Discourse-level annotation There are three main types of annotation at the level of discourse: topic identification, coreference annotation, and discourse structure. Topic identification (also called “topic detection”) annotates texts with information about the events or activities described in the text. Co-reference annotation links referring objects (e.g., pronouns, definite noun phrases) to prior elements in a discourse to which they refer. This type of annotation is invariably performed manually, since reliable software to identify co-referents is not available. Discourse structure
  • 4. annotation identifies multi-level hierarchies of discourse segments and the relations between them, building upon low-level analysis of clauses, phrases, or sentences. To date, annotation of discourse structure is almost always accomplished by hand, although some software to perform discourse segmentation has been developed (e.g., Marcu, 1996).  Abstraction Abstraction consists of the definition of an experiment (or series of experiments), and the construction of an abstract model or sample which may then be analysed. The simplest type of model is just a table of totals called a contingency table. To construct a contingency table, one must carry out a set of queries over the corpus, where these queries are organised into dependent (predicted) and independent (predictive) variables. The scope of the model is defined by two aspects: a) the (sub-) corpus and how it is sampled, and b) what precisely is under investigation - the case definition. A contingency table is not the only, or necessarily the optimum, type of abstract model. If we step back a bit, a contingency table is perhaps best thought of as one kind of summary of the experimental dataset. An experimental dataset consists of a set of cases rather than simply a total number of each type of case. This notion is implicit in our discussion of experimentation on our website. Contingency tables are very simple and effective for small two-variable experiments. However, if one want to consider more than one hypothesis at a time, or look for other predictive variables, it is more useful to explicitly abstract (define) and then explore this dataset.  Analysis Corpus analysis is done on three grounds i.e. concordances, collocation and key word in context. o Concordance A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Only works of special importance have had concordances prepared for them, such as the Vedas, Bible or the works of Shakespeare or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era. A bilingual concordance is a concordance based on aligned parallel text. A topical concordance is a list of subjects that a book covers (usually The Bible), with the immediate context of the coverage of those subjects. Unlike a traditional concordance, the indexed word does not have to appear in the verse.  Use of concordances in linguistics Concordances are frequently used in linguistics, when studying a text. For example: a. comparing different usages of the same word
  • 5. b. analyzing keywords c. analyzing word frequencies d. finding and analyzing phrases and idioms e. finding translations of substantial elements, e.g. terminology, in translation memories f. creating indexes and word lists (also useful for publishing) Concordancing techniques are widely used in national text corpora such as American National Corpus, British National Corpus, and Corpus of Contemporary American English available on- line. Stand-alone applications that employ concordancing techniques are known as concordancers or more advanced corpus managers. Some of them have integrated part-of- speech taggers and enable the user to create his/her own pos-annotated corpora to conduct various type of searches adopted in corpus linguistics o Collocation In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation, as propounded by Michael Halliday, is the expression strong tea. While the same meaning could be conveyed by the roughly equivalent powerful tea, this expression is considered excessive and awkward by English speakers. Conversely, the corresponding expression in technology, powerful computer is preferred over strong computer. Phraseological collocations should not be confused with idioms, where an idiom's meaning is derived from its convention as a stand-in for something else while collocation is a mere popular composition.  Types of Collocation There are about six main types of collocations: a. adjective+noun b. noun+noun c. verb+noun d. adverb+adjective e. verbs+prepositional phrase f. verb+adverb Collocation extraction is a computational technique that finds collocations in a document or corpus, using various computational linguistics elements resembling data mining. o Key Word In Context KWIC is an acronym for Key Word In Context, the most common format for concordance lines. The term KWIC was first coined by Hans Peter Luhn. A KWIC index is formed by sorting and aligning the words within an article title to allow each word (except the stop words) in titles to be searchable alphabetically in the index. It was a useful indexing method for technical manuals before computerized full text search became common.  Parts in KWIC There are three parts in KWIC Index:
  • 6. 1. Keywords: Subject denoting words which serve as approach terms. 2. Context: Keywords selected also specify the particular context of the document. (Rest of the words in the title) 3. Identification or location code: To provide full bibliographic description of the document address code is used. Conclusion Corpus Linguistics is now seen as the study of linguistic phenomena through large collections of machine-readable texts: corpora. These are used within a number of research areas going from the Descriptive Study of the Syntax of a Language to Prosody or Language Learning, to mention but a few. An over-view of some of the areas where corpora have been used can be found on the Research areas page. The use of real examples of texts in the study of language is not a new issue in the history of linguistics. However, Corpus Linguistics has developed considerably in the last decades due to the great possibilities offered by the processing of natural language with computers. The availability of computers and machine-readable text has made it possible to get data quickly and easily and also to have this data presented in a format suitable for analysis. Corpus linguistics is, however, not the same as mainly obtaining language data through the use of computers. Corpus linguistics is the study and analysis of data obtained from a corpus. The main task of the corpus linguist is not to find the data but to analyse it. Computers are useful, and sometimes indispensable, tools used in this process.