Corpus - Based methods in linguistic research

Corpus-based methods in linguistic research was initiated by Paul
Baker in 2006 through the publication ‘Using Corpora in Discourse Analysis’
(London: Continuum). Since then, the method has been applied by many linguists
in great variety of research, especially in making dictionaries, teaching languages,
and comparing translations. With a number of advantages, the method enables
researchers to quantify linguistic patterns, and to come to solid conclusions. Thanks
to the technology development, user-friendly software has motivated the method to
move rapidly, promising to bring many useful linguistic applications to our life.
Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
7 trang minhkhanh 22960 Free
Download
Bạn đang xem tài liệu "Corpus - Based methods in linguistic research", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Corpus - Based methods in linguistic research

VẤN ĐỀ HÔM NAY
23Tạp chí 
Kinh doanh và Công nghệ
No 05E/2019
* Dean of English Faculty B,
 Hanoi University of Business and Technology
1. Introduction 
As far as we know, quantitative research 
concentrate on how much or how many 
there is/are of a particular characteristic or 
item of a linguistic aspect. The strong point 
of this method is that it enables researchers 
to compare ralatively large numbers of 
things, people by applying a comparatively 
easy index. Quantative data use statistical 
methods for analysis, that is, particular 
methametics tools allow researchers to 
conduct on numeric data. Corpus-based 
methods are regarded as effective research 
methods in linguistics, and they ‘should 
not be considered as only quantitative, but 
rather than an approach which can combine 
both qualitative and quantitative processes’ 
(Paul Baker in Litosseliti, 2010: 93). 
This article mentions corpus-based 
methods in details, from basic concepts, 
their applications, to building and 
annotating corpora in linguistic research. 
2. Theoretical Concepts of Corpus 
Linguistics
 1.1. Definition 
Corpus linguistics is a popular field of 
linguistics involving ‘the analysis of very 
large collections of electronically stored 
texts aided by computer software’ (Paul 
Baker in Litosseliti, 2010: 93). The word 
‘corpus’ in Latin means body, so a corpus can 
be understood a ‘body’ of texts. According 
to Mc Enery and Wilson (1996:1), corpus 
linguistics features a ‘methodology’ rather 
than a traditional branch of linguistics like 
semantics, grmammar, or phonetics. 
Corpus linguistics with characteristics 
of empirical, inductive forms of analysis 
relies on instances of language in real-life 
use. The aims of corpus linguistics are to 
find out patterns, rules or explore trends 
about the ways people actually use their 
language in everyday life. 
2.2. Advantages of Corpus Linguistics 
There are several great advantages 
of corpus-based methods in linguistics 
as follow: 
(i) Corpus-based methods enable 
researchers to restate or refect hypotheses 
about language use; 
CORPUS-BASED METHODS
IN LINGUISTIC RESEARCH
Nguyen Thi Hong Ha MA *
Abstract: Corpus-based methods in linguistic research was initiated by Paul 
Baker in 2006 through the publication ‘Using Corpora in Discourse Analysis’ 
(London: Continuum). Since then, the method has been applied by many linguists 
in great variety of research, especially in making dictionaries, teaching languages, 
and comparing translations. With a number of advantages, the method enables 
researchers to quantify linguistic patterns, and to come to solid conclusions. Thanks 
to the technology development, user-friendly software has motivated the method to 
move rapidly, promising to bring many useful linguistic applications to our life. 
Key words: corpus, method, research, linguistics, application.
VẤN ĐỀ HÔM NAY
24Tạp chí 
Kinh doanh và Công nghệ
No 05E/2019
(ii) Corpus-based methods allow 
researchers to bring out new questions 
and theories about language; 
(iii) Corpus-based methods help 
researchers to quantify linguistic patterns, 
reaching more solid conclusions on 
language; 
(iv) Corpus-based methods with 
large corpora lead researchers to obvious 
evidence of rare or unusual instances 
of language, and confirmation on very 
common phenomena of language. 
2.3. Research questions of corpus-
based methods
The overarching questions of corpus-
based methods is ‘how do people really use 
language’, around which many research 
questions related to different fields in 
linguistics are raised. For example, in 
the field of language teaching it can be 
seen: ‘Is the language used in textbooks 
actually reflected the language that people 
encounter in everyday life?’ (Mindt, 1996); 
or in study on language genres: ‘Has 
writeen language become more informal 
over recent years?’ (Kennedy, 1998). 
Moreover, the comparative trend also 
appears in research questions within studies 
on corpus-based linguistics, such as ‘How 
does the use of linguistic feature X differ in 
usage between language varieties A and B 
in terms of frequency and/ or typical usage?’ 
or ‘What associations are triggered by the 
use of linguistic item X, based on its typical 
uses?’ Not only to discover similarities 
between the features of languages, but also 
do corpus-based methods help find out 
differences, in spite of small difference, 
or even no difference which is worth 
researching. In addition, corpus linguistic 
research approaches language patterns that 
people are unaware of, but they may still 
strongly influence users. 
3. Types of corpora
Corpora in existence are divided 
into three main pairs (Baker, P. 2006 
in Litosseliti, 2010), based on their 
characteristics and aims: 
3.1. General corpora and specialized 
corpora 
A general corpus aims to represent 
a particular language like the British 
National Corpus BNC or The Bank 
of English BoE., which is extremely 
large and takes a long time to build and 
annotate. More importantly, they are 
very useful resources for a wide range 
of research purposes. They play the role 
of ‘benchmark’ about a typical language 
in comparison with a specilized corpus. 
Meanwile, a specialized corpus is much 
smaller and has more limited sets of 
texts in restrictions on time, genre or 
place/language variety. For instance, a 
specialized corpus of just newspapers 
published in October, 2019 in Vietnam. 
Speacilized corpora are generally easier 
to collect and for specific research 
questions. 
3.2. Written corpora and spoken corpora
Written corpora contain computer-
mediated texts such as e-mails, text 
messages or websites or mixture of all 
three while spoken corpora are usually 
smaller due to complexities surrounding, 
gathering and transcribing data. Written 
corpora with the access to the  ...  grammar of learners in 
comparison with an equivalent corpus of 
native speakers. The Longman Learner 
Corpus and the International Corpus of 
Learner English both receive contributions 
from a great variety of learners all over 
the world, helping researchers find out the 
extent to which a student’s first language 
is likely to influence the way they acquire 
English. 
4. Applications of corpus-based 
methods in linguistics
4.1. Application in linguistic description 
Corpus-based methods can aid 
researchers in making dictionary with 
real-life examples of words in use. 
Hunston (2002) researches the senses of 
the verb ‘KNOW’ shown in examples in 
three dictionaries, one of which did not 
use a corpus and the others used a corpus. 
These are the findings of the study 
(Hunston, 2002: 97): 
- The 1987 Longman Dictionary of 
Contemporary English (without a corpus): 
20 senses;
- The 1995 Longman Dictionary 
(with a corpus): over 40 senses;
- The 1995 COBUILD Dictionary 
(with a corpus): over 30 senses. 
Obviously, corpus-base methods 
show their advantages of enhancing senses 
of meaning of words in dictionaries, 
reflecting the reality of diversely using 
language in everyday life. 
4.2. Aplication in translation studies 
Corpus-base methods, especially 
with the types of multilingual and 
parallel corpora show their strengths in 
comparative transaltion and interpreting 
studies. When conducting research on 
punctuation in Hans Christian Andersen’s 
stories and in their translations into 
English, Malmkjer (1997) concludes that 
in translations, punctuation tends to be 
strenghtened, with commas aften being 
replaced with semicolons or full stops, 
and semicolons being transferred to full 
stops, too. This leads to long, complex 
sentences being divided into shorter and 
simpler clauses in translations to reduce 
the complexity of sentence structures in 
the original. 
In another study, Mauranen (2000) 
reveals that translators usually make 
optional cohesive markers explicit in the 
translations, even though they are not 
available in the original text. This results 
in a tendency to spell things out rather 
than make them implicit. 
4.3. Application in forensic linguistics 
Coulthard (1993) carries out his study 
on witness statements used as evidence in 
VẤN ĐỀ HÔM NAY
26Tạp chí 
Kinh doanh và Công nghệ
No 05E/2019
the trial of Derk Bentley (who was excuted 
in Britian in 1953 for his involvement 
in a policeman’s death). He compares 
the frequencies of words in Bentley’s 
statements with that in general written and 
spoken English, and that of other policemen 
and witnesses. He notes that Bentley had 
a higher frequency of using word ‘then’ 
than others. However, this word is a very 
typical feature of the police. This, together 
with other corpus-based eveidence, the 
researcher argues that Bentley aged 11 
had not made his own statement, but it had 
been written for him. 
4.4. Application in Critical Discourse 
Analysis CDA
In the area of CDA, Baker (2006) 
demonstrates how corpus-based methods 
can be used to express the ‘incremental 
effect of discourse’. He states that an 
association between two words, occuring 
many times in naturally occurring 
language is much better evidence for an 
underlying hegemonic discourse made 
explicit through the word combination 
than a single case. Furthermore, Mautner 
(2007) examines a corpus from a wide 
range of language sources to see how 
the elderly construct their discourse. The 
researcher recognizes that their discourse 
is expressed as ill-health victims who 
are in need of care more often than as 
empowered or independent citizens. 
4.5. Application in language teaching
Corpora can also help language 
teaching be more effective. Mindt (1996) 
studies a corpus of spoken English and 
realized that native speakers use the model 
verb ‘will’ most frequently for referring to 
future time. Yet, in German textbooks used 
for teaching English, ‘will’ was introduced 
to students in the middle of the second year 
when they had already learnt other modal 
verbs, which were less frequent in use in 
the corpus. Such studies are very useful for 
writing textbooks and designing teaching 
syllabus in language teaching. 
4.6. Application in stylistics
Corpus methods of analysis have 
been used in stylistics in order to enhance 
systematicity and decrease subjectivity. 
For example, Malhberg (2009) examines 
stylistics in writing literary works by 
Charles Dickens and states that the writer 
often mentions the ways characters use 
household objects as a way of drawing 
readers’ attention to the characters’ 
emotions. Indicating a number of 
examples in the works related to objects 
like a watering-pot or a knife and fork, 
the researcher through her corpus-based 
analysis concludes that in Dickens’ novels 
these objects are consistently used to 
emphasize characters’ emotional states. 
5. Corpus-based research tools and 
building corpora
Difference across space/genres 
(variation) and over time (change) is the 
most commonly applied in corpus- based 
research. According to Baker, among them 
frequency data are ‘indicator of markedness’ 
(Baker 2010: p.125); wordlists are the most 
basic ‘points of entry’ (Baker 2010: p.133) 
to analyse; keywords are ‘somewhat 
more sophisticated’ (Baker 2010: p.134) 
means of research and concordances with 
associated information are involved in 
collocates and clusters. 
The most frequently used data include: 
(i) Corpus, text and sentence (average 
word length);
(ii) Standardized type/ token ration 
(STTR) and standard deviation (SD);
(iii) Significance (p-value).
Theoretically, any text and collection 
of texts are considered to be a corpus and the 
VẤN ĐỀ HÔM NAY
27Tạp chí 
Kinh doanh và Công nghệ
No 05E/2019
analysis can be conducted on the corpus of 
very short texts. According to McEnery and 
Wilson (1996), a corpus usually contains 
a sample ‘maximally representative of 
the variety under examination’ is ‘of a 
finte size’, exists in ‘machine readable’ 
and ‘constitutes a standard reference for 
the language variety which it represents’ 
(McEnery and Wilson, 1996: 22-23). It is 
clear that a corpus must be large enough 
to show some feature about frequencies of 
linguistic phenomena, helping researchers 
to discover the common as well as unusual 
things in language. 
Baker in Litosseliti (2010: p.95) 
suggests three criteria for identifying the 
size of a corpus: aspects of language, 
type of language, and pratical reasons. 
Kennedy (1998: 68) states that ‘a corpus 
of 100,000 words will usually big enough 
for the sudy of prosody’, and an analysis 
of verb-form morphology will need half 
a million words. Meanwhile, according 
to Biber (1993) confirms that a million 
words will be enough for a study on 
grammar. Also, the more various the 
language is, the larger the corpus requires. 
The British National Corpus involving a 
very wide range of written and spoken 
language genres and as a standard 
reference for British English, has the size 
of 100 million words while a corpus of 
weather forecast with restricted language 
just needs a much more smaller size. The 
practical conditions affecting the building 
a corpus could be the avalability of texts, 
the amount of money and time spent on 
the study and permission from authors. 
The key theoretical concepts in corpus 
linguistic include sampling, balcance and 
representativeness. A corpus must be 
representative of a particular language 
variety, so the texts need choosing carefully 
to make sure that the corpus must be as a 
whole. Perhaps, it is not necessary to take 
the whole text, but parts of it if they are 
too long novels. Equal-sized samples from 
texts also need to be considered carefully to 
assure the balance. For instance, different 
parts (beginnings, middles, ends) of texts 
are equally sampled for the corpus. If texts 
are quite short and from one writer, the 
whole texts might be included rather than 
different sections. 
Corpora are usually applied with 
the assistance of analysis software doing 
counting, sorting and presenting language 
characteristics. The current software such 
as WordSmith Tools, Xaira, Wmatrix, and 
AntConc can be employed in conjuction 
with a range of corpora. First, each text 
within a corpus is usually saved on a 
separate file containing a ‘header’ which 
has information about its author, date of 
publication, genre, etc. It allows researchers 
to concentrate on specific types of texts 
or to compare different types. Second, a 
corpus is annotated or tagged with further 
information for more complex calculations 
carried out on them. Sometimes, standard 
generalized mark-up language (SGML) 
where tags are presented in codes 
(elements) inside matching angle brackets 
() is applied. Finally, certain features 
of the language variety are represented 
with other codes (entities) starting with an 
ampersand character (&) and ending in a 
semi-colon. 
Hand-checking is often required 
although tagging can be automatically 
done by computer as tagging software 
is not always 100% accurate. Computer 
programs usually work best on texts having 
grammatically predictable sentences and 
relatively familiar words. Apart from 
checking, only human beings can interpret 
VẤN ĐỀ HÔM NAY
28Tạp chí 
Kinh doanh và Công nghệ
No 05E/2019
the results of calculations from computer, 
no computer software can cover this work.
6. Conclusion 
In summary, corpus-based methods 
are regared as potential to produce 
interesting findings about language, 
but as with many other methods, it is 
researchers’ task to provide explantion 
for the findings. Corpus-based methods 
cannot cover all fields in linguistics, so it 
could be combined with other methods to 
maximize strengths and avoid limitations. 
Although it contains its weak points such 
as being time and money-consuming 
to build a corpus, a continuing need 
to update balanced reference corpora, 
researcher’s ability to use computer 
fluently, identifying certain type of 
language patterns, it is worth applying 
to linguistic studies where comparisons 
must be done. The advantage of the 
corpus-based methods lies in employing 
fast and accurate techniques to discover 
patterns that human reserarchers would 
not recognize. Corpus analysis uses large 
amount of natural data, so a high degree 
of reliability and validity of linguistic 
research is usually achieved.
References
1. Baker, P. (2006). Using Corpora in Discourse Analysis. London: Continuum. 
2. Baker, P. (2010). Sociolinguistics and Corpus Linguistics. Edinburgh 
Sociolinguistics. Edinburgh: Edinburgh University Press. 
3. Biber, D. (1993). ‘Representativess in corpus design’ Literary and linguistics 
Computing 8,4: 243-57. 
4. Coulthard, M. (1993). ‘On beginning the study of forensic texts: corpus concordance 
collocation’. In M.Hoey (ed.). Data, Description, Discourse. London: Harper Collins. 
5. Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge 
University Press. 
6. Kennedy, G. (1998), An Introduction to Corpus Linguistics. London: Longman. 
7. Litosseliti, L. (2010). Research Methods in Linguistics. Continuum. 
8. Malhberg, M. (2009). ‘Corpus stylistics and the Pickwickian watering-pot’ in 
Baker (ed.). Contemporary Approaches to Corpus Linguistics. London: Continuum.
9. Malmkjer, K. (1997). ‘Punctuation in Hans Christian Andersen’s stories and 
in their translations into English’, in F.Poyatos (ed.). Nonverbal Communication and 
Translation. New Perspectives and Challenges in Literature, Interpretation and the 
Media. Amsterdam and Philadelphia: Benjamins. 
10. Mauranen, A. (2000). ‘Strange strings in translated language: a study on 
corpora’, in M. Olohan (ed.). Intercultural Faultlines. Research Models in Translation 
Studies 1: Textual and Cognitive Aspects. Manchester: St. Jerome Publishing. 
11. Mautner, G. (2007). ‘Mining large corpora for social information: the case of 
elderly’, Language in Society, 36 (1), 51-72. 
12. Mindt, D. (1996). ‘English corpus linguistics and the foreign language teaching 
syllabus’, in J. Thomas and M. Shorty (eds), Using Corpora for Language Research. 
London: Longman, 232-247. 
13. Mc Enery and Wilson (1996). Corpus Linguistics. Edinburgh: Edinburgh 
University Press.
VẤN ĐỀ HÔM NAY
29Tạp chí 
Kinh doanh và Công nghệ
No 05E/2019
PHƯƠNG PHÁP DỰA VÀO KHỐI LIỆU TRONG NGHIÊN CỨU NGÔN NGỮ
Phương pháp dựa vào khối liệu trong 
nghiên cứu ngôn ngữ được học giả Paul 
Baker khởi xướng vào năm 2006 qua ấn 
bản ‘Sử dụng Khối liệu trong Phân tích 
Diễn ngôn’ (Nxb London: Comtinuum). 
Kể từ đó, phương pháp này được nhiều 
nhà ngôn ngữ học áp dụng trong các 
nghiên cứu của mình, đặc biệt trong việc 
biên soạn từ điển, dạy ngoại ngữ, và so 
sánh dịch thuật. 
Phương pháp này có nhiều lợi thế, 
giúp nhà nghiên cứu định lượng hóa các 
mẫu ngôn ngữ, từ đó đi đến những kết luận 
đầy thuyết phục. Ứng dụng của phương 
pháp này trong nghiên cứu ngôn ngữ rất 
phong phú: từ việc mô tả ngôn ngữ trong 
làm từ điển, đến hỗ trợ dạy tiếng, ngôn 
ngữ pháp lý, phong cách học, đến nghiên 
cứu so sánh trong dịch thuật,
Đây là một phương pháp mới trong 
nghiên cứu ngôn ngữ ở Việt Nam, nên tác 
giả muốn giới thiệu nó đến bạn đọc. Cùng 
với sự phát triển của công nghệ, các phần 
mềm hỗ trợ tính toán hiện đại đã làm cho 
phương pháp này ngày càng phát triển 
mạnh, mang lại nhiều ứng dụng thiết thực 
của ngôn ngữ cho đời sống đương đại. 
Từ khóa: Phương pháp, nghiên cứu, 
khối liệu, ngôn ngữ học, ứng dụng.
ThS. Nguyễn Thị Hồng Hà *
* Chủ nhiệm Khoa Tiếng Anh B,
 Trường ĐH KD&CN Hà Nội
Ngày nhận bài: 15/10/2019
File đính kèm:
corpus_based_methods_in_linguistic_research.pdf