1 Introduction to Corpus Linguistics

1 Introduction to Corpus Linguistics
1 Introduction to Corpus Linguistics

第一章语料库语言学的目标和方法

Introduction to Corpus Linguistics

1.1 What is a corpus?

In the language sciences a corpus is a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description. In many respects it is the use to which the body of textual material is put, rather than its design features, which define what a corpus is.

A corpus constitutes an empirical basis not only for identifying the elements and structural patterns which make up the systems we use in a language, but also for mapping out our use of these systems. A corpus can be analyzed and compared with other corpora or parts of corpora to study variation. Most importantly, it can be analyzed distributionally to show how often particular phonological, lexical,

grammatical, discoursal or pragmatic features occur, and also where they occur.

By the 1990s there were many corpus-making projects in various parts of the world. Lancashire (1991) shows the huge range of corpora, archives and other electronic databases available or being compiled for a wide variety of purposes. Some of the largest corpus projects have been undertaken for commercial purposes, by dictionary publishers. Other projects in corpus compilation or analysis are on a smaller scale, and do not necessarily become well known. Undertaken as part of graduate theses or undergraduate projects, they enabled students to gain original insights into the structure and use of language.

1. 2 Categorization of Corpus Computerized corpora consist of:

Raw corpora (原始语料库),这就是将现实中的口语和笔语用文字形式收集起来,按一定原则(语域,语体,历时,共时等)归类

汇编起来的各种语料库。

Annotated corpora (附码语料库),这是指对原始语料进行了词性、语法、语音、语义或语篇乃至语用标记附码的语料库

Parallel corpora (平行语料库),这是指两种或多种语言在句子乃至单词短语层面上实现同步对译的互动语料库,如英法德西班牙等语种的平行语料库CRA TER (McEnery & Oakes 1996)和英汉双语平行语料库(中国外语教学研究中心基地2000)等

Learners corpora (学习者语料库), 即非母语学习者的口语和笔语语料库,其中包括注有学习者拼写和语法差错标记以及修改提示的语料库。如ICLE (国际英语学习者书面语料库),LINDSEI (国际英语学习者口语语料库)(Granger 2000) 和CLEC (中国英语学习者书面语料库)(桂诗春2001)等等

Lattice corpora (网格式语料库),这是指对自然语言(包括口语和笔语)进行自动语音和手写识别处理之后声称的语料库(Atwell 1996).

总体说来,语料库分成原始语料库与附码语料库。

1.3 What a corpus can do

Strictly speaking, a corpus by itself can do nothing at all, being nothing other than a store of used language. Corpus access software, however, can rearrange that store so that observations of various kinds can be made. If a corpus represents, very roughly and partially, a speaker’s experience of language, the access software re-orders that experience so that it can be reexamined in ways that are usually impossible. A corpus does not contain new information about language, but the software packages process data from a corpus in three ways: showing frequency, phraseology and collocation.

2. What is corpus linguistics?

2.1 The definition of corpus linguistics Over the last three decades the compilation and analysis of corpora stored in computerized databases has led to a new scholarly enterprise known as corpus linguistics. It brings together some of the findings of corpus-based studies of English, the language which has so far received the most attention from corpus linguists, and shows how quantitative analysis can contribute to linguistic description.

2. 2 The history of corpus linguistics

The use of corpus for linguistic studies can date back to the end of the nineteenth century when only cards and manual retrieval could be used as a means of research.

As we have seen, corpus linguistics goes beyond the use of corpora as a source of evidence in linguistic description. It also revives and carries on a concern of some

linguists with the statistical distribution of linguistic items in the context of use. From 1920s there was, especially in the United States and the United Kingdom, a tradition of word counting in texts in order to discover the most frequent, and arguably therefore the most pedagogically useful, words and grammatical structures for language teaching purposes. From the 1930s, Prague School linguistics undertook quantitative studies (Mainly of Czech, English and Russian) of different parts of speech, the location and distribution of information in the sentence, and the statistical distribution of syllable types and structures. Different varieties of English have been studied.

The earliest computerized corpora compiled for linguistic research from the 1960s required the use of mainframe computers, and researchers frequently had to design their own software for analysis. Initial interest was often in lexis, including word counts, but it was

quickly apparent that a computer facilitated the study of permissible or likely word sequences or collocations (are we more likely to write different from, different to or different than?) and grammatical and stylistic characteristics of particular authors and genres. There was a particular interest in what characterized ‘scientific style’, ‘newspaper style’and ‘literary or imaginative style’. The renowned British scholar R. Greenbaum began to cooperate for the sake of establishing a corpus Survey of English Usage (SEU) in 1950s and 1960s, first on paper and then computerized at the beginning of the 1980s, which marks the transition from the traditional corpus to the computerized corpus. Brown University Standard Corpus of Present-day American English Corpus (BROWN) was established in the 1960s and 1970s. London-Lund Corpus of Spoken English (LLC) was accomplished in the 1980s, which was the first corpus of its kind, including formal and

informal speeches, commentaries, dialogues, discussions, interviews and so on. These three classic corpora lay a solid foundation for the present-day corpus linguistics, for they are based on systematically comprehensive, authentic and reliable corpora, and easy for storage and retrieval.

2. 3 The scope of corpus linguistics Corpus linguistics is based on bodies of text as the domain of study and as the source of evidence for linguistic description and argumentation. It also has come to embody methodologies for linguistic description in which quantification of the linguistic items is part of the research activity. As Leech (1992:107) has noted, the focus of study is on performance rather than on competence, and on observation of language in use leading to theory rather than vice versa.

Corpus linguists are concerned typically not only with what words, structures or uses are

possible in a language but also with what is probable – what is likely to occur in language use. The use of corpus as a source of evidence however is not necessarily incompatible with any linguistic theory, and progress in the language sciences as a whole is likely to benefit from a judicious use of evidence from various sources: texts, introspection, elicitation or other types of experimentation as appropriate. Any scientific enterprise must be empirical in the sense it has to be supported or falsified on evidence and, in the final analysis, statements made about language have to stand up to the evidence of language use. The evidence can be based on the introspective judgment of speakers of the language or on a corpus of text. The difference lies in the richness of the evidence and the confidence we can have in the generalizability of that evidence, and in its validity and reliability.

2. 4 Applications of corpus linguistics

Corpus linguistics can be widely exploited in a variety of domains—most centrally in the design of syllabi and materials for language teaching, but also in dictionary work, the study of ideology and culture, translation, stylistics, forensic linguistics, and the provision of on-line assistance for writers in well-defined technical domains.

3. Types of corpus researchers

Work in corpus linguistics is currently associated with several quite different activities.Scholars working in the field tend to be identified with one or more of them. The first group of researchers consists of corpus makers or compilers. These scholars are concerned with the design and compilation of corpora, the collection of texts and their preparation and storage for later analysis.

A second group of researchers has been concerned with developing tools for the analysis of corpora. This is the main task of

researchers in computational linguistics.

A third group of researchers consists of descriptive linguists whose main concern has been to make use of computerized corpora to describe reliably the lexicon and grammar of languages, both of the linguistic systems we use and our likely use of those systems. It is the probabilistic aspect of corpus-based descriptive linguistic studies which especially distinguishes them from conventional descriptive fieldwork in linguistics or lexicography.

A fourth area of activity, which has been among the most innovative outcomes of the corpus revolution, has been the exploitation of corpus-based linguistic description for use in a variety of applications such as language learning and teaching, and natural language processing by machine, including speech recognition and translation.

4. The objective of offering this course

It is my hope that this course will whet the appetites of the growing body of teachers and students with access to corpora to discover more for themselves about how language works in all their variety.

There is no doubt that corpus linguistics is not an end in itself but is one source of evidence for improving descriptions of the structure and use of languages, and for various applications, including the processing of natural language by machine and understanding how to learn and teach a language.

It should be made clear that corpus l i n g u i s t i c s i s n o t a mindless process of automatic language description. Linguists use corpora to answer questions and solve problems. Some of the most revealing insights on language and language use have come from a blend of manual and computer analysis.It is now

possible for researchers with access to a personal computer and off-shelf software to do linguistics analysis using a corpus, and to discover facts about a language which have never been noticed or written about previously. The most important skill is not to be able to program a computer or even to manipulate available software (which, in any case, is increasingly user-friendly). Rather, it is to be able to ask insightful questions which address real issues and problems in theoretical, descriptive and applied l a n g u a g e s t u d i e s. Many of the key problems and challenges in corpus linguistics are associated with the

f o l l o w i n

g q u e s t i o n s:

①How can we exploit the opportunities

which arise from having texts stored in machine-retrievable form?

②What linguistic theories will best structure

corpus-based research?

③What linguistic phenomena should we

look for?

④What applications can make use of the

insights and improved descriptions of languages which come out of this research?

In answering these and other questions corpus linguistics has

potential to provide solutions and new directions to some of the major issues and problems in the study of human communication.

语料库是语言学研究的强有力工具,杨惠中认为,基于语料库的研究具有以下几个特点(杨惠中,2005):

i.真实性(authenticity) :学习者语料库

中的全部材料都是通过随机采样所收

集的学生实际作文,是真实的语言运

用。通过对学习者中间语的分析得出

的结论都是有根有据的,是从实际出

发的。我国的英语教学成绩很大,但

实际问题也不少,只有通过对教学现

状的科学分析、深入研究,才有可能

提出针对性、切合实际的改进方案,

收到实际效果,避免无的放矢,避免

无谓的争论;

ii.定量分析(Quantitative analysis):基于语料库的研究,一个明显的特点是

数据驱动,计算机的存储和语言处理

能力为观察语言,包括学习者中间语,

提供了一千难以想象的强有力的手

段,定量分析使描写具有客观性,通

过统计推断也可以避免判断的主观

性。数据驱动的定量分析使我们有可

能看到以前凭直觉无法发现的问题。

我们可以从量化的角度来探讨中国学

生词汇学习的特点。当然,定量分析

还必须辅以定性分析,才能为怎样学

好英语提出切合实际的解决办法;iii.群体分析(Group analysis):通过数据驱动的定量分析,我们还可以发现,

有些言语失误是个别学生语言运用中

的问题,是个体行为。有些言语失误

则带有普遍性,是中国的英语学习者

中普遍存在的现象,这就促使我们不

得不认真研究其发生的原因,是因为

母语迁移?还是由于过度概括?或者

是出于教材或教学中的疏漏?只有找

到了原因,才能找到改进教学的方法,

提高教学质量;

iv.纵向分析(Longitudinal analysis) 学习者语料库中收集了各个层次学生的

语言产出,他们出于不同的学习阶段,

有的是初学者,有的已经到了学习的

较高级阶段,这样就使我们有可能对

英语学习的发展过程进行纵向分析,

看看哪些是初学者容易出现的言语失

误,哪些言语失误到高级阶段就很少

出现,而另一类言语失误的出现频率

却又增加了,从而发现学习者中间语

(inter-language) 发展的规律;

v.对比分析(Contrastive analysis) 所谓言语失误是指中间语中不合目标语规

范的地方。有了学习者语料库,我们

可以定量分析的方法,通过把学习者

语料库与本族语者语料库进行对比分

析,不但能够找出那些不符合规范的

地方,而且可以找出哪些语言现象过

度使用(即超用)了,哪些现象又过

少使用(即少用)了,这些都是凭经

验和直觉难以发现的。

通过研究,我们发现以下几点值得在教学中引起人们重视:

1.通过对学习者中间语的分析,不难

发现初学者的语言运用中语法形式

方面的错误较多。但是语法结构可

以较快地学会,随着学习者应予程

度的提高,这方面的言语失误很快

减少,而大量的言语失误表现在词

的用法和词的搭配关系上,这说明

要掌握地道、自然的英语关键在于

词汇教学;

2.学习词汇不但要掌握词的音、形、

义,而且要掌握词的用法(类联结)

和搭配,因此必须在具体的语境中、

在使用中学习词汇,必须结合听说读写的技能训练来学习词汇,孤立地背诵单词是不可能收到好的效果的。把语法和词汇有机地结合起来的正是词块。词块是作为整体储存在大脑中的,是本族语者语言能力的重要组成部分,在教学中重视词块的教学能有效地提高学习的言语表达能力;

3.我国的英语教学是在非目标语的环境下进行的,因此初学者的英语必然寄生于母语结构要学习和掌握地道、自然的英语,减少母语的影响,必须多渠道地加大语言输入量。由于词汇在母语和目标语中各有自己约定成俗的搭配关系,因此只有大量阅读、大量输入,才能培养起对文化差异的敏感性,摆脱对母语结构的寄生,建立起英语的语义网络,才能掌握地道、自然的英语;

4.正确对待学习者英语中的言语失误,把他们看作学生积极的御用策

略,并采取补偿教学的方法,向学生提供本族语者的正确用法。在这方面学习者语料库和本族语者应予语料库可以直接在教学中发挥作用;

5.此外,虽然CLEC语料库所收集的是中国学生的书面语,不包括口语。但他们的书面语有一个特点,就是“写话”——把说的东西写下来。也就是说,他们的产出性语言并没有很鲜明的语体特征。怎样在我国英语教学中全面培养学生各种语言能力,让他们了解语言能力的语体特征,是一个迫在眉睫的问题。

相关主题
相关文档
最新文档