Corpora in Translation Studies: An Overview and Some Suggestions
for Future Rearch
Mona Baker
UMIST& Middlex University Abstract: Corpus-bad rearch has become widely accepted as a factor in improving the performance of machine translation systems, and corpus-bad terminology compilation is now the norm rather than the exception. Within translation studies proper, Lindquist (1984) has advocated the u of corpora for training translators, and Baker (1993a) has argued that theoretical rearch into the nature of translation will receive a powerful impetus from corpus-bad studies. It is becoming increasingly important to take stock of what is happening on this front and to start working towards the development of an explicit and coherent methodology for corpus-bad rearch in the discipline. This paper discuss the current and potential u of corpora in translation studies, with particular reference to theoretical issues.
Résumé: On s'accorde à voir dans la recherche sur corpus un f acteur susceptible d'améliorer les systèmes de traduction automatique; la terminologie basée sur corpus devient la règle plutôt que l'exception. A propos des recherches sur la traduction, Lidquist (1984) a prôné le recours aux corpora d
ans la formation des traducteurs; lon Baker (1993a), l'étude théorique de la traduction bénéficiera des recherches f ondées sur corpus. Il importe désormais de répertorier les acquis en ce domaine, afin de mettre au point une méthodologie explicite et cohérente.德威国际学校
L'article qui suit analy l'usage prént et possible des corpora dans les recher-ches sur la traduction, et prêtant une attention particulière aux questions théori-ques.
Target 7:2 (1995), 223-243. DOI 10.1075/target.7.2.03bak
ISSN 0924-1884 / E-ISSN 1569-9986 © John Benjamins Publishing Companypossible什么意思
224 MONA BAKER
1. Introduction
condlifeThe potential for using corpora is beginning to take shape in translation studies. Computerid corpora are becoming increasingly popular in tho areas of the discipline which have clo links with the hard sciences. This is particularly true of terminology and machine translation, where the emphasis is primarily, if not exclusively, on scientific and technical texts.
Terminology compilation is now firmly corpus-bad. The desire to construct abstract and neat conceptual systems has given way here to the practical need of addressing what happens in real life. Terms are therefore no longer extracted from previous lists but are rather drawn from a reprentative corpus of authentic texts held in electronic form (Sager 1990: 130).1 A similar development has taken place in machine translation where it is now widely accepted that access to computerid corpora may well hold the key to future success in the field. Again, this reflects a move away from conceptual and formal reprentations of language, which have not proved very helpful in the past, to addressing natural language in all its variety and infiniteness. The repeated failure of pre-formulated rules and neat mantic analys to improve the performance of machine translation systems has led to the gradual realisa-tion that the knowledge required to improve the systems must come from natural language in u (Schubert 1992: 87; Laffling 1991, 1992). Corpora are not only ud by linguists to write better rules for the machines to operate on but also as a direct knowledge source for the machines themlves (ibid). Modern machine translation systems u the principle of analogy to extrapo-late from the typical examples held in the corpus to texts that have not been encountered before.
The development of corpus-bad techniques in terminology and ma-chine translation is encouragin
g. It goes some way towards fulfilling the growing need for a rigorous descriptive methodology in an attempt to increa the inter-subjectivity of the applied areas of translation studies, such as transla-tor training and translation criticism, and of cour in the pursuit of a more satisfying theoretical account of the phenomenon of translation itlf. It is the potential u of corpora in the theoretical and pedagogical areas that I would now like to address. But before I do so, it is perhaps important to look in some detail at the way in which the words corpus and corpora have been ud in the literature in order to avoid possible misunderstanding in the discussion which follows. It might also be uful to give a brief overview of the kind of information that can be obtained from corpora.
CORPORA IN TRANSLATION STUDIES 225 2. Corpora: Definition, Types and Overview of Basic Operations 2.1. What Is a Corpus?
The word corpus originally meant any collection of writings, in a procesd or unprocesd form, usually by a specific author.2 In recent years, and with the growth of corpus linguistics,3 this definition has changed in three important ways: (i) corpus now means primarily a collection of texts held in machine-readable form and capable of being analyd automatically or mi-automati-cally in a variety of ways; (ii) a corpus is no longer restricted to 'writings' but includes spoken as well as written text, and (iii) a corpus may include a large number of texts from a variety of sources, by many
writers and speakers and on a multitude of topics. What is important is that it is put together for a particular purpo and according to explicit design criteria in order to ensure that it is reprentative of the given area or sample of language it aims to account for. Some of the criteria are discusd under 2.3 below.4
One important feature that remains variable in modern corpora is the nature and extent of the texts held. In linguistics, corpora usually consist of running texts, but the texts are not always held in full. For example, the Brown and LOB Corpora consist of fragments of texts, each fragment being approximately 2000 words in length (Hofland and Johansson 1982), lected on a more or less random basis within specified genres (Sinclair 1991a: 23). The British National Corpus consists of text samples, generally no longer than 40,000 words each. The samples are taken randomly from the beginning, middle or end of longer texts, but care is taken to choo a convenient breakpoint, such as the end of a ction or chapter, to begin and end the sample in order not to fragment high-level discour units (British National Corpus 1991). Other corpora, for example the Cobuild/Bank of English corpus, consist of whole texts, irrespective of the size of any individual text held in the collection.
In machine translation, by contrast, a corpus does not necessarily consist of running texts; it may be
no more than a t of examples (Schubert 1992: 87). One of the definitions of corpus in this field is therefore "the finite collection of grammatical ntences that is ud as a basis for the descriptive analysis of a language" (definition given in the 'Glossary of Terms' in Newton 1992: 223). It is also important to bear in mind that the word corpus has often been ud in translation studies proper to refer to fairly small collections of text which are not held in electronic form and which are therefore arched manually (Baker 1993a: 241).
226 MONA BAKER
In what follows, I intend to u corpus to mean any collection of running texts (as oppod to examples/ntences), held in electronic form and analysable automatically or mi-automatically (rather than manually).5
2.2. Basic Text Processing Operations
A great deal of experience in corpus work has been acquired in the past few decades and a stock of very powerful routines for processing text held in machine-readable form has now been developed. Some of the routines have not only become standard operations which any corpus holder will have access to, but they are also now included in software packages which are readily available to t
he public at very modest prices. The most popular and versatile of the packages is Microconcord, marketed by Oxford University Press. OUP have so far also relead two corpus collections of one million words each and are planning to relea more corpora as part of the British National Corpus initiative. Working with corpora is therefore becoming a perfectly viable proposition even at the level of individual rearchers.
lenetaThe corpus analyst's stock-in-trade is the KWIC concordance, KWIC being an acronym for Key Word In Context. This is a list of all the occurrences of a specified keyword or expression in the corpus, t in the middle of one line of context each. The following is a KWIC concordance of Greek from the OUP Corpus Collection A (British newspaper texts from The Independent and The Independent on Sunday):
THOMPSON in Athens </bl> <st> <p> A GREEK air force warrant officer, Michalis P MCA_IND4.FOR
ere four ndings-off, three of them Greek, as Bulgaria beat the visitors 4-0 in MCA_IND3.SPO
There are two more sales today. <p> Greek bidders descended on the sale yesterd MCA_IND4.HOM
ed over oil spill </hl> <st> <p> The Greek captain of a ship responsible for Sun MCA_IND1 .HOM
ere added to other Christian groups (Greek Catholics, Greek Orthodox, Armenians MCA_IND1.FOR
er </hl> <st> <p> ATHENS (UPI) _ The Greek Chief Justice, Yannis Grivas, was swo MCA_IND3.FORcare什么意思
s, it fails to persuade us that with Greek drama we are not cular theatre-goer MCA_IND2.ART
1012 </dt> <hl> Storms threaten mild Greek election climate </hl> <bl> From PETE MCA_IND2.FOR
1011 </dt> <hl> THEATRE / A spark of Greek fire: The Trojan Women - Liverpool Ev MCA_IND2.ART
well as agriculture, she has studied Greek, History, Philosophy, Astonomy, Mathe MCA_IND2.ART eration by sunshine and retsina on a Greek island holiday. (They don't have wint MCA_IND2.ART
r _ alleged to be a rare work of the Greek master Skopas. Archaeologists and art MCA_IND4.FOR
t this place?" asks the disapproving Greek matriarch when her grandson returns f MCA_IND1.ART
of the Titans (1981), his foray into Greek mythology. <p> There must be easier w MCA_IND3.ART
ter Lebanon in 1926, they propod a Greek Orthodox as president, becau he wou MCA_IND 1 FOR
r Christian groups (Greek Catholics, Greek Orthodox, Armenians and others), the MCA_IND1 .FOR sarouchis, the doyen of contemporary Greek painters, who died early this year, a MCA_IND4.HOM depicted wearing bull's horns in the Greek playwright Euripides' Bacchae. <p> Th MCA_IND4.HOM
er this constructive interlude, that Greek politics will avoid reverting to thei MCA_IND2.FOR
CORPORA IN TRANSLATION STUDIES 227
sion _ it is the first time a former Greek prime minister has been indicted by p MCA_IND2.FOR
pline by backing bouzouki players in Greek restaurants, or a plethora of cabaret MCA_IND2.ART
ted by a Wren facade, another by the Greek revival. <p> At best the buildings MCA_IND1.ART
python的re04.) Where can you go to church in a Greek sarcophagus? (St Jude's Church, Blyth MCA_IND1 .ART
three sciences and perhaps Latin and Greek. <ct> Home News Page 3 </ct> </st MCA_IND4.HOM residing over the Everyman's current Greek quence caud the jolt which nt a MCA_IND4.ART
e sculpture _ the helmeted head o
f a Greek soldier _ alleged to be a rare work o MCA_IND4.FOR
election campaign, though low-key by Greek standards, is proving rancorous. The MCA_IND4.FOR
venged unnaturally. This may be more Greek than Asian, but it's also compelling MCA_IND2.ART
r of riddles. <p> Why did Alexander 'Greek" Thompson design Eygptian halls in Un MCA_IND1.ART
toilet怎么读like Templeton's carpet factory or 'Greek" Thompson's Vincent Street church (is MCA_IND1.ART
eeds to remember is that the 'almost Greek tragedy" line should be restricted to MCA_IND3.ART quel. Aspiring to the condition of Greek tragedy, this version makes the dread MCA_IND2.ART
The codes at the end of each line indicate the source of the concordanc (whether it is from the arts [
ART], sports [SPO], home [HOM] or foreign news [FOR] ctions, for instance). The codes in angle brackets indicate typographic conventions (e.g. <b> for bold and <u> for underline) and other information (e.g. <hl> for headline and <ct> for caption).
KWIC concordances can be sorted in a variety of ways (for instance to the left or right of the keyword) and can be expanded online to reveal more of the context. Some programmes also allow the ur access to ntence and para-graph length concordances. Others, such as Microconcord, offer a collocational profile of the keyword by listing the most frequent collocates within a given span, for instance three words to the left or right of the keyword.
practice的过去式
Apart from KWIC concordances, most software packages offer facilities for listing all the word-forms in a corpus, or in a specific text or group of texts in a corpus, in frequency or alphabetical order. Here is an extract from a frequency list for a short text quoted in Sinclair (1991a: 141-142):
11 is 2 although 2 to
10 of 2 are 2 very
8 and 2 but active
8 the 2 have an
6 activity 2 if animals
5 a 2 kinds anything锦纶英文
5 communication 2 language armchair
5 in 2 library aspects
4 it 2 like attempts
nodoubt
3 his 2 look authors
3 human 2 many become
3 only 2 there boast
3 we 2 through brain
can