Language Learning & Technology
llt.msu.edu/vol14num2/chenbaker.pdf June 2010, Volume 14, Number 2
pp. 30–49
LEXICAL BUNDLES IN L1 AND L2 ACADEMIC WRITING学位英语考试成绩查询
Yu-Hua Chen and Paul Baker
Lancaster University
This paper adopts an automated frequency-driven approach to identify frequently-ud
word combinations (i.e., lexical bundles) in academic writing. Lexical bundles retrieved
from one corpus of published academic texts and two corpora of student academic writingpestle
(one L1, the other L2), were investigated both quantitatively and qualitatively. Published
academic writing was found to exhibit the widest range of lexical bundles whereas L2
student writing showed the smallest range. Furthermore, some high-frequency expressions
in published texts, such as in the context of, were underud in both student corpora, while
the L2 student writers overud certain expressions (e.g., all over the world) which native
madoff
academics rarely ud. The findings drawn from structural and functional analys of
lexical bundles also have some pedagogical implications.
INTRODUCTION
“Phraology” (Granger & Meunier, 2008; Meunier & Granger, 2007) and “formulaic
quences/language” (Schmitt, 2004; Wray, 2002, 2008) are two umbrella terms often ud to refer to various types of multi-word units. In recent years, an increasing number of studies have made u of corpus data to add weight to the importance of multi-word units in language. For instance Altenberg (1998), in his exploration of the London-Lund Corpus, estimated that 80% of the words in the corpus formed part of recurrent word combinations. As Wray (2002, p. 9) obrves, however, there is a “problem of terminology” when describing word co-occurrence. On the one hand, the same
notonlybutalsoterm might be ud in different ways by different scholars; on the other hand, various terms are ud to refer to similar or even the same notion of word co-occurrence. Some examples of such terms include clusters (Hyland, 2008a; Schmitt, Grandage & Adolphs, 2004; also ud in the corpus tool WordSmith), recurrent word combinations (Altenberg, 1998; De Cock, 1998), phrasicon (De Cock, Granger, Leech, & McEnery, 1998), n-grams (Stubbs, 2007a, 2007b) and lexical bundles (e.g., Biber & Barbieri, 2007; Cortes, 2002). The terms—clusters, phrasicon, n-grams, recurrent word combinations, lexical bundles—actually refer to continuous word quences retrieved by taking a corpus-driven approach with specified frequency and distribution criteria. The retrieved recurrent quences are fixed multi-word units that have customary pragmatic and/or discour functions, ud and recognized by the speakers of a language within certain contexts. This methodology is considered to be a frequency-bad approach for determining phraology (e Granger & Paquot, 2008).
end game
From a psycholinguistic viewpoint, formulaic language has been found to have “a processing advantage over creatively generated language” for non-native as well as native speakers (Conklin & Schmitt, 2008, p. 72), although different psycholinguistic studies have ud various types of formulaic language, such as idioms (e.g., take the bull by the horn) or non-idiomatic phras (e.g., a
s soon as), as the target forms. A particularly inspirational study was conducted by Jiang and Nekrasova (2007), in which they utilized corpus-derived recurrent word combinations as materials in two online grammaticality-judgment experiments. Their findings provide “prevailing evidence in support of the holistic nature of formula reprentation and processing in cond language speakers” (Jiang & Nekrasova, 2007, p. 433). Schmitt et al. (2004) also investigated the psycholinguistic validity of corpus-derived recurrent clusters and share some similarities with Jiang and Nekrasova (2007).
In a ries of lexical bundle studies conducted by Biber and colleagues (Biber & Barbieri, 2007; Biber & Conrad, 1999; Biber, Conrad, & Cortes, 2003, 2004; Biber, Johansson, Leech, Conrad, & Finegan, 1999), it was found that conversation and academic pro prent distinctive distribution patterns of lexical
bundles. For example, most bundles in conversation are clausal, whereas most bundles in academic pro are phrasal. Other studies of bundles have focud primarily on comparisons between expert and non-expert writing. Cortes (2002) investigated bundles in native freshman compositions and found that the bundles ud by the novice writers were functionally different from tho in published academic pro. In another study, Cortes (2004) compared native student writing with that
absolutionin academic journals, concluding that students rarely ud the lexical bundles identified in the corpus of published writing. Even if they did, the students ud the bundles in a different manner. Working with academic writing only, Hyland (2008b) indicated that there was disciplinary variation in the u of lexical bundles. He also investigated the role of lexical bundles in published academic pro and in postgraduate writing and found that postgraduate students tended to employ more formulaic expressions than native academics in order to display their competence (Hyland, 2008a).
To date, only a few studies of L2 written data have performed structural and functional categorization of lexical bundles. Although Hyland, in his two studies (2008a, 2008b), included masters’ thes and doctoral disrtations produced by L2 English students in Hong Kong, he did not begin from a perspective of cond-language learning. Instead, he treated L2 postgraduate writing as “highly proficient,” on the ground that all the data in his corpus of texts had been awarded high pass. Drawing on the previous rearch, the prent study aims to compare the u of recurrent word combinations in native-speaker and non-native speaker academic writing in order to reveal the potential problems in cond language learning. Quantitative and qualitative analys were carried out on three corpora in order to identify similarities and differences in recurrent word combinations at different levels of writing proficiency. One corpus (the L2 or learner corpus) contained writing from L1
Chine learners of L2 English, while the two other comprid L1 writing: one from academics (whom we term “expert” writers) and the other university students (who are similar in background to the L1 Chine learners, aside from their first language). Lexical bundles is adopted as the primary term throughout this study, as it is ud by Biber in a ries of studies upon which the theoretical and analytical framework of the current study is bad. Another term, recurrent word combination, is also ud interchangeably, given its transparent literal meaning.
DATA AND METHODOLOGY
Data
Two existing corpora are ud in the prent study: the Freiburg-Lancaster-Oslo/Bergen (FLOB) corpus, and the British Academic Written English (BAWE) corpus. To ensure comparability, only part of each corpus was lected for investigation. The FLOB corpus is a one-million-word corpus of written British English from the early 1990s, comprising fifteen genre categories. For the current study, only the category of academic pro, FLOB-J, was ud to reprent native expert writing. FLOB-J contains eighty 2,000-word excerpts from published academic texts, retrieved from journals or book ctions. With regard to L1 and L2 student academic writing, parts of the BAWE corpus wer
e utilized. The BAWE corpus, relead in 2008, contains approximately 3,000 pieces (approx. 6.5m. words) of proficient assd student writing from British universities. Two subcorpora were lected from the BAWE corpus: BAWE-CH contains essays produced by L1 Chine students of L2 English, and BAWE-EN is a comparable datat contributed by peer L1 English students. FLOB-J, BAWE-CH and BAWE-EN cover a wide range of disciplines, including arts and humanities, life sciences, physical sciences and social sciences (for BAWE, e Alsop & Nesi, 2009; for FLOB, e Hundt, Sand & Siemund, 1998). The size of each finalized corpus for investigation is around 150,000 words (e Table 1).
Table 1. Constituents of the Three Academic Corpora
Reprentation Corpus Word count Average length of text No. of texts Native expert writing FLOB-J 164,742 2,059 80 Native peer writing BAWE-EN 155,781 2,596 60 Learner writing BAWE-CH 146,872 2,771 53 Operationalization
Several key criteria have been pinpointed in the literature regarding how to generate a list of lexical bundles using automated corpus tools. The first criterion is the cut-off frequency, which determines the number of lexical bundles to be included in the analysis. The normalized frequency threshold for
large written corpora generally ranges between 20-40 per million words (e.g., Biber et al., 2004; Hyland, 2008b), while for relatively small spoken corpora, a raw cut-off frequency is often ud, ranging from 2-10 (e.g., Altenberg, 1998; De Cock, 1998). The cond criterion is the requirement that combinations have to occur in different texts, usually in at least 3-5 texts (e.g., Biber & Barbieri, 2007; Cortes, 2004), or 10% of texts (e.g., Hyland, 2008a), which helps to avoid idiosyncrasies from individual writers/speakers. The last issue concerns the length of word combinations, usually 2-, 3-, 4-, 5-, or 6-word units. Four-word quences are found to be the most rearched length for writing studies, probably becau the number of 4-word bundles is often within a manageable size (around 100) for manual categorization and concordance checks. The frequency and dispersion thresholds adopted vary from study to study, and even the sizes of corpora and subcorpora differ drastically, ranging from around 40,000 to over 5 million words. After repeated experiments with the corpus data under investigation, the frequency and distribution thresholds for determining 4-word lexical bundles were t to 4 times or more (approximately 25 times per million words on average), occurring in at least three texts. This resulted in an “optimum” number of bundles, which was considered sufficiently reprentative of the corpora being examined. One might argue that an identical standardized threshold, such as 20 or 40 times per million words, should be applied to each of the corpora investigated, as generally reported in the literature. However, when a normalized rate is converted to
raw frequencies, it substantially affects the number of generated word combinations when comparing corpora of various sizes. For instance, if we compare an 80,000-word corpus with a 40,000-word corpus with a cut-off standardized frequency t at 40 times per million words, it means that the converted raw-frequency threshold for the larger corpus is 3.2, whereas the converted raw-frequency threshold for the smaller corpus is much lower, at 1.6. Any decimals have to be rounded up or down in order to function as an operational cut-off frequency. Yet rounding down 3.2 to 3 results in a normalized rate of 37.5 whereas rounding up 1.6 to 2 generates a normalized rate of 50, both of which are different from the originally reported frequency threshold of 40 times per million words. Reporting only the standardized frequency criterion could therefore be misleading, becau a standardized cut-off frequency would inevitably lo its expected impartiality after being converted into raw frequencies corresponding to different corpus sizes. In this study, it could be argued that both the raw cut-off frequency and corresponding normalized frequency should be reported in order to reflect transparently the threshold adopted. For the sake of comparison, if the frequency threshold is t at 25 times per million words for the prent study, the converted raw frequencies for each corpus are 3.7, 3.9 and 4.1 times respectively, which are all rounded up or down to 4 (cf. Table 2 and Table 3).
Table 2. Raw and Corresponding Normalized Frequency Thresholds Adopted
Corpus Set raw frequency threshold Corresponding normalized frequency (per million words)
FLOB-J 4 24.3 BAWE-EN 4 25.7 BAWE-CH 4 27.2
Table 3. Normalized and Corresponding Raw Frequency Thresholds for Comparison
Corpus Set normalized frequency threshold
(per million words) Corresponding raw frequency
FLOB-J 25 3.7
BAWE-EN 25 3.9
BAWE-CH 25 4.1
After automatic retrieval of 4-word clusters using the corpus tool WordSmith 4.0 (Scott, 2007), word quences containing content words that were prent in the essay questions (e.g., financial and non financial), or any other context-dependent bundles, usually incorporating proper nouns (e.g., in the UK and, the Second World War), were manually excluded from the extracted bundle lists. It was
also found that overlapping word quences could inflate the results of quantitative analysis. Overlaps were thus checked manually via concordance analys. Two major types of overlaps are discusd here. One is “complete overlap,” referring to two 4-word bundles which are actually derived from a single 5-word combination. For example, it has been suggested and has been suggested that both occur six times, coming from the longer expression it has been suggested that. The other type of overlap is “complete subsumption,” referring to a situation where two or more 4-word bundles overlap and the occurrences of one of the bundles subsume tho of the other overlapping bundle(s). For example, as a result of occurs
17 times, while a result of the occurs five times, both of which occur as a subt of the 5-word bundle as
a result of the. Each ca of the above overlapping word quences (12 cas in total) were combined into one longer unit so as to guard against inflated results.
A further potential problem when comparing bundles across corpora involves what is actually counted (i.e., type/token distinction). Should we count the number of types of bundles (e.g., counting as a result of and it is possible to each as one type of bundle), or should we count the total occurren2013年考研复试分数线
ce of bundles (e.g., as a result of might occur 20 times in one corpus and 50 times in another)? One corpus could exhibit a very narrow range of bundles but have very high frequencies of them, while another might have the opposite pattern. We therefore distinguished between different types of bundles (types) and frequencies of bundles (tokens).1The numbers of bundle types and tokens, before and after data refinement, including removing context-dependent bundles and overlapping ones, are shown in Table 4 below.
Table 4. Number of Bundles Before and After the Removal of Context-Dependent Bundles and Overlaps
Corpus
Before refinement After refinement
No. of lexical
bundles (types)
No. of lexical
brightstar
bundles (tokens)
No. of lexical
traffic lights什么意思bundles (types)
No. of lexical
bundles (tokens)
FLOB-J 118 749 108 704 BAWE-EN 120 757 104 667 BAWE-CH 90 554 80 507
ANALYSIS AND RESULTS
Our analys in the following ction are bad on the recurrent word combinations retrieved and refined (for the full list, e Appendix). In this ction, structural and functional comparisons are made between the three groups of different writing proficiency levels. At the beginning of each sub-ction, Structures or Discour Functions, we begin by illustrating how the lexical bundles are categorized, structurally or functionally. Then we go on to the examples and discuss how usage of the word combinations is different and/or similar in the three groups of writers, in terms of both structures and discour functions. For functional analysis, we look further at the quantitative comparisons with some statistical analysis. Structures
The structural classification of lexical bundles in the Longman Grammar of Spoken and Written English (Biber et al., 1999) has been widely ud in other studies on recurrent word combinations (Cortes, 2002, 2004; Hyland, 2008a, 2008b). In the Longman Spoken and Written English (LSWE) corpus, fourteen categories of lexical bundles are grouped in conversation and twelve categories in academic pro with some overlap between them. Here, a structural classification, following the LSWE taxonomy, was carried out on the lexical bundles retrieved from FLOB-J, BAWE-EN and BAWE-CH. The results were then compared with the proportions of structural categories in the LSWE corpus. As shown in Table 5, despite the drastic difference in corpus size2 and different frequency thresholds (ten times per million words for LSWE, and four times as the raw cut-off frequency for the current study), there appears to be a surprising clo match between the academic pro component of LSWE and FLOB-J, while the proportions for the two groups of student writing fluctuate to some extent when compared with the academic pro in LSWE. Not only does such comparison lend a good deal of credence to the u of smaller corpora with different frequency cut-offs in the current project, but it also indicates a gap between native expert academic pro and immature student academic writing. This gap might be a result of genre difference between published academic essays and university assignments, but it is more likely that it hinges on writing proficiency. Three broad structural categories were distinguished: “NP-bad,” “PP-bad,” and “VP-exercis怎么读
bad.” NP-bad bundles include any noun phras with post-modifier fragments, such as the role of the or the way in which (i.e., Category (1) in Table 5). PP-bad bundles refer to tho starting with a preposition plus a noun-phra fragment, such as at the end of or in relation to the (i.e., Category (2) in Table 5). With regard to VP-bad bundles, any word combinations with a verb component, such as in order to make or was one of the, is assigned to this category (i.e., Categories (3) to (8) in Table 5).
In Table 5, it can be en that the u of NP-bad bundles differs the most amongst the three groups of writing. We thus grouped the NP-bad combinations further into two structural subcategories to e more precily how the three corpora were distinguished from each other. The two subcategories are noun phra fragments with of (NP + of) (e.g., in the context of) and any other noun phra fragments without of (NPf) (e.g., the way in which). In addition to the relatively low proportion of NP-bad bundles when compared with FLOB-J, the Chine student writing reprented in BAWE-CH is notably different from the two groups of native writing in the subcategory of NPf, becau there is no NPf bundle in BAWE-CH. In contrast, the NPf bundles prent in FLOB-J are mostly ud by the British students in BAWE-EN, although there are some slight variations (e Table 6). The NPf combinations found in this investigation are all part of relative
claus, such as the extent to which, the fact that this, or the way(s) in which. It is evident that the L2 students did not u the types of relative clau as frequently as native speakers did.