Information-on-the-BNC_COCA-word-family-lists

更新时间:2023-06-28 01:12:17 阅读: 评论:0

The BNC/COCA word family lists
(17 September 2012)
The BNC/COCA word family lists consist of 29 word family lists. Twenty-five of the lists contain word families bad on frequency and range data. The four additional lists are (1) an ever-growing list of proper names, (2) a list of marginal words including swear words, exclamations, and letters of the alphabet, (3) a list of transparent compounds, and (4) a list of abbreviations. In the lists for AntWordProfiler, each list has a name which describes its content. In the lists for Range, becau of the requirements of the Range program, each list has a fixed name – , where x is a number. Bawrd26-30 just contain one nonn word each. They were made to provide space for additional lists and to avoid having to keep changing the names of the proper nouns etc lists. Bawrd31 contains proper nouns, bawrd32 marginal words, bawrd33 transparent compounds and bawrd34 abbreviations. More detail on the additional lists can found in Nation and Webb (2011: Chapter 8).
The lists are saved in UTF-8, without BOM (choo under Encoding in Notepad ++). The making of the lists
The 1st 1000 and 2nd 1000 word family lists
The first two 1000 word family lists were made using a specially designed 10 million token corpus. Six million tokens of this corpus were spoken English from both British and American English (e Corpus/PN corpus for 2000) as well as movies and TV programs. The written ctions included texts for young children and fiction (e Table 1).
五香炸鸡
Table 1: The corpus ud for the first two 1000 word family lists
This unusual step of creating a special corpus for the first 2000 word families was followed becau the previous lists made from the British National Corpus were so
strongly influenced by the written formal nature of the corpus that they were not suitable lists for creating language cours or graded reader lists (e Nation, 2004). Very common words in spoken English like alright, pardon, hello, dad, bye could then be included in the high frequency words. Other arbitrary adjustments included putting all the word forms of numbers (one, two, hundred) and weekdays in the 1st 1000, and the months of the year in the 2nd 1000, even though their frequency did not always justify this. The goal was to have a t of high frequency word lists that were suitable for teaching and cour design.
The 3rd 1000 onwards十个基本礼仪
The remaining 1000 lists were made by using COCA/BNC rankings in data kindly provided by Mark Davies (Davies COCA BNC.xls) after removing my specially created first 2000 word families.
Word families
The criteria ud to make word families were bad on Bauer and Natio n’s (1993) level 6, which includes all the affixes from levels 2 to 6 (e Table 2).
Table 2: Word family levels
The word families were developed over veral years and low frequency family members continue to be added to the existing families.
The nature of the families
The word lists were made to be ud with the AntWordProfiler and Range computer programs and the program cannot distinguish between homonyms like Smith (the family name) and smith (blacksmith) and March (the month) and march (as soldiers do). Thus when the program runs, the us are not distinguished and would be counted in the same family and as the same type. There was an attempt to deal with this wherever possible. Marched, marching, marches, marcher, marchers etc for example were put in one family and March into another. This does not completely distinguish the homonyms, but it is a step towards doing so.
The high frequency word families tend to be quite large as it appears that higher frequency stems generally can take a greater range of affixes than lower frequency words. For example, the high frequency word family nation has the following members nations, national, nationally, nationwide, nationalism, nationalisms, internationalism, internationalisms, nationalisations, internationalisation, nationalist, nationalists, nationalistic, nationalistically, internationalist, internationalists, nationali, n
ationalid, nationalising, nationalisation, nationalize, nationalized, nationalizing, nationalization, nationhood, nationhoods.
The word family lists group items together that would be perceived as the same words for the receptive skills of listening and reading. If word lists were made for productive purpos, for speaking and writing, the lemma would be the largest nsible unit to u. Some rearchers argues for the word type.
蝉鸣的夏季
The word lists contain compound words but they do not contain phras. According to or au fait, for example, might be best counted as a unit, but in the lists the unit is the single word.
The validity of the BNC word family lists
There are ways of checking whether the word family lists are properly ordered. From the 1st 1000 to the 25th 1000, the number of tokens, types, and families found in an independent corpus should decrea. That is, when the lists are run over a different corpus from the BNC or COCA, the 1st 1000 word family list should account for more tokens, types and families than the 2nd 1000 family list does. Similarly, the 2nd 1000 word family list should account for more tokens, types and families than the 3rd 1000 family list does and so on. While this does not show that each word family is in the
right list, it does show that the lists are properly ordered. Table 3 prents such data using the Range output from the Wellington Written Corpus.
Table 3: Tokens, types and families in the Wellington Written Corpus
WORD LIST                TOKENS/%            TYPES/%            FAMILIES
one                      772697/75.22            4762/11.74            999
日历是谁发明的
two                      91545/ 8.91            4299/10.60            999
three                    53591/ 5.22            3903/ 9.62            999
four                      17967/ 1.75            2853/ 7.03            995
five                      10899/ 1.06            2336/ 5.76            981
地球自转和公转six                        7267/ 0.71            1986/ 4.90            950
ven                      4513/ 0.44            1564/ 3.86            904
eight                      4313/ 0.42            1336/ 3.29            853
nine                      2592/ 0.25            1089/ 2.68            760
ten                        2005/ 0.20            920/ 2.27            700
11                        1533/ 0.15            721/ 1.78            585
12                        1063/ 0.10            589/ 1.45            489
13                          832/ 0.08            438/ 1.08            391
中医体质养生
14                          737/ 0.07            346/ 0.85            304
15                          531/ 0.05            276/ 0.68            246
16                          443/ 0.04            220/ 0.54            198
17                          628/ 0.06            194/ 0.48            173
18                          250/ 0.02            127/ 0.31            117
19                          247/ 0.02            104/ 0.26            101
20                          269/ 0.03            104/ 0.26              89
21                          132/ 0.01              79/ 0.19              74
22                          130/ 0.01              63/ 0.16              59
23                          80/ 0.01              43/ 0.11              40
24                          296/ 0.03              52/ 0.13              48
25                          134/ 0.01              31/ 0.08              29
26                            0/ 0.00              0/ 0.00              0
27                            0/ 0.00              0/ 0.00              0
28                            0/ 0.00              0/ 0.00              0
29                            0/ 0.00              0/ 0.00              0
30                            0/ 0.00              0/ 0.00              0
31                        30991/ 3.02            3844/ 9.48            3691
32                        3111/ 0.30              90/ 0.22              33
33                        4203/ 0.41            1200/ 2.96            926
34                        1380/ 0.13            191/ 0.47            188
not in the lists          12819/ 1.25            6803/16.77          ?????
教学背景
Total                  1027198                40563                16921小红帽绘本故事
A cond way of checking the validity of the lists is to look at the total number of types in each list. Low frequency words tend to have less family members than high frequency words, so even though the number of families in each list is the same, one thousand, the number of types should be less. Table 4 contains this data.
Table 4: The number of types (family members) in each of the twenty-five 1000 word family lists
The 1st 1000 word families contains 6,857 word types, an average of 6.857 per family as each list contains exactly 1000 word families. There is decrea in word types from one list to the next. The families in the newly created 25th 1000, which was made from a dictionary, may have been more diligently made than the preceding lists.
A third way of checking the validity of the lists is to make sure that no wide range, high or mid-frequency words are missing from the lists. To check this, the lists were run over a wide range of different corpora, existing lists, and texts. No frequent, wide range word families were missing.
Words not in the lists
Table 5: The percentage amounts of different kinds of word types in the British National Corpus and not in the twenty 1000 word family British National Corpus word lists and additional lists
There are 272,782 word types in the British National Corpus that are not in the first 20 word lists ud with the Range program, plus a list of proper nouns, a list of transparent compounds, and a list of  exclamations, hesitations and other spoken marginal words. Note in Table 5 that almost half of the different words are proper nouns. Four percent are foreign words, and 6% are low frequency me
mbers of word families already in the 20 one thousand word lists. Ideally, the family members should be added to the families in the existing lists.
The main point of the table is to show that the new words (49,101) plus the 20,000 in the word lists total around 70,000 word families which is a figure not too far from Nagy and Anderson’s (1984) estimates, and the number of words in most reasonably sized non-historical dictionaries. The reason for distinguishing recurring words (tho occurring 2 times or more in the British National Corpus) from tho occurring only once in the

本文发布于:2023-06-28 01:12:17,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/1057997.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:自转   绘本   炸鸡   礼仪   养生   背景
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图