首页 > 英语园地

分词实验指导

更新时间:2023-06-30 05:12:02 阅读：评论：0

实验二、中文分词

MAXLEN=4

import codecs

#import sys

#语料

corpus=codecs.open('F:\\2018课件\自然语言处理\lkx_自然语言处理实验\分词实验\','r','UTF-8')

ad()

corpus.clo()

#字典

dic=codecs.open('F:\\2018课件\自然语言处理\lkx_自然语言处理实验\分词实验\word_freq_

','r','UTF-8')

adlines()

dic.clo()

#分别存储四字词、三字词和二字词

char_4=[]

char_3=[]

char_2=[]

for i in diclines:

if len(i.decode().split('\r\n','rb')[0])==4:

char_4.append(i.decode().split('entrance\r\n')[0])

elif len(i.decode().split('\r\n')[0])==3:

char_3.append(i.decode().split('\r\n')[0])

el:

char_2.append(i.decode().split('\r\n')[0])

char_4=t(char_4)

char_3=t(char_3)

char_2=t(char_2)

ntences=[]

corpuslines=corpusReader.split('\r\n')

for nten in corpuslines:

ntences.append(nten)

print('Plea wait a ')

temp=''

gResult=codecs.open(中英文翻译机'','w','utf-8')

k=0

while k!=len(ntences):

i=0

while triggeringi<len(ntences[k]):

if i+MAXLEN<len(ntences[k]):

possible_word=ntences[k][i:i+MAXLEN].split('\r\n')[0]

if possible_word in char_4:

temp+=possible_word+' '

#gResult.write(possible_word+' ')

i+=MAXLEN

continue

if i+高考资讯3<len(ntences[k]):

possible_word=ntences[k][i:i+3].split('\r\n')[0]

if possible_word in char_3:

temp+=possible_word+' '

万圣节英语怎么说 #gResult.write(possible_word+' ')

i+=3

continue

if i+2<len(ntences[k]):

possible_word=ntences[k][i:i+2].split('\r\n')[0]

if possible_word in char_2:

temp+=possible_word+' '

#gResult.write(possible_word+' ')

i+=2

函授报名

continue

possible_word=ntences[k][i]

temp+=possible_word+' '

#gResult.write(possible_word+' ')

i+=1

#gResult.write('\r\n')

k+=1

temp=temp.strip()

goalkeeper

gResult.write(temp)

gResult.clo()

print('Segmentation ends,calculating precision rate,recall rate and f-score.')

gResult=codecs.open('','r','utf-8')

ad()

gResult.clo()

gold_corpus=codecs.open('此处为金标分词结果路径','r','utf-8')

gold=ad()

gold_corpus.clo()

gold_split_enter=gold.split('\r\n')

gold=''

for i in gold_split_enter:

gold+=i

gold_list=gold.strip().split(' ')

明确英语my_list=my.split(' ')

gold_len=len(gold_list)

my_len=len(my_list)

correct=0

gold_before=''

my_before=''

i=1

j=1

gold_before+=gold_list[0]

my_before+=my_list[0]

if gold_before==my_before and gold_list[0]==my_list[0]:

郑州o学习 correct+=1

#sys.stdout.write(my_list[0])

while True:

if gold_before==my_before and gold_list[i]==my_list[j]:

correct+=1

#sys.stdout.write(my_list[j])

gold_before+=str(gold_list[i])

my_before+=str(my_list[j])

i+=1

j+=1

elif len(gold_before)<len(my_before):

gold_before+=str(gold_list[i])

i+=1

elif len(gold_before)>len(my_before):

my_before+=str(my_list[j])

j+=1

elif gold_before==my_before and gold_list[i]!=my_list[j]:

gold_before+=str(gold_list[i])

my_before+=str(my_list[j])

i+=1

j+=1

if i>=len(gold_list) and j>=len(my_list):

break

precision=correct/my_len

recall=correct/gold_len

f_score=2*precision*recall/(precision+recall)

print('precision rate:',precision)

print('recall rate:',recall)纪念品英文

print('f-score:',f_score)

1、实验内容

a.用最大匹配算法设计分词程序实现对文档分词，并计算该程序分词的召回率。

b.可以输入任意句子，显示分词结果。

实验数据：

（1）word_freq_list.txt 分词词典

(2）未经过分词的文档文件

（3）pku_ 经过分词的文档文件

2、实验所采用的开发平台及语言工具

开发平台：任意

语言工具：任意

3、实验的核心思想和算法描述

核心思想：最大匹配算法

算法描述：正向最大匹配法算法如下所示：逆向匹配法思想与正向一样，只是从右向左切分，这里举一个例子：

输入例句：S1="计算语言学课程有意思" ；

定义：最大词长MaxLen = 5；S2= " "；分隔符 = “/”；

假设存在词表：…，计算语言学，课程，意思，…；

最大逆向匹配分词算法过程如下：

（1）S2=""；S1不为空，从S1右边取出候选子串W="课程有意思"；

（2）查词表，W不在词表中，将W最左边一个字去掉，得到W="程有意思"；

（3）查词表，W不在词表中，将W最左边一个字去掉，得到W="有意思"；

（4）查词表，W不在词表中，将W最左边一个字去掉，得到W="意思"

（5）查词表，“意思”在词表中，将W加入到S2中，S2=" 意思/"，并将W从S1中去掉，此时S1="计算语言学课程有"；

（6）S1不为空，于是从S1左边取出候选子串W="言学课程有"；

（7）查词表，W不在词表中，将W最左边一个字去掉，得到W="学课程有"；

（8）查词表，W不在词表中，将W最左边一个字去掉，得到W="课程有"；

本文发布于:2023-06-30 05:12:02，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1069598.html

上一篇：河南专升本：语法系列复习专题二-----代词

下一篇：八年级英语外研版上册Module1_Unit1_核心词汇讲解

标签：分词实验算法匹配计算课程任意

留言与评论（共有 0 条评论）