首页 > 美文阅读

nlp中文数据预处理

更新时间:2023-05-14 01:44:00 阅读：评论：0

交通安全三字经

nlp中⽂数据预处理

nlp 中⽂数据预处理

此博⽂详细介绍中⽂数据预处理的过程并配上⼀定量的代码作为实例

数据加载（默认csv格式）

import pandas as pd

datas = pd.read_csv("./test.csv", header=0, index_col=0) # DataFrame

n_datas = _numpy() # ndarray 转成numpy更好处理（个⼈喜好）

去除空⾏

def delete_blank_lines(ntences):

return [s for s in ntences if s.split()]

no_line_datas = delete_blank_lines(n_datas)

婚礼红包祝福语

去除数字

DIGIT_RE = re.compile(r'\d+')

no_digit_datas = DIGIT_RE.sub('', no_line_datas)

def delete_digit(ntences):

return [DIGIT_RE.sub('', s) for s in ntences]

判断句⼦形式（简单句或者复杂句）

STOPS = ['。', '.', '?', '？', '!', '！'] # 中英⽂句末字符

def is_sample_ntence(ntence):

count = 0

for word in ntence:

if word in STOPS:

怎样发绿豆芽

count += 1

2字游戏名字

if count > 1:

return Fal

return True

去除中英⽂标点

from string import punctuation

import re

punc = punctuation + u'.,;《》？！“”‘’@#￥%…&×（）——+【】{};；●，。&～、|\s:：'

def delete_punc(ntences):

return [re.sub(r"[{}]+".format(punc), '', s) for s in a]

去除英⽂（仅留汉字）

ENGLISH_RE = re.compile(r'[a-zA-Z]+')

def delete_e_word(ntences):

return [ENGLISH_RE.sub('', s) for s in ntences]

事业编怎么考

去除乱码和特殊符号

使⽤正则表达式去除相关⽆⽤符号和乱码

# 该操作可以去掉所有的符号，标点和英⽂，由于前期可能需要标点进⼀步判断句⼦是否为简单句，所以该操作可以放到最后使⽤。SPECIAL_SYMBOL_RE = re.compile(r'[^\w\s\u4e00-\u9fa5]+')

创业的本质是什么def delete_special_symbol(ntences):

return [SPECIAL_SYMBOL_RE.sub('', s) for s in ntences]

中⽂分词

# 使⽤jieba

def g_ntences(ntences):

cut_words = map(lambda s: list(jieba.cut(s)), ntences)

return list(cut_words)

# 使⽤pyltp分词

坤的组词def g_ntences(ntences):渡江战役简介

gmentor = Segmentor()

gmentor.load('./del') # 加载分词模型参数

g_nts = [(nt)) for nt in ntences]

return g_nts

去除停⽤词

# 停⽤词列表需要⾃⾏下载

stopwords = []

def delete_stop_word(ntences):

return [[word for word in s if word not in stopwords] for s in ntences]

本文发布于:2023-05-14 01:44:00，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/621067.html

上一篇：最新辩论赛观后感

下一篇：最新大学生会计实训报告总结大学生会计实践报告3000字(三篇)

标签：去除符号数据

留言与评论（共有 0 条评论）