首页 > 英文翻译

基于Roberta进行微博情感分析

更新时间:2023-05-15 17:41:49 阅读：评论：0

基于Roberta进⾏微博情感分析

概览：

情感分析是NLP中⼀⼤分⽀，本⽂尝试使⽤预训练模型（Roberta-wwm-ext）对微博通⽤数据进⾏情感分类，共六种类别（积极、愤怒、悲伤、恐惧、惊奇、⽆情绪）。数据来源：

该评测任务中涉及通⽤数据和疫情数据，本⽂只使⽤通⽤数据。

本⽂着重展⽰利⽤预训练模型在torch环境下进⾏情感分析的pipeline，弱化提升模型本⾝精度的探索。

数据介绍：

训练集：27,768条；测试集：5,000条

提取码：q2f8

数据格式如下图：id是编号；content为⽂本内容；label是情绪

模型训练：make sb to do还是do

本⽂基于HuggingFace开源的Transformers（Torch版本）实现。

主要库版本：Transformers == 2.2.2 torch == 1.5.0

（1）加载预训练模型（模型：RoBERTa-wwm-ext，提取码：369y）

class Model(nn.Module):

def __init__(lf, num_class):

super(Model, lf).__init__()

lf.bert = BertModel.from_pretrained('chine_wwm_ext_pytorch') # /roberta-wwm-ext pretrain/

for param in lf.bert.parameters():

lf.fc = nn.Linear(768, num_class) # 768 -> 6

def forward(lf, x, token_type_ids, attention_mask):

context = x # 输⼊的句⼦

types = token_type_ids

mask = attention_mask # 对padding部分进⾏mask，和句⼦相同size，padding部分⽤0表⽰，如：[1, 1, 1, 1, 0, 0]

_, pooled = lf.bert(context, token_type_ids=types, attention_mask=mask)

out = lf.fc(pooled) # 得到6分类概率

return out

# 加载模型

MODEL1 = Model(num_class=6) # 指定分类类别

注意：这⾥需要额外定义⼀个Model类对BertModel基类的输出进⾏处理。原因是原始基类返回的是CLS的hidden向量经过⼀层den和activation后得到的向量（本⽂中使⽤的维度是768），因此还需要⼀个全连接层转换成每个分类的输出（本⽂类别数num_class=6）

（2）构造训练数据和测试数据

Roberta-wwm-ext模型共需要传⼊三类向量，⼀是要分类的⽂本本⾝的token；⼆是表征token type的向量，⽤来表征该位置的⽂本token是否是PAD产⽣，⾮PAD记为0，PAD记为1；三是表⽰mask标志的向量，PAD位置的mask标记为0，否则为1。关于这三类向量的概念和意义不在本⽂中展开讨论，请⾃⾏检索相关资料。

构造三类向量

# 数据进⾏token化处理, q_length表⽰接受的句⼦最⼤长度

def convert_text_to_token(tokenizer, ntence, q_length):

tokens = kenize(ntence) # 句⼦转换成token

tokens = ["[CLS]"] + tokens + ["[SEP]"] # token前后分别加上[CLS]和[SEP]

# ⽣成 input_id, g_id, att_mask

ids1 = vert_tokens_to_ids(tokens)

types = [0] * len(ids1)

masks = [1] * len(ids1)

# 句⼦长度统⼀化处理：截断或补全⾄q_length

if len(ids1) < q_length: #补全

ids = ids1 + [0] * (q_length - len(ids1)) #[0]是因为词表中PAD的索引是0

types = types + [1] * (q_length - len(ids1)) # [1]表明该部分为PAD

masks = masks + [0] * (q_length - len(ids1)) # PAD部分，attention mask置为[0]

el: # 截断

ids = ids1[:q_length]

types = types[:q_length]

masks = masks[:q_length]

asrt len(ids) == len(types) == len(masks)

return ids, types, masks

TOKENIZER = BertTokenizer.from_pretrained("chine_wwm_ext_pytorch") #模型[roberta-wwm-ext]所在的⽬录名称

注意：（1）这⾥的Tokenizer使⽤BertTokenizer，切勿使⽤RobertaTokenizer（2）q_length表⽰构造的训练数据的长度，长则截断，短则补齐。本⽂中q_length = 128

构造训练集和测试集的DataLoader

# 构造训练集和测试集的DataLoader

def genDataLoader(is_train):

if is_train: # 构造训练集

path = TRAIN_DATA_PATH

el: # 构造测试集

bbc新闻

path = TEST_DATA_PATH

with open(path, encoding='utf8') as f:

data = json.load(f)

ids_pool = []

types_pool = []

masks_pool = []

target_pool = []

count = 0

# 遍历构造每条数据

for each in data:

cur_ids, cur_type, cur_mask = convert_text_to_token(TOKENIZER, each['content'], q_length = SEQ_LENGTH) ids_pool.append(cur_ids)

新东方英语在线

types_pool.append(cur_type)

masks_pool.append(cur_mask)

cur_target = LABEL_DICT[each['label']]

英语心得

target_pool.append([cur_target])

count += 1

if count % 1000 == 0:

print('已处理{}条'.format(count))

# break

# 构造loader

data_gen = TensorDatat(torch.LongTensor(np.array(ids_pool)),

torch.LongTensor(np.array(types_pool)),

torch.LongTensor(np.array(masks_pool)),

torch.LongTensor(np.array(target_pool)))

日译汉# print('shit')

sampler = RandomSampler(data_gen)

loader = DataLoader(data_gen, sampler=sampler, batch_size=BATCH_SIZE)

return loader

构造DataLoader是为了训练时能够⼩批量训练，即每次只feed batch_size个数据

拟合度（3）训练

机器配置：两张V100

batch_size：8

可以看到⼀个epoch耗时⼤概6min，这⾥我只训练了3个epoch

注意：训练时每个epoch完成后需要验证⼀下结果，保存最佳模型，验证代码如下

def test(model, device, test_loader): # 测试模型, 得到测试集评估结果

model.eval()

test_loss = 0.0

acc = 0

for (x1, x2, x3, y) in tqdm(test_loader):

x1, x2, x3, y = x1.to(device), x2.to(device), x3.to(device), y.to(device)

_grad():

y_ = model(x1, token_type_ids=x2, attention_mask=x3)

test_loss += F.cross_entropy(y_, y.squeeze())

pred = y_.max(-1, keepdim=True)[1] # .max(): 2输出，分别为最⼤值和最⼤值的index

length是什么意思

acc += pred.eq(y.view_as(pred)).sum().item() # 记得加item()

test_loss /= len(test_loader)

print('\nTest t: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)'.format(

test_loss, acc, len(test_loader.datat),

100. * acc / len(test_loader.datat)))

return acc / len(test_loader.datat)

（4）测试

训练完后进⾏效果测试，测试代码如下

def test(model):

with open('data/usual_', encoding='utf8') as f:

data = json.load(f)

res = []

correct = 0

count = 0

for each in data:

cur_ntence = each['content']

cur_label = each['label']

ids = []

types = []

masks = []

cur_ids, cur_type, cur_mask = convert_text_to_token(TOKENIZER, each['content'], q_length=SEQ_LENGTH)

ids.append(cur_ids)

bestiality videotypes.append(cur_type)

masks.append(cur_mask)

cur_ids, cur_type, cur_mask = torch.LongTensor(np.array(ids)).to(DEVICE), torch.LongTensor(np.array(types)).to(DEVICE), torch.LongTensor(np.array(ma _grad():

y_ = model(cur_ids, token_type_ids=cur_type, attention_mask=cur_mask)

pred = y_.max(-1, keepdim=True)[1] # 取最⼤值

georgecur_pre = LABEL_DICT[int(pred[0][0].cuda().data.cpu().numpy())] # 预测的情绪

if cur_label == cur_pre:

correct += 1

cur_res = cur_ntence + '\t' + cur_label + '\t' + cur_pre

diligenceres.append(cur_res)

count += 1

if count % 1000 == 0:

print('已处理{}条'.format(count))

accu = correct / len(data)

print('accu是{}'.format(accu))

with open('', 'w', encoding='utf8') as f:

for each in res:

f.write(each+'\n')

注意：测试时需要先加载原始roberta-wwm-ext模型，然后使⽤load_state_dict⽅法读⼊训练3个epoch后的模型参数。

测试结果演⽰

计算后，总体精度为78.02%，这⾥展⽰20条测试结果。第⼀列为待测试内容，第⼆列是真实的情感标签，第三列是模型预测的情感标签。初步认为这个可以当做⼀个ba模型了。

写在后⾯：

（1）个⼈认为HuggingFace开源的Transformers⽐较成功，作为⼀个初创公司，值得国内同⾏学习

（2）提取码：lncz

使⽤⽅法：

训练： python train.py

测试：python testCa.py

参考⽂献

本文发布于:2023-05-15 17:41:49，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/109612.html

上一篇：彻底搞懂Netty高性能之零拷贝

下一篇：该驱动程序不支持SQLServer8版

标签：模型训练数据测试构造

留言与评论（共有 0 条评论）