Building a Question Classifier for a TREC-Style Question Answering System Topic Question Cl

更新时间:2023-07-31 20:52:45 阅读：评论：0

Building a Question Classiﬁer for a TREC-Style

Question Answering System赠梁任父同年

Richard May&Ari Steinberg

Topic:Question Classiﬁcation

We deﬁne Question Classiﬁcation(QC)here to be the task that,given a question,maps it to one of k class,which provide a mantic constraint on the sought-after answer[Li02].The topic of Question Classiﬁcation aris in the area of automated question-answering systems,such as tho created for the TREC question answering competition.Automated question-answering systems diﬀer from other information retrieval systems(ie.arch engines) in that they do not return a list of documents with possible relevance to the topic,but rather return a short phra containing the answer.Moreover, question answering systems take in as input queries expresd in natural language rather than the keywords traditional arch engines u.

In order to respond correctly to a free form factual question given a large collection of texts,any system needs to understand the question to a level that allows determining some of the constraints the questio

n impos on a possible answer.The constraints may include a mantic classiﬁcation of the sought after answer and may even suggest using diﬀerent strategies when looking for and verifying a candidate answer.More speciﬁcally,know-ing the class(or possible class)of the sought after answer narrows down the number of possible phras/paragraphs a question-answering system has to consider,and thus greatly improves performance of the overall system. [Harabagiu]divides their QA system into three distinct pieces.At the core of theﬁrst module in the system lies a question classiﬁcation task.Thus, it ems that question classiﬁcation is an important subtask of automated question answering.An error in question classiﬁcation will almost undoubt-edly throw oﬀthe entire QA pipeline,so it is crucial to be able to classiﬁy question correctly.

In our paper,we will build on the hierarchal classiﬁcation discusd in[Li02] and experiment with some features of our own design.We expect that by tweaking both the classiﬁcation algorithms and the choice of features,we can get improvements in this crucial QA subsystem.

1开学心愿

Classiﬁcation

Classiﬁcation Methodology

Results from[Li02]demonstrated that using aﬂat classiﬁer performs just as well as a two-layer hierarchical classiﬁer that ud a coar classiﬁer to dispatch the classiﬁcation task to a cond classiﬁer.We too,plan on using a hierarchal classiﬁer,however,ours will diﬀer in that we will also attempt to learn which class are often confud,whereas[Li02]applied domain knowledge to the task and created six hand crafted super class that then contained distinct subts of the true class.睡懒觉

骑自行车的人Classiﬁcation System

For our code ba,we leverage an existing machine learning library,MAL-LET,discusd in[McCallum].We designed a hiearchal classiﬁer trainer, which takes in a training t and partitions it into a ba train t and an advanced train t.The trainer then us the ba train t to train the coar classiﬁer over all the possible question types.The trainer then tests the coar classiﬁer on the advanced training data,to build a confusion matrix.Using a t of threshold parameters,the trainer decides if certain predicted class have too high a confusion rate and then trains a condary classiﬁer on the advanced training instances that were predicted to be part of the high confusion rate class.

Figure1:Diagram of how to train a hierarchal classiﬁer

Features

The majority of our creative energies were focud into feature engineering. It is a process that requires a lot of trial and error.We intend to leverage

resources in WordNet to improve on mantic understanding of the ques-tions.

As input to our machine learning algorithm,the computer examines each question and derives a reprentation consisting of numerous features.In the end,a typical question can have as many as60features,and our t of 6000total questions can result in anywhere from30,000to120,000unique features.While it would be easy to generate far more than this,due to memory constraints we must be careful in lecting only the most uful types of features to include in this t.In addition,adding extraneous features can add noi to the data and result in weaker performance.On the other hand,large performance improvements can be gained by adding uful features to the t.

Basic features

The most basic reprentation of a question is simply the individual words of that question(ignoring contextual information such as the ordering of the words).While simple,this is also by far the most important part of our program-the best indicators of certain question types are single words,and in particular question words clearly reveal a lot about the type of question being asked.On the other hand,a large amount of information would be lost by stopping here,since words can often mean many diﬀerent things depending on their context.

We can regain some of this contextual information by examining part of speech tags.We run a parr over the question and take the preterminal nodes of the parr as the parts of speech for each word.The parts of speech alone wouldn’t help much,so we add as features word and part of speech pairs,thus helping to disambiguate words which have diﬀerent ns depending on their part of speech.

Another problem that part of speech tags fail to address is feature spar-sity(in fact,part of speech and word pairs suﬀer even further from this problem).Some words can reveal a lot of information,but do not show up enough times in the training t to allow our classiﬁer to pick up on this.To address this problem,we u a stemmer to create a more generalized form of the word.For example,whether a verb is in prent or past ten might not impact how it aﬀects the question being asked.

Another basic feature that we add is bigrams-pairs of words that occur quentially.However,bigrams inherently suﬀer even wor from sparsity problems than individual words,so we u the stemmed form of the words to create the bigrams.

Oneﬁnal basic feature that we experimented with is to take conjunctions of any other two basic features.Bigrams do not capture long-distance de-pendencies-two words that may aﬀect each other’s meanings but are not adjacent in the ntence,so we attempted to do that with the conjunc-tions.The conjunctions did help to improve accuracy,but they resulted in unmanageably large feature t sizes,so we could not run them with the larger training corpus or with the help of our full feature t.

Par Signatures

反片打印是什么意思

A more advanced feature that we added is something we call a“par sig-nature”,though the term“ntence reduction”may be a more accurate descriptor.Our goal with the signatures was to create a reprentation of the entire question structure,instead of just small fragments of it.Obvi-ously,though,adding the full ntence would only help to later classify that exact ntence,so we nee

ded a way of generalizing this.Another motivation for creating the signatures was to attempt to have some way to reprent the rich grammatical information given to us by the question’s par tree, but again without suﬀering from the sparsity issues that would come with using the entire tree as a feature.

A par signature can be thought of as a left-to-right readout of a par tree.More formally,it is a t of nodes from a par tree such that every leaf node has exactly one ancestor(or itlf)in the t.The question itlf would be one such signature,and the tree’s root node would be another (though neither of tho would be particularly uful for our purpos).It should be clear that the number of par signatures that can be generated from a question is enormous,so we need a way of limiting this number.Our solution is to parameterize the signature generation with the desired length of the readouts.Choosing to include all signatures of lengths1to5cap-tures the information that we desire while keeping the number of features at a reasonable level.

We can further reduce the noi of this data by choosing to not traver below certain nodes.For example,our par trees always end in a“.”node which expands to a“?”,but since nearly every question ends with a?this means we would always be generating two almost-identical versions of each signature-one ending in.and one ending in?.To avoid this,we tell the signature generator to nev

er traver some other parts of speech such as CC who exact words em to have little impact.

WordNet

Thus far,all of our features have focud on syntactic considerations,but there is clearly a lot of information to be gained by looking at the mantic information in the questions.In order to accomplish this,we add WordNet information to our features.For every noun,verb,adjective,and adverb,we also add theﬁrst synonym that appears in its“synt”(a group of words that WordNet tells us share a meaning).While we could add every synonym, as long as we are careful to choo the same synonym each time something in a particular synt occurs there is no reason to add more than one. Perhaps more interesting than synonyms,though,are hypernyms.This is the“...is a ”relationship(going from more speciﬁc to more gen-eral),e.g.“animal”for“dog”.While the idea of a more generalized form of a word is clearly a uful one,just adding a word’s direct hypernym probably wouldn’t be very helpful,since this may not be general enough(we might get that a dog is a canine and a cat is a feline instead ofﬁnding that they are both animals).Going multiple levels up also

does not help,becau two words can be at diﬀerent levels in the tree,so while they may share some ancestor,it may be2levels up for one word and3for another and we would again not e the similarity.One solution would be to add all of a word’s hypernyms,but this would result in too much data.

Our solution was instead to always take the hypernyms a certain distance from the“root”node.For example,many nouns have“entity”as their root node,but knowing that a noun is an entity isn’t much better than knowing that it’s a noun.We start at the root(such as entity)and work down veral nodes in the direction of the target word until we get to ourﬁxed level and choo that as the hypernym to add.This way we are more likely to have related words share a hypernym,even if they are at diﬀerent levels of the WordNet tree.We add such a hypernym to our feature t for every word that falls occurs below the target level.

Finally,this hypernym informationﬁts nicely into our par signature idea. The signature“What caus pneumonia”,for example,is probably too spe-ciﬁc,but having“What caus NN”may be too general.If we instead had “What caus HYP dia”(inrting the HYP preﬁx to distinguish the hypernym from the ba word)as our signature this may be more ideal.We inrt the hypernyms into the tree between a word’s part of speech and the word itlf.For words that do not have hypernyms(becau

they are the wrong part of speech or occur too clo to the root node in the WordNet hierarchy)but do have stemmed forms we add the stemmed form instead. Again,this has the potential to dramatically increa the number of par signatures,so we tell our signature generator to never traver below a node

with an HYP or STM preﬁx.

圆梦巨人观后感Evaluation&Results

Datat林丹世界排名

什么是细小In order to compare our results with previous work,we ud the same datat as[Li02].The data can be obtained at the groups website(l2r. cs.uiuc.edu/~cogcomp/Data/QA/QC/).The training data consists of5500 labelled questions.The test data is500questions taken from the TREC 10t.In both the training and test data,there are a total of50diﬀerent question class.In addition,we count an answer as correct if the best classiﬁcation label output by our classiﬁer is the true label.

Hierarchal Classiﬁcation Results

After experimenting with a variety of classiﬁers(SVM,MaxEnt,NB,Deci-sion Tree)for primary and condary classiﬁcation,we decided that a mix of a Maximum Entropy cour classiﬁer with a Naive Bayesﬁne classiﬁer was the best combination.Obrvations indicated that it was better to mix classiﬁers than to have the same type of classiﬁer as both the primary and condary.

As for why a Naive Bayes condary classiﬁer had the best results,we can oﬀer some intuition.By the time we reached the condary classiﬁcation level,our training data’s feature ts were extremely spar in the size of the feature space.Thus we would want a classiﬁer that could take in a lot of features while still being very general.The exponential family of classiﬁers are known to perform better than some of the other classiﬁers we experi-mented with given spar training data.

Unfortunately our best hierarchal classiﬁers still under-performed compared to theﬂat maximum entropy classiﬁer we train.Under the best conditions, we choo to split the training data with75%ud for the cour classiﬁer and25%for theﬁne classiﬁers.

Training Set10002000300040005500

Flat Classiﬁer67.674.277.880.282

Hierarchal Classiﬁer64.67175.877.881

Examining our performance data,we realize that the main loss of perfor-mance is due to reduced amount of training data fed into the primary clas-siﬁer.If we rescale the training t axis of the hierarchal classiﬁer to reﬂect the u of only75%of the training data for the cour classiﬁer,we e that

本文发布于:2023-07-31 20:52:45，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1103483.html

上一篇：目标检测领域常用的模型

下一篇：The Importance of Encoding Versus Training with Spar Coding and Vector Quantization

标签：打印观后感赠梁

留言与评论（共有 0 条评论）