Building a Question Classifier for a TREC-Style Question Answering System Topic Question Cl

更新时间:2023-07-31 20:52:45 阅读: 评论:0

Building a Question Classifier for a TREC-Style
Question Answering System赠梁任父同年
Richard May&Ari Steinberg
Topic:Question Classification
We define Question Classification(QC)here to be the task that,given a question,maps it to one of k class,which provide a mantic constraint on the sought-after answer[Li02].The topic of Question Classification aris in the area of automated question-answering systems,such as tho created for the TREC question answering competition.Automated question-answering systems differ from other information retrieval systems(ie.arch engines) in that they do not return a list of documents with possible relevance to the topic,but rather return a short phra containing the answer.Moreover, question answering systems take in as input queries expresd in natural language rather than the keywords traditional arch engines u.
In order to respond correctly to a free form factual question given a large collection of texts,any system needs to understand the question to a level that allows determining some of the constraints the questio
n impos on a possible answer.The constraints may include a mantic classification of the sought after answer and may even suggest using different strategies when looking for and verifying a candidate answer.More specifically,know-ing the class(or possible class)of the sought after answer narrows down the number of possible phras/paragraphs a question-answering system has to consider,and thus greatly improves performance of the overall system. [Harabagiu]divides their QA system into three distinct pieces.At the core of thefirst module in the system lies a question classification task.Thus, it ems that question classification is an important subtask of automated question answering.An error in question classification will almost undoubt-edly throw offthe entire QA pipeline,so it is crucial to be able to classifiy question correctly.
In our paper,we will build on the hierarchal classification discusd in[Li02] and experiment with some features of our own design.We expect that by tweaking both the classification algorithms and the choice of features,we can get improvements in this crucial QA subsystem.
1开学心愿
Classification
Classification Methodology
Results from[Li02]demonstrated that using aflat classifier performs just as well as a two-layer hierarchical classifier that ud a coar classifier to dispatch the classification task to a cond classifier.We too,plan on using a hierarchal classifier,however,ours will differ in that we will also attempt to learn which class are often confud,whereas[Li02]applied domain knowledge to the task and created six hand crafted super class that then contained distinct subts of the true class.睡懒觉
骑自行车的人Classification System
For our code ba,we leverage an existing machine learning library,MAL-LET,discusd in[McCallum].We designed a hiearchal classifier trainer, which takes in a training t and partitions it into a ba train t and an advanced train t.The trainer then us the ba train t to train the coar classifier over all the possible question types.The trainer then tests the coar classifier on the advanced training data,to build a confusion matrix.Using a t of threshold parameters,the trainer decides if certain predicted class have too high a confusion rate and then trains a condary classifier on the advanced training instances that were predicted to be part of the high confusion rate class.
Figure1:Diagram of how to train a hierarchal classifier
Features
The majority of our creative energies were focud into feature engineering. It is a process that requires a lot of trial and error.We intend to leverage
2
resources in WordNet to improve on mantic understanding of the ques-tions.
As input to our machine learning algorithm,the computer examines each question and derives a reprentation consisting of numerous features.In the end,a typical question can have as many as60features,and our t of 6000total questions can result in anywhere from30,000to120,000unique features.While it would be easy to generate far more than this,due to memory constraints we must be careful in lecting only the most uful types of features to include in this t.In addition,adding extraneous features can add noi to the data and result in weaker performance.On the other hand,large performance improvements can be gained by adding uful features to the t.
Basic features
The most basic reprentation of a question is simply the individual words of that question(ignoring contextual information such as the ordering of the words).While simple,this is also by far the most important part of our program-the best indicators of certain question types are single words,and in particular question words clearly reveal a lot about the type of question being asked.On the other hand,a large amount of information would be lost by stopping here,since words can often mean many different things depending on their context.
We can regain some of this contextual information by examining part of speech tags.We run a parr over the question and take the preterminal nodes of the parr as the parts of speech for each word.The parts of speech alone wouldn’t help much,so we add as features word and part of speech pairs,thus helping to disambiguate words which have different ns depending on their part of speech.
Another problem that part of speech tags fail to address is feature spar-sity(in fact,part of speech and word pairs suffer even further from this problem).Some words can reveal a lot of information,but do not show up enough times in the training t to allow our classifier to pick up on this.To address this problem,we u a stemmer to create a more generalized form of the word.For example,whether a verb is in prent or past ten might not impact how it affects the question being asked.
Another basic feature that we add is bigrams-pairs of words that occur quentially.However,bigrams inherently suffer even wor from sparsity problems than individual words,so we u the stemmed form of the words to create the bigrams.
3
Onefinal basic feature that we experimented with is to take conjunctions of any other two basic features.Bigrams do not capture long-distance de-pendencies-two words that may affect each other’s meanings but are not adjacent in the ntence,so we attempted to do that with the conjunc-tions.The conjunctions did help to improve accuracy,but they resulted in unmanageably large feature t sizes,so we could not run them with the larger training corpus or with the help of our full feature t.
Par Signatures
反片打印是什么意思
A more advanced feature that we added is something we call a“par sig-nature”,though the term“ntence reduction”may be a more accurate descriptor.Our goal with the signatures was to create a reprentation of the entire question structure,instead of just small fragments of it.Obvi-ously,though,adding the full ntence would only help to later classify that exact ntence,so we nee
ded a way of generalizing this.Another motivation for creating the signatures was to attempt to have some way to reprent the rich grammatical information given to us by the question’s par tree, but again without suffering from the sparsity issues that would come with using the entire tree as a feature.
A par signature can be thought of as a left-to-right readout of a par tree.More formally,it is a t of nodes from a par tree such that every leaf node has exactly one ancestor(or itlf)in the t.The question itlf would be one such signature,and the tree’s root node would be another (though neither of tho would be particularly uful for our purpos).It should be clear that the number of par signatures that can be generated from a question is enormous,so we need a way of limiting this number.Our solution is to parameterize the signature generation with the desired length of the readouts.Choosing to include all signatures of lengths1to5cap-tures the information that we desire while keeping the number of features at a reasonable level.
We can further reduce the noi of this data by choosing to not traver below certain nodes.For example,our par trees always end in a“.”node which expands to a“?”,but since nearly every question ends with a?this means we would always be generating two almost-identical versions of each signature-one ending in.and one ending in?.To avoid this,we tell the signature generator to nev
er traver some other parts of speech such as CC who exact words em to have little impact.
4
WordNet
Thus far,all of our features have focud on syntactic considerations,but there is clearly a lot of information to be gained by looking at the mantic information in the questions.In order to accomplish this,we add WordNet information to our features.For every noun,verb,adjective,and adverb,we also add thefirst synonym that appears in its“synt”(a group of words that WordNet tells us share a meaning).While we could add every synonym, as long as we are careful to choo the same synonym each time something in a particular synt occurs there is no reason to add more than one. Perhaps more interesting than synonyms,though,are hypernyms.This is the“...is a ”relationship(going from more specific to more gen-eral),e.g.“animal”for“dog”.While the idea of a more generalized form of a word is clearly a uful one,just adding a word’s direct hypernym probably wouldn’t be very helpful,since this may not be general enough(we might get that a dog is a canine and a cat is a feline instead offinding that they are both animals).Going multiple levels up also
does not help,becau two words can be at different levels in the tree,so while they may share some ancestor,it may be2levels up for one word and3for another and we would again not e the similarity.One solution would be to add all of a word’s hypernyms,but this would result in too much data.
Our solution was instead to always take the hypernyms a certain distance from the“root”node.For example,many nouns have“entity”as their root node,but knowing that a noun is an entity isn’t much better than knowing that it’s a noun.We start at the root(such as entity)and work down veral nodes in the direction of the target word until we get to ourfixed level and choo that as the hypernym to add.This way we are more likely to have related words share a hypernym,even if they are at different levels of the WordNet tree.We add such a hypernym to our feature t for every word that falls occurs below the target level.
Finally,this hypernym informationfits nicely into our par signature idea. The signature“What caus pneumonia”,for example,is probably too spe-cific,but having“What caus NN”may be too general.If we instead had “What caus HYP dia”(inrting the HYP prefix to distinguish the hypernym from the ba word)as our signature this may be more ideal.We inrt the hypernyms into the tree between a word’s part of speech and the word itlf.For words that do not have hypernyms(becau
they are the wrong part of speech or occur too clo to the root node in the WordNet hierarchy)but do have stemmed forms we add the stemmed form instead. Again,this has the potential to dramatically increa the number of par signatures,so we tell our signature generator to never traver below a node
5
with an HYP or STM prefix.
圆梦巨人观后感Evaluation&Results
Datat林丹世界排名
什么是细小In order to compare our results with previous work,we ud the same datat as[Li02].The data can be obtained at the groups website(l2r. cs.uiuc.edu/~cogcomp/Data/QA/QC/).The training data consists of5500 labelled questions.The test data is500questions taken from the TREC 10t.In both the training and test data,there are a total of50different question class.In addition,we count an answer as correct if the best classification label output by our classifier is the true label.
Hierarchal Classification Results
After experimenting with a variety of classifiers(SVM,MaxEnt,NB,Deci-sion Tree)for primary and condary classification,we decided that a mix of a Maximum Entropy cour classifier with a Naive Bayesfine classifier was the best combination.Obrvations indicated that it was better to mix classifiers than to have the same type of classifier as both the primary and condary.
As for why a Naive Bayes condary classifier had the best results,we can offer some intuition.By the time we reached the condary classification level,our training data’s feature ts were extremely spar in the size of the feature space.Thus we would want a classifier that could take in a lot of features while still being very general.The exponential family of classifiers are known to perform better than some of the other classifiers we experi-mented with given spar training data.
Unfortunately our best hierarchal classifiers still under-performed compared to theflat maximum entropy classifier we train.Under the best conditions, we choo to split the training data with75%ud for the cour classifier and25%for thefine classifiers.
Training Set10002000300040005500
Flat Classifier67.674.277.880.282
Hierarchal Classifier64.67175.877.881
Examining our performance data,we realize that the main loss of perfor-mance is due to reduced amount of training data fed into the primary clas-sifier.If we rescale the training t axis of the hierarchal classifier to reflect the u of only75%of the training data for the cour classifier,we e that
6

本文发布于:2023-07-31 20:52:45,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/1103483.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:打印   观后感   赠梁
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图