Abstract —Existing chatbot knowledge bas are mostly hand- constructed, which is time consuming and difficult to adapt to new domains. Automatic chatbot knowledge acquisition method from online forums is prented in this paper. It includes a classification model bad on rough t, and the theory of enmble learning is combined to make a decision. Given a forum, multiple rough t classifiers are constructed and trained first. Then all replies are classified with the classifiers. The final recognition results are drawn by voting to the output of the classifiers. Finally, the related replies are lected as chatbot knowledge. Relevant experiments on a child-care forum prove that the method bad on rough t has high recognition efficiency to related replies and the combination of enmble learning improves the results.
Index Terms —Chatbot, Enmble Learning, Knowledge Acquisition, Rough Set
I. I NTRODUCTION
chatbot is a conversational agent that interacts with urs in a certain domain or on a certain topic with natural languages. Many chatbots have been developed on the Internet, including Eliza, Claude, SHAMpage, ALICE and so on. Eliza [1] developed by Japh Weizenbaum in 1956 is the first chatbot, which was originally designed to emulate a psychotherapist and had a knowledge ba in this domain.
Claude devid by Brian McLaughlin was a chatbot using standard pattern matching to find a suitable answer. It split ur input, found the matching ones and constructed the answer. SHAMpage was designed to chat with people in BBS by Richard Waugh and could answer some relative questions. Hex was developed by Jason Hutchens, from intelligent information processing centre of Western Australia University. It could deal with different circumstance of the program. ALICE (Artificial Linguistic Internet Computer Entity) is an AI (Artificial Intelligence) chatbot bad on experience and was developed by Dr. Richard S. Wallace of Lehigh University in Pennsylvania. It had AIML (Artificial Intelligence Markup Language) [2] as its knowledge description language and stored more than 40,000 knowledge records. Becau of its perfect
This paper is supported by: National Basic Rearch Program ("973 Program") of China (No.2008CB317111), National Natural Science Foundation of China (No. 60703010), Chongqing Natural Science Foundation of China.
Yu Wu 1, Gongxiao Wang 2 and Weisheng Li 3 are with the Institute of
Artificial Intelligence, Chongqing University of Posts and Telecommunications, China(e-mail: 1.wuyu@cqupt.edu, 2.wgx_,
3.liws@cqupt.edu).
design and realistic effect, it won the famous Loebner Prize in 2000 and 2001. Many existing chatbots are developed on the basis of the framework of ALICE.
Most existing chatbots consist of a dialog management module and a chatbot knowledge ba. The knowledge ba is just like the brain of the chatbot and stores chatbot knowledge to respon to ur input. The dialog management module is ud to control the process of dialogue, including receiving and understanding ur input, arching for the right respon from the chat knowledge ba, and feedback. Sometimes the dialogue control module also needs to take the initiative to rai new issues. Typical implementation of chatbot knowledge bas contains a t of templates that match ur input and generate respon. Templates currently ud in chatbots, however, are hand-coded. Therefore, the construction of chatbot knowledge ba is time consuming, and difficult to adapt to new domains.
Existing work on automatic chatbot knowledge acquisition is mainly bad on human annotated datats, such as the rearch by Shawar and Atwell [3] and Tarau and Figa [4]. Their approaches are helpful to construct commonn knowledge for chatbots, but are not capable of extracting kno
wledge for specific domains. Jizhou Huang propod a method of using online discussion forums to extract chatbot knowledge [5], which took two steps to classify the replies using SVM (Support Vector Machine), and extracted <thread-title, reply>pairs automatically as chatbot knowledge. In this way, certain domain knowledge for chatbot was constructed effectively. However, since SVM us quadratic programming to obtain support vector, it is difficult for the implementation of large-scale samples.
In this paper, chat knowledge is extracted from online discussion forums automatically. Through analyzing replies with rough t technology, a reply classification model is designed. Furthermore, enmble learning bad on Bagging is applied here to improve the recognition results. Becau of the pretreatment and reduction with rough t, it is advantageous in processing speed. Experiments show that this method has simple model and good classifications.
The rest of this paper is organized as follows. The challenges of extracting high-quality chat knowledge are outlined in Section Ⅱ. Related theories are introduced in Section Ⅲ. Section Ⅳ propos a classification model bad on rough t and enmble learning. Section Ⅴ reports the experimental
Zhijun Li is with China Mobile Communication Corporation, Chongqing LTD(e-mail: lizhijun@) Automatic Chatbot Knowledge Acquisition from Online Forum via Rough Set and Enmble Learning
Yu Wu, Gongxiao Wang, Weisheng Li, Zhijun Li
A
2008 IFIP International Conference on Network and Parallel Computing
results. The conclusion and the future work are given in Section Ⅵ.
II.CHALLENGES
An online discussion forum is a type of online asynchronous communication system. It normally consists of veral discussion ctions. Each ction focus on a specific discussion theme and includes many threads. People can initiate new discussion by creating threads, or ask (answer) questions by posting questions (replies) to an existing ction. There is popular, rich and live information in an online discussion forum. The root message and its following up replies could be viewed as <input, respon> pairs, with the same structure of chat template of a chatbot. So it is suit
校训是什么意思
able for chatbot knowledge extraction. However, the nature of a forum makes it difficult to extract high-quality chat knowledge. The reasons are as follows [5]:
Replies are often short, elliptical, and irregular, and full of spelling, usage, and grammar mistakes which result in
noisy text.
Not all of replies are related to root messages.
A reply may be parated in time or place from the reply
to which it responds, leading to a fragmented conversational structure. Thus, adjacent replies might be
mantically unrelated.
There is no evidence to reveal whether the reply is given to the root message or another reply, unless the participants have quoted the entire entries or parts of a
previously posted reply to prerve context.
To overcome the difficulties, this paper prents a model for chatbot knowledge extraction bad on rough t. It extracts structural and content features of different replies, and us enmble learning method in the decision-making stage. Finally, experiments are carried out to verify the effectiveness of the model.
III.R ELATED T HEORIES
A.Rough Set
Rough t theory [6] was propod by Z Pawlak in 1982 as a powerful mathematical analysis tool to process incomplete data and inaccurate knowledge. It is built on the classification mechanism and regards classification as indiscernibility (equivalence) relations. Indiscernibility relations constitute division of the space. Rough t theory regards knowledge as the division of the data. Each divided t is called concept. It's a new hot spot in the artificial intelligence field at prent and has been widely ud in knowledge acquisition, knowledge analysis, decision analysis and so on [7].Without any mathematical description of attributes and features of detected objects in advance, directly from given knowledge classification, it determines the knowledge reduction and educes decision rules via indiscernibility relations and class. So the description to the uncertainty of the problem is more objective.
B.Enmble Learning
The method, which constructs multiple learners and combines the classification results of the learners to get the final results, is called enmble learning technology [8]. It can improve the generalization of the learners. In recent years, it has become a hot topic in machine learning, data mining, artificial intelligence and other fields, and it has been considered to be the first place of the four topics within the current machine learning area [8]. Now there are many enmble learning algorithms. Bagging [9] and Boosting [10] are most famous. In Bagging, bootstrap sample is ud and the scale of the new training ts is similar with the original training ts. However, in Boosting, the lecting of the training t of each learner is impacted by the performance of the previous learner, and samples which have been wrongly classified by existing learners will have greater probability to be in the new training t. Boosting can get better classification results than Bagging. However, different from Bagging, Boosting may lead to overfiting in handling some practical problems and get even wor results than a single learner. In this paper, Bagging is adopted.
IV.C LASSIFICATION M ODEL B ASED ON R OUGH S ET AND
E NSEMBLE L EARNING
A.System Model
The system model of this paper is shown in Fig. 1. Forum data collecting module is in charge of collecting forum data and has it to be the raw data to generate rules. Data lecting module is in charge of analyzing data. It lects out target data from raw data. Then multiple training subts are lected and multiple rough t classifiers are trained. Finally it votes according to the principle of majority priority and gets the final
FIG. 1. AUTOMATIC EXTRACTION SYSTEM
FOR CHATBOT KNOWLEDGE BASE
B.Classification Model Bad on Rough Set
An input online discussion forum F contains discussion ctions
12
,,,
k
s s s
. Each ction consists of u threads 12
,,,
u
t t t
. Each thread t is a quence of replies
{}012,,,,n t r r r r = , where r 0 is the root message posted by the thread starter and i r is the ith ()1i ≥ reply.
In this paper, replies are classified into related replies (RR) and irrelated replies (IR). RR is defined as a direct reply to the root message and it is not correlated with the other replies in the thread. The others are called IR. We get the feature t through analyzing a certain number of replies. Rough t theory takes relational databa as rearch object. It abstracts a relational databa into an information system.
Rough t model for online forum reply classification system is defined as follows.
Definition 1: An online forum reply classification system is a 4-tuplle ,,,S U R C D V f ==∪, where U is a t of replies.
R is a t of attributes, C is a t of reply attributes, and D denotes the decision attribute, who value is the type of the replies. V is the domain of the attributes. f is an information function which denotes the attribute value of each x in U.
Definition 2: Indiscernibility relations in U : P is a t of
attributes, P R ⊆, object ,X Y U ∈, for each a P ∈, if ()(),,f X a f Y a =, we call X and Y have indiscernibility, that is ()()()()(){}
2
,,,,ind P X Y U a P f X a f Y a =
∈∀∈=. Here ()ind P is an indiscernibility relation. Then attribute t P can
be regarded as a knowledge name with equivalence relationship.
So an online forum reply classification system can be regarded as a knowledge ba system. When a reply is needed to be analyzed, we lect the corresponding knowledge ba rule according to ()ind P , and get the classification result. According to Definition 1, attribute t includes reply
attribute t and decision attribute t. Rearch in literature [5] shows that the number of overlapping words between thread-title and reply is not helpful for the classification. Though analyzing large amount of replies, we define the following reply attribute t (Attribute A 3, A 4 and A 5 are new content-related features):
012345C A A A A A A =∪∪∪∪∪
The meanings of the features are as follows.
A 0: Is this reply posted by the tread starter? 1 means yes, while 0 means no.
A 1: Interval of replies between same author’s previous and current reply.
A 2: Quotation of the reply. No quotation is shown as 0. Quotation of root message is shown as 1. Quotation of other reply is shown as 2.
A 3: Whether the reply has feature words which show it is not a relative reply? We define the following 12 feature words: 顶(up), up, ding, 同问(same question), 好贴(good reply), 收藏(collecting), 帮顶(help to put it on top) and so on. A 4: The number of reply words.
A 5: The number of question marks in this reply.
The decision attribute D of the decision table is: RR is 1. IR is 0.
The steps of finding decision rules from decision table are given as follows.
Data Preprocessing. It includes deleting repeated records, filling missing data and discretization.
Reduction of Attributes. This step deletes unnecessary or unimportant attributes, maintaining the classification capacity unchanged.
Reduction of Values. In this step, the decision table is further simplified on the basis of reduction of attributes.
Obtaining Logic Rules. Combined with logic meanings of attributes, the rules derived from reduction of values are analyzed to form logic rules, which are validated later. C. Classification Model Adding Enmble Learning
Here enmble learning theory is combined on the basis of the aforementioned model bad on rough t. We u the enmble learning method of Bagging.
The principle of the Bagging algorithm is as follows. Given a
data t ()(){}11,,,,m m L x y x y = , and a basic learner (),h x L ,
where x is the input, we u (),h x L to predict y . Now, we have a quence of data ts {}k
L , each quence is compod of m independent obrvations which have the same distribution of L .
七十的英文Our goal is using {}k L to find a better learner than the learner
from a single data t. So we will u a quence of learners
关于春节的古诗句(){},k h x L . The pudo-code description of the Bagging algorithm is as follows.
Algorithm 1: Bagging algorithm Input: training t ()(){}11,,,,m m S x y x y = , a weak learner
WeakLearn, the biggest round T of training. Output: an enmble prediction model.
Step 1: Extract m training subts S’ from raw t S using the method of bootstrap.
Step 2: Train weak learners on S’ and get the round t hypothesis function h t .
Step 3: If t <T , go to Step 1 and t t=t+1,el go to Step 4. Step 4: Combine multiple hypothesis 12,,,T h h h to get the final hypothesis function: ()()()A t h x sign h x =∑. Step 5: The end.
V. E XPERIMENTAL R ESULTS
We lect the Early for Parents Section in Liba Forum (bbs./forum.php?forumId =123) to be the experimental data. It is the most popular Chine forum of childcare. For rearch p
urpo, the discussion records are collected by crawling the forum over the time period from April 26, 2007 to November 29, 2007. The downloaded collection contains 45,000 threads.
A. Experiment Bad on Rough Set
Here, 137 treads are randomly lected from the collection. There are 5,557 replies in all, so there are, on average, 40.56 replies per thread. A human expert was hired to manually
identify the relevance of the replies to the thread-title in each thread. After the labeling process, 3,197 replies were RR and 2,360 replies were IR.
4,000 replies are randomly lected for training (72% of the total) and 1,557 replies for testing (28% of the total). In our experiments, RIDAS system is adopted as data mining platform, which is developed by Institute of Computer Science and Technology, Chongqing University of Posts and Telecommunications in China [11]. The system integrates almost 30 classical algorithms about rough t theory. Table 1 gives an example of a reply classification decision table.
TABLE 1.
A REPLY CLASSIFICATION DECISION TABLE
A0A1A2A3A4A5 D
1 2 0 0 89 0 0
1 24 0 0 150 0 1
0 2 0 0 226 0 1
1 1 0 0 18 1 0
0 0 0 0 143 0 1
1 3 0 1 140 0 0
1 2 2 0 54 0 0
0 9999 0 0 63 0 1
0 9999 1 0 90 0 1 Next we u RIDAS to extract rules from decision table as follows. Since values of attributes are unique, there's nothing missing in the information table and the filling of missing data isn't needed. Types of all data are integer and have high degree
of dispersion. However, without further discretization, the attributes of generated rules will be some data points. It will lead to high rate of non-recognition. So discretization of data is necessary. With attribute importance bad discretization algorithm [12], the broken t can be obtained as listed in Table 2. From each broken t, intervals can be easily got. For example, the broken t ‘*, 0.5, *’ means that there are 2 intervals, including [*, 0.5] and [0.5, *].
TABLE 2. VALUE INTERVALS OF ATTRIBUTES
Attribute Count of
Interval Broken Set
A0 2 *,0.5,* A19 *,0.5,1.5,2.5,3.5,5.5,6.5,20.5,5058.5,* A2 3 *,0.5,1.5,* A3 2 *,0.5,* A4163 *,3.5,4.5,…,708.5,724.5,869.5,928.5,944,*
A5 3 *,0.5,1.5,* Then MIBARK algorithm [12] is adopted in reduction of attributes. All condition attributes are rerved, which means the decision attributes educed by them are accordant with the ones by all attributes.
After reduction of values, totally 959 rules are obtained from the reply pre-classification information t
able. Here heuristic value reduction algorithm [12] is adopted. Several generated rules are listed as follows:
rule1:[)[][)[)
0123
*,0.55058.5,**,0.5*,0.5
A A A A
∪∪∪∪[)[)
45
29.5,30.5*,0.51
A A→
∪ RR
rule2: [][)
014
*,0.5[0.5,1.5]151.5,162.50
A A A→
∪∪ IR
rule3: [)[)
24
0.5,1.539.5,40.51
A A→
∪ RR
rule4: [)[)[)[)
0123
*,0.5 1.5,2.5 1.5,**,0.5
A A A A
∪∪∪∪[)
5
0.5,1.50
A→ IR
We can analyze the rules referring to Table 2.
Rule1: if the reply is not posted by the tread starter, has not the same author’s reply before itlf, does not quote another reply, has no feature words, has proper word length, and haven’t question marks, then this reply is a related reply.
Rule2: if the reply is not posted by the tread starter, has a reply of the same author just before itlf, and has rather long word length, then this reply is an irrelated reply.
Rule3: if the reply has quoted the root message, and has proper word length, then this reply is a related reply.
Rule4: if the reply is not posted by the tread starter, has not a reply of the same author before itlf, has quoted another reply, has no feature words, and has question marks, then this reply is an irrelated reply.肠胃炎的药
From the analysis above, decision rules from rough t platform are consistent with analyzing rules from feature extraction. This proves the availability of the algorithm for extracting RR and the accuracy of the experimental process. Through analyzing rules which has large coverage, it can be concluded that the solution is effective.
Experiments are carried out with generated rules, and the results are shown in Table 3. From Table 3 we can conclude applying rough t on chat knowledge acquisition has good results. Becau it is RRs that construct chat knowledge ba, we give the recognition results of RR. It has good performances in precision, recall and F-score.
TABLE 3. EXPERIMENTAL RESULTS ON ROUGH SET
Rough Set
Total
绿天Recog.
Rate
Item Right
Wrong
Unknown
Recog.
Number
1342 175 40
Percentage 86.2% 11.2% 2.6%
RR
Recog.
Results
Precision 88.1%
Recall 93.3%
F-score 90.7%
B.Experiment Adding Enmble Learning
On the basis of Section A, we combine enmble learning method of Bagging. The basic classifier number of enmble learning can be adjusted according to the practical application of different backgrounds. Here we u 30 classifiers to be the basic classifiers of Bagging. Combine Bagging on the basis of
rough t. First, take samples 30 times from training t with the sampling rate of 0.8. Second, train a rough t classifier with each sampled subt. Finally, vote according to the principle of majority priority and get the final classification results as Table 4.
TABLE 4.EXPERIMENTAL RESULTS ADDING ENSEMBLE LEARNING
Rough Set+Bagging
Total Recog. Rate
Item Right
Wrong
Unknown Recog.
Number
1396 160 1 Percentage 89.7% 10.3% ≈0
RR Recog. Results Precision 89.5% Recall 93.3% F-score 91.4%
The experimental results above show that combining enmble learning in the decision-making stage improves both total recognition rate and RR recognition results. This shows that the combination of rough t and enmble learning on chat knowledge acquisition is effective.
Fig. 3 gives an example of RR. Since experimental data are all in Chine, an English translation is also given here. Fig. 3 shows the chat knowledge after recognizing a thread. It will be taken as a record of the knowledge ba and be to retrieved by the dialog management module.
标题:10天的宝宝便秘要怎么办呢?
描述:那么小,便秘怎么办呢?想给她吃金银花露可是我妈说要过
一个月。看她这2天一直闹腾我也不好受了。已经尽量灌白开水了,还有什么办法吗?
1.我们前天刚遇到过,用婴儿皂削成铅笔状头,一定要是圆的,
在温水里润滑一下……
2.买点合生元吃吃,我家宝宝吃了就很管用的
3.让宝宝喝点蔬菜水看看
4.给他吃煮熟的苹果水,挺有用的
5.可能是奶粉的问题,可以换奶粉试试不要给宝宝吃什么金银花
露,没有好处的……
Title: 10 days baby has constipation. What can I do?
Description: She is so little, what can I do? I wanted to give her some
honeysuckle, but my mother said it was permitted only after one month.
I am very sad. I have given her much water. What el can I do?
1. We just experienced this the day before yesterday. Cut baby soap
into round tip, and immer it into warm water…
2.Eat some BiosTime, it is effective to my baby.
3.Let baby drink some vegetable water.
莺燕
4.Let him drink cooked apple water, it is helpful.
5. The problem may be milk powder. Change another band. Don’t give
baby honeysuckle. It is not benefiting.
FIG. 3. AN EXAMPLE OF RR
VI.C ONCLUSIONS
Now chatbot is a hot issue in Internet applications. The manual construction of the chatbot knowledge ba makes it difficult to update the domain of the chatbot. Not so many current rearches focus on the automatic construction of the knowledge ba. This paper applies rough t and enmble learning theory to the system modeling and data analysis, and gets good results. However, since many attributes affect the quality of a reply, here we only consider some structural features and part of content features.
To improve the classification results, we need to make efforts in the following aspects. First, further mine available content features. For example, reply A is related to reply B. But becau of the different habits of the responders, A may have not quoted B. In this condition, we need to mine new content features. In addition, different forums have different styles and formats. Therefore, in feature extraction, we should consider the characteristics shared by all forums to improve the algorithm portability. Finally, extracted RRs are not suitable to be added directly to the knowledge ba. It is a f
ocus issue in the future that how to design a reasonable storage structure to improve the retrieval efficiency of the dialog management module.
R EFERENCES
[1]J. Weizenbaum, “Eliza—a computer program for the study of natural
language communication between man and machine,” Communication of the Association for Computing Machinery, vol.9, no.1, pp.36-45, 1996. [2]R. S. Wallace. (2001) AIML overview [Online].
Available:/pandor-a/pics/wallaceaimltutori
a1.html
[3] B. A. Shawar and E. Atwell, “Machine learning from dialogue corpora to
generate chatbots,” In Expert Update Journal, vol.6, no.3, pp.25-29,
沈从文的代表作2003.
[4]P. Tarau and E. Figa, “Knowledge-bad conversational agents and
virtual story-telling,” in Proceedings 2004 ACM Symposium on Applied Computing, 2004, vol1, pp.39-44.
[5]J. Z. Huang, M. Zhou, and D. Yang. “Extracting chatbot knowledge from
online discussion forums,” in Proceedings Of International Joint
Conference on Artificial Intelligence 2007, pp.423-428.
[6]Z. Pawlak, “Rough t,” International Journal o1 Computer and
Information Sciences, vol11, no.5, pp.341-356, 1982.
[7]G. Y. Wang, J. Zhao, J. J. An, Y. Wu, “Theoretical study on attribute
reduction of rough t theory: in algebra view and information view,” in Third International Conference on Cognitive Informatics. 2004, pp.148-155.
[8]T. G. Dietterich, “Machine learning rearch: four current directions,” AI
Magazine, vol18, no.4, pp.97-136. 1997.
[9]L. Breiman, “Bagging predicators,” Machine Learning, vol24, no.2,
pp.123-140, 1996.
[10]R. E. Schapire, “The strength of weak learning,” Machine Learning, vol.5,
no.2, pp.197-227, 1990.
[11] G. Y. Wang, Z. Zheng, Y. Zhang, “RIDAS-A Rough Set Bad
Intelligent Data Analysis System,” in Proceedings of the First Int. Conf.
on Machine Learning and Cybernetics, 2002, pp.646-649.移动搜索
[12]G. Y. Wang, Rough Set Theory and Data Mining, Xi’an: Xi’an Jiaotong
University Press, 2001.