Chine Unknown Word Identification Bad on Local Bigram
Model
Z HUORAN W ANG*, T ING L IU+
Information Retrieval Laboratory, School of Computer Science and Technology,
Harbin Institute of Technology, P.O.Box 321, HIT, Harbin, P.R.China, 150001
* zrwang@ir., + tliu@ir.
$EVWUDFW
The paper prents a Chine unknown word identification system bad on a local
bigram model. Generally, our word gmentation system employs a statistical-bad
unigram model. But to identify tho unknown words, we take advantage of their
contextual information and apply a bigram model locally. By adjusting the value of
interpolation which is derived from a smoothing method, we combine the two
models with different dimensions. As a simplification of bigram, this method is
simple as well as feasible, since the complexity of its algorithm is quite low and not
so many training corpora are needed. The results of our experiments show the
solution is effective.
Keywords: Unknown Word Identification; Chine Word Segmentation; Local
Bigram Model.
1.Introduction
Unlike Indo-European languages, Chine words do not have spaces to mark their boundaries. So word gmentation is the foundation of Chine natural language processing (NLP). A system with a lexicon could perform quite well, however, the unknown words which are not registered in the lexicon, become the bottle-neck of the precision and recall. Simply enlarging vocabulary will not work,
becau tho unknown words are so vast and various as to be collected exhaustively. Unknown word identification (UWI) therefore plays a significant role in word gmentation.
In this paper, we illustrate an approach to recognize unknown words, which is bad on statistics and probabilities, and us minimal linguistic knowledge. First, a process to specify an unknown word candidate is triggered on some certain conditions (a Chine character that can be ud as a surname will cau a specifying process of a candidate of person name, for example), and an initial probability of this word is calculated from its internal formation. Then the contextual information of it is taken into consideration, with a local bigram model applied here, who details will be discusd later in Section 3. At last the specified candidates of unknown words, together with normal lexicon words, will go through a arching procedure to
get a solution that meets the maximum probability, or to generate N best results rving further processing [1][2].
As our word gmentation system is the first level of a Chine natural language understanding platform bad on multilayer arch mechanism (CUP Platform of IR-Lab [2]), we only involve information on word level. The unknown words that we identified include Chine person names (PE
R), location names (LOC), numerals (NUM) and time expressions (TIME). Although organization names are frequent unknown words, its recognition usually needs more linguistic knowledge. So it is procesd by an upper Named-Entity Recognition (NER) system other than the word gmentation system, which is beyond the scope of this paper. The evaluation is conducted on a large t of corpus gmented and tagged manually, while the experimental results show the efficiency of our method. The UWI precision of PER, LOC, NUM and TIME on the test t is 82.1%, 80.5%, 91.7%, 95.5%, respectively; and the recall is 87.0%, 71.3%, 90.6%, 93.7%, respectively. There are 75.1% of the results matching the standard ones perfectly, if five best results for one input are saved.
2.Related Works
Recently, many techniques have been propod for Chine unknown words identification. Lv et al [3] prented a rule-bad method, which ud unknown words’ internal structure and context relations to develop its evaluation functions, and resolved them by a dynamic programming. Tan et al [4] expounded a transformation-bad machine learning approach. They focud on the problems of Chine place name recognition, with a similar idea with Brill Algorithm [5], and eliminated many incorrectly recognized quences by transformation rules. And Zhang et al [6] revealed a universal method bad on roles tagging. They tagged the roles of words’ component tokens by applying a Vit
erbi algorithm in the fashion of a POS tagger. As a result their system could recognize veral types of unknown words.
In the last few years, some class-bad models were introduced. For example, Sun et al [7] integrated word gmentation and named entity identification into a unified framework and developed the Class-bad Language Model [8] with their own features. Zhang et al [9] improved their role-bad model and added a class-bad gmentation. Fu and Luke [10] also combined the model bad on class with their word juncture models and word-formation patterns for UWI.
We were enlightened by their approaches, and explore a model that fit our own word gmentation system and NLP platform.
3.Local Bigram Model For UWI
3.1 The U nigram Model’s Dilemma
A rearch shows that ambiguities occur 1.2 times per 100 Chine characters, while the ratio of interction ambiguities (the word that can form an ambiguity with the word before or after
it) and combining ambiguities (the characters quence forming the word could have another combin
ation in different context) is 12:1 [11]. So unigram could work rather well, as it solves interction ambiguities effectively, which are major problems for word gmentation. But for UWI, it is not always competent. The formation of Chine unknown words (especially Chine person names) is so complex, that it can generate veral types of ambiguities as Table 1 shows. It can be en that information only from the word formation pattern itlf is not adequate. Maybe, in Ca 1 or Ca 2, it still works. But if the input ntence is like Ca 3 or Ca 4, the corresponding probabilities will probably be partial to the lexicon words. Furthermore, in Ca 5, we could not even judge whether it is a person name or not. Considering that, contextual information is indispensable to UWI. 1.Interction with previous word
䖭䞠/ / /⚜ /ⱘ/䘫⠽/DŽ(There are relics of Martyr Guan Tianpei here.) 2.Interction with posterior word
䙧 /ㄝ/乚 (Deng Xiaoping etc.) 3.Surname with a lexicon word
⥟䳲 (Wang Xiaguang) 4.Being a lexicon word simultaneously
咢 (Li Ming, with another meaning of dawn) 5.Combining ambiguity a)䆄㗙/ Ё/ 䘧DŽ˄Reported by Quan Zhong.˅
Ҫ/ / / /Ё/䵊 /DŽ(His ten shots all hit the
bull’s-eye.)
b)1000/ԭ/ϛ/ (more than 10 million yuans)
䭓/ԭϛ / /ϔ/Ͼ/ /DŽ(Yu Wangyuan,
the leader of village, has a son.)
Table 1: Examples of Ambiguities Related with Person Names
A rule-bad weighting is efficient, but for a statistical-bad model that outputs N best results for further refining, the weight added illogically may confu the evaluation process, which obeys probability originally, to mark and order the results. Then bigram model is chon, and indeed, it performs better. Yet, the training of the transition probabilities of every two words calls for an enormous collection of corpus, which is not handy at all time. If ud as a rough gmentation, it would cost a lot of time and memory for the decoding to gain N best results. Hence we decide to find a compromi.
3.2 Local Bigram Model
Obrving the P(W) of a bigram model, it is generally computed under the Maximum Likelihood paradigm as:
)(),()|(111 |i i i i i w C w w C w w P (1)
where C(w i-1,w i ) is the frequency of the co-occurrences of w i and w i-1, and C(w i-1) is the total frequency of w i-1 in the corpus. Here we prefer to write it as:
)(),()|(111 |
i i i i i w P w w P w w P
(2)
by dividing the numerator and denominator with a same factor C(Total), the total of the words in the corpus, for the convenience of latter illations. But if P(w i-1,w i )=0 in the training corpus, the element who P(w i |w i-1)is 0, will make it terrible on test data. Considering this, many smoothing ways have been devid, one of which is interpolation. With this method unigram and bigram are combined, and Eq. (2) is rewritten as: )()1()(),()|(111i i i i i i w P w P w w P w w P O O | (3)
in which O )10(d d O is usually determined by optimizing on the “held-out” data.
According to this principle, we find a solution to combine the unigram which is the integral model of our word gmentation system and the bigram applied locally for UWI. Firstly, we make some transformation to Eq. (3) by dividing both side of the equation by (1-O ),and define P’=P/(1-O ):
)()(),(1)|(111i i i i i i w P w P w w P w w P c O O (4)
Let: c c i i i W w w P W
)|(max arg ˆ1 (5)
which is the recognized word quence. Then, we assume that in a ntence, probabilities of the rest parts’ co-occurrences are all 0, so they only have factors P(w i ), except the unknown words candidates identified and some local areas associated with them who probabilities will have dependencies with their previous words. H ere, we must also make a definition to the “local area ”: We consider an unknown word together with its one previous word or one posterior word (both are the longest) as a local area. After that, the general parts of the quence inputted will have P’(w i |w i-1)=P(w i )that just equal the probabilities with unigram. Whereas, take the ntence “ 㽓 ” (Professor Meng Xi-An) as an example, the unknown word candidate “ 㽓 ”(Meng Xi-An) will have such a P’ as:
)()
(),(1)|( 㽓 㽓 㽓 P P P P c O O which is enlarged by its previous word “ ” (professor), for “ ” (professor) is a appellation word often appearing ahead a person name. In this way, the “local bigram model ” is created.
On the cond thoughts, to avoid the data sparness that remains rious when the probabilities are computed bad on exact word, we actually apply a clustering method. Besides unknown words that take their types as their class, we also defined veral word class that
have a clo relation with them. Table 2 shows the details. We ignore the other words’ effects to unknown words candidates, to reduce the resource ud for count and storage.Unknown Word Related Word Class Expression Example
PER Appellation NC 䆄㗙⥟ (Reporter Wang Nan) Preposition P ≜䰇 (in Shenyang)
LOC Orientation F Ⓖ 䚼 (northern part of Harbin)
Postfixe of numeral PX ϝ (more than three thousand)
NUM Quantifier Q ϔⱒ (a hundred pieces of)
TIME None
Table 2: The Class Having Dependencies with Unknown Words
For the above class, we can believe that in the defined local area if a class C i contains a word w i , then w i belongs and only belongs to C i . So we can suppo P(w i )= P(w i , C i ). According to this, for an unknown word candidate, Eq. (4) is transformed to:
)()
(),,,(1)()(),(1)|(111111i i i i i i i i i i i i w P w P C w C w P w P w P w w P w w P c O O O O If complete probability formula is ud, then:
)()
|()(),|,(),(1)|(1111111i i i i i i i i i i i i w P C w P C P C C w w P C C P w w P c O O Suppo w i-1 and w i are independent under the condition C i-1, C i :
)()
|()(),|(),|(),(1)|(11111111i i i i i i i i i i i i i i w P C w P C P C C w P C C w P C C P w w P c O O Further, suppo w i is independent of C j (i z j ):
)()
()|(),(1)()|()()|()|(),(1)|(111111111i i i i i i i i i i i i i i i i i i w P C P C w P C C P w P C w P C P C w P C w P C C P w w P c O O O O We can also express it as:
)(1)()(),(1)|(111i i i i i i i w P C P C P C C P w w P »¼
º«¬ª c O O
(6)
The initial probability P(w i ) of the candidate is calculated as: )()|()(i i i i C P C w P w P
(7) The conditional probability P(w i |C i )is generated from its internal formation patterns: 1
10)|()|()|()|(m j i Mid i j i End i m i Beg i i i C s P C s P C s P C w P (8)