Learning New Compositions from Given Ones
Ji Donghong
Department of Computer Science
Tsinghua University
Beijing, 100084, P. R. China
Email: jdh@s1000e.cs.tsinghua.edu
He Jun
Department of Computer Science
Herbin Institute of Technology
Email: hj@pact518.hit.edu
卢卡奇
Huang Changning
Department of Computer Science
Tsinghua University
Beijing, 100084, P. R. China
Email: hcn@mail.tsinghua.edu
Abstract:In this paper, we study the problem of learning new compositions of words from given ones with a specific syntactic structure, e.g., A-N or V-N structures. We first cluster words according to the given compositions, then construct a cluster-bad compositional frame for each word cluster, which contains both new and given compositions relevant with the words in the cluster. In contrast to other methods, we don’t pre-define the number of clusters, and formalize the problem of clustering words as a non-linear optimization one, in which we specify the environments of words bad on word clusters to be determined, rather than their neighboring words. To solve the problem, we make u of a kind of cooperative evolution strategy to design an evolutionary algorithm.
岳西政府网
Key words: word compositions, evolutionary learning, clustering, natural language processing.
1. Introduction
Word compositions have long been a concern in lexicography(Benson et al. 1986; Miller et al. 1995), and now as a specific kind of lexical knowledge, it has been shown that they have an important role in many areas in natural language processing, e.g., parsing, generation, lexicon building, word n disambiguation, and information retrieving, etc.(e.g., Abney 1989, 1990; Benson et al. 1986; Yarowsky 1995; Church and Hanks 1989; Church, Gale, Hans, and Hindle 1989). But due to the huge number of words, it is impossible to list all compositions between words by hand in dictionaries. So an urgent problem occurs: how to automatically acquire word compositions?
In general, word compositions fall into two categories: free compositions and bound compositions, i.e., collocations. Free compositions refer to tho in which words can be replaced by other similar ones, while in bound compositions, words cannot be replaced freely(Benson 1990). Free compositions are predictable, i.e., their reasonableness can be determined according to the syntactic and mantic properties of the words in them. While bound compositions are not predictable, i.e., their reasonableness cannot be derived from the syntactic and mantic properties
of the words in them(Smadja 1993).
N ow with the availability of large-scale corpus, automatic acquisition of word compositions, especiall
y word collocations from them have been extensively , Choueka et al. 1988; Church and Hanks 1989; Smadja 1993). The key of their methods is to make u of some statistical means, e.g., frequencies or mutual information, to quantify the compositional strength between words. The methods are more appropriate for retrieving bound compositions, while less appropriate for retrieving free ones. This is becau in free compositions, words are related with each other in a more loo way, which may result in the invalidity of mutual information and other statistical means in distinguishing reasonable compositions from unreasonable ones.
In this paper, we start from a different point to explore the problem of automatic acquisition of free compositions. Although we cannot list all free compositions, we can lect some typical ones as tho specified in some , Benson 1986; Zhang et al. 1994). According to the properties held by free compositions, we can reasonably suppo that lected compositions can provide strong clues for others. Furthermore we suppo that words can be classified into clusters, with the members in each cluster similar in their compositional ability, which can be characterized as the t of the words able to combined with them to form meaningful phras. Thus any given composition, although specifying the relation between two words literally, suggests the relation between two clusters. So for each word(or cluster), there exist some word clusters, the word (or the
words in the cluster) can and only can combine with the words in the clusters to form meaningful phras. We call the t of the clusters compositional frame of the word (or the cluster).
A emingly plausible method to determine compositional frames is to make u of pre-defined mantic class in some , Miller et al. 1993; Mei et al. 1996). The rationale behind the method is to take such an assumption that if one word can be combined with another one to form a meaningful phra, the words similar to them in meaning can also be combined with each other. But it has been shown that the similarity between words in meaning doesn’t correspond to the similarity in compositional ability(Zhu 1982). So adopting mantic class to construct compositional frames will result in considerable redundancy.
An alternative to mantic class is word cluster bad on distributional environment (Brown et al., 1992), which in general refers to the surrounding words distributed around certain word (e.g., Hatzivassiloglou et al., 1993; Pereira et al., 1993), or the class of them(Bensch et al., 1995), or more complex statistical means (Dagan et al., 1993). According to the properties of the clusters in compositional frames, the clusters should be bad on the environment, which, however, is narrowed in the given compositions. Becau the given compositions are listed by hand, it is impossible to make u of statistical means to form the environment, the remaining choices are surr
ounding words or class of them.
Pereira et al.(1993) put forward a method to cluster nouns in V-N compositions, taking the verbs which can combine with a noun as its environment. Although its goal is to deal with the problem of data sparness, it suffers from the problem itlf. A strategy to alleviate the effects of the problem is to cluster nouns and verbs simultaneously. But as a result, the problem of word clustering becomes a bootstrapping one, or a non-linear one: the environment is also to be determined. Bensch et al. (1995) propod a definite method to deal with the generalized version of the non-linear problem, but it suffers from the problem of local optimization.
In this paper, we focus on A-N compositions in Chine, and explore the problem of learning
new compositions from given ones. In order to copy with the problem of sparness, we take adjective clusters as nouns’ environment, and take noun clusters as adjectives’ environment. In order to avoid local optimal solutions, we propo a cooperative evolutionary strategy. The method us no specific knowledge of A-N structure, and can be applied to other structures.
The remainder of the paper is organized as follows: in ction 2, we give a formal description of the problem. In ction 3, we discuss a kind of cooperative evolution strategy to deal with the problem. I
n ction 4, we explore the problem of parameter estimation. In ction 5, we prent our experiments and the results as well as their evaluation. In ction 6, we give some conclusions and discuss future work.
2. Problem Setting
Given an adjective t and a noun t, suppo for each noun, some adjectives are listed as its compositional instances1. Our goal is to learn new reasonable compositions from the instances. To do so, we cluster nouns and adjectives simultaneously and build a compositional frame for each noun.
Suppo A is the t of adjectives, N is the t of nouns, for any a symbol 206 \f "Symbol" \s 10.5∈}A, let f(a)symbol 205 \f "Symbol" \s 10.5⊆}N be the instance t of a, i.e., the t of nouns in N which can be combined with a, and for any n symbol 206 \f "Symbol" \s 10.5∈}N, let g(n)symbol 205 \f "Symbol" \s 10.5⊆}A be the instance t of n, i.e., the t of adjectives in A which can be combined with n. We first give some formal definitions in the following:
Definition 1 partition
Suppo U is a non-empty finite t, we call <U1, U2, ..., U k> a partition of U, if:
i) for any U i, and U j, i symbol 185 \f "Symbol" \s 10.5≠}j, U i symbol 199 \f "Symbol" \s
10.5∩}U j•symbol 102 \f "Symbol" \s 10.5φ,
ii) U=U i
i k
1≤≤
U
We call U i a cluster of U.
Suppo U•< A1, A2, ..., A p > is a partition of A, V•<N1, N2, ..., N q> is a partition of N, f and g are defined as above, for any N i, let g(N i)•{A j:_symbol 36 \f "Symbol" \s 10.5∃n symbol 206 \f "Symbol" \s 10.5∈}N i, A j symbol 199 \f "Symbol" \s 10.5∩}g(n)symbol 185 \f "Symbol" \s 10.5≠}symbol 102 \f "Symbol" \s 10.5φ}, and for any
n, let δ
<>
U V n
,
()•symbol 124 \f "Symbol" \s 10.5|{a:_symbol 36 \f "Symbol" \s 10.5∃A j, A j symbol 206 \f "Symbol" \s 10.5∈}g(N k), a symbol 206 \f "Symbol" \s 10.5∈}A j}symbol 45 \f "Symbol" \s 10.5−g(n)symbol 124 \f "Symbol" \s 10.5|, where n symbol 206 \f "Symbol" \s 10.5∈}N k. Intuitively,
δ
<>
U V n
,
()is the number of the new instances relevant with n. We define the general learning amount as the following:
Definition 2 learning amountδ
葱油拌面做法
<>
U V,
1 The compositional instances of the adjectives can be inferred from tho of the nouns.
δ<>U V ,•
δ<>∈∑n N n ,()
同城交易Bad on the partitions of both nouns and adjectives, we can define the distance between nouns and that between adjectives.
Definition 3 distance between words
for any a symbol 206 \f "Symbol" \s 10.5∈}A , let f a ()={N i :_1symbol 163 \f "Symbol" \s 10.5≤}i symbol 163 \f "Symbol" \s 10.5≤}q , N i symbol 199 \f "Symbol" \s 10.5∩}f(a)symbol 185 \f "Symbol" \s 10.5≠}symbol 102 \f "Symbol" \s 10.5φ}, for any n symbol 206 \f "Symbol" \s 10.5∈}N , let g n U ()=
{A i :_1symbol 163 \f "Symbol" \s 10.5≤}i symbol 163 \f "Symbol" \s 10.5≤}p , A i symbol 199 \f "Symbol" \s 10.5∩}g(n)symbol 185 \f "Symbol" \s 10.5≠}symbol 102 \f "Symbol" \s 10.5φ}, for any two nouns n 1 and n 2, any two adjectives a 1 and a 2, we define the distances between them respectively as the following: i) dis n n U (,)12=1-g n g n g n g n U U U U ()()
()()
1212∩∪ii) dis a a V (,)12=1-
f a f a f a f a V V V V ()()
()()1212∩∪According to the distances between words, we can define the distances between word ts .
Definition 4 distance between word ts
Given any two adjective ts X 1, X 2symbol 205 \f "Symbol" \s 10.5⊆}A , any two noun ts Y 1, Y 2symbol 205 \f "Symbol" \s 10.5⊆}N , their distances are:i) dis X X V (,)12=a X a X V dis a a 112212∈∈,max
{(,)}ii) dis Y Y U (,)12 =
n Y n Y dis n n 112212∈∈,max {(,)}
Intuitively, the distance between word ts refer to the biggest distance between words respectively in the two ts.
We formalize the problem of clustering nouns and adjectives simultaneously as an optimization problem with some constraints. (1)To determine a partition U •<A 1, A 2, ..., A p > of A , and a partition V •<N 1, N 2, ..., N q > of N , where p , q symbol 62 \f "Symbol" \s 10.5>0, which satisfies i) and ii), and minimize δ<>U V ,.
i) for any a 1, a 2symbol 206 \f "Symbol" \s 10.5∈}A i , 1symbol 163 \f "Symbol" \s 10.5≤}i symbol 163 \f "Symbol" \s 10.5≤}p , dis a a (,)12<t 1; for A i and A j , 1symbol 163 \f "Symbol" \s 10.5≤}i symbol 185 \f "Symbol" \s 10.5≠}j symbol 163 \f "Symbol" \s 10.5≤}p , dis A A V i j (,)symbol 179 \f "Symbol" \s 10.5≥} t 1;
ii) for any n 1, n 2symbol 206 \f "Symbol" \s 10.5∈}N i , 0symbol 163 \f
"Symbol" \s 10.5≤}i symbol 163 \f "Symbol" \s 10.5≤}q , dis n n U (,)12<t 1; for N i and N j , 1symbol 163 \f "Symbol" \s 10.5≤}i symbol 185 \f "Symbol" \s 10.5≠}j symbol 163 \f "Symbol" \s 10.5≤}q , dis N N i j (,)symbol 179 \f "Symbol" \s 10.5≥} t 2;
where 0symbol 163 \f "Symbol" \s 10.5≤}t 1, t 2symbol 163 \f "Symbol" \s 10.5≤}1.
Intuitively, the conditions i) and ii) make the distances between words within clusters smaller, and tho between different clusters bigger, and to minimize δ<>U V ,means to minimize the distances between the words within clusters. In fact, (U , V ) can be en as an abstraction model over given compositions, and t 1, t 2 can be en as its abstraction degree . Consider the two special ca: one is t 1=t 2 =0, i.e., the abstract degree is the lowest, when the result is that one noun forms a cluster and on adjective forms a cluster, which means that no new compositions are learned. The other is t 1=t 2 =1, the abstract degree is the highest, when a possible result is that all nouns form a cluster and all adjectives form a cluster, which means that all possible compositions, reasonable or unreasonable, are learned. So we need estimate appropriate values for the two parameters, in order to make an appropriate abstraction over given compositions, i.e., make the compositional frames contain as many reasonable compositions as possible, and as few unreasonable ones as possible.
3. Cooperative Evolution
Since the beginning of evolutionary algorithms, they have been applied in many areas in AI(Davis et al., 1991; Holland 1994). Recently, as a new and powerful learning strategy,cooperative evolution ha
s gained much attention in solving complex non-linear problem. In this ction, we discuss how to deal with the problem (1) bad on the strategy.
According to the interaction between adjective clusters and noun clusters, we adopt such a cooperative strategy: after establishing the preliminary solutions, for any preliminary solution, we optimize N ’s partition bad on A ’s partition, then we optimize A ’s partition bad on N ’s partition, and so on, until the given conditions are satisfied.
3.1 Preliminary Solutions
When determining the preliminary population, we also cluster nouns and adjectives respectively.However, we e the environment of a noun as the t of all adjectives which occur with it in given compositions, and that of an adjective as the t of all the nouns which occur with it in given compositions. Compared with (1), the problem is a linear clustering one.
Suppo a 1, a 2symbol 206 \f "Symbol" \s 10.5∈}A , f is defined as above, we define the linear distance between them as (2):
(2) dis (a 1, a 2)•1-f a f a f a f a ()()
()()
1212∩∪Similarly, we can define the linear distance between nouns dis (n 1, n 2) bad on g . In contrast, we call the distances in definition 3 non-linear distances .
According to the linear distances between adjectives, we can determine a preliminary partition of N : randomly lect an adjective and put it into an empty t X , then scan the other
标书封装adjectives in A , for any adjective in A -X, if its distances from the adjectives in X are all smaller than t 1, then put it into X , finally X forms a preliminary cluster. Similarly, we can build another preliminary cluster in (A -X ). So on, we can get a t of preliminary clusters, which is just a partition of A . According to the different order in which we scan the adjectives, we can get different preliminary partitions of A .
Similarly, we can determine the preliminary partitions of N bad on the linear distances between nouns. A partition of A and a partition of N forms a preliminary solution of (1), and all possible preliminary solutions forms the population of preliminary solutions, which we also call the population of 0th generation solutions.
3.2 Evolution Operation
In general, evolution operation consists of recombination, mutation and lection. Recombination makes two solutions in a generation combine with each other to form a solution belonging to next
generation. Suppo <U i 1(); V i 1()> and <U i 2(); V i 2()> are two ith generation solutions, where
科幻故事
U i 1() and U i 2() are two partitions of A , V i 1() and V i 2() are two partitions of N , then <U i 1()•V i 2()>
and <U i 2(); V i 1()> forms two possible (i +1)th generation solutions.
Mutation makes a solution in a generation improve its fitness , and evolve into a new one
belonging to next generation. Suppo <U i (); V i ()> is a ith generation solution, where
U i ()•<A 1, A 2, ..., A p >•V i ()•<N 1, N 2, ..., N q > are partitions of A and N respectively, the mutation is aimed at optimizing V
i () into V i ()£«1 bad on U i (), and makes V i ()£«1 satisfy the condition ii)in (1), or optimizing U i () into U
i ()£«1 bad on V i (), and makes U i ()£«1 satisfy the condition i)in (1), then moving words across clusters to minimize δ<>U V ,.
We design three steps for mutation operation: splitting , merging and moving , the former two are intended for the partitions to satisfy the conditions in (1), and the third intended to minimize
δ<>U V ,. In the following, we take the evolution of V i ()£«
1 as an example to demonstrate the three steps.
symbol 183 \f "Symbol" \s 10.5•} Splitting Procedure . For any N k , 1symbol 163 \f "Symbol"\s 10.5≤}k symbol 163 \f "Symbol" \s 10.5≤}q , if there exist n 1, n 2symbol 206 \f "Symbol" \s 10.5∈}N k , such that dis n n U i ()(,)12symbol 179 \f "Symbol" \s 10.5≥} t 2, then splitting N k into two subts X and Y . The procedure is given as the following:
i) Put n 1 into X , n 2 into Y ,
ii) Select the noun in (N k -(X symbol 200 \f "Symbol" \s 10.5∪}Y )) who distance from n 1
儿童创意画is the smallest, and put it into X ,
iii) Select the noun in (N k -(X symbol 200 \f "Symbol" \s 10.5∪}Y )) who distance from n 2
is the smallest, and put it into Y,
iv) Repeat ii) and iii), until X symbol 200 \f "Symbol" \s 10.5∪}Y = N k .
For X (or Y ), if there exist n 1, n 2symbol 206 \f "Symbol" \s 10.5∈}X (or Y ), dis n n U i ()(,)12symbol 179 \f "Symbol" \s 10.5≥} t 2, then we can make u of the above procedure to split it into more
smaller ts. Obviously, we can split any N k in V i () into veral subts which satisfy the
condition ii) in (1) by repeating the procedure.
symbol 183 \f "Symbol" \s 10.5•} Merging procedure . If there exist N j and N k , where
>小朋友吹泡泡