数据挖掘第二次作业
第一题:
1.
a) Compute the Information Gain for Gender, Car Type and Shirt Size.
b) Construct a decision tree with Information Gain.羊杂粉汤
答案:
a)因为class分为两类:C0和C1,其中C0的频数为10个,C1的频数为10,所以class元组的
信息增益为Info(D)==1
1.按照Gender进行分类:
Info gender(D)==0.971
Gain(Gender)=1-0.971=0.029
2.按照Car Type进行分类
Info carType(D)=
=0.314 Gain(Car Type)=1-0.314=0.686
3.按照Shirt Size进行分类:
Info shirtSize(D)==0.988
Gain(Shirt Size)=1-0.988=0.012
b)由a中的信息增益结果可以看出采用Car Type进行分类得到的信息增益最大,所以决策树为:
第二题:
兵马俑2. (a) Design a multilayer feed-forward neural network (one hidden layer) for the
data t in Q1. Label the nodes in the input and output layers.
(b) Using the neural network obtained above, show the weight values after one
iteration of the back propagation algorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and bias and the learning rate ud.
a)
x 11x 12x 21x 22x 23x 31x 32x 33x 34
输入层隐藏层输出层
b) 由a 可以设每个输入单元代表的属性和初始赋值
由于初始的权重和偏倚值是随机生成的所以在此定义初始值为:
净输入和输出:
每个节点的误差表:
10 0.0089 11 0.0030 12 -0.12
权重和偏倚的更新: W 1,10 W 1,11 W 2,10 W 2,11 W 3,10 W 3,11 W 4,10 W 4,11 W 5,10 W 5,11 0.201 0.198 -0.211 -0.099 0.4 0.308 -0.202 -0.098 0.101 -0.100 W 6,10 W 6,11 W 7,10 W 7,11 W 8,10 W 8,11 W 9,10 W 9,11 W 10,12 W 11,12 0.092 -0.211 -0.400 0.198 0.201 0.190 -0.110 0.300 -0.304 -0.099 θ10 θ11 θ12 -0.287 0.179
0.344
第三题:
3.
a) Suppo the fraction of undergraduate students who smoke is 15% and the
fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate stu
dents and the rest are undergraduates, what is the probability that a student who smokes is a graduate student?
b) Given the information in part (a), is a randomly chon college student more耿耿于怀
likely to be a graduate or undergraduate student?
c) Suppo 30% of the graduate students live in a dorm but only 10% of the
undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and tho who smoke.
答:
a) 定义:A={A 1 ,A 2}其中A 1表示没有毕业的学生,A 2表示毕业的学生,B 表示抽烟
则由题意而知:
P(B|A 1)=15% P(B|A 2)=23% P(A 1)= P(A 2)=
则问题则是求P(A 2|B)
由()166.0)()|B ()()|B (B 2
2
1
1
=+=A P A p A P A P P
则()277.0166
.02
.023.0)()()|(|222
=⨯=⨯=
B P A P A B P B A
印度三大主神
P
b) 由a 可以看出随机抽取一个抽烟的大学生,是毕业生的概率是0.277,未毕业的学生是0.723,所以有很大的可
能性是未毕业的学生。 c) 设住在宿舍为事件C
则P(C|A 2)=30% P(C|A 1)=10%
五朵金花观后感()14.0)()|C ()()|C (C 2211=+=A P A p A P A P P
023.014.0166.0)()()(=⨯==C P B P BC P
6.0023
.02
.03.023.0)()()|()|()|(2222=⨯⨯==
BC P A P A C P A B P BC A P
)|(1BC A P =0.4
所以由上面的结果可以看出是毕业生的概率大一些
第四题:
4. Suppo that the data mining task is to cluster the following ten points (with(x, y, z)
reprenting location) into three clusters:优思明副作用
A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2), C2(1,4,6), C3(9,1,7), C4(5,6,7)
The distance function is Euclidean distance. Suppo initially we assign A1, B1, C1 as the center of each cluster, respectively. U the K-Means algorithm to show only (a) The three cluster center after the first round execution (b) The final three clusters
答:
a) 各点到中心点的欧式距离 第一轮:
从而得到的三个簇为:
非法用手{A 1, A 3,B 3,C 2, C 3, C 4} {B 1,B 2} {C 1,A 2}
所以三个簇新的中心为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2) 第二轮:
当我看见你的时候新的簇均值为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2)