《数据挖掘》习题库及答案.docx
《《数据挖掘》习题库及答案.docx》由会员分享,可在线阅读,更多相关《《数据挖掘》习题库及答案.docx(36页珍藏版)》请在三一文库上搜索。
1、数据挖掘复习试题和答案考虑表中二元分类问题的训练样本集表48练习3的数据集实例z目标类1TTLO+2TT6.0+3TF5.04FF4。+5FT7.06FT3.07FF&08TF70+9FT5.Q1.整个训练样本集关于类属性的滴是多少?2.关于这些训练集中a1,a2的信息增益是多少?3.对于连续属性a3,计算所有可能的划分的信息增益。4.根据信息增益,a1,a2,a3哪个是最佳划分?5.根据分类错误率,a1,a2哪具最佳?6.根据gini指标,a1,a2哪个最佳?答1.ExamplesforcomputingEntropyCl0C26Entropy(t)=-p(jOlog2p(jt)P(C1)三
2、0/6=0P(C2)=6/6=1Entropy=-0log0-1logI=-O-O=OP(C1)=16P(C2)=5/6Entropy=-(1/6)Iog2(1/6)-(5/6)Iog2(5/6)=0.65P(C1)=26P(C2)=4/6Entropy=-(2/6)Iog2(2/6)-(4/6)Iog2(4/6)=0.92Z7(+)=4/9andP(一)=5/9-4/9Iog2(4/9)-5/9Iog2(54)=0.9911.答2:SplittingBasedonINFO.InformationGain:Ckr、GAINpht-Entropy(p)-Entropy(J)ParentNode,
3、pissplitintokpartitions;niisnumberofrecordsinpartitioni- MeasuresReductioninEntropyachievedbecauseofthesplit.Choosethesplitthatachievesmostreduction(maximizesGAIN)- UsedinID3andC4.5- Disadvantage:Tendstoprefersplitsthatresultinlargenumberofpartitions,eachbeingsmallbutpure.(估计不考)Forattribute,thecorre
4、spondingcountsandprobabilitiesare:+-TF3114TheentropyforaisI-(34)log2(34)-(l4)log2(l4)+3-(l5)log2(l5)-(45)log2(45)=0.7616.Therefore,theinformationgainforais0.99110.7616=0.2294.Forattributes,thecorrespondingcountsandprobabilitiesare:S+T23F22Theentropyfor敢is3-(25)log2(25)-(35)log2(35)7-(24)log2(24)-(24
5、)log2(24)=0.9839.Therefore,theinformationgainfor做is0.99110.9839=0.0072.ContinuousAttributes:ComputingGiniIndex.Forefficientcomputation:foreachattribute,-Sorttheattributeonvalues一Linearlyscanthesevalues,eachtimeupdatingthecountmatrixandcomputingginiindex一Choosethesplitpositionthathastheleastginiindex
6、CheatSortedVaIues_SplitPositions_NoNoNoYesYesYesNoNoNoIMOI60I7。I75I85TaxableIncome9095100I120125I2255657280879297110122172230Yes0303030312213030303030No0716253434343443526170Gini0.4200.4000.3750.3430.4170.4000.3430.3750.4000.420Tan,Steinbach,KumarIntroductiontoDataMining4/18/200437Q3ClasslabelSplitp
7、ointEntropyInfoGain1.0+2.00.84840.14273.0-3.50.98850.264.04.50.91830.07285.05.0-5.50.98390.726.0+6.50.97280.01837.07.0+7.50.88890.1022答4:Accordingtoinformationgain,aproducesthebestspIit.答5:ExamplesforComputingErrorError(t)=1maxF(z11)Therefore,accordingtoerrorrate,aproducesthebestspIit.答6:Gini(ChiIdr
8、en)=7/12*0.408+5/12*0.32=0.371BinaryAttributes:ComputingGINIIndex Splitsintotwopartitions EffectofWeighingpartitions:一LargerandPurerPartitionsaresoughtfor.Gini(N1)=1-(5/7)2-(2/7)2=0.408Gini(N2)=1-(1/5)2_(4/5)2=0.324/18/200434三)Tan,Steinbach,KumarIntroductiontoDataMiningForattributentheginiindexisr15
9、一1-(3/4)2-(1/4)2+-1-(1/5)2-(4/5)2=0.3444.Forattribute02,theginiindexis51(2/5)2(3/5)2+Ii(2/4)2(2/4)2=qssqSincetheginiindexforaissmaller,itproducesthebettersplit.二、考虑如下二元分类问题的数据集AB类标号TF+TT+TT+TFTT+FFFFFFTTTF图443二元分类问题不纯性度量之间的比较1 .计算a.b信息增益,决策树归纳算法会选用哪个属性ThecontingencytablesaftersplittingonattributesAa
10、ndBare:A=TA=FB=TB=FtD33315Theoverallentropybeforesplittingis:Erig=0.4log0.40.6log0.6=0.9710TheinformationgainaftersplittingonAis:4433EA=T=jlogw-亏lgJ=州527777尸33(J0nEa=F=-2log3-31g3=0=Emg-7/10EA=T-3/10EA=F=O.2813TheinformationgainaftersplittingonBis:3311Eb=t=-7lg7-7lg7=081134444EB=F=-77logp_77lg77=0.G
11、500bbb=EoHg-4/10EB=T-6/10EB=F=O.2565Therefore,attributeAwillbechosentosplitthenode.2 .计算a.bgini指标,决策树归纳会用哪个属性?Theowrallginibeforesplittingis:Goria=1-0.42-0.62=0.48ThegaininginiaftersplittingonAis:GA=T=l-(02-0)2=0.4898-=1=-(D2=Gorig-710G=t-310G=f=0.1371ThegaininginiaftersplittingonBis:GR=TGR=FY)Ie)LE
12、Y)YGorig-410G11三-6/1OGH=F=0.1633Therefore,attributeBwillbechosentosplitthenode.这个答案没问题3 .从图4-13可以看出嫡和gini指标在0,0.5都是单调递增,而0.5,1之间单调递减。有没有可能信息增益和gini指标增益支持不同的属性?解释你的理由Yes,eventhoughthesemeasureshavesimiIarrangeandmonotonousbehavior,theirrespectivegains,whichareseaIeddifferencesofthemeasures,donotneces
13、sariIybehaveinthesameway,asiIIustratedbytheresuItsinparts(八)and(b).贝叶斯分类ExampleofNaiveBayesClassifierGivenaTestRecord:X-(Refund=No5Married,Income=120K)naiveBayesClassifier:P(Refund=YeslNo)=3/7P(Refund=NolNo)=P(XYes)P(Yes)ThereforeP(NoX)P(YesX)=Class=No4/18/200466IntroductiontoDataMining7.考虑表540中的数据集
14、表g1C匀殖7的数据建记录ABC类1000+200130114O115001+6101+7101-81019111+10101+(八)估计条件概率尸(A+),P(B),P(Q+),P(AH),P(BH)和尸(C|一)。C=O)的类(b)根据(八)中的条件概率,使用朴素贝叶斯方法预测测试样本(A=0,B=l标号。(C)使用m估计方法(p=1/2且加=4)估计条件概率。(d)同(b),使用(C)中的条件概率。(e)比较估计概率的两种方法。哪一种更好?为什么?1. PA=1/-)=2/5=0.4,P(B=11-)=2/5=0.4,P(C=1/-)=1,P(A=0/-)=3/5=0.6,P(B=0/
15、)=3/5=0.6,P(C=0/-)=0;Pa=1/+)=3/5=0.6,P(B=1/+)=1/5=0.2,P(C=1/+)=24=0.4,P(A=0/+)=2/5=0.4,P(B=0/+)=4/5=0.8,P(C=0/+)=3/5=0.6.LetPA=0,B=1,。=O)=K.F(+A=0,3=1,C=O)F(A=0,B=l,C=0)尸(+)产(4=0,8=1,。=0)PA=0)P(B=1+)P(C=O)XF()一K=0.4X0.20.60.57=0.024/K.P-A=0,8=1,C=0)F(A=O,B=1,。=O-)尸(一)=F(A=O,B=1,C=0)P(A=O-)P(B=1-)P(
16、C=O-)F(一)=K=OlK2. Theclasslabelshouldbe3. P(A=0/+)=(2+2)/(5+4)=4/9,PA=O/-)=(3+2)/(5+4)=54,P(B=1/+)=(1+2)/(5+4)=3R,户(8=1/_)=(2+2)/(5+4)=4力,P(C=O/+)=(3+2)/(5+4)=5力,P(C=O/-)=(0+2)/(5+4)=2R.4. LetPa=O,8=1,C=O)=KP(+=O,B=1,C=O)P(A=O,B=1,C=O)P()=F(A=O,B=1,C=O)PA=O+)P(B=1+)P(C=O)尸(十)=K(4/9)(3/9)(5/9)0.5=K=0
17、0412/KP(-A=O,=l,C=O)P(A=O,B=1,C=O-)XP(一)=F(.4=O.8=Le=O)P(A=O-)P(B=1|一)XP(C=O-)尸(一)=K(5/9)(4/9)(2/9)0.5=K=0.0274/KTheclasslabelshouldbe5. 当的条件概率之一是零,则估计为使用m-估计概率的方法的条件概率是更好的,因为我们不希望整个表达式变为零。&考虑表5-11中的数据集。表511习题8的数据集实例IABC类I0012101+30104100,51I016001+711I0-80009010+10111估计条件概率P(A=Ii+),P(B=Il+),P(C=Il
18、),P(A=1HP(B=Ik)和F(C二ip)0(b)根据(八)中的条件概率,使用朴素贝叶斯方法预测测试样本(4=1,B=1,C=I)的类标号。(C)比较产4=1),F(B=I)和尸(A=1,8=1)。陈述A、,之间的关系。(d)对P(A=1),P(B=0)和P(A=1,8=0)重复(C)的分析(e)比较P(A=I,B=l类=+)与P(A=类=+)和P(B=II类=+”给定类+,变量A,B条件独立吗?1. P(A=1/+)=0.6,P(B=1/+)=0.4,P(C=1/+)=0.8,P(A=1/-)=0.4,P(B=11-)=0.4,and户(C=I/一)=0.22.LetR:(4=fB-,
19、C=I)bethetestrecord.Todetermineitsclass,weneedtocomputePeIBandP(-/而.UsingBayestheorem,P(+R-9(/?/+)P(+)andPjIB=P(R-)P(一)P(R).SinceP(+)=p(-)=0.5andP(R)sconstant,RcanbecIassifiedbycomparingPGIR)andP(-R.Forthisquestion,PIRl+)二户(4=1/+)XP(B=+)XP(C=I/+)=0.192PkRj)二夕(4二1/一)XP(B=W)XP(C=I/)=0.032SincePlRI4sIa
20、rger,therecordisassignedto(+)cIass.3.P(A=1)=0.5,P(B=1)=0.4andP(A=1,=1)=P(KXP()=0.2.Therefore,AandBareindependent.4.P(A=1)=0.5,P(B=0)=0.6,andP(A=1,8=0)=P(A=1)XP(B=0)=0.3.AandBarestillindependent.5.CompareP(A=1,8=1/+)=0.2againstP(A=1/+)=0.6andP(B=1/Class=+)=0.4.SincetheproductbetweenP(A=1/+)andPA=1/-)
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据挖掘 数据 挖掘 习题 答案
