早教吧 育儿知识 作业答案 考试题库 百科 知识分享

英语翻译TreePruningMostreal-worlddatasetsdonothavetheconvenientpropertiesshownhere.Instead,noisydatawithmeasurementerrorsorincorrectclassificationforsomeexamplescanleadtoverybushytreesinwhichtheruletreehasmanysp

题目详情
英语翻译
Tree Pruning
Most real-world datasets do not have the convenient properties shown here.Instead,noisy data with measurement errors or incorrect classification for some examples can lead to very bushy trees in which the rule tree has many special cases to classify small numbers of uninteresting samples.
One way to address this problem is to use "rule-tree pruning." Instead of stopping when the negentropy reaches zero,you stop when it reaches some sufficiently small value,indicating that you are near the end of a branch.This pruning leaves a small number of examples incorrectly classified,but the overall structure of the decision tree will be preserved.Finding the exact,nonzero cutoff value will be a matter of experimentation.
Implementing Decision Trees
In implementing the decision tree described here,I've represented it within the program (see Listing One,as a binary tree,constructed of NODE structures pointing to other NODEs or to NULLs for terminal nodes.
Rather than copying the data table for each partition,I pass the partially formed data tree to the routine that calculates negentropy,allowing the program to exclude records that are not relevant for that part of the tree.Negentropy of a partition is calculated in routine negentropy (see Listing Two),which is called for all attribute/threshold combinations by routine ID3 (Listing Three).
The ability to use real-valued as well as binary-valued attributes comes at a price.To ensure the correct value of r,we scan through all attribute values in the dataset--a process that can be quite computationally intensive for large datasets.
No claims are made for the efficiency of this implementation.For cases where many sample attribute values are the same,or where a mixture of real-valued and binary-valued attributes is to be considered,the user is probably better advised to sort the attributes into a list and to eliminate repeated values.I've also not considered the case where a question can have more than two outcomes.
Two illustrative datasets are available electronically; see "Availability," page 3.The first is a set of sample data from a botanical classification problem,in which a type of flower,an iris,is to be classified into one of two subgenera (Virgin,Setosa) according to the dimensions of sample pistils and stamens.The data is taken from M.James' book Classification Algorithms (Collins,1985).Figure 3 shows the resulting decision tree.
The second dataset is a torture test for the algorithm.Given the attributes {John,Paul,George,Ringo},which are random numbers between 0 and 1,and a target attribute that takes random values from {0,1},the program returns a large and complex tree that classifies 100 examples.On a 486/75,the calculation took about 220 seconds to run to completion.Most real-world datasets will produce much simpler decision trees than Listing Four.
▼优质解答
答案和解析
本文就基于决策树的分类系统进行了说明,主要介绍了根据决策树算法中的ID3算法,利用开发工具Visual C++ 6.0完成系统的方法.首先介绍了机器学习、归纳学习、决策树学习等方面的相关背景.接着详细介绍了决策树,ID3算法的理论知识,包括信息熵知识,算法原理,以及分析了ID3算法的优劣.本文针对本系统的实际情况,详细的介绍了系统中的各模块和实现方法,以及系统功能的全过程.
另外,本文还比较详细的介绍了系统开发工具Visual C++ 6.0,从实现本系统的角度,对涉及的相关内容进行了介绍.通过对系统不同实验数据的实验结果的分析,直观的显示了系统能够完成的所有功能.文章的最后做出了总体上的结论,并指出了本分类系统的存在的许多不足之处,这也为以后的进一步研究奠定了基础.