PMML 1.1 -- DTD of Tree Model |
A Tree Model consists of four major parts: |
|
| Definitions:
TreeModel - starts the definition for a tree model. Node - this element is an encapsulation for either defining a split or an apex on a tree model architecture. Every Node contains at a minimum one Predicate that identifies a rule for choosing itself or any of its siblings. Many Predicate constructs are use to identify each split rule within a node. modelName - the value in modelName in a TreeModel element identifies the model with an unique name in the context of the PMML file. This attribute is not required. User reading models of a PMML file are free to manage model's naming at their discretion. Each Node consists of: |
|
| Definitions:
score - The value of score in a Node serves as the predicted value for a record that choses the Node. recordCount - The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. These numbers do not necessarily determine the number of records which have been used to build/train the model. Nevertheless, they allow to determine the relative size of given values in a score distribution as well as the relative size of a node when compared to the parent node. The PREDICATES are:Each Node has one %PREDICATES; that may be a Predicate, a CompoundPredicate, a True, or a False. |
|
|
| Definitions:
Predicate - this element consist of defining a rule in the form of a simple boolean expression. The rule consist of a dataField, a binary comparison operator(booleanOperator), and a value. Mathematically the rule is express as dataField booleanOperator value, that is, the dataField is the left operand and the value is the right operand. The following samples represent the equivalent to "age < 30"
<Predicate dataField="age" operator="lessThan" value="30" >
<Predicate value="30" operator="lessThan" dataField="age" >
<Predicate operator="lessThan" value="30" dataField="age" >
field - This attribute of Predicate element is a name entry of one of the miningField elements at the MiningSchema. operator - This attribute of Predicate is one of the six pre-defined comparizon operators.
Operator Math Symbol
equal =
notEqual !=
lessThan <
lessOrEqual <=
greaterThan >
greaterOrEqual >=
value - This attribute of Predicate element is the information to evaluate / compare against. |
|
Definition: CompoundPredicate - an encapsulating element for combining two or more elements as defined at the entity %PREDICATES;. The attribute associated with this element, booleanOperator, can take one of following logical(boolean) operators: and, or, xor, or cascade. The operator and indicates an evaluation to TRUE if all the predicates evaluate to TRUE. The operator or indicates an evaluation to TRUE if one of the predicates evaluates to TRUE. The operator xor indicates an evaluation to TRUE if all the predicates evaluate to the same reality (either all TRUE or all FALSE). The operator cascade allows for specifing surrogate predicates. cascasde is use for cases where a missing value undertermines the evaluation of the parent predicate so an alternative predicate is available. |
|
Definition: True - a predicate element that identifies the boolean constant TRUE. |
|
Definition: False - a predicate element that identified the boolean constant FALSE. ScoreDistributionA method to list predicted values in a classification trees structure. |
|
Definitions:ScoreDistribution - an element of Node to represent segments of the score that a node predicts in a classification model. If the node holds an enumeration, each entry of the enumeration is store in one ScoreDistribution element. value - This attribute of ScoreDistribution is the label in a classification model. recordCount - This attribute of ScoreDistribution is the size(possibly the number of records) associated with the value attribute. Extensions: |
<!ATTLIST TreeModel x-splitCharacteristic (binarySplit | multiSplit) #REQUIRED > |
Definition:x-slitCharacteristic - indicates whether the tree model has exactly two splits per node, or multiple splits per node. In the case of multiple, it means that each node may have independently 2 or more splits. Examples |
How cascade may be usefullIn the CART algorithm there is this concept of surrogate split. Say one is classifying a record, he/she drops the record to a node where the primary split is "salary <= 35000". Further assume the record has missing value for the salary, which is quite natural to happen. CART deals with this situation by applying a sequence of surrogate rules, in cascade-like fashion, until one of them can classify the given record. There may be 0 or more surrogate splits available. In our example, we could have "age <=28" and "homeowner==0" as surrogates. If age is not missing, the record is classified according to the age value. If age is missing, we try homeowner. If also "homeowner" is missing, we have run out of surrogates and we apply a True or False, as specified by the XML document. For example, (salary <= 35000) cascade (True), meaning "classify the record according to age, if age not missing. If age is missing, the predicate returns True anyway". |
Example TreeModel
|
| |