DMG.ORG

PMML 1.1 -- DTD of Tree Model


The tree modeling framework allows for defining either a classification or prediction structure. Each Node holds a rule, called PREDICATES, that determines the reason for choosing the Node or any of the branching Nodes. Branching can be multi-ary at each Node. No branching restrictions are build at the framework, neither a description for branching style. However, an Extension for branching style has been defined at the attributes of TreeModel, either all Nodes are bin-ary or multi-ary (see Extension Section: x-splitCharacteristic).

A Tree Model consists of four major parts:


<!ELEMENT TreeModel (Extension*, MiningSchema, ModelStats?, Node)>
<!ATTLIST TreeModel
   modelName      CDATA     #IMPLIED
>
Definitions:

TreeModel - starts the definition for a tree model.

Node - this element is an encapsulation for either defining a split or an apex on a tree model architecture. Every Node contains at a minimum one Predicate that identifies a rule for choosing itself or any of its siblings. Many Predicate constructs are use to identify each split rule within a node.

modelName - the value in modelName in a TreeModel element identifies the model with an unique name in the context of the PMML file. This attribute is not required. User reading models of a PMML file are free to manage model's naming at their discretion.

Each Node consists of:


 <!ELEMENT Node ( Extension*, (%PREDICATES;), Node*, ScoreDistribution* )>
 <!ATTLIST Node
    score         CDATA      #REQUIRED
    recordCount   %NUMBER;   #IMPLIED
 >
Definitions:

score - The value of score in a Node serves as the predicted value for a record that choses the Node.

recordCount - The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. These numbers do not necessarily determine the number of records which have been used to build/train the model. Nevertheless, they allow to determine the relative size of given values in a score distribution as well as the relative size of a node when compared to the parent node.

The PREDICATES are:

Each Node has one %PREDICATES; that may be a Predicate, a CompoundPredicate, a True, or a False.
<!ENTITY % PREDICATES "( Predicate | 
                         CompoundPredicate | 
                         True | False  ) " >

<!ELEMENT Predicate EMPTY>
<!ATTLIST Predicate
   field        %FIELD-NAME;   #REQUIRED
   operator     ( equal       | notEqual | 
                  lessThan    | lessOrEqual | 
                  greaterThan | greaterOrEqual )  #REQUIRED
   value        CDATA          #REQUIRED
>
Definitions:

Predicate - this element consist of defining a rule in the form of a simple boolean expression. The rule consist of a dataField, a binary comparison operator(booleanOperator), and a value. Mathematically the rule is express as dataField booleanOperator value, that is, the dataField is the left operand and the value is the right operand. The following samples represent the equivalent to "age < 30"
      <Predicate dataField="age" operator="lessThan" value="30" >
      <Predicate value="30" operator="lessThan" dataField="age" >
      <Predicate operator="lessThan" value="30" dataField="age" >

field - This attribute of Predicate element is a name entry of one of the miningField elements at the MiningSchema.

operator - This attribute of Predicate is one of the six pre-defined comparizon operators.
    Operator 	    Math Symbol
        equal             =
        notEqual         !=
        lessThan          <
        lessOrEqual       <=
        greaterThan       >
        greaterOrEqual    >=

value - This attribute of Predicate element is the information to evaluate / compare against.

<!ELEMENT CompoundPredicate ( %PREDICATES; , (%PREDICATES;)+ >
<!ATTLIST CompoundPredicate
    booleanOperator (or | and | xor | cascade) #REQUIRED>
Definition:
CompoundPredicate - an encapsulating element for combining two or more elements as defined at the entity 
%PREDICATES;. The attribute associated with this element, booleanOperator, can take one of following logical(boolean) 
operators: and, or, xor, or cascade.
The operator and indicates an evaluation to TRUE if all the predicates evaluate to TRUE.
The operator or indicates an evaluation to TRUE if one of the predicates evaluates to TRUE.  
The operator xor indicates an evaluation to TRUE if all the predicates evaluate to the same reality (either all 
TRUE or all FALSE).
The operator cascade allows for specifing surrogate predicates.  cascasde is use for 
cases where a missing value undertermines the evaluation of the parent predicate so an alternative predicate is available. 

<!ELEMENT True EMPTY>
Definition:
True - a predicate element that identifies the boolean constant TRUE.

<!ELEMENT False EMPTY>
Definition:
False - a predicate element that identified the boolean constant FALSE.

ScoreDistribution

A method to list predicted values in a classification trees structure.

<!ELEMENT ScoreDistribution EMPTY>
<!ATTLIST ScoreDistribution 
   value       CDATA          #REQUIRED
   recordCount %NUMBER;       #REQUIRED
>
Definitions:
ScoreDistribution - an element of Node to represent segments of the score that a node predicts in 
a classification model.  If the node holds an enumeration, each entry of the enumeration is store in one 
ScoreDistribution element.
value - This attribute of ScoreDistribution is the label in a classification model.
recordCount - This attribute of ScoreDistribution is the size(possibly the number of records) 
associated with the value attribute.


Extensions:

<!ATTLIST TreeModel
   x-splitCharacteristic (binarySplit | multiSplit) #REQUIRED
>
Definition:
x-slitCharacteristic - indicates whether the tree model has exactly two splits per node, or multiple splits 
per node. In the case of multiple, it means that each node may have independently 2 or more splits.


Examples

How cascade may be usefull
In the CART algorithm there is this concept of surrogate split. Say one is classifying a record, he/she drops 
the record to a node where the primary split is "salary <= 35000". Further assume the record has missing value for 
the salary, which is quite natural to happen. CART deals with this situation by applying a sequence of surrogate rules, 
in cascade-like fashion, until one of them can classify the given record. There may be 0 or more surrogate splits 
available. In our example, we could have "age <=28" and "homeowner==0" as surrogates. If age is not missing, 
the record is classified according to the age value. If age is missing, we try homeowner. If also "homeowner" is missing,
we have run out of surrogates and we apply a True or False, as specified by the XML document. 
For example, (salary <= 35000) cascade (True), meaning "classify the record according to age, if age not missing. 
If age is missing, the predicate returns True anyway".
Example TreeModel

<?xml version="1.0" ?>
<PMML version="1.1" >
  <Header copyright="www.MagnifyResearch.com" description="A very small binary tree model to show structure."/>
  <DataDictionary numberOfFields="5" >
     <DataField name="temperature" optype="continuous"/>
     <DataField name="humidity" optype="continuous"/>
     <DataField name="windy" optype="categorical" >
        <Value value="true"/>
        <Value value="false"/>
     </DataField>
     <DataField name="outlook" optype="categorical" >
        <Value value="sunny"/>
        <Value value="overcast"/>
        <Value value="rain"/>
     </DataField>
     <DataField name="whatIdo" optype="categorical" >
        <Value value="play"/>
        <Value value="no_play"/>
     </DataField>
  </DataDictionary>
  <TreeModel modelName="golfing">
     <MiningSchema>
        <MiningField name="temperature"/>
        <MiningField name="humidity"/>
        <MiningField name="windy"/>
        <MiningField name="outlook"/>
        <MiningField name="whatIdo" usageType="predicted"/>
     </MiningSchema>
     <Node score="play">
        <Predicate field="outlook" operator="equal" value="sunny"/>
        <Node score="play">
           <CompoundPredicate booleanOperator="and" >
              <Predicate field="temperature" operator="lessThan" value="90F" />
              <Predicate field="temperature" operator="greaterThan" value="50F" />
              <Predicate field="humidity" operator="lessThan" value="70" />
           </CompoundPredicate>
           <Node score="play"> <True/> </Node>
           <Node score="no_play"> <True/> </Node>
        </Node>
        <Node score="play">
           <Predicate field="outlook" operator="equal" value="rain"/>
           <Node score="no_play"> <True/> </Node>
           <Node score="play">
              <Predicate field="windy" operator="equal" value="true" />
              <Node score="no_play"> <True/> </Node>
              <Node score="play"> 
                 <CompoundPredicate booleanOperator="and" >
                    <Predicate field="temperature" operator="lessThan" value="100F" />
                    <Predicate field="humidity" operator="lessThan" value="60" />
                 </CompoundPredicate>
                 <Node score="play"> <True/> </Node>
                 <Node score="no_play"> <True/> </Node>      
              </Node>
           </Node>
        </Node>
     </Node>
  </TreeModel>
</PMML>

Webmaster

Copyright © 2000 DMG.org All Rights Reserved.