|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PMML 4.4 - Tree ModelsThe TreeModel in PMML allows for defining either a classification or prediction structure. Each Node holds a logical predicate expression that defines the rule for choosing the Node or any of the branching Nodes. <xs:element name="TreeModel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="MiningSchema"/> <xs:element ref="Output" minOccurs="0"/> <xs:element ref="ModelStats" minOccurs="0"/> <xs:element ref="ModelExplanation" minOccurs="0"/> <xs:element ref="Targets" minOccurs="0"/> <xs:element ref="LocalTransformations" minOccurs="0"/> <xs:element ref="Node"/> <xs:element ref="ModelVerification" minOccurs="0"/> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="modelName" type="xs:string"/> <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/> <xs:attribute name="algorithmName" type="xs:string"/> <xs:attribute name="missingValueStrategy" type="MISSING-VALUE-STRATEGY" default="none"/> <xs:attribute name="missingValuePenalty" type="PROB-NUMBER" default="1.0"/> <xs:attribute name="noTrueChildStrategy" type="NO-TRUE-CHILD-STRATEGY" default="returnNullPrediction"/> <xs:attribute name="splitCharacteristic" default="multiSplit"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="binarySplit"/> <xs:enumeration value="multiSplit"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="isScorable" type="xs:boolean" default="true"/> </xs:complexType> </xs:element> Definitions:
Each Node consists of: <xs:element name="Node"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="PREDICATE"/> <xs:choice> <xs:sequence> <xs:element ref="Partition" minOccurs="0"/> <xs:element ref="ScoreDistribution" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Node" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:group ref="EmbeddedModel"/> </xs:choice> </xs:sequence> <xs:attribute name="id" type="xs:string"/> <xs:attribute name="score" type="xs:string"/> <xs:attribute name="recordCount" type="NUMBER"/> <xs:attribute name="defaultChild" type="xs:string"/> </xs:complexType> </xs:element> Definitions:
The content of the attribute id can be any string that is unique within a model. Any numbering scheme can be used, e.g., ids can be enumerated as 1, 2, 3, 4, etc. Or it can be a hierarchical schema as used for chapters, sections, subsections, etc in a book, e.g., 1.1.2.1, or 1.2.2.2. PredicatesEach Node has one PREDICATE; that may be a SimplePredicate, a SetPredicate, a CompoundPredicate, a True, or a False. <xs:group name="PREDICATE"> <xs:choice> <xs:element ref="SimplePredicate"/> <xs:element ref="CompoundPredicate"/> <xs:element ref="SimpleSetPredicate"/> <xs:element ref="True"/> <xs:element ref="False"/> </xs:choice> </xs:group> <xs:element name="SimplePredicate"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="field" type="FIELD-NAME" use="required"/> <xs:attribute name="operator" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="equal"/> <xs:enumeration value="notEqual"/> <xs:enumeration value="lessThan"/> <xs:enumeration value="lessOrEqual"/> <xs:enumeration value="greaterThan"/> <xs:enumeration value="greaterOrEqual"/> <xs:enumeration value="isMissing"/> <xs:enumeration value="isNotMissing"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="value" type="xs:string"/> </xs:complexType> </xs:element> If the operator is isMissing or isNotMissing, the attribute value must not appear. With all other operators, however, the attribute value is required. The predicates in the subnodes are evaluated left-to-right. The application algorithm chooses the first Node where the predicate evaluates to TRUE. Typically the rightmost Node just contains the predicate <True/>. If no Node applies and no final <True/> node is present, the noTrueChildStrategy applies (see below). Definitions:
Mathematically the rule is expressed as field booleanOperator
value, that is, field is the left operand and value is
the right operand. The following samples represent the equivalent to
<SimplePredicate field="age" operator="lessThan" value="30"/> <SimplePredicate value="30" operator="lessThan" field="age"/> <SimplePredicate operator="lessThan" value="30" field="age"/> Compound predicates<xs:element name="CompoundPredicate"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:sequence minOccurs="2" maxOccurs="unbounded"> <xs:group ref="PREDICATE"/> </xs:sequence> </xs:sequence> <xs:attribute name="booleanOperator" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="or"/> <xs:enumeration value="and"/> <xs:enumeration value="xor"/> <xs:enumeration value="surrogate"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> Definitions:
The operator and indicates an evaluation to TRUE if all the predicates evaluate to TRUE. The operator or indicates an evaluation to TRUE if one of the predicates evaluates to TRUE. The operator xor indicates an evaluation to TRUE if an odd number of the predicates evaluates to TRUE and all others evaluate to FALSE. The operator surrogate allows for specifying surrogate predicates. They are used for cases where a missing value appears in the evaluation of the parent predicate such that an alternative predicate is available. Simple set predicates<xs:element name="SimpleSetPredicate"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Array"/> </xs:sequence> <xs:attribute name="field" type="FIELD-NAME" use="required"/> <xs:attribute name="booleanOperator" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="isIn"/> <xs:enumeration value="isNotIn"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> Definition:
The set of values is specified by the array in the content. The attribute associated with this element, booleanOperator, can take one of following boolean operators: isIn, and isNotIn. The operator isIn indicates an evaluation to TRUE if the field value is contained in the list of values in the array. The operator isNotIn indicates an evaluation to TRUE if the field value is not contained in the list of values in the array. <xs:element name="True"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> Definition
<xs:element name="False"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> Definition:
Sub-predicates (siblings of a CompoundPredicate) are to be grouped together and evaluated together. For example,
is represented by
<CompoundPredicate booleanOperator="and"> <SimplePredicate field="temperature" operator="greaterThan" value="60"/> <SimplePredicate field="temperature" operator="lessThan" value="100"/> <SimplePredicate field="outlook" operator="equal" value="overcast"/> </CompoundPredicate> In the case where siblings of a CompoundPredicate are CompoundPredicates, each of the CompoundPredicates are evaluated together. For example,
is represented by
<CompoundPredicate booleanOperator="or"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="temperature" operator="lessThan" value="90"/> <SimplePredicate field="temperature" operator="greaterThan" value="50"/> </CompoundPredicate> <SimplePredicate field="humidity" operator="greaterOrEqual" value="80"/> </CompoundPredicate> Predicates on missing valuesThe value of any field in a logical expression may be missing. A SimplePredicate
evaluates to UNKNOWN if the value of Field is missing.
Note that the DataDictionary and MiningSchema may contain a
definition on how to handle a missing value, e.g., by replacing it by a
substitute. In that case the substituted value is used to evaluate the
predicate.
The result of a CompoundPredicate with an operator and, or or xor is determined by the following table:
The operator surrogate provides a special means to handle logical expressions with missing values. It is applied to a sequence of predicates. The order of the predicates matters, the first predicate is the primary, the next predicates are the surrogates. Evaluation order is left-to-right. The cascaded predicates are applied when the primary predicate evaluates to UNKNOWN. Therefore, a surrogate predicate can provide a resolution to undetermined predicates. Example<CompoundPredicate booleanOperator="surrogate"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="temperature" operator="lessThan" value="90"/> <SimplePredicate field="temperature" operator="greaterThan" value="50"/> </CompoundPredicate> <SimplePredicate field="humidity" operator="greaterOrEqual" value="80"/> <False/> </CompoundPredicate> The primary predicate is ScoreDistributionThis element comprises a method to list predicted values in a classification trees structure. <xs:element name="ScoreDistribution"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="value" type="xs:string" use="required"/> <xs:attribute name="recordCount" type="NUMBER" use="required"/> <xs:attribute name="confidence" type="PROB-NUMBER"/> <xs:attribute name="probability" type="PROB-NUMBER"/> </xs:complexType> </xs:element> Attribute Definitions
When a Node is selected as the final Node and if this Node has no score attribute, then the highest recordCount in the ScoreDistribution determines which value is selected as the predicted class. If a Node contains a sequence of ScoreDistribution elements such that there is more than one entry where recordCounti is an upper bound, then the first entry is selected. Note: If a Node has an attribute score then this attribute value overrides the computation of a predicted value from the ScoreDistribution. Missing Value Strategies and PenaltiesThe purpose of the missing value strategy is to define what happens when missing values are encountered in a case to be scored by the tree model - in situations where the main predicate defined at a decision tree node evaluates to UNKNOWN. See the section on Predicates on Missing Values for an explanation of how missing values can cause a predicate to evaluate to UNKNOWN. missingValueStrategy:This optional attribute of TreeModel indicates which strategy to apply when a Node's predicate evaluates to UNKNOWN during the scoring of a case: <xs:simpleType name="MISSING-VALUE-STRATEGY"> <xs:restriction base="xs:string"> <xs:enumeration value="lastPrediction"/> <xs:enumeration value="nullPrediction"/> <xs:enumeration value="defaultChild"/> <xs:enumeration value="weightedConfidence"/> <xs:enumeration value="aggregateNodes"/> <xs:enumeration value="none"/> </xs:restriction> </xs:simpleType> Definitions:
Note: The missingValueStrategy is not invoked if missing values are handled within predicates, either by compound predicates composed using the surrogate operator or by simple predicates containing the comparison operators isMissing or isNotMissing. When the predicate contains these operators it is possible for the predicate to evaluate to TRUE or FALSE when fields referenced within the predicate have missing values. missingValuePenalty:This optional attribute of TreeModel allows computed confidences to be reduced by a specified factor each time certain kinds of missing value handling are invoked during the scoring of a case. For each Node where either surrogate rules or the defaultChild strategy had to be used to select a child, the final confidences are multiplied by this factor. Note that this is based on the number of Nodes, not on the overall number of missing values that were encountered (with operator surrogate, multiple missing values can be encountered within a single Node). For example, if two Nodes with missing values were encountered to get to the final prediction, confidence is multiplied by the two missingValuePenalty values.Handling the situation where scoring cannot continuenoTrueChildStrategy:During the scoring of a case, if the scoring reaches an internal Node at which none of the subnodes' predicates evaluate to TRUE, and no missing value handling strategy (if defined) is invoked for any of these subnodes, this optional attribute of TreeModel determines what to do next: <xs:simpleType name="NO-TRUE-CHILD-STRATEGY"> <xs:restriction base="xs:string"> <xs:enumeration value="returnNullPrediction"/> <xs:enumeration value="returnLastPrediction"/> </xs:restriction> </xs:simpleType> Definitions:
In the following example, if scoring reaches N1, but the case to be scored has a value for field prob1 which is less than or equal to 0.33, the noTrueChildStrategy defined for the tree determines what action to take. If set to returnNullPrediction, then no prediction is returned. If set to returnLastPrediction, then the score of N1 (0) is returned. <Node id="N1" score="0"> <True/> <Node id="T1" score="1"> <SimplePredicate field="prob1" operator="greaterThan" value="0.33"/> </Node> </Node> ExamplesHow to use surrogateThe CART algorithm features the concept of surrogate splits. For example,
when classifying a record, the record is dropped to a node where the primary
split is For example,
means:
Classify the record according to salary, if salary is not missing. If salary is missing, the predicate returns TRUE anyway. Example TreeModel<PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4"> <Header copyright="www.dmg.org" description="A very small binary tree model to show structure."/> <DataDictionary numberOfFields="5"> <DataField name="temperature" optype="continuous" dataType="double"/> <DataField name="humidity" optype="continuous" dataType="double"/> <DataField name="windy" optype="categorical" dataType="string"> <Value value="true"/> <Value value="false"/> </DataField> <DataField name="outlook" optype="categorical" dataType="string"> <Value value="sunny"/> <Value value="overcast"/> <Value value="rain"/> </DataField> <DataField name="whatIdo" optype="categorical" dataType="string"> <Value value="will play"/> <Value value="may play"/> <Value value="no play"/> </DataField> </DataDictionary> <TreeModel modelName="golfing" functionName="classification"> <MiningSchema> <MiningField name="temperature"/> <MiningField name="humidity"/> <MiningField name="windy"/> <MiningField name="outlook"/> <MiningField name="whatIdo" usageType="target"/> </MiningSchema> <Node score="will play"> <True/> <Node score="will play"> <SimplePredicate field="outlook" operator="equal" value="sunny"/> <Node score="will play"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="temperature" operator="lessThan" value="90"/> <SimplePredicate field="temperature" operator="greaterThan" value="50"/> </CompoundPredicate> <Node score="will play"> <SimplePredicate field="humidity" operator="lessThan" value="80"/> </Node> <Node score="no play"> <SimplePredicate field="humidity" operator="greaterOrEqual" value="80"/> </Node> </Node> <Node score="no play"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="temperature" operator="greaterOrEqual" value="90"/> <SimplePredicate field="temperature" operator="lessOrEqual" value="50"/> </CompoundPredicate> </Node> </Node> <Node score="may play"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="outlook" operator="equal" value="overcast"/> <SimplePredicate field="outlook" operator="equal" value="rain"/> </CompoundPredicate> <Node score="may play"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="temperature" operator="greaterThan" value="60"/> <SimplePredicate field="temperature" operator="lessThan" value="100"/> <SimplePredicate field="outlook" operator="equal" value="overcast"/> <SimplePredicate field="humidity" operator="lessThan" value="70"/> <SimplePredicate field="windy" operator="equal" value="false"/> </CompoundPredicate> </Node> <Node score="no play"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="outlook" operator="equal" value="rain"/> <SimplePredicate field="humidity" operator="lessThan" value="70"/> </CompoundPredicate> </Node> </Node> </Node> </TreeModel> </PMML> Scoring ProcedureWe will use the above example to illustrate the steps that should be followed in the scoring process. The input data is assumed to be:
Scoring Procedure with Missing Value StrategiesExample calculations of scoring with missing values are based on the following tree model: <PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4"> <Header copyright="www.dmg.org" description="A very small tree model to demonstrate missing value handling and confidence calculation."/> <DataDictionary numberOfFields="4"> <DataField name="temperature" optype="continuous" dataType="double"/> <DataField name="humidity" optype="continuous" dataType="double"/> <DataField name="outlook" optype="categorical" dataType="string"> <Value value="sunny"/> <Value value="overcast"/> <Value value="rain"/> </DataField> <DataField name="whatIdo" optype="categorical" dataType="string"> <Value value="will play"/> <Value value="may play"/> <Value value="no play"/> </DataField> </DataDictionary> <TreeModel modelName="golfing" functionName="classification" missingValueStrategy="weightedConfidence"> <MiningSchema> <MiningField name="temperature"/> <MiningField name="humidity"/> <MiningField name="outlook"/> <MiningField name="whatIdo" usageType="target"/> </MiningSchema> <Node id="1" score="will play" recordCount="100" defaultChild="2"> <True/> <ScoreDistribution value="will play" recordCount="60" confidence="0.6"/> <ScoreDistribution value="may play" recordCount="30" confidence="0.3"/> <ScoreDistribution value="no play" recordCount="10" confidence="0.1"/> <Node id="2" score="will play" recordCount="50" defaultChild="3"> <SimplePredicate field="outlook" operator="equal" value="sunny"/> <ScoreDistribution value="will play" recordCount="40" confidence="0.8"/> <ScoreDistribution value="may play" recordCount="2" confidence="0.04"/> <ScoreDistribution value="no play" recordCount="8" confidence="0.16"/> <Node id="3" score="will play" recordCount="40"> <CompoundPredicate booleanOperator="surrogate"> <SimplePredicate field="temperature" operator="greaterOrEqual" value="50"/> <SimplePredicate field="humidity" operator="lessThan" value="80"/> </CompoundPredicate> <ScoreDistribution value="will play" recordCount="36" confidence="0.9"/> <ScoreDistribution value="may play" recordCount="2" confidence="0.05"/> <ScoreDistribution value="no play" recordCount="2" confidence="0.05"/> </Node> <Node id="4" score="no play" recordCount="10"> <CompoundPredicate booleanOperator="surrogate"> <SimplePredicate field="temperature" operator="lessThan" value="50"/> <SimplePredicate field="humidity" operator="greaterOrEqual" value="80"/> </CompoundPredicate> <ScoreDistribution value="will play" recordCount="4" confidence="0.4"/> <ScoreDistribution value="may play" recordCount="0" confidence="0.0"/> <ScoreDistribution value="no play" recordCount="6" confidence="0.6"/> </Node> </Node> <Node id="5" score="may play" recordCount="50"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="outlook" operator="equal" value="overcast"/> <SimplePredicate field="outlook" operator="equal" value="rain"/> </CompoundPredicate> <ScoreDistribution value="will play" recordCount="20" confidence="0.4"/> <ScoreDistribution value="may play" recordCount="28" confidence="0.56"/> <ScoreDistribution value="no play" recordCount="2" confidence="0.04"/> </Node> </Node> </TreeModel> </PMML> Example 1 - Scoring with explicit confidencesThe case to be scored has The prediction at node 4 is "no play" and the associated confidence (given by the confidence attribute where value="no play" in the score distribution) is 0.6. Example 2 - Scoring with a missing value, and weightedConfidence missing value handlingThe case to be scored has Scoring leads to node 2 but because temperature and humidity are not known, the predicates for node 2's first child (node 3) evaluates to UNKNOWN and missingValueHandlingStrategy weightedConfidence is invoked at this point. This is resolved by deriving confidences for each class resulting from choosing each child node of node 2 where the predicate does not evaluate to FALSE (nodes 3 and 4). Node 3 confidences:Node 4 confidences: Now these confidences are recombined, but weighted according to the relative numbers of records assigned to nodes 3 (40 records) and 4 (10 records). Node 2 confidences: The overall prediction returned for this case is the one with the highest confidence, "will play". Example 3 - Scoring with multiple missing values, and weightedConfidence missing value handlingThe case to be scored has unknown values for outlook, humidity and temperature. The predicates of node 2 evaluate to UNKNOWN, due to missing value for outlook, so missingValueHandlingStrategy weightedConfidence is invoked at this point. Confidences for each class are derived from each child of node 1 where the predicate does not evaluate to FALSE (nodes 2 and 5) and recombined. Node 2 confidences for this case are computed using the steps in example 2. Node 2 confidences:Node 5 confidences Now the confidences are recombined, but weighted according to the numbers of records assigned to nodes 2 (50 records) and 5 (50 records). Node 1 confidences: The overall prediction returned for this case is the one with the highest confidence, "will play". Example 4 - Scoring with defaultChild missing value handlingSuppose we alter the example TreeModel to set the missingValueStrategy attribute to defaultChild. We also add the attribute missingValuePenalty and set it to 0.8. Now consider how to score a case with The predicate of node 2 evaluates to UNKNOWN, due to missing value for outlook, so missingValueHandlingStrategy defaultChild is invoked at this point. Scoring continues by selecting node 1's defaultChild (node 2). Scoring then continues normally from node 2, the prediction returned is "no play", but the confidence returned is 0.6 multiplied by the missingValuePenalty of 0.8, which is 0.48. Example 5 - Scoring with defaultChild missing value handling, multiple missing valuesSuppose we alter the example TreeModel to set the missingValueStrategy attribute to defaultChild. We also add the attribute missingValuePenalty and set it to 0.8. Now consider how to score a case with The predicate of node 2 evaluates to UNKNOWN, due to missing value for
outlook, so missingValueHandlingStrategy
defaultChild is invoked at this point. Scoring continues by
selecting node 1's defaultChild (node 2). At node 2, the
surrogate predicate based on humidity is used to select
node 3. The prediction returned is "will play" but the confidence returned is
0.9 multiplied by the missingValuePenalty of 0.8 for each of the two
nodes (node 1, node 2) where missing value handling was used, giving
Example 6 - Scoring with lastPrediction missing value handlingSuppose we alter the example TreeModel to set the missingValueStrategy attribute to lastPrediction. Now consider how to score a case with
Example 7 - Scoring with nullPrediction missing value handlingSuppose we alter the example TreeModel to set the missingValueStrategy attribute to nullPrediction. Now consider how to score a case with
Example 8 - Scoring with missingValueHandling aggregateNodesThe case to be scored has temperature="45" and humidity="90", but outlook is unknown. Evaluation of node 2 is UNKNOWN because of outlook being unknown. missingValueHandlingStrategy aggrgateNodes is invoked at this point and it is assumed that node 2 evaluates to TRUE. Under this assumption, node 3 evaluates to FALSE, but node 4 evaluates to TRUE. The remaining sibling nodes of node 2 must also be evaluated: Node 5 evaluates to TRUE. Node 4 recordCounts:Node 5 recordCounts: Now the recordCounts are accumulated which leads to the following total recordCounts:
The overall prediction returned for this case is the one with the highest accumulated recordCount, "may play". The confidence is calculated as follows:
Example 9 - Scoring with missingValueHandling none... <TreeModel modelName="golfing" functionName="classification" missingValueStrategy="none"> ... <Node id="1" score="will play" recordCount="100"> <True/> <Node id="2" score="will play" recordCount="50"> <SimplePredicate field="age" operator="lessThan" value="30"/> </Node> <Node id="3" score="will not play" recordCount="20"> <SimplePredicate field="age" operator="greaterOrEqual" value="30"/> </Node> <Node id="4" score="will play" recordCount="30"> <True/> </Node> </Node> ... Now consider how to score a case with age being unknown. While all valid values for age would be covered by nodes 2 and 3 and never reach node 4, missingValueHandling="none" will prevent either one from firing, since value missing is neither less than 30 nor greater than or equal to 30. A final node that will always fire takes care of missing values for age. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|