PMML 4.4.1 - Mining Schema

The MiningSchema is the Gate Keeper for its model element. All data entering a model must pass through the MiningSchema. Each model element contains one MiningSchema which lists fields as used in that model. While the MiningSchema contains information that is specific to a certain model, the DataDictionary contains data definitions which do not vary per model. The main purpose of the MiningSchema is to list the fields that have to be provided in order to apply the model.

MiningFields also define the usage of each field (active, supplementary, target, ...) as well as policies for treating missing, invalid or outlier values.

<xs:element name="MiningSchema">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element maxOccurs="unbounded" ref="MiningField"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="MiningField">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="name" type="FIELD-NAME" use="required"/>
    <xs:attribute name="usageType" type="FIELD-USAGE-TYPE" default="active"/>
    <xs:attribute name="optype" type="OPTYPE"/>
    <xs:attribute name="importance" type="PROB-NUMBER"/>
    <xs:attribute name="outliers" type="OUTLIER-TREATMENT-METHOD" default="asIs"/>
    <xs:attribute name="lowValue" type="NUMBER"/>
    <xs:attribute name="highValue" type="NUMBER"/>
    <xs:attribute name="missingValueReplacement" type="xs:string"/>
    <xs:attribute name="missingValueTreatment" type="MISSING-VALUE-TREATMENT-METHOD"/>
    <xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid"/>
    <xs:attribute name="invalidValueReplacement" type="xs:string"/>
  </xs:complexType>
</xs:element>

<xs:simpleType name="FIELD-USAGE-TYPE">
  <xs:restriction base="xs:string">
    <xs:enumeration value="active"/>
    <xs:enumeration value="predicted"/>
    <xs:enumeration value="target"/>
    <xs:enumeration value="supplementary"/>
    <xs:enumeration value="group"/>
    <xs:enumeration value="order"/>
    <xs:enumeration value="frequencyWeight"/>
    <xs:enumeration value="analysisWeight"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="OUTLIER-TREATMENT-METHOD">
  <xs:restriction base="xs:string">
    <xs:enumeration value="asIs"/>
    <xs:enumeration value="asMissingValues"/>
    <xs:enumeration value="asExtremeValues"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="MISSING-VALUE-TREATMENT-METHOD">
  <xs:restriction base="xs:string">
    <xs:enumeration value="asIs"/>
    <xs:enumeration value="asMean"/>
    <xs:enumeration value="asMode"/>
    <xs:enumeration value="asMedian"/>
    <xs:enumeration value="asValue"/>
    <xs:enumeration value="returnInvalid"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="INVALID-VALUE-TREATMENT-METHOD">
  <xs:restriction base="xs:string">
    <xs:enumeration value="returnInvalid"/>
    <xs:enumeration value="asIs"/>
    <xs:enumeration value="asMissing"/>
    <xs:enumeration value="asValue"/>
  </xs:restriction>
</xs:simpleType>

name: symbolic name of field, must refer to a field in the scope of the parent of the MiningSchema's model element. For information on the scope of field names, see Scope of Fields.

If the DataDictionary defines a displayName for a certain field, still the attribute name is used for matching the input parameters to the internal formulas. displayName allows using human readable names at the interface while using artificial identifiers within the semantics of model.

usageType

active: field used as input (independent field).

target: field that was used a training target for supervised models.

predicted: field whose value is predicted by the model. As of PMML 4.2, this is deprecated and it has been replaced by the usage type target.

supplementary: field holding additional descriptive information. Supplementary fields are not required to apply a model. They are provided as additional information for explanatory purpose, though. When some field has gone through preprocessing transformations before a model is built, then an additional supplementary field is typically used to describe the statistics for the original field values.

group: field similar to the SQL GROUP BY. For example, this is used by AssociationModel and SequenceModel to group items into transactions by customerID or by transactionID.

order: This field defines the order of items or transactions and is currently used in SequenceModel and TimeSeriesModel. Similarly to group, it is motivated by the SQL syntax, namely by the ORDER BY statement.

frequencyWeight and analysisWeight: These fields are not needed for scoring, but provide very important information on how the model was built. Frequency weight usually has positive integer values and is sometimes called "replication weight". Its values can be interpreted as the number of times each record appears in the data. Analysis weight can have fractional positive values, it could be used for regression weight in regression models or for case weight in trees, etc. It can be interpreted as different importance of the cases in the model. Counts in ModelStats and Partitions can be computed using frequency weight, mean and standard deviation values can be computed using both weights.

The definition of target fields in the MiningSchema is not required and , in most cases, it does not have an impact on the scoring results. For supervised models, it is useful since, along with the corresponding data field, it provides information about the training target values and the values expected to be computed by the model. In addition, it is necessary when:

The model has more than one target field and disambiguation is required, as in the case of KNN models.
The model needs to compute residual values as one of the outputs.

optype: The attribute value overrides the corresponding value in the DataField. That is, a DataField can be used with different optypes in different models. For example, a 0/1 indicator could be used as a numeric input field in a regression model while the same field is used as a categorical field in a tree model.

importance: states the relative importance of the field. This indicator is typically used in predictive models in order to rank fields by their predictive contribution. A value of 1.0 suggests that the target field is directly correlated to this field. A value of 0.0 suggests that the field is completely irrelevant. Most likely such a field would have usageType="supplementary" rather than usageType="active".
Note that the importance cannot be negative. Unlike a Pearson correlation coefficient, it does not indicate the 'direction' of a correlation with a negative number if a higher field value correlates to a lower target value. There is no commonly accepted correlation measure that is applicable to all combinations of numeric and categorical fields. But this attribute is still useful as it provides a mechanism for representing the results of feature selection.
Note that other mining standards such as JDM include algorithms for computing the importance of input fields. The results can be represented by this attribute in PMML.

outliers: This attribute determines how outliers are handled by the model. Outliers are valid numeric values which are either greater than the specified highValue or less than the specified lowValue.

asIs: field values treated at face value.

asMissingValues: outlier values are treated as if they were missing.

asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField.

highValue and lowValue: bounds of the valid range for the field used in conjunction with the above attributes. At least one of those values is required when outlierTreatmentMethod="asExtremeValues" or outlierTreatmentMethod="asMissingValues".
Usage as extreme values:

if x<lowValue then x = lowValue

if x>highValue then x = highValue

Note that outliers applies only to fields defined in the MiningSchema and hence can not be used for DerivedFields.

The DataDictionary describes the value element, which includes a property attribute for defining values as valid, invalid or missing. While valid values can be dealt with unchanged, the standard allows special treatment for missing and invalid values. The next two sections describe how missing and invalid values should be handled.

Missing Values

missingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input. For example the surrogate operator in TreeModel does not apply if the MiningField specifies a replacement value.

missingValueTreatment: In a PMML consumer this field is for information only, unless the value is returnInvalid, in which case if a missing value is encountered in the given field, the model should return a value indicating an invalid result; otherwise, the consumer only looks at missingValueReplacement - if a value is present it replaces missing values. Except as described above, the missingValueTreatment attribute just indicates how the missingValueReplacement was derived, but places no behavioral requirement on the consumer.

MissingValueTreatment is a useful parameter in an API for training. The parameter can be copied into the PMML model. The scoring function, however, does not always know the actual mean, mode, median, etc. The corresponding value must be present in the attribute missingValueReplacement.

For example, if you want the scoring function to replace missing values by the mean value, and the mean value in the training data is 3.14, write

<MiningField name="foo" missingValueReplacement="3.14" missingValueTreatment="asMean"/>

The replacement value MUST be specified using the missingValueReplacement attribute.

Specifications for missing values occur at a couple of places in PMML.

The external representation of missing values in not directly defined by PMML. A PMML consumer system may implement them as null values in a database, or as blank strings in a file, etc.
The DataDictionary allows for an optional list of values which indicate a missing value. E.g., the data source may use the string "-" or "NA". If such a value occurs in the input data, a PMML consumer must treat it as a missing value.
The MiningSchema within a model may define an optional replacement value. If an input value is missing, then a PMML consumer must replace it with the specified value.
For each PMML model type, there is a specific method how missing values are used in the computation of the score results.

Invalid Values

Invalid values are defined in PMML as those values not explicitly defined as valid or missing in DataField. invalidValueTreatment: This field specifies how invalid input values are handled. returnInvalid is the default and specifies that, when an invalid input is encountered, the model should return a value indicating an invalid result has been returned. asIs means to use the input without modification. asMissing specifies that an invalid input value should be treated as a missing value and follow the behavior specified by the missingValueReplacement attribute if present (see above). If asMissing is specified but there is no respective missingValueReplacement present, a missing value is passed on for eventual handling by successive transformations via DerivedFields or in the actual mining model. asValue specifies that an invalid input value should be replaced with the value specified by attribute invalidValueReplacement which must be present in this case, or the PMML is invalid.

e-mail

info at dmg.org