|
||||||||||||||
|
||||||||||||||
| ||||||||||||||
PMML 4.4.1 - Mining SchemaThe MiningSchema is the Gate Keeper for its model element. All data entering a model must pass through the MiningSchema. Each model element contains one MiningSchema which lists fields as used in that model. While the MiningSchema contains information that is specific to a certain model, the DataDictionary contains data definitions which do not vary per model. The main purpose of the MiningSchema is to list the fields that have to be provided in order to apply the model. MiningFields also define the usage of each field (active, supplementary, target, ...) as well as policies for treating missing, invalid or outlier values. <xs:element name="MiningSchema"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element maxOccurs="unbounded" ref="MiningField"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="MiningField"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="name" type="FIELD-NAME" use="required"/> <xs:attribute name="usageType" type="FIELD-USAGE-TYPE" default="active"/> <xs:attribute name="optype" type="OPTYPE"/> <xs:attribute name="importance" type="PROB-NUMBER"/> <xs:attribute name="outliers" type="OUTLIER-TREATMENT-METHOD" default="asIs"/> <xs:attribute name="lowValue" type="NUMBER"/> <xs:attribute name="highValue" type="NUMBER"/> <xs:attribute name="missingValueReplacement" type="xs:string"/> <xs:attribute name="missingValueTreatment" type="MISSING-VALUE-TREATMENT-METHOD"/> <xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid"/> <xs:attribute name="invalidValueReplacement" type="xs:string"/> </xs:complexType> </xs:element> <xs:simpleType name="FIELD-USAGE-TYPE"> <xs:restriction base="xs:string"> <xs:enumeration value="active"/> <xs:enumeration value="predicted"/> <xs:enumeration value="target"/> <xs:enumeration value="supplementary"/> <xs:enumeration value="group"/> <xs:enumeration value="order"/> <xs:enumeration value="frequencyWeight"/> <xs:enumeration value="analysisWeight"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="OUTLIER-TREATMENT-METHOD"> <xs:restriction base="xs:string"> <xs:enumeration value="asIs"/> <xs:enumeration value="asMissingValues"/> <xs:enumeration value="asExtremeValues"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="MISSING-VALUE-TREATMENT-METHOD"> <xs:restriction base="xs:string"> <xs:enumeration value="asIs"/> <xs:enumeration value="asMean"/> <xs:enumeration value="asMode"/> <xs:enumeration value="asMedian"/> <xs:enumeration value="asValue"/> <xs:enumeration value="returnInvalid"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="INVALID-VALUE-TREATMENT-METHOD"> <xs:restriction base="xs:string"> <xs:enumeration value="returnInvalid"/> <xs:enumeration value="asIs"/> <xs:enumeration value="asMissing"/> <xs:enumeration value="asValue"/> </xs:restriction> </xs:simpleType> name: symbolic name of field, must refer to a field in the scope of the parent of the MiningSchema's model element. For information on the scope of field names, see Scope of Fields. If the DataDictionary defines a displayName for a certain field, still the attribute name is used for matching the input parameters to the internal formulas. displayName allows using human readable names at the interface while using artificial identifiers within the semantics of model. usageType active: field used as input (independent field). target: field that was used a training target for supervised models. predicted: field whose value is predicted by the model. As of PMML 4.2, this is deprecated and it has been replaced by the usage type target. supplementary: field holding additional descriptive information. Supplementary fields are not required to apply a model. They are provided as additional information for explanatory purpose, though. When some field has gone through preprocessing transformations before a model is built, then an additional supplementary field is typically used to describe the statistics for the original field values. group: field similar to the SQL GROUP BY. For example, this is used by AssociationModel and SequenceModel to group items into transactions by customerID or by transactionID. order: This field defines the order of items or transactions and is currently used in SequenceModel and TimeSeriesModel. Similarly to group, it is motivated by the SQL syntax, namely by the ORDER BY statement. frequencyWeight and analysisWeight: These fields are not needed for scoring, but provide very important information on how the model was built. Frequency weight usually has positive integer values and is sometimes called "replication weight". Its values can be interpreted as the number of times each record appears in the data. Analysis weight can have fractional positive values, it could be used for regression weight in regression models or for case weight in trees, etc. It can be interpreted as different importance of the cases in the model. Counts in ModelStats and Partitions can be computed using frequency weight, mean and standard deviation values can be computed using both weights. The definition of target fields in the MiningSchema is not required and , in most cases, it does not have an impact on the scoring results. For supervised models, it is useful since, along with the corresponding data field, it provides information about the training target values and the values expected to be computed by the model. In addition, it is necessary when:
optype: The attribute value overrides the corresponding value in the DataField. That is, a DataField can be used with different optypes in different models. For example, a 0/1 indicator could be used as a numeric input field in a regression model while the same field is used as a categorical field in a tree model. importance: states the relative importance of the field. This
indicator is typically used in predictive models in order to rank fields by
their predictive contribution. A value of 1.0 suggests that the target field
is directly correlated to this field. A value of 0.0 suggests that the field
is completely irrelevant. Most likely such a field would have
usageType="supplementary" rather than
usageType="active". outliers: This attribute determines how outliers are handled by the model. Outliers are valid numeric values which are either greater than the specified highValue or less than the specified lowValue. asIs: field values treated at face value. asMissingValues: outlier values are treated as if they were missing. asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField. highValue and lowValue: bounds of the valid range for the
field used in conjunction with the above attributes. At least one of those values is
required when outlierTreatmentMethod="asExtremeValues" or
outlierTreatmentMethod="asMissingValues".
Note that outliers applies only to fields defined in the MiningSchema and hence can not be used for DerivedFields. The DataDictionary describes the value element, which includes a property attribute for defining values as valid, invalid or missing. While valid values can be dealt with unchanged, the standard allows special treatment for missing and invalid values. The next two sections describe how missing and invalid values should be handled. Missing ValuesmissingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input. For example the surrogate operator in TreeModel does not apply if the MiningField specifies a replacement value. missingValueTreatment: In a PMML consumer this field is for information only, unless the value is returnInvalid, in which case if a missing value is encountered in the given field, the model should return a value indicating an invalid result; otherwise, the consumer only looks at missingValueReplacement - if a value is present it replaces missing values. Except as described above, the missingValueTreatment attribute just indicates how the missingValueReplacement was derived, but places no behavioral requirement on the consumer. MissingValueTreatment is a useful parameter in an API for training. The parameter can be copied into the PMML model. The scoring function, however, does not always know the actual mean, mode, median, etc. The corresponding value must be present in the attribute missingValueReplacement. For example, if you want the scoring function to replace missing values by the mean value, and the mean value in the training data is 3.14, write <MiningField name="foo" missingValueReplacement="3.14" missingValueTreatment="asMean"/> The replacement value MUST be specified using the missingValueReplacement attribute. Specifications for missing values occur at a couple of places in PMML.
Invalid ValuesInvalid values are defined in PMML as those values not explicitly defined as valid or missing in DataField. invalidValueTreatment: This field specifies how invalid input values are handled. returnInvalid is the default and specifies that, when an invalid input is encountered, the model should return a value indicating an invalid result has been returned. asIs means to use the input without modification. asMissing specifies that an invalid input value should be treated as a missing value and follow the behavior specified by the missingValueReplacement attribute if present (see above). If asMissing is specified but there is no respective missingValueReplacement present, a missing value is passed on for eventual handling by successive transformations via DerivedFields or in the actual mining model. asValue specifies that an invalid input value should be replaced with the value specified by attribute invalidValueReplacement which must be present in this case, or the PMML is invalid. |
||||||||||||||
|