PMML 1.1 -- Data Dictionary
The data dictionary contains definitions for fields as used
in mining models.
It specifies the types and value ranges. These definitions are assumed to
be independent of specific data sets as used for training or building a
specific model.
A data dictionary can be shared by multiple models, statistics and other
information related to the training set is stored within a model;
see also the DTDs for
statistics
and
mining fields
.
|
|
|
The value
'numberOfFields'
is the number of fields which are defined in the content of
'DataDictionary',
this number can be added for consistency checks.
The name of a data field must be unique in the data dictionary.
The displayName is a string which may be used by applications to
refer to that field.
Within the XML document only the value of
name
is significant.
If
displayName
is not given, then
name
is the default.
The fields are separated into different types depending on which operations are defined on the values; this is defined by the attribute optype. Categorical fields have the operator "=", ordinal fields have an additional "<", and continuous fields also have arithmetic operators. Cyclic fields have a distance measure which takes into account that the maximal value and minimal value are close together. The content of a DataField defines the set of values which are considered to be valid. Mining models distinguish three properties of values:
The following element definitions for Value and Interval are used to define the types and value ranges for fields in the data dictionary. The range of valid values can either be defined by specifying the set itself or by specifying complement set. Note that PMML does not define how an interpreter of a model actually represents invalid or missing values. This depends on the application environment.
A continuous field may have at most
two intervals defining the range of valid values.
If intervals are present, any data that is outside the intervals
will be considered invalid.
If no intervals are present, the entire real
axis (except for discrete missing values) is made of valid values.
Intervals are not allowed for non-continuous fields
|
|
|
If a categorical or ordinal field contains at least one
Value element where
the value of
property
is 'valid' or unspecified, then the set of Value elements
completely defines the set of valid values. Otherwise any value
is valid by default.
The element Interval defines a range of numeric values. |
<!ELEMENT Interval EMPTY> <!ATTLIST Interval closure (openClosed | openOpen | closedOpen | closedClosed ) #REQUIRED leftMargin %NUMBER; #IMPLIED rightMargin %NUMBER; #IMPLIED > |
| The attributes leftMargin and rightMargin are optional but at least one value must be defined. If a margin is missing, then +/- infinity is assumed. |
|
|