PMML 1.1 -- General Structure of a PMML DocumentPMML uses XML to represent mining models. The structure of the models is described by a DTD which is called the PMML DTD. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general stucture of a PMML document is: |
|
|
A PMML document is not required to have a
DOCTYPE
declaration.
If there is one then
a PMML document must not depend on external parameters,
that is we assume the default attribute
standalone="yes"
in the
<?xml?>
statement.
Although a PMML document must be valid with respect to the PMML DTD,
a document must not require a validating parser,
which would load external entities.
In addition to being a valid XML document, a valid PMML document
must obey a number of further rules which are described at
various places in the PMML specification.
See also the conformance rules
for valid PMML documents, producers, and consumers.
The root element of a PMML document must have type PMML . |
|
|
A PMML document can contain more than one model.
If the application system provides a means of
selecting models by name and if the PMML consumer specifies a model
name, then that model is used; otherwise the first
model is used.
A PMML 1.1 compliant system is not required to provide model selection
by name.
For PMML version 1.1 the attribute version must have the value 1.1 .
For all PMML models the structure of the top-level model element is similar to |
|
|
The non-empty list of mining fields define a so-called mining schema.
The univariate statistics contain global statistics on
(a subset of the) mining fields.
Other model specific elements follow after ModelStats
in the content of XModel.
For a list of models that have been defined in PMML 1.1 see the
entity A-PMML-MODEL above.
The naming conventions for PMML are
Extension MechanismThe PMML DTD contains a mechanism for extending the content of a model. Extension elements are included in the content defintition of many element types. These extension elements have a content model of ANY, allowing considerable freedom in the nature of the extensions. One use of the extension mechanism could be to associate display information for a particular tool. |
|
|
Extension
is intended to replace occurrences of the element
called 'info' in PMML 1.0
For compatibility with info the attributes
'name' and
'value' are defined.
However, the use of these attributes is discouraged because
future versions PMML may remove them.
The extension data should be part of the content within
the elements of type
Extension.
With XML 1.0 one can add attribute declarations to given elements, without changing an external DTD. An XML parser may give a warning but a document which uses the additional attributes can be valid. That is, if the standard PMML DTD contains an element TreeNode then a document may declare additional attributes on TreeNode. PMML 1.1 adopts this rule but attribute names must have prefix 'x-' in order to make an extension obvious. The same convention is used for vendor specific element types which can be contained in an Extension element; the tag name must start with 'X-'. This convention also helps to avoid conflicts with possible future extensions to standard PMML. If a document uses local namespaces, then the name of the namespace must not start with 'PMML' or 'DMG' or any variant of these names with lowercase characters. They are reserved for future use in PMML. An extended PMML document could look like |
|
Basic data types and entities:The definition |
|
| is commonly used for distinguishing numeric values from other data. Numbers may have a leading sign, fractions, and an exponent. In addition to NUMBER there are a couple of more specific definitions: |
|
|
Content must be an integer, no fractions or exponent.
|
|
|
Content can be any number
covers C/C++ types 'float','long','double'
scientific notation, eg 1.23e4, is allowed.
|
|
|
A REAL-NUMBER between 0.0 and 1.0
usually describing a probability.
|
|
|
A REAL-NUMBER between 0.0 and 100.0.
Note that these entities do not enforce the XML parser to check the data types. However they still define requirements for a valid PMML document. Many elements contain references to input fields. PMML does not use IDREF to represent field names because field names are not necessarily valid XML identifiers. However, given the definition |
|
|
then references to input fields (in the data dictionary) will
be obvious from the DTD syntax.
PMML 1.1 uses the character '.' as decimal point in the representation of REAL-NUMBER values. In the future PMML might be extended to use locale values as specified by XML, that is with xml:lang="fr-FR" the document would use a comma as decimal separator.
Compact arrays of valuesInstances of mining models often contain sets with a large number of values. The type Array is defined as container structure which implement arrays of numbers and strings in a fairly compact way. |
|
|
The content of
'Array'
is a blank separated sequence of values,
multible blanks are as good as one blank.
The attribute
'n'
determines the number of elements in the sequence.
If n is given it must match the number of values in the content,
otherwise the PMML document is invalid.
The attribute
'type'
indicates the data types of values in the array.
This attribute is optional because in many cases the
data type is implied from the
context where the array is used.
String values may be enclosed within
",
which are not considered to be
part of the value. If a string value contains the " character,
then it must be escaped by a backslash character
\,
that's the same escaping mechanism as used in C/C++.
Examples: |
|
| If there is a value for the dimension attribute n then the number of entries in the content must match this value; otherwise the PMML document is not valid. Similar to the entities for different types of numbers we define entities for arrays which should have a specific content type. Again, these entities just map to a single XML markup. |
|
|
%NUM-ARRAY; is an array of numbers.
The following entities define arrays which contain integers, reals, or strings. |
|
|
|