PMML 1.1 -- DTD for StatisticsThis DTD subset for statistics provides a basic framework for representing univariate statistics. It is used by DTDs for data mining models as ModelStats. The general guideline for PMML models is: if there is any need for statistics then these representations should look like the elements defined below. There is no need to use exactly the same elements, but for ease of presentation and implementation it is recommended to use the same basic structure. |
|
| The statistics for a model is made of the collection of the statistics for single fields. |
Univariate Statistics |
|
|
An UnivariateStats element contains statistical information
about a single mining field.
Discrete AND continuous statistics are possible simultaneously
for numeric fields. This may be important if a numeric field
has too many discrete values. The statistics can include
the most frequent values, and also a complete histogram distribution.
Statistics for ordinal fields are contained in DiscrStats. It may be necessary to extend quantiles to ordinal fields. |
|
| The element Counts carries counters for frequency of values with respect to their state of being missing, invalid, or valid. |
|
totalFreq
counts all records, same as for statistics of all mining fields,
missingFreq
counts the number of records where value is missing,
invalidFreq
counts the number of records with values other than valid.
The total frequency includes the missing values and invalid values.
|
|
|
The values for
mean, minimum, maximum
are defined as usual,
standardDeviation
as usual,
median
is calculated as 50% quantile;
interQuartileRange
is calculated as (75% quantile - 25% quantile).
|
|
|
quantileLimit
is a percentage number between 0 and 100.
quantileValue
is the corresponding value in the domain of field values.
|
|
|
modalValue
is the most frequent discrete value.
The INT-ARRAY contains a compact representation
of all frequency numbers.
If there is an array of string values then the frequency numbers
in the INT-ARRAY correspond to the string values one by one.
Otherwise the frequency numbers correspond to the list of (valid) values
as given in the DataDictionary.
|
|
|
The three ARRAY's contain the frequencies, sum of values, and
sum of squared values for each interval.
Note: Interval is defined in the DTD for DataDictionary.
|
| A partition contains statistics for a subset of records, for example it can describe the population in a cluster. |
|
|
name
identifies the partition.
size
is the number of records, all AGGREGATEs in
PartitionFieldStats
must have total-frequency=size.
|
|
|
field
references to (the name of) a MiningField for background statistics.
The sequence of NUM-ARRAYs is the same as for ContStats. For categorical fields
there is only one array containing the frequencies; for numeric fields, the
second and third array contain the sums of values and the sums of squared
values, respectively.
The number of values in each array must match the number of categories or
intervals in
UnivariateStats
of the field.
|
|
|