PMML 1.1 -- DTD for Clustering Models
A cluster model basically consists of a set of clusters.
For each cluster a center vector can be given.
In center-based models a cluster is defined by a vector of
center coordinates. Some distance mesure is used to determine
the nearest center, that is the nearest cluster
for a given input record.
For distribution-based models (e.g. in demographic clustering)
the clusters are defined by their statistics.
Some similarity measure is used to determine the best matching
cluster for a given record.
The center vectors then only approximate the clusters.
|
|
|
The attribute
modelClass
specifies whether the clusters are defined by
center-vectors or whether they
are defined by the statistics.
The latter is used by demographic clustering.
The fields which are used in the center vectors are normalized, in particular this allows to map categorical input fields to numeric values in center vectors. For %NORM-INPUT; see DTD on normalization. MiningField information (in MiningSchema) must be present for each active variable. For numeric variables it specifies the treatment of outliers. Note that there may be supplementary mining fields. The statistics for these fields are part of the model but they are not required to apply the model. For each active MiningField, UnivariateStats (in ModelStats) holds information about the overall (background) population. This includes (required) DiscrStats or ContStats, which include possible field values and interval boundaries. Optionally, statistical information is included for the background data. Each Partition corresponds to a cluster and holds a center vector and/or field statistics to describe it. |
|
|
The
%NUM-ARRAY;
contains the center coordinates for the cluster.
The correspondence between
input fields and their coordinates is defined by
CenterFields,
see
ClusteringModel.
Note that categorical fields can have more than one coordinate,
depending on the normalization method (see
%NORM-INPUT;).
If some normalization is defined for input fields, then the center coordinates are defined using the normalized values. For numeric fields we could also use the values in the original domain. For categorical values, however, the center vector is not required to contain the indicator values 0.0 or 1.0. There can also be any values between 0.0 and 1.0. These indicate a distribution of categorical values, defining a kind of virtual center point. If the cluster model contains statistics with mean values, then the center coordinates are not necessarily indentical to the corresponding mean values. There may be differences depending on the kind of normalization of input values. |
|
|
A covariance matrix stores coordinate-by-coordinate variances (diagonal cells) and covariances (non-diagonal cells). The covariance matrix must be symmetric so only half of the non-diagonal covariance cells need to be stored. Missing covariance cells are reconstructed by symmetry. The sequence of rows/columns correspond to the sequence in MiningSchema. |
|
|
field
references (the name of) a
MiningField.
fieldWeight is the importance factor for the field. similarityScale is the distance such that similarity becomes 0.5. compareFunction is a function of taking two field values and a similarityScale to define similarity/distance. It can override the general specification of compareFunction in ComparisonMeasure. For the computation of distances and similarities see below. |
|
|
Comparisons
is a matrix which contains the similarity values
or distance values, depending on the attribute
modelClass
in
ClusteringModel.
The order of the rows and columns corresponds to the order of
discrete values or intervals in that field.
MatrixThere are several kinds of matrices which are used within cluster models, eg., to describe covariances and similarities. In order to save space, a matrix can be stored as sparse matrix, diagonal matrix, etc. |
|
|
A matrix may be represented as a sequence of arrays. If the matrix is
diagonal,
then the content is just one array of numbers representing the diagonal
values. Otherwise, each array contains elements of one row in the matrix.
If the kind of the matrix is
any,
then all values are given. If the matrix is
symmetric
then the first array contains the matrix element M(0,0),
the second array contains M(1,0), M(1,1), and so
on (that's the lower left triangle).
Other elements are defined by symmetry.
A sparse matrix may also be represented in a compact form as an enumeration of MatCell. Each MatCell contains the numeric value of a single cell. In this case, diagonal has no significance for the matrix representation. |
|
Evaluating a matrix element M(i,j) proceeds as follows:
Distance or Similarity MeasureWhen two records are compared then either the distance or the similarity is of interest. In both cases the measures can be computed by a combination of an 'inner' function and an 'outer' function. The inner function compares two single field values and the outer function computes an aggregation over all fields.
Each field has a comparison function, this is either defined as a default in
ClusteringModel
or it can be defined per
ClusteringField
.
Given two field values x and y, the inner function can be one of:
c(x,y) = |x-y|gaussSim: gaussian similarity c(x,y) = exp(-ln(2)*z*z/(s*s)) where z=x-y, and s is the value of attribute similarityScale (required in this case) in the ClusteringFielddelta: c(x,y) = 0 if x=y, 1 elseequal: c(x,y) = 1 if x=y, 0 elsetable: c(x,y) = lookup in similarity matrix |
|
|
Per
ClusteringModel
there is one aggregation function: depending on the attribute
kind
in
ComparisonMeasure
the aggregated value is optimal if it is 0 (for distance measure) or greater
values indicate optimal fit (for similarity measure).
euclidean: kind=distanceFor binary or categorical data, let two individuals X and Y compare their values for each attribute, and
simpleMatching: kind=similarity min=0 max=1
|
|
ConformanceCenter-based clustering:
|
Example for a center-based clustering model |
|
|
|