[ Pobierz całość w formacie PDF ]
subclasses. Writing a new filter essentially involves overriding some of
these. Filter also documents the purpose of these methods, and how they
need to be changed for particular types of filter algorithm.
3 1 6 CHAPTER EIGHT | MACHINE LEARNING ALGORITHMS IN JAVA
Table 8.9 Public methods in the Filter class.
method description
boolean inputFormat(Instances) Set input format of data, returning true if output
format can be collected immediately
Instances outputFormat() Return output format of data
boolean input(Instance) Input instance into filter, returning true if instance
can be output immediately
boolean batchFinished() Inform filter that all training data has been input,
returning true if instances are pending output
Instance output() Output instance from the filter
Instance outputPeek() Output instance without removing it from output
queue
int numPendingOutput() Return number of instances waiting for output
boolean isOutputFormatDefined() Return true if output format can be collected
The first step in using a filter is to inform it of the input data format,
accomplished by the method inputFormat(). This takes an object of class
Instances and uses its attribute information to interpret future input
instances. The filter s output data format can be determined by calling
outputFormat() also stored as an object of class Instances. For filters that
process instances at once, the output format is determined as soon as the
input format has been specified. However, for those that must see the whole
dataset before processing any individual instance, the situation depends on
the particular filter algorithm. For example, DiscretizeFilter needs to see
all training instances before determining the output format, because the
number of discretization intervals is determined by the data. Consequently
the method inputFormat() returns true if the output format can be
determined as soon as the input format has been specified, and false
otherwise. Another way of checking whether the output format exists is to
call isOutputFormatDefined().
Two methods are used for piping instances through the filter: input()
and output(). As its name implies, the former gets an instance into the
filter; it returns true if the processed instance is available immediately and
false otherwise. The latter outputs an instance from the filter and removes
it from its output queue. The outputPeek() method outputs a filtered
instance without removing it from the output queue, and the number of
instances in the queue can be obtained with numPendingOutput().
Filters that must see the whole dataset before processing instances need
to be notified when all training instances have been input. This is done by
calling batchFinished(), which tells the filter that the statistics obtained
from the input data gathered so far the training data should not be
8.4 EMBEDDED MACHINE LEARNING 3 1 7
updated when further data is received. For all filter algorithms, once
batchFinished() has been called, the output format can be read and the
filtered training instances are ready for output. The first time input() is
called after batchFinished(), the output queue is reset that is, all training
instances are removed from it. If there are training instances awaiting
output, batchFinished() returns true, otherwise false.
An example filter
It s time for an example. The ReplaceMissingValuesFilter takes a dataset
and replaces missing values with a constant. For numeric attributes, the
constant is the attribute s mean value; for nominal ones, its mode. This
filter must see all the training data before any output can be determined,
and once these statistics have been computed. they must remain fixed when
future test data is filtered. Figure 8.12 shows the source code.
InputFormat()
ReplaceMissingValuesFilter overwrites three of the methods defined in
Filter: inputFormat(), input(), and batchFinished(). In inputFormat(), as
you can see from Figure 8.12, a dataset m_InputFormat is created with the
required input format and capacity zero; this will hold incoming instances.
The method setOutputFormat(), which is a protected method in Filter, is
called to set the output format. Then the variable b_NewBatch, which
indicates whether the next incoming instance belongs to a new batch of
data, is set to true because a new dataset is to be processed; and
m_ModesAndMeans, which will hold the filter s statistics, is initialized. The
variables b_NewBatch and m_InputFormat are the only fields declared in the
superclass Filter that are visible in ReplaceMissingValuesFilter, and they
[ Pobierz całość w formacie PDF ]