November 5, 2013
SPSS Modeler 15 - How to use the Auto Classifier Node
IBM SPSS Modeler V15.0 enables you to
build predictive models to solve business issues, quickly and
intuitively, without the need for programming.
In this demonstration we are going to show, how you can use the
The Auto Classifier node can be used for nominal or binary targets. It
tests and compares various models in a single run.
You can select which algorithms (Decision trees, Neural Networks, KNN,
...) you want and even tweak some of the properties for each algorithm so
you can run different variations of a single algorithm. It makes it
really easy to evaluate all algorithms at once and saves the best
models for scoring or further analysis. In the end you can choose which
algorithm you want to use for scoring or use them all in an ensemble!
First a brief description of the data. The data comes from the 1994 US
Census database. You can find the data here http://archive.ics.uci.edu/ml/datasets/Adult
from the UCI Machine Learning Repository. The goal here is to determine
whether a person makes over 50K a year. It has 14 variables both
categorical and numeric.
First step is to import the data. The data are in csv format so we can
use the "Var. File" node to import them. All you have to do is define
the source path and we are ready to import the data.
Then we can use the "Data Audit" node to inspect the data. This is one
of the most useful nodes of SPSS Modeler. It will display a graph and
statistics for all variables and locate if there are missing values or
outliers in the data. I am going to write more about this in another
After inspecting the data we can see
that we do not have any serious problems with missing values or
outliers. But we will do a couple of transformations to improve the
performance and the interpretability of the model.
We will reclassify the countries variable. In the countries variable
90% of the records are US and various other countries with a frequency
of 1% or less. So we will use a reclassify node and change the variable
to 90% US and 10% Non-US.
Another thing we can do to improve it is binning. We can use the
"Binning Node" that has a very good feature called optimal binning.
This method will bin the data and try to fight find the optimal bins
according to a supervisor field which is usually the target so that
this new variable can help better to predict the target.
And here are the results after binning the age variable. The "Binning"
node created 8 bins that help to categorise the age variable better
with respect to the target variable.
Next step is to partition our data so can test the performance of our
models with a fresh set of data. For this we can use the "Partition"
node. We divided the data in two random samples with the train sample
containing 70% of the records and the testing sample contains the 30%.
Then we need to instantiate our data, decide which variables will be
included and define which varable will be our target if we haven't done
already from the import node. For this we use the "Type" node. In this
case I do not want to include in the model the old age variable but
rather the new one with the bins and also not include the old country
variable. So I set the role to none and set the role to input for the
new variables that we created and define that our target variable is
the class variable.
Then it is time for the modelling part. We drag and drop into the
canvas the "Auto Classifier" node and edit to customise it.
The first tab is the "Fields" tab where you can set the target
variable, the input variables and partitions or just use the default
which will read the settings from the type and partition nodes, that we
have set up earlier.
let's move to the model tab. Here we can adjust settings like
- how we want modeler to rank the models that will be auto
- which partition to use to evaluate the models,
- how many models to keep,
- enable modeler to calculate predictor importance,
- set criteria for the lift chart and
- assign costs for wrong predictions and revenues for correct
prediction so that modeler can estimate the profit we will get by
applying each model.
Next is the expert tab where we can
check which algorithm we want to use and also we can adjust various
parameters and thus create multiple models from each algorithm. But
caution is advised because although this functionality is really
powerful by choosing to run many models you may increase the build time
In the discard tab we set properties so that modeler can choose the
models we want and discard the rest.
The final tab is the settings tab where you define properties for the
ensemble, that is if you decide to use all the models generated in an
Now we can click run to train our
models. After the training is over, a model nugget will be created. If
we open that model nugget we can see a lot of details about our models.
We can see how long it took for each model to be build, the max profit
we get from each model, number of fields used and some statistics about
how good our models are. The statistics include the:
- Lift - if we score our database and choose the top 30% the lift
shows how much better results we will get by using the model instead of
- Overall Accuracy - the percentage of correctly predicting a value
- Area Under Curve - that is the area under a ROC curve, the higher
But these statistcs will not agree
every time. In our example we can see according to overall accuracy the
C5 decision tree is the best but it is not the best for the Area Under
Curve or Lift. It is up to us to decide which model to use. Of course
with SPSS Modeler you can create evaluation charts and other statistics
to help you, but we are going to present that in an other blog post.
Next is the Graph tab. The graph on the left is based on the ensemble
of the models and shows the performance of the ensemble. On the right
we can see the predictor importance, really useful information for
analysing our results.
Then in the summary tab we can see information about each model like,
the fields used or the settings that were defined for each model.
Finally in the settings tab we can define how the ensemble will use all
the models to make a prediction. Options are:
- Confidence - Weighted Voting
- Raw Propensity - Weighted Voting
- Highest Confidence Wins
- Average Raw Propensity
That's it we have created and compared various models from various
algorithms and we are ready to make predictions!!!
IBM SPSS Modeler 15IBM SPSS ModelerSPSSPredictive AnalyticsModelerModel