Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )
Linear classification functions were suggested in 1936 by
the noted statistician R. A. Fisher as the basis for improved
separation of observations into classes. The idea is to find
linear functions of the measurements that maximize the
ratio of between-class variability to within-class
variability. In other words, we would obtain classes that
are very homogeneous and differ the most from each other.
For each observation, these functions are used to compute
scores that measure the proximity of that observation to
each of the classes. An observation is classified as
belonging to the class for which it has the highest
classification score (equivalent to the smallest statistical
distance).
FIGURE 12.3 DISCRIMINANT ANALYSIS OUTPUT
FOR RIDING-MOWER DATA, DISPLAYING THE
ESTIMATED CLASSIFICATION FUNCTIONS
USING
CLASSIFICATION
SCORES TO CLASSIFY
FUNCTION
For each record, we calculate the value of the
classification function (one for each class);
442
whichever class’s function has the highest value (=
score) is the class assigned to that record.
The classification functions are estimated using software
(see Figure 12.3). Note that the number of classification
functions is equal to the number of classes (in this case, 2).
To classify a family into the class of owners or nonowners,
we use the functions above to compute the family’s
classification scores: A family is classified into the class of
owners if the owner function is higher than the nonowner
function and into nonowners if the reverse is the case.
These functions are specified in a way that can be
generalized easily to more than two classes. The values
given for the functions are simply the weights to be
associated with each variable in the linear function in a
manner analogous to multiple linear regression. For
instance, the first household has an income of $60K and a
lot size of 18.4K ft2. Their owner score is therefore –73.16
+ (0.43)(60) + (5.47)(18.4) = 53.2, and their nonowner
score is –51.42 + (0.33)(60) + (4.68)(18.4) = 54.48. Since
the second score is higher, the household is (mis)classified
by the model as a nonowner. The scores for all 24
households are given in Figure 12.4.
An alternative way for classifying an observation into one
of the classes is to compute the probability of belonging to
each of the classes and assigning the observation to the
most likely class. If we have two classes, we need only
compute a single probability for each observation (e.g., of
belonging to owners). Using a cutoff of 0.5 is equivalent to
443
assigning the observation to the class with the highest
classification score. The advantage of this approach is that
we can sort the records in order of descending probabilities
and generate lift curves. Let us assume that there are m
classes. To compute the probability of belonging to a
certain class k, for a certain observation i, we need to
compute all the classification scores
FIGURE 12.4 CLASSIFICATION SCORES FOR
RIDING-MOWER DATA
444
c1(i), c2(i),..., cm(i) and combine them using the following
formula:
In XLMiner these probabilities are
automatically, as can be seen in Figure 12.5.
445
computed
We now have three misclassifications, compared to four in
our original (ad hoc) classifications. This can be seen in
Figure 12.6, which includes the line resulting from the
discriminant model.2
FIGURE 12.5 DISCRIMINANT ANALYSIS OUTPUT
FOR RIDING-MOWER DATA, DISPLAYING THE
ESTIMATED PROBABILITY OF OWNERSHIP FOR
EACH FAMILY
FIGURE 12.6 CLASS SEPARATION OBTAINED
FROM
THE
DISCRIMINANT
MODEL
446
(COMPARED TO AD HOC LINE FROM FIGURE
12.1)
12.4 Classification performance of Discriminant Analysis
The discriminant analysis method relies on two main
assumptions to arrive at classification scores: First, it
assumes that the measurements in all classes come from a
multivariate normal distribution. When this assumption is
reasonably met, discriminant analysis is a more powerful
tool than other classification methods, such as logistic
regression. In fact, it is 30% more efficient than logistic
regression if the data are multivariate normal, in the sense
that we require 30% less data to arrive at the same results.
In practice, it has been shown that this method is relatively
robust to departures from normality in the sense that
predictors can be nonnormal and even dummy variables.
This is true as long as the smallest class is sufficiently
large (approximately more than 20 cases). This method is
447
also known to be sensitive to outliers in both the univariate
space of single predictors and in the multivariate space.
Exploratory analysis should therefore be used to locate
extreme cases and determine whether they can be
eliminated.
The second assumption behind discriminant analysis is that
the correlation structure between the different
measurements within a class is the same across classes.
This can be roughly checked by estimating the correlation
matrix for each class and comparing matrices. If the
correlations differ substantially across classes, the
classifier will tend to classify cases into the class with the
largest variability. When the correlation structure differs
significantly and the dataset is very large, an alternative is
to use quadratic discriminant analysis.3
With respect to the evaluation of classification accuracy,
we once again use the general measures of performance
that were described in Chapter 5 (judging the performance
of a classifier), with the principal ones based on the
confusion matrix (accuracy alone or combined with costs)
and the lift chart. The same argument for using the
validation set for evaluating performance still holds. For
example, in the riding-mower example, families 1, 13, and
17 are misclassified. This means that the model yields an
error rate of 12.5% for these data. However, this rate is a
biased estimate—it is overly optimistic because we have
used the same data for fitting the classification parameters
and for estimating the error. Therefore, as with all other
models, we test performance on a validation set that
includes data that were not involved in estimating the
classification functions.
448
To obtain the confusion matrix from a discriminant
analysis, we either use the classification scores directly or
the probabilities of class membership that are computed
from the classification scores. In both cases we decide on
the class assignment of each observation based on the
highest score or probability. We then compare these
classifications to the actual class memberships of these
observations. This yields the confusion matrix. In the
Universal Bank case we use the estimated classification
functions in Figure 12.4 to predict the probability of loan
acceptance in a validation set that contains 2000 customers
(these data were not used in the modeling step).
12.5 Prior Probabilities
So far we have assumed that our objective is to minimize
the classification error. The method presented above
assumes that the chances of encountering an item from
either class requiring classification is the same. If the
probability of encountering an item for classification in the
future is not equal for the different classes, we should
modify our functions to reduce our expected (long-run
average) error rate. The modification is done as follows:
Let us denote by pj the prior or future probability of
membership in class j (in the two-class case we have p1
and p2 = 1 – p1). We modify the classification function for
each class by adding log(pj).4 To illustrate this, suppose
that the percentage of riding-mower owners in the
population is 15% (compared to 50% in the sample). This
means that the model should classify fewer households as
owners. To account for this, we adjust the constants in the
classification functions from Figure 12.3 and obtain the
adjusted constants –73.16 + log(0.15) = –75.06 for owners
449
and –51.42 + log(0.85) = –50.58 for nonowners. To see
how this can affect classifications, consider family 13,
which was misclassified as an owner in the case involving
equal probability of class membership. When we account
for the lower probability of owning a mower in the
population, family 13 is classified properly as a nonowner
(its owner classification score exceeds the nonowner
score).
12.6 Unequal Misclassification Costs
A second practical modification is needed when
misclassification costs are not symmetrical. If the cost of
misclassifying a class 1 item is very different from the cost
of misclassifying a class 2 item, we may want to minimize
the expected cost of misclassification rather than the
simple error rate (which does not take cognizance of
unequal misclassification costs). In the two-class case, it is
easy to manipulate the classification functions to account
for differing misclassification costs (in addition to prior
probabilities). We denote by C1 the cost of misclassifying
a class 1 member (into class 2). Similarly, C2 denotes the
cost of misclassifying a class 2 member (into class 1).
These costs are integrated into the constants of the
classification functions by adding log(C1) to the constant
for class 1 and log(C2) to the constant of class 2. To
incorporate both prior probabilities and misclassification
costs, add log(p1C1) to the constant of class 1 and
log(p2C2) to that of class 2.
In practice, it is not always simple to come up with
misclassification costs C1 and C2 for each class. It is
usually much easier to estimate the ratio of costs C2/C1
450
(e.g., the cost of misclassifying a credit defaulter is 10
times more expensive than that of misclassifying a
nondefaulter). Luckily, the relationship between the
classification functions depends only on this ratio.
Therefore, we can set C1 = 1 and C2 = ratio and simply
add log(C2/C1) to the constant for class 2.
12.7 Classifying more Than Two Classes
Example 3: Medical Dispatch to Accident Scenes
Ideally, every automobile accident call to 911 results in the
immediate dispatch of an ambulance to the accident scene.
However, in some cases the dispatch might be delayed
(e.g., at peak accident hours or in some resource-strapped
towns or shifts). In such cases, the 911 dispatchers must
make decisions about which units to send based on sketchy
information. It is useful to augment the limited information
provided in the initial call with additional information in
order to classify the accident as minor injury, serious
injury, or death. For this purpose we can use data that were
collected on automobile accidents in the United States in
2001 that involved some type of injury. For each accident,
additional information is recorded, such as day of week,
weather conditions, and road type. Figure 12.7 shows a
small sample of records with 10 measurements of interest.
The goal is to see how well the predictors can be used to
classify injury type correctly. To evaluate this, a sample of
1000 records was drawn and partitioned into training and
validation sets, and a discriminant analysis was performed
on the training data. The output structure is very similar to
that for the two-class case. The only difference is that each
451
observation now has three classification functions (one for
each injury type), and the confusion and error matrices are
3 ´ 3 to account for all the combinations of correct and
incorrect classifications (see Figure 12.8). The rule for
classification is still to classify an observation to the class
that has the highest corresponding classification score. The
classification scores are computed, as before, using the
classification function coefficients. This can be seen in
Figure 12.9. For instance, the no injury classification score
for the first accident in the training set is –24.51 + 1.95(1)
+ 1.19(0) +... + 16.36(1) = 30.93. The nonfatal score is
similarly computed as 31.42 and the fatal score as 25.94.
Since the nonfatal score is highest, this accident is
(correctly) classified as having nonfatal injuries.
FIGURE 12.7 SAMPLE OF 20 AUTOMOBILE
ACCIDENTS FROM THE 2001 DEPARTMENT OF
TRANSPORTATION
DATABASE.
EACH
ACCIDENT IS CLASSIFIED AS ONE OF THREE
INJURY TYPES (NO INJURY, NONFATAL, OR
FATAL)
AND
HAS
10
MEASUREMENTS
(EXTRACTED FROM A LARGER SET OF
MEASUREMENTS)
452