3 Fisher’s Linear Classification Functions

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )

Linear classification functions were suggested in 1936 by

the noted statistician R. A. Fisher as the basis for improved

separation of observations into classes. The idea is to find

linear functions of the measurements that maximize the

ratio of between-class variability to within-class

variability. In other words, we would obtain classes that

are very homogeneous and differ the most from each other.

For each observation, these functions are used to compute

scores that measure the proximity of that observation to

each of the classes. An observation is classified as

belonging to the class for which it has the highest

classification score (equivalent to the smallest statistical

distance).

FIGURE 12.3 DISCRIMINANT ANALYSIS OUTPUT

FOR RIDING-MOWER DATA, DISPLAYING THE

ESTIMATED CLASSIFICATION FUNCTIONS

USING

CLASSIFICATION

SCORES TO CLASSIFY

FUNCTION

For each record, we calculate the value of the

classification function (one for each class);

442

whichever class’s function has the highest value (=

score) is the class assigned to that record.

The classification functions are estimated using software

(see Figure 12.3). Note that the number of classification

functions is equal to the number of classes (in this case, 2).

To classify a family into the class of owners or nonowners,

we use the functions above to compute the family’s

classification scores: A family is classified into the class of

owners if the owner function is higher than the nonowner

function and into nonowners if the reverse is the case.

These functions are specified in a way that can be

generalized easily to more than two classes. The values

given for the functions are simply the weights to be

associated with each variable in the linear function in a

manner analogous to multiple linear regression. For

instance, the first household has an income of $60K and a

lot size of 18.4K ft2. Their owner score is therefore –73.16

+ (0.43)(60) + (5.47)(18.4) = 53.2, and their nonowner

score is –51.42 + (0.33)(60) + (4.68)(18.4) = 54.48. Since

the second score is higher, the household is (mis)classified

by the model as a nonowner. The scores for all 24

households are given in Figure 12.4.

An alternative way for classifying an observation into one

of the classes is to compute the probability of belonging to

each of the classes and assigning the observation to the

most likely class. If we have two classes, we need only

compute a single probability for each observation (e.g., of

belonging to owners). Using a cutoff of 0.5 is equivalent to

443

assigning the observation to the class with the highest

classification score. The advantage of this approach is that

we can sort the records in order of descending probabilities

and generate lift curves. Let us assume that there are m

classes. To compute the probability of belonging to a

certain class k, for a certain observation i, we need to

compute all the classification scores

FIGURE 12.4 CLASSIFICATION SCORES FOR

RIDING-MOWER DATA

444

c1(i), c2(i),..., cm(i) and combine them using the following

formula:

In XLMiner these probabilities are

automatically, as can be seen in Figure 12.5.

445

computed

We now have three misclassifications, compared to four in

our original (ad hoc) classifications. This can be seen in

Figure 12.6, which includes the line resulting from the

discriminant model.2

FIGURE 12.5 DISCRIMINANT ANALYSIS OUTPUT

FOR RIDING-MOWER DATA, DISPLAYING THE

ESTIMATED PROBABILITY OF OWNERSHIP FOR

EACH FAMILY

FIGURE 12.6 CLASS SEPARATION OBTAINED

FROM

THE

DISCRIMINANT

MODEL

446

(COMPARED TO AD HOC LINE FROM FIGURE

12.1)

12.4 Classification performance of Discriminant Analysis

The discriminant analysis method relies on two main

assumptions to arrive at classification scores: First, it

assumes that the measurements in all classes come from a

multivariate normal distribution. When this assumption is

reasonably met, discriminant analysis is a more powerful

tool than other classification methods, such as logistic

regression. In fact, it is 30% more efficient than logistic

regression if the data are multivariate normal, in the sense

that we require 30% less data to arrive at the same results.

In practice, it has been shown that this method is relatively

robust to departures from normality in the sense that

predictors can be nonnormal and even dummy variables.

This is true as long as the smallest class is sufficiently

large (approximately more than 20 cases). This method is

447

also known to be sensitive to outliers in both the univariate

space of single predictors and in the multivariate space.

Exploratory analysis should therefore be used to locate

extreme cases and determine whether they can be

eliminated.

The second assumption behind discriminant analysis is that

the correlation structure between the different

measurements within a class is the same across classes.

This can be roughly checked by estimating the correlation

matrix for each class and comparing matrices. If the

correlations differ substantially across classes, the

classifier will tend to classify cases into the class with the

largest variability. When the correlation structure differs

significantly and the dataset is very large, an alternative is

to use quadratic discriminant analysis.3

With respect to the evaluation of classification accuracy,

we once again use the general measures of performance

that were described in Chapter 5 (judging the performance

of a classifier), with the principal ones based on the

confusion matrix (accuracy alone or combined with costs)

and the lift chart. The same argument for using the

validation set for evaluating performance still holds. For

example, in the riding-mower example, families 1, 13, and

17 are misclassified. This means that the model yields an

error rate of 12.5% for these data. However, this rate is a

biased estimate—it is overly optimistic because we have

used the same data for fitting the classification parameters

and for estimating the error. Therefore, as with all other

models, we test performance on a validation set that

includes data that were not involved in estimating the

classification functions.

448

To obtain the confusion matrix from a discriminant

analysis, we either use the classification scores directly or

the probabilities of class membership that are computed

from the classification scores. In both cases we decide on

the class assignment of each observation based on the

highest score or probability. We then compare these

classifications to the actual class memberships of these

observations. This yields the confusion matrix. In the

Universal Bank case we use the estimated classification

functions in Figure 12.4 to predict the probability of loan

acceptance in a validation set that contains 2000 customers

(these data were not used in the modeling step).

12.5 Prior Probabilities

So far we have assumed that our objective is to minimize

the classification error. The method presented above

assumes that the chances of encountering an item from

either class requiring classification is the same. If the

probability of encountering an item for classification in the

future is not equal for the different classes, we should

modify our functions to reduce our expected (long-run

average) error rate. The modification is done as follows:

Let us denote by pj the prior or future probability of

membership in class j (in the two-class case we have p1

and p2 = 1 – p1). We modify the classification function for

each class by adding log(pj).4 To illustrate this, suppose

that the percentage of riding-mower owners in the

population is 15% (compared to 50% in the sample). This

means that the model should classify fewer households as

owners. To account for this, we adjust the constants in the

classification functions from Figure 12.3 and obtain the

adjusted constants –73.16 + log(0.15) = –75.06 for owners

449

and –51.42 + log(0.85) = –50.58 for nonowners. To see

how this can affect classifications, consider family 13,

which was misclassified as an owner in the case involving

equal probability of class membership. When we account

for the lower probability of owning a mower in the

population, family 13 is classified properly as a nonowner

(its owner classification score exceeds the nonowner

score).

12.6 Unequal Misclassification Costs

A second practical modification is needed when

misclassification costs are not symmetrical. If the cost of

misclassifying a class 1 item is very different from the cost

of misclassifying a class 2 item, we may want to minimize

the expected cost of misclassification rather than the

simple error rate (which does not take cognizance of

unequal misclassification costs). In the two-class case, it is

easy to manipulate the classification functions to account

for differing misclassification costs (in addition to prior

probabilities). We denote by C1 the cost of misclassifying

a class 1 member (into class 2). Similarly, C2 denotes the

cost of misclassifying a class 2 member (into class 1).

These costs are integrated into the constants of the

classification functions by adding log(C1) to the constant

for class 1 and log(C2) to the constant of class 2. To

incorporate both prior probabilities and misclassification

costs, add log(p1C1) to the constant of class 1 and

log(p2C2) to that of class 2.

In practice, it is not always simple to come up with

misclassification costs C1 and C2 for each class. It is

usually much easier to estimate the ratio of costs C2/C1

450

(e.g., the cost of misclassifying a credit defaulter is 10

times more expensive than that of misclassifying a

nondefaulter). Luckily, the relationship between the

classification functions depends only on this ratio.

Therefore, we can set C1 = 1 and C2 = ratio and simply

add log(C2/C1) to the constant for class 2.

12.7 Classifying more Than Two Classes

Example 3: Medical Dispatch to Accident Scenes

Ideally, every automobile accident call to 911 results in the

immediate dispatch of an ambulance to the accident scene.

However, in some cases the dispatch might be delayed

(e.g., at peak accident hours or in some resource-strapped

towns or shifts). In such cases, the 911 dispatchers must

make decisions about which units to send based on sketchy

information. It is useful to augment the limited information

provided in the initial call with additional information in

order to classify the accident as minor injury, serious

injury, or death. For this purpose we can use data that were

collected on automobile accidents in the United States in

2001 that involved some type of injury. For each accident,

additional information is recorded, such as day of week,

weather conditions, and road type. Figure 12.7 shows a

small sample of records with 10 measurements of interest.

The goal is to see how well the predictors can be used to

classify injury type correctly. To evaluate this, a sample of

1000 records was drawn and partitioned into training and

validation sets, and a discriminant analysis was performed

on the training data. The output structure is very similar to

that for the two-class case. The only difference is that each

451

observation now has three classification functions (one for

each injury type), and the confusion and error matrices are

3 ´ 3 to account for all the combinations of correct and

incorrect classifications (see Figure 12.8). The rule for

classification is still to classify an observation to the class

that has the highest corresponding classification score. The

classification scores are computed, as before, using the

classification function coefficients. This can be seen in

Figure 12.9. For instance, the no injury classification score

for the first accident in the training set is –24.51 + 1.95(1)

+ 1.19(0) +... + 16.36(1) = 30.93. The nonfatal score is

similarly computed as 31.42 and the fatal score as 25.94.

Since the nonfatal score is highest, this accident is

(correctly) classified as having nonfatal injuries.

FIGURE 12.7 SAMPLE OF 20 AUTOMOBILE

ACCIDENTS FROM THE 2001 DEPARTMENT OF

TRANSPORTATION

DATABASE.

EACH

ACCIDENT IS CLASSIFIED AS ONE OF THREE

INJURY TYPES (NO INJURY, NONFATAL, OR

FATAL)

AND

HAS

10

MEASUREMENTS

(EXTRACTED FROM A LARGER SET OF

MEASUREMENTS)

452

Xem Thêm

3 Fisher’s Linear Classification Functions

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về