Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )
classification is the goal and performance is evaluated by
reviewing results with a validation sample. For reference,
some key concepts of a classical statistical perspective, are
included below.
FIGURE
10.8
OUTPUT
FOR
LOGISTIC
REGRESSION MODEL WITH ONLY SEVEN
PREDICTORS.
Appendix A: Why Linear Regression Is Inappropriate for a
Categorical Response
380
Now that you have seen how logistic regression works, we
explain why linear regression is not suitable. Technically,
one can apply a multiple linear regression model to this
problem, treating the dependent variable Y as continuous.
Of course, Y must be coded numerically (e.g., 1 for
customers who did accept the loan offer and 0 for
customers who did not accept it). Although software will
yield an output that at first glance may seem usual (e.g.,
Figure 10.9), a closer look will reveal several anomalies:
1. Using the model to predict Y for each of the
observations (or classify them) yields predictions that are
not necessarily 0 or 1.
2. A look at the histogram or probability plot of the
residuals reveals that the assumption that the dependent
variable (or residuals) follows a normal distribution is
violated. Clearly, if Y takes only the values 0 and 1, it
cannot be normally distributed. In fact, a more appropriate
distribution for the number of 1’s in the dataset is the
binomial distribution with p = P (Y = 1).
3. The assumption that the variance of Y is constant across
all classes is violated. Since Y follows a binomial
distribution, its variance is np(1 – p). This means that the
variance will be higher for classes where the probability of
adoption, p, is near 0.5 than where it is near 0 or 1.
FIGURE 10.9 OUTPUT FOR MULTIPLE LINEAR
REGRESSION MODEL OF PERSONAL LOAN ON
THREE PREDICTORS.
381
Below you will find partial output from running a multiple
linear regression of Personal Loan (PL, coded as PL = 1
for customers who accepted the loan offer and PL = 0
otherwise) on three of the predictors.
The estimated model is
To predict whether a new customer will accept the
personal loan offer (PL = 1) or not (PL = 0), we input the
information on its values for these three predictors. For
example, we would predict the loan offer acceptance of a
customer with an annual income of $50K with two family
members who does not hold CD accounts in Universal
Bank to be –0.2346 + (0.0032)(50) + (0.0329)(2) = –0.009.
Clearly, this is not a valid “loan acceptance” value.
Furthermore, the histogram of the residuals (Figure 10.10)
reveals that the residuals are probably not normally
distributed. Therefore, our estimated model is based on
violated assumptions.
FIGURE 10.10 HISTOGRAM OF RESIDUALS
FROM A MULTIPLE LINEAR REGRESSION
382
MODEL OF LOAN ACCEPTANCE ON THE THREE
PREDICTORS. THIS SHOWS THAT THE
RESIDUALS DO NOT FOLLOW THE NORMAL
DISTRIBUTION THATTHE MODEL ASSUMES.
Appendix B: Evaluating Goodness of Fit
When the purpose of the analysis is profiling (i.e.,
explaining the differences between the classes in terms of
predictor values) we want to go beyond simply assessing
how well the model classifies new data, and also assess
how well the model fits the data it was trained on. For
example, if we are interested in characterizing loan offer
acceptors versus nonacceptors in terms of income,
education, and so on, we want to find a model that fits the
data best. However, since overfitting is a major danger in
classification, a “too good” fit of the model to the training
data should raise suspicions. In addition, questions
regarding the usefulness of specific predictors can arise
383
even in the context of classification models. We therefore
mention some of the popular measures that are used to
assess how well the model fits the data. Clearly, we look at
the training set in order to evaluate goodness of fit.
Overall Fit As in multiple linear regression, we first
evaluate the overall fit of the model to the data before
looking at single predictors. We ask: Is this group of
predictors better than a simple naive model for explaining
the different classes5?
The deviance D is a statistic that measures overall
goodness of fit. It is similar to the concept of sum of
squared errors (SSE) in the case of least-squares estimation
(used in linear regression). We compare the deviance of
our model, D (called Std Dev Estimate in XLMiner, e.g.,
in Figure 10.11), to the deviance of the naive model, D0. If
the reduction in deviance is statistically significant (as
indicated by a low p-value6 or in XLMiner by a high
multiple R2), we consider our model to provide a good
overall fit. XLMiner’s Multiple-R-Squared measure is
computed as (D0 – D)/D0. Given the model deviance and
the multiple R2, we can compute the null deviance by D0 =
D/(1 – R2).
FIGURE 10.11 MEASURES OF GOODNESS OF FIT
FOR UNIVERSAL BANK TRAINING DATA WITH
A 12-PREDICTOR MODEL
384
Finally, the classification matrix and lift chart for the
training data (Figure 10.12) give a sense of how
accurately the model classifies the data. If the model fits
the data well, we expect it to classify these data accurately
into their actual classes.
Appendix C: Logistic Regression for More Than Two
Classes
The logistic model for a binary response can be extended
for more than two classes. Suppose that there are m
classes. Using a logistic regression model, for each
observation we would have m probabilities of belonging to
each of the m classes. Since the m probabilities must add
up to 1, we need estimate only m – 1 probabilities.
Ordinal Classes Ordinal classes are classes that have a
meaningful order. For example, in stock recommendations,
the three classes buy, hold, and sell can be treated as
ordered. As a simple rule, if classes can be numbered in a
meaningful way, we consider them ordinal. When the
number of classes is large (typically, more than 5), we can
treat the dependent variable as continuous and perform
multiple linear regression. When m = 2, the logistic model
described above is used. We therefore need an extension
ofthe logistic regression for a small number of ordinal
classes (3 ≤ m ≤ 5). There are several ways to extend the
385
binary class case. Here we describe the proportional odds
or cumulative logit method. For other methods, see
Hosmer and Lemeshow (2000).
FIGURE 10.12 (A) CLASSIFICATION MATRIX AND
(B) LIFT CHART FOR TRAINING DATA FOR
UNIVERSAL BANK TRAINING DATA WITH 12
PREDICTORS
386
For simplicity of interpretation and computation, we look
at cumulative probabilities of class membership. For
example, in the stock recommendations we have m = 3
classes. Let us denote them by 1 = buy, 2 = hold, and 3 =
sell. The probabilities that are estimated by the model are
387
P(Y ≤ 1), (the probability of a buy recommendation) and
P(Y ≤ 2) (the probability of a buy or hold
recommendation). The three noncumulative probabilities
of class membership can easily be recovered from the two
cumulative probabilities:
Next, we want to model each logit as a function of the
predictors. Corresponding to each of the m – 1 cumulative
probabilities is a logit. In our example we would have
Each of the logits is then modeled as a linear function of
the predictors (as in the two-class case). If in the stock
recommendations we have a single predictor x, we have
two equations:
This means that both lines have the same slope (β1) but
different intercepts. Once the coefficients α0, β0, β1 are
388
estimated, we can compute the class membership
probabilities by rewriting the logit equations in terms of
probabilities. For the three-class case, for example, we
would have
where a0,b0, and b1 are the estimates obtained from the
training set.
For each observation we now have the estimated
probabilities that it belongs to each of the classes. In our
example, each stock would have three probabilities: for a
buy recommendation, a hold recommendation, and a sell
recommendation. The last step is to classify the
observation into one of the classes. This is done by
assigning it to the class with the highest membership
probability. So if a stock had estimated probabilities P(Y =
1) = 0.2, P(Y = 2) = 0.3, and P(Y = 3) = 0.5, we would
classify it as getting a sell recommendation.
This procedure is currently not implemented in XLMiner.
Other non-Excel- based packages that do have such an
implementation are Minitab and SAS.
Nominal Classes When the classes cannot be ordered and
are simply different from one another, we are in the case of
nominal classes. An example is the choice between several
389
brands of cereal. A simple way to verify that the classes
are nominal is when it makes sense to tag them as A, B, C,
..., and the assignment of letters to classes does not matter.
For simplicity, let us assume that there are m = 3 brands of
cereal that consumers can choose from (assuming that each
consumer chooses one). Then we estimate the probabilities
P(Y = A), P (Y = B), and P (Y = C). As before, if we know
two of the probabilities, the third probability is determined.
We therefore use one of the classes as the reference class.
Let us use C as the reference brand.
The goal, once again, is to model the class membership as
a function of predictors. So in the cereals example we
might want to predict which cereal will be chosen if we
know the cereal’s price, x.
Next, we form m – 1 pseudologit equations that are linear
in the predictors. In our example we would have
Once the four coefficients are estimated from the training
set, we can estimate the class membership probabilities7:
390