5 Appendix: logistic Regression for Profiling

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )

classification is the goal and performance is evaluated by

reviewing results with a validation sample. For reference,

some key concepts of a classical statistical perspective, are

included below.

FIGURE

10.8

OUTPUT

FOR

LOGISTIC

REGRESSION MODEL WITH ONLY SEVEN

PREDICTORS.

Appendix A: Why Linear Regression Is Inappropriate for a

Categorical Response

380

Now that you have seen how logistic regression works, we

explain why linear regression is not suitable. Technically,

one can apply a multiple linear regression model to this

problem, treating the dependent variable Y as continuous.

Of course, Y must be coded numerically (e.g., 1 for

customers who did accept the loan offer and 0 for

customers who did not accept it). Although software will

yield an output that at first glance may seem usual (e.g.,

Figure 10.9), a closer look will reveal several anomalies:

1. Using the model to predict Y for each of the

observations (or classify them) yields predictions that are

not necessarily 0 or 1.

2. A look at the histogram or probability plot of the

residuals reveals that the assumption that the dependent

variable (or residuals) follows a normal distribution is

violated. Clearly, if Y takes only the values 0 and 1, it

cannot be normally distributed. In fact, a more appropriate

distribution for the number of 1’s in the dataset is the

binomial distribution with p = P (Y = 1).

3. The assumption that the variance of Y is constant across

all classes is violated. Since Y follows a binomial

distribution, its variance is np(1 – p). This means that the

variance will be higher for classes where the probability of

adoption, p, is near 0.5 than where it is near 0 or 1.

FIGURE 10.9 OUTPUT FOR MULTIPLE LINEAR

REGRESSION MODEL OF PERSONAL LOAN ON

THREE PREDICTORS.

381

Below you will find partial output from running a multiple

linear regression of Personal Loan (PL, coded as PL = 1

for customers who accepted the loan offer and PL = 0

otherwise) on three of the predictors.

The estimated model is

To predict whether a new customer will accept the

personal loan offer (PL = 1) or not (PL = 0), we input the

information on its values for these three predictors. For

example, we would predict the loan offer acceptance of a

customer with an annual income of $50K with two family

members who does not hold CD accounts in Universal

Bank to be –0.2346 + (0.0032)(50) + (0.0329)(2) = –0.009.

Clearly, this is not a valid “loan acceptance” value.

Furthermore, the histogram of the residuals (Figure 10.10)

reveals that the residuals are probably not normally

distributed. Therefore, our estimated model is based on

violated assumptions.

FIGURE 10.10 HISTOGRAM OF RESIDUALS

FROM A MULTIPLE LINEAR REGRESSION

382

MODEL OF LOAN ACCEPTANCE ON THE THREE

PREDICTORS. THIS SHOWS THAT THE

RESIDUALS DO NOT FOLLOW THE NORMAL

DISTRIBUTION THATTHE MODEL ASSUMES.

Appendix B: Evaluating Goodness of Fit

When the purpose of the analysis is profiling (i.e.,

explaining the differences between the classes in terms of

predictor values) we want to go beyond simply assessing

how well the model classifies new data, and also assess

how well the model fits the data it was trained on. For

example, if we are interested in characterizing loan offer

acceptors versus nonacceptors in terms of income,

education, and so on, we want to find a model that fits the

data best. However, since overfitting is a major danger in

classification, a “too good” fit of the model to the training

data should raise suspicions. In addition, questions

regarding the usefulness of specific predictors can arise

383

even in the context of classification models. We therefore

mention some of the popular measures that are used to

assess how well the model fits the data. Clearly, we look at

the training set in order to evaluate goodness of fit.

Overall Fit As in multiple linear regression, we first

evaluate the overall fit of the model to the data before

looking at single predictors. We ask: Is this group of

predictors better than a simple naive model for explaining

the different classes5?

The deviance D is a statistic that measures overall

goodness of fit. It is similar to the concept of sum of

squared errors (SSE) in the case of least-squares estimation

(used in linear regression). We compare the deviance of

our model, D (called Std Dev Estimate in XLMiner, e.g.,

in Figure 10.11), to the deviance of the naive model, D0. If

the reduction in deviance is statistically significant (as

indicated by a low p-value6 or in XLMiner by a high

multiple R2), we consider our model to provide a good

overall fit. XLMiner’s Multiple-R-Squared measure is

computed as (D0 – D)/D0. Given the model deviance and

the multiple R2, we can compute the null deviance by D0 =

D/(1 – R2).

FIGURE 10.11 MEASURES OF GOODNESS OF FIT

FOR UNIVERSAL BANK TRAINING DATA WITH

A 12-PREDICTOR MODEL

384

Finally, the classification matrix and lift chart for the

training data (Figure 10.12) give a sense of how

accurately the model classifies the data. If the model fits

the data well, we expect it to classify these data accurately

into their actual classes.

Appendix C: Logistic Regression for More Than Two

Classes

The logistic model for a binary response can be extended

for more than two classes. Suppose that there are m

classes. Using a logistic regression model, for each

observation we would have m probabilities of belonging to

each of the m classes. Since the m probabilities must add

up to 1, we need estimate only m – 1 probabilities.

Ordinal Classes Ordinal classes are classes that have a

meaningful order. For example, in stock recommendations,

the three classes buy, hold, and sell can be treated as

ordered. As a simple rule, if classes can be numbered in a

meaningful way, we consider them ordinal. When the

number of classes is large (typically, more than 5), we can

treat the dependent variable as continuous and perform

multiple linear regression. When m = 2, the logistic model

described above is used. We therefore need an extension

ofthe logistic regression for a small number of ordinal

classes (3 ≤ m ≤ 5). There are several ways to extend the

385

binary class case. Here we describe the proportional odds

or cumulative logit method. For other methods, see

Hosmer and Lemeshow (2000).

FIGURE 10.12 (A) CLASSIFICATION MATRIX AND

(B) LIFT CHART FOR TRAINING DATA FOR

UNIVERSAL BANK TRAINING DATA WITH 12

PREDICTORS

386

For simplicity of interpretation and computation, we look

at cumulative probabilities of class membership. For

example, in the stock recommendations we have m = 3

classes. Let us denote them by 1 = buy, 2 = hold, and 3 =

sell. The probabilities that are estimated by the model are

387

P(Y ≤ 1), (the probability of a buy recommendation) and

P(Y ≤ 2) (the probability of a buy or hold

recommendation). The three noncumulative probabilities

of class membership can easily be recovered from the two

cumulative probabilities:

Next, we want to model each logit as a function of the

predictors. Corresponding to each of the m – 1 cumulative

probabilities is a logit. In our example we would have

Each of the logits is then modeled as a linear function of

the predictors (as in the two-class case). If in the stock

recommendations we have a single predictor x, we have

two equations:

This means that both lines have the same slope (β1) but

different intercepts. Once the coefficients α0, β0, β1 are

388

estimated, we can compute the class membership

probabilities by rewriting the logit equations in terms of

probabilities. For the three-class case, for example, we

would have

where a0,b0, and b1 are the estimates obtained from the

training set.

For each observation we now have the estimated

probabilities that it belongs to each of the classes. In our

example, each stock would have three probabilities: for a

buy recommendation, a hold recommendation, and a sell

recommendation. The last step is to classify the

observation into one of the classes. This is done by

assigning it to the class with the highest membership

probability. So if a stock had estimated probabilities P(Y =

1) = 0.2, P(Y = 2) = 0.3, and P(Y = 3) = 0.5, we would

classify it as getting a sell recommendation.

This procedure is currently not implemented in XLMiner.

Other non-Excel- based packages that do have such an

implementation are Minitab and SAS.

Nominal Classes When the classes cannot be ordered and

are simply different from one another, we are in the case of

nominal classes. An example is the choice between several

389

brands of cereal. A simple way to verify that the classes

are nominal is when it makes sense to tag them as A, B, C,

..., and the assignment of letters to classes does not matter.

For simplicity, let us assume that there are m = 3 brands of

cereal that consumers can choose from (assuming that each

consumer chooses one). Then we estimate the probabilities

P(Y = A), P (Y = B), and P (Y = C). As before, if we know

two of the probabilities, the third probability is determined.

We therefore use one of the classes as the reference class.

Let us use C as the reference brand.

The goal, once again, is to model the class membership as

a function of predictors. So in the cereals example we

might want to predict which cereal will be chosen if we

know the cereal’s price, x.

Next, we form m – 1 pseudologit equations that are linear

in the predictors. In our example we would have

Once the four coefficients are estimated from the training

set, we can estimate the class membership probabilities7:

390

Xem Thêm

5 Appendix: logistic Regression for Profiling

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về