8 Time Series Case: Forecasting Public Transportation Demand

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )

public transportation. This case is based on the

competition, with some modifications.

Problem Description

A public transportation company is expecting an increase

demand for its services and is planning to acquire new

buses and to extend its terminals. These investments

require a reliable forecast of future demand. To create such

forecasts, one can use data on historic demand. The

company’s data warehouse has data on each 15-minute

interval between 6:30 am and 22:00, on the number of

passengers arriving at the terminal. As a forecasting

consultant you have been asked to create a forecasting

method that can generate forecasts for the number of

passengers arriving at the terminal.

Available Data

Part of the historic information is available in the file

bicup2006.xls. The file contains the worksheet “Historic

Information” with known demand for a 3-week period,

separated into 15-minute intervals. The second worksheet

(“Future”) contains dates and times for a future 3-day

period, for which forecasts should be generated (as part of

the 2006 competition).

Assignment Goal

Your goal is to create a model/method that produces

accurate forecasts. To evaluate your accuracy, partition the

given historic data into two periods: a training period (the

first two weeks) and a validation period (the last week).

673

Models should be fitted only to the training data and

evaluated on the validation data.

Although the competition winning criterion was the lowest

Mean Absolute Error (MAE) on the future 3-day data, this

is not the goal for this assignment. Instead, if we consider a

more realistic business context, our goal is to create a

model that generates reasonably good forecasts on any

time/day of the week. Consider not only predictive metrics

such as MAE, MAPE, and RMSE, but also look at actual

and forecasted values, overlaid on a time plot.

Assignment

For your final model, present the following summary:

1. Name of the method/combination of methods.

2. A brief description of the method/combination.

3. All estimated equations associated with constructing

forecasts from this method.

4. The MAPE and MAE for the training period and the

validation period.

5. Forecasts for the future period (March 22–24), in

15-minute bins.

6. A single chart showing the fit of the final version of the

model to the entire period (including training, validation,

and future). Note that this model should be fitted using the

combined training + validation data.

674

Tips and Suggested Steps

1. Use exploratory analysis to identify the components of

this time series. Is there a trend? Is there seasonality? If so,

how many “seasons” are there? Are there any other visible

patterns? Are the patterns global (the same throughout the

series) or local?

2. Consider the frequency of the data from a practical and

technical point of view. What are some options?

3. Compare the weekdays and weekends. How do they

differ? Consider how these differences can be captured by

different methods.

4. Examine the series for missing values or unusual values.

Think of solutions.

5. Based on the patterns that you found in the data, which

models or methods should be considered?

6. Consider how to handle actual counts of zero within the

computation of MAPE.

1

The Charles Book Club case was derived, with the

assistance of Ms. Vinni Bhandari, from The Bookbinders

Club, a Case Study in Database Marketing, prepared by

Nissan Levin and Jacob Zahavi, Tel Aviv University; used

with permission.

2

This

is

available

from

machine-learning-databases/statlog

675

ftp.ics.uci.edu/pub/

3

Resampling Stats, Inc. 2006; used with permission.

4

Cytel, Inc. and Resampling Stats, Inc. 2006; used with

permission.

5

Resampling Stats, Inc. 2006; used with permission.

6

This case was prepared by Professor Mark E. Haskins

and Professor Phillip E. Pfeifer. It was written as a basis

for class discussion rather than to illustrate effective or

ineffective handling of an administrative situation.

Copyright 1988 by the University of Virginia Darden

School Foundation, Charlottesville, VA. All rights

reserved. To order copies, send an e-mail to

sales@dardenpublishing.com. No part of this publication

may be reproduced, stored in a retrieval system, used in a

spreadsheet, or transmitted in any form or by any

means—electronic, mechanical, photocopying, recording,

or otherwise—without the permission of the Darden

School Foundation.

676

References

Agrawal, R., Imielinski, T., and Swami, A. (1993).

“Mining associations between sets of items in massive

databases,” in Proceedings of the 1993 ACM-SIGMOD

International Conference on Management of Data (pp.

207–216), New York: ACM Press.

Berry, M. J. A., and Linoff, G. S. (1997). Data Mining

Techniques. New York: Wiley.

Berry, M. J. A., and Linoff, G. S. (2000). Mastering Data

Mining. New York: Wiley.

Breiman, L., Friedman, J., Olshen, R., and Stone, C.

(1984). Classification and Regression Trees. Boca Raton,

FL: Chapman & Hall/CRC (orig. published by

Wadsworth).

Chatfield, C. (2003). The Analysis of Time Series: An

Introduction, 6th ed. Chapman & Hall/CRC.

Delmaster, R., and Hancock, M. (2001). Data Mining

Explained. Boston: Digital Press.

Few, S. (2004). Show Me the Numbers. Oakland, CA,

Analytics Press.

Few, S. (2009). Now You See It. Oakland, CA, Analytics

Press.

Han, J., and Kamber, M. (2001). Data Mining: Concepts

and Techniques. San Diego, CA: Academic.

677

Hand, D., Mannila, H. and Smyth, P. (2001). Principles of

Data Mining. Cambridge, MA: MIT Press.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The

Elements of Statistical Learning. 2nd ed. New York:

Springer.

Hosmer, D. W., and Lemeshow, S. (2000). Applied

Logistic Regression, 2nd ed. New York: WileyInterscience.

Jank, W, and Yahav, I. (2010). E-Loyalty Networks in

Online Auctions. Annals of Applied Statistics,

forthcoming.

Johnson, W., and Wichern, D. (2002). Applied

Multivariate Statistics. Upper Saddle River, NJ: Prentice

Hall.

Larsen, K. (2005). “Generalized Naïve Bayes Classifiers,”

SIGKDD Explorations,7(1), pp. 76–81.

Lyman, P., and Varian, H. R. (2003). “How much

information,”

retrieved

from

http://www.sims.berkeley.edu/how-much-info-2003

on

Nov. 29, 2005.

Manski, C. F. (1977). “The structure of random utility

models,” Theory and Decision,8, pp. 229–254.

McCullugh, C. E., Paal, B., and Ashdown, S. P. (1998).

“An optimisation approach to apparel sizing,” Journal of

the Operational Research Society,49(5), pp. 492–499.

678

Pregibon, D. (1999). “2001: A statistical odyssey,” invited

talk at The Fifth ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, ACM Press,

NY, p. 4.

Trippi, R., and Turban, E. (eds.) (1996). Neural Networks

in Finance and Investing. New York: McGraw- Hill.

Veenhoven, R., World Database of Happiness, Erasmus

University

Rotterdam;

available

at

http://worlddatabaseofhappiness.eur.nl.

679

Index

accident data

discriminant analysis

naive Bayes

neural nets

additive seasonality

adjusted-R2

affinity analysis

agglomerative

agglomerative algorithm

aggregation

airfare data

multiple linear regression

algorithm

ALVINN

Amtrak data

time series

680

visualization

analytical techniques

antecedent

appliance shipments data

time_series

visualization

applications

Apriori algorithm

AR models

ARIMA models

artificial intelligence

artificial neural networks

association rules

confidence

confidence intervals

cutoff

data format

681

item set

lift ratio

random selection

statistical significance

support

assumptions

asymmetric cost

asymmetric response

attribute

Australian wine sales data

time series

autocorrelation

average clustering

average error

average linkage

average squared errors

back propagation

682

backward elimination

balanced portfolios

bankruptcy data

bar chart

batch updating

bath soap data

benchmark

benchmark confidence value

best pruned tree

best subsets

bias

bias–variance trade-off

binning

binomial distribution

black box

Boston housing data

multiple linear regression

683

Xem Thêm

8 Time Series Case: Forecasting Public Transportation Demand

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về