Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )
public transportation. This case is based on the
competition, with some modifications.
Problem Description
A public transportation company is expecting an increase
demand for its services and is planning to acquire new
buses and to extend its terminals. These investments
require a reliable forecast of future demand. To create such
forecasts, one can use data on historic demand. The
company’s data warehouse has data on each 15-minute
interval between 6:30 am and 22:00, on the number of
passengers arriving at the terminal. As a forecasting
consultant you have been asked to create a forecasting
method that can generate forecasts for the number of
passengers arriving at the terminal.
Available Data
Part of the historic information is available in the file
bicup2006.xls. The file contains the worksheet “Historic
Information” with known demand for a 3-week period,
separated into 15-minute intervals. The second worksheet
(“Future”) contains dates and times for a future 3-day
period, for which forecasts should be generated (as part of
the 2006 competition).
Assignment Goal
Your goal is to create a model/method that produces
accurate forecasts. To evaluate your accuracy, partition the
given historic data into two periods: a training period (the
first two weeks) and a validation period (the last week).
673
Models should be fitted only to the training data and
evaluated on the validation data.
Although the competition winning criterion was the lowest
Mean Absolute Error (MAE) on the future 3-day data, this
is not the goal for this assignment. Instead, if we consider a
more realistic business context, our goal is to create a
model that generates reasonably good forecasts on any
time/day of the week. Consider not only predictive metrics
such as MAE, MAPE, and RMSE, but also look at actual
and forecasted values, overlaid on a time plot.
Assignment
For your final model, present the following summary:
1. Name of the method/combination of methods.
2. A brief description of the method/combination.
3. All estimated equations associated with constructing
forecasts from this method.
4. The MAPE and MAE for the training period and the
validation period.
5. Forecasts for the future period (March 22–24), in
15-minute bins.
6. A single chart showing the fit of the final version of the
model to the entire period (including training, validation,
and future). Note that this model should be fitted using the
combined training + validation data.
674
Tips and Suggested Steps
1. Use exploratory analysis to identify the components of
this time series. Is there a trend? Is there seasonality? If so,
how many “seasons” are there? Are there any other visible
patterns? Are the patterns global (the same throughout the
series) or local?
2. Consider the frequency of the data from a practical and
technical point of view. What are some options?
3. Compare the weekdays and weekends. How do they
differ? Consider how these differences can be captured by
different methods.
4. Examine the series for missing values or unusual values.
Think of solutions.
5. Based on the patterns that you found in the data, which
models or methods should be considered?
6. Consider how to handle actual counts of zero within the
computation of MAPE.
1
The Charles Book Club case was derived, with the
assistance of Ms. Vinni Bhandari, from The Bookbinders
Club, a Case Study in Database Marketing, prepared by
Nissan Levin and Jacob Zahavi, Tel Aviv University; used
with permission.
2
This
is
available
from
machine-learning-databases/statlog
675
ftp.ics.uci.edu/pub/
3
Resampling Stats, Inc. 2006; used with permission.
4
Cytel, Inc. and Resampling Stats, Inc. 2006; used with
permission.
5
Resampling Stats, Inc. 2006; used with permission.
6
This case was prepared by Professor Mark E. Haskins
and Professor Phillip E. Pfeifer. It was written as a basis
for class discussion rather than to illustrate effective or
ineffective handling of an administrative situation.
Copyright 1988 by the University of Virginia Darden
School Foundation, Charlottesville, VA. All rights
reserved. To order copies, send an e-mail to
sales@dardenpublishing.com. No part of this publication
may be reproduced, stored in a retrieval system, used in a
spreadsheet, or transmitted in any form or by any
means—electronic, mechanical, photocopying, recording,
or otherwise—without the permission of the Darden
School Foundation.
676
References
Agrawal, R., Imielinski, T., and Swami, A. (1993).
“Mining associations between sets of items in massive
databases,” in Proceedings of the 1993 ACM-SIGMOD
International Conference on Management of Data (pp.
207–216), New York: ACM Press.
Berry, M. J. A., and Linoff, G. S. (1997). Data Mining
Techniques. New York: Wiley.
Berry, M. J. A., and Linoff, G. S. (2000). Mastering Data
Mining. New York: Wiley.
Breiman, L., Friedman, J., Olshen, R., and Stone, C.
(1984). Classification and Regression Trees. Boca Raton,
FL: Chapman & Hall/CRC (orig. published by
Wadsworth).
Chatfield, C. (2003). The Analysis of Time Series: An
Introduction, 6th ed. Chapman & Hall/CRC.
Delmaster, R., and Hancock, M. (2001). Data Mining
Explained. Boston: Digital Press.
Few, S. (2004). Show Me the Numbers. Oakland, CA,
Analytics Press.
Few, S. (2009). Now You See It. Oakland, CA, Analytics
Press.
Han, J., and Kamber, M. (2001). Data Mining: Concepts
and Techniques. San Diego, CA: Academic.
677
Hand, D., Mannila, H. and Smyth, P. (2001). Principles of
Data Mining. Cambridge, MA: MIT Press.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The
Elements of Statistical Learning. 2nd ed. New York:
Springer.
Hosmer, D. W., and Lemeshow, S. (2000). Applied
Logistic Regression, 2nd ed. New York: WileyInterscience.
Jank, W, and Yahav, I. (2010). E-Loyalty Networks in
Online Auctions. Annals of Applied Statistics,
forthcoming.
Johnson, W., and Wichern, D. (2002). Applied
Multivariate Statistics. Upper Saddle River, NJ: Prentice
Hall.
Larsen, K. (2005). “Generalized Naïve Bayes Classifiers,”
SIGKDD Explorations,7(1), pp. 76–81.
Lyman, P., and Varian, H. R. (2003). “How much
information,”
retrieved
from
http://www.sims.berkeley.edu/how-much-info-2003
on
Nov. 29, 2005.
Manski, C. F. (1977). “The structure of random utility
models,” Theory and Decision,8, pp. 229–254.
McCullugh, C. E., Paal, B., and Ashdown, S. P. (1998).
“An optimisation approach to apparel sizing,” Journal of
the Operational Research Society,49(5), pp. 492–499.
678
Pregibon, D. (1999). “2001: A statistical odyssey,” invited
talk at The Fifth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, ACM Press,
NY, p. 4.
Trippi, R., and Turban, E. (eds.) (1996). Neural Networks
in Finance and Investing. New York: McGraw- Hill.
Veenhoven, R., World Database of Happiness, Erasmus
University
Rotterdam;
available
at
http://worlddatabaseofhappiness.eur.nl.
679
Index
accident data
discriminant analysis
naive Bayes
neural nets
additive seasonality
adjusted-R2
affinity analysis
agglomerative
agglomerative algorithm
aggregation
airfare data
multiple linear regression
algorithm
ALVINN
Amtrak data
time series
680
visualization
analytical techniques
antecedent
appliance shipments data
time_series
visualization
applications
Apriori algorithm
AR models
ARIMA models
artificial intelligence
artificial neural networks
association rules
confidence
confidence intervals
cutoff
data format
681
item set
lift ratio
random selection
statistical significance
support
assumptions
asymmetric cost
asymmetric response
attribute
Australian wine sales data
time series
autocorrelation
average clustering
average error
average linkage
average squared errors
back propagation
682
backward elimination
balanced portfolios
bankruptcy data
bar chart
batch updating
bath soap data
benchmark
benchmark confidence value
best pruned tree
best subsets
bias
bias–variance trade-off
binning
binomial distribution
black box
Boston housing data
multiple linear regression
683