Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )
• Examine scatterplots with added color/panels/size to
determine the need for interaction terms.
• Use various aggregation levels and zooming to determine
areas of the data with different behavior, and to evaluate the
level of global versus local patterns.
Classification
• Study relation of outcome to categorical predictors using bar
charts with the outcome on the y axis.
• Study relation of outcome to pairs of numerical predictors
via color-coded scatterplots (color denotes the outcome).
• Study relation of outcome to numerical predictors via
side-by-side boxplots: Plot boxplots of a numerical variable
by outcome. Create similar displays for each numerical
predictor. The most separable boxes indicate potentially
useful predictors.
• Use color to represent the outcome variable on a parallel
coordinate plot.
• Use distribution plots (boxplot, histogram) for determining
needed transformations of the outcome variable.
• Examine scatterplots with added color/panels/size to
determine the need for interaction terms.
• Use various aggregation levels and zooming to determine
areas of the data with different behavior, and to evaluate the
level of global versus local patterns.
Time Series Forecasting
• Create line graphs at different temporal aggregations to
determine types of patterns.
• Use zooming and panning to examine various shorter
periods of the series to determine areas of the data with
different behavior.
• Use various aggregation levels to identify global and local
patterns.
• Identify missing values in the series (that require handling).
136
• Overlay trend lines of different types to determine adequate
modeling choices.
Unsupervised Learning
• Create scatterplot matrices to identify pairwise relationships
and clustering of observations.
• Use heatmaps to examine the correlation table.
• Use various aggregation levels and zooming to determine
areas of the data with different behavior.
• Generate a parallel coordinate plot to identify clusters of
observations.
PROBLEMS
3.1 Shipments of Household Appliances: Line Graphs.
The file ApplianceShipments.xls contains the series of
quarterly shipments (in million $) of U.S. household
appliances between 1985 and 1989 (data courtesy of Ken
Black).
a. Create a well-formatted time plot of the data using
Excel.
b. Does there appear to be a quarterly pattern? For a closer
view of the patterns, zoom in to the range of 3500–5000 on
the y axis.
c. Create four separate lines for Q1, Q2, Q3, and Q4, using
Excel. In each, plot a line graph. In Excel, order the data
by Q1, Q2, Q3, Q4 (alphabetical sorting will work), and
plot them as separate series on the line graph. Zoom in to
the range of 3500–5000 on the y axis. Does there appear to
be a difference between quarters?
137
d. Using Excel, create a line graph of the series at a yearly
aggregated level (i.e., the total shipments in each year).
e. Re-create the above plots using an interactive
visualization tool. Make sure to enter the quarter
information in a format that is recognized by the software
as a date.
f. Compare the two processes of generating the line graphs
in terms of the effort as well as the quality of the resulting
plots. What are the advantages of each?
3.2 Sales of Riding Mowers: Scatterplots. A company
that manufactures riding mowers wants to identify the best
sales prospects for an intensive sales campaign. In
particular, the manufacturer is interested in classifying
households as prospective owners or nonowners on the
basis of Income (in $1000s) and Lot Size (in 1000 ft2). The
marketing expert looked at a random sample of 24
households, included in the file RidingMowers.xls.
a. Using Excel, create a scatterplot of Lot Size vs. Income,
color coded by the outcome variable owner/nonowner.
Make sure to obtain a well-formatted plot (remove
excessive background and gridlines; create legible labels
and a legend, etc.). The result should be similar to Figure
9.2. Hint: First sort the data by the outcome variable, and
then plot the data for each category as separate series.
b. Create the same plot, this time using an interactive
visualization tool.
138
c. Compare the two processes of generating the plot in
terms of the effort as well as the quality of the resulting
plots. What are the advantages of each?
3.3 Laptop Sales at a London Computer Chain: Bar
Charts
and
Boxplots.
The
file
LaptopSalesJanuary2008.xls contains data for all sales of
laptops at a computer chain in London in January 2008.
This is a subset of the full dataset that includes data for the
entire year.
a. Create a bar chart, showing the average retail price by
store. Which store has the highest average? Which has the
lowest?
b. To better compare retail prices across stores, create
side-by-side boxplots of retail price by store. Now
compare the prices in the two stores above. Do you see a
difference between their price distributions? Explain.
3.4 Laptop Sales at a London Computer Chain:
Interactive Visualization. The next exercises are designed
for use with an interactive visualization tool. The file
LaptopSales.txt is a comma-separated file with nearly
300,000 rows. ENBIS (the European Network for Business
and Industrial Statistics) provided these data as part of a
contest organized in the fall of 2009.
Scenario: You are a new analyst for Acell, a company
selling laptops. You have been provided with data about
products and sales. Your task is to help the company to
plan product strategy and pricing policies that will
maximize Acell’s projected revenues in 2009. Using an
139
interactive visualization tool, answer the following
questions.
a. Price Questions
i. At what prices are the laptops actually selling?
ii. Does price change with time? (Hint: Make sure that the
date column is recognized as such. The software should
then enable different temporal aggregation choices, e.g.,
plotting the data by weekly or monthly aggregates, or even
by day of week.)
iii. Are prices consistent across retail outlets?
iv. How does price change with configuration?
b. Location Questions
i. Where are the stores and customers located?
ii. Which stores are selling the most?
iii. How far would customers travel to buy a laptop?
Hint 1: you should be able to aggregate the data, for
example, plot the sum or average of the prices.
Hint 2: Use the coordinated highlighting between
multiple visualizations in the same page, for example,
select a store in one view to see the matching customers in
another visualization.
140
Hint 3: Explore the use of filters to see differences. Make
sure to filter in the zoomed out view. For example, try to
use a “store location” slider as an alternative way to
dynamically compare store locations. This is especially
useful for spotting outlier patterns if there are many store
locations to compare.
iv. Try an alternative way of looking at how far customers
traveled. Do this by creating a new data column that
computes the distance between customer and store.
c. Revenue Questions
i. How do the sales volume in each store relate to Acell’s
revenues?
ii. How does this depend on the configuration?
d. Configuration Questions
i. What are the details of each configuration? How does
this relate to price?
ii. Do all stores sell all configurations?
1
We refer here to a bar chart with vertical bars. The same
principles apply if using a bar chart with horizontal bars,
except that the x axis is now associated with the numerical
variable and the y axis with the categorical variable.
2
See
http://blog.bzst.com/2009/08/
creating-color-coded-scatterplots-in.html.
141
Chapter 4
Dimension Reduction
In this chapter we describe the important step of dimension
reduction. The dimension of a dataset, which is the number
of variables, must be reduced for the data mining
algorithms to operate efficiently. We present and discuss
several dimension reduction approaches: (1) Incorporating
domain knowledge to remove or combine categories, (2)
using data summaries to detect information overlap
between variables (and remove or combine redundant
variables or categories), (3) using data conversion
techniques such as converting categorical variables into
numerical variables, and (4) employing automated
reduction techniques, such as principal components
analysis (PCA), where a new set of variables (which are
weighted averages of the original variables) is created.
These new variables are uncorrelated and a small subset of
them usually contains most of their combined information
(hence, we can reduce dimension by using only a subset of
the new variables). Finally, we mention data mining
methods such as regression models and regression and
classification trees, which can be used for removing
redundant variables and for combining “similar” categories
of categorical variables.
4.1 Introduction
In data mining one often encounters situations where there
are a large number of variables in the database. In such
situations it is very likely that subsets of variables are
142
highly correlated with each other. Included in a
classification or prediction model, highly correlated
variables, or variables that are unrelated to the outcome of
interest, can lead to overfitting, and accuracy and
reliability can suffer. Large numbers of variables also pose
computational problems for some models (aside from
questions of correlation). In model deployment,
superfluous variables can increase costs due to the
collection and processing of these variables. The
dimensionality of a model is the number of independent or
input variables used by the model. One of the key steps in
data mining, therefore, is finding ways to reduce
dimensionality without sacrificing accuracy. In the
artificial intelligence literature, dimension reduction is
often referred to as factor selection or feature extraction.
4.2 Practical Considerations
Although data mining prefers automated methods over
domain knowledge, it is important at the first step of data
exploration to make sure that the variables measured are
reasonable for the task at hand. The integration of expert
knowledge through a discussion with the data provider (or
user) will probably lead to better results. Practical
considerations include: Which variables are most
important for the task at hand, and which are most likely to
be useless? Which variables are likely to contain much
error? Which variables will be available for measurement
(and what will it cost to measure them) in the future if the
analysis is repeated? Which variables can actually be
measured before the outcome occurs? (For example, if we
want to predict the closing price of an ongoing online
143
auction, we cannot use the number of bids as a predictor
because this will not be known until the auction closes.)
Example 1: House Prices in Boston
We return to the Boston housing example introduced in
Chapter 2. For each neighborhood, a number of variables
are given, such as the crime rate, the student/teacher ratio,
and the median value of a housing unit in the
neighborhood. A description of all 14 variables is given in
Table 4.1. The first 10 records of the data are shown in
Figure 4.1. The first row in this figure represents the first
neighborhood, which had an average per capita crime rate
of 0.006, 18% of the residential land zoned for lots over
25,000 ft2, 2.31% of the land devoted to nonretail
business, no border on the Charles River, and so on.
TABLE 4.1 DESCRIPTION OF VARIABLES IN THE BOSTON
HOUSING DATASET
CRIM
Crime rate
ZN
Percentage of residential land zoned for lots over 25,000 ft2
INDUS
Percentage of land occupied by nonretail business
CHAS
Charles River dummy variable (= 1 if tract bounds river; =0
otherwise)
NOX
Nitric oxide concentration (parts per 10 million)
RM
Average number of rooms per dwelling
AGE
Percentage of owner-occupied units built prior to 1940
DIS
Weighted distances to five Boston employment centers
RAD
Index of accessibility to radial highways
TAX
Full-value property tax rate per $10,000
PTRATIO Pupil/teacher ratio by town
144
B
1000(Bk minus 0.63)2, where Bk is the proportion of blacks by
town
LSTAT
% Lower status of the population
MEDV
Median value of owner-occupied homes in $1000s
FIGURE 4.1 FIRST NINE RECORDS IN THE BOSTON HOUSING
DATASET
4.3 Data Summaries
As we have seen in Chapter 3 on data visualization, an
important initial step of data exploration is getting familiar
with the data and their characteristics through summaries
and graphs. The importance of this step cannot be
overstated. The better you understand the data, the better
the results from the modeling or mining process will be.
Numerical summaries and graphs of the data are very
helpful for data reduction. The information that they
convey can assist in combining categories of a categorical
variable, in choosing variables to remove, in assessing the
level of information overlap between variables, and more.
Before discussing such strategies for reducing the
dimension of a data set, let us consider useful summaries
and tools.
Summary Statistics
145
Excel has several functions and facilities that assist in
summarizing data. The functions average, stdev, min, max,
median, and count are very helpful for learning about the
characteristics of each variable. First, they give us
information about the scale and type of values that the
variable takes. The min and max functions can be used to
detect extreme values that might be errors. The average
and median give a sense of the central values of that
variable, and a large deviation between the two also
indicates skew. The standard deviation gives a sense of
how dispersed the data are (relative to the mean). Other
functions, such as countblank, which gives the number of
empty cells, can tell us about missing values. It is also
possible to use Excel’s Descriptive Statistics facility in the
Data > Data Analysis menu (in Excel 2003: Tools > Data
Analysis). This will generate a set of 13 summary statistics
for each of the variables.
FIGURE 4.2 SUMMARY STATISTICS FOR THE BOSTON
HOUSING DATA
Figure 4.2 shows six summary statistics for the Boston
housing example. We see immediately that the different
146