6 Summary of major visualizations and operations, according to data mining goal

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )

• Examine scatterplots with added color/panels/size to

determine the need for interaction terms.

• Use various aggregation levels and zooming to determine

areas of the data with different behavior, and to evaluate the

level of global versus local patterns.

Classification

• Study relation of outcome to categorical predictors using bar

charts with the outcome on the y axis.

• Study relation of outcome to pairs of numerical predictors

via color-coded scatterplots (color denotes the outcome).

• Study relation of outcome to numerical predictors via

side-by-side boxplots: Plot boxplots of a numerical variable

by outcome. Create similar displays for each numerical

predictor. The most separable boxes indicate potentially

useful predictors.

• Use color to represent the outcome variable on a parallel

coordinate plot.

• Use distribution plots (boxplot, histogram) for determining

needed transformations of the outcome variable.

• Examine scatterplots with added color/panels/size to

determine the need for interaction terms.

• Use various aggregation levels and zooming to determine

areas of the data with different behavior, and to evaluate the

level of global versus local patterns.

Time Series Forecasting

• Create line graphs at different temporal aggregations to

determine types of patterns.

• Use zooming and panning to examine various shorter

periods of the series to determine areas of the data with

different behavior.

• Use various aggregation levels to identify global and local

patterns.

• Identify missing values in the series (that require handling).

136

• Overlay trend lines of different types to determine adequate

modeling choices.

Unsupervised Learning

• Create scatterplot matrices to identify pairwise relationships

and clustering of observations.

• Use heatmaps to examine the correlation table.

• Use various aggregation levels and zooming to determine

areas of the data with different behavior.

• Generate a parallel coordinate plot to identify clusters of

observations.

PROBLEMS

3.1 Shipments of Household Appliances: Line Graphs.

The file ApplianceShipments.xls contains the series of

quarterly shipments (in million $) of U.S. household

appliances between 1985 and 1989 (data courtesy of Ken

Black).

a. Create a well-formatted time plot of the data using

Excel.

b. Does there appear to be a quarterly pattern? For a closer

view of the patterns, zoom in to the range of 3500–5000 on

the y axis.

c. Create four separate lines for Q1, Q2, Q3, and Q4, using

Excel. In each, plot a line graph. In Excel, order the data

by Q1, Q2, Q3, Q4 (alphabetical sorting will work), and

plot them as separate series on the line graph. Zoom in to

the range of 3500–5000 on the y axis. Does there appear to

be a difference between quarters?

137

d. Using Excel, create a line graph of the series at a yearly

aggregated level (i.e., the total shipments in each year).

e. Re-create the above plots using an interactive

visualization tool. Make sure to enter the quarter

information in a format that is recognized by the software

as a date.

f. Compare the two processes of generating the line graphs

in terms of the effort as well as the quality of the resulting

plots. What are the advantages of each?

3.2 Sales of Riding Mowers: Scatterplots. A company

that manufactures riding mowers wants to identify the best

sales prospects for an intensive sales campaign. In

particular, the manufacturer is interested in classifying

households as prospective owners or nonowners on the

basis of Income (in $1000s) and Lot Size (in 1000 ft2). The

marketing expert looked at a random sample of 24

households, included in the file RidingMowers.xls.

a. Using Excel, create a scatterplot of Lot Size vs. Income,

color coded by the outcome variable owner/nonowner.

Make sure to obtain a well-formatted plot (remove

excessive background and gridlines; create legible labels

and a legend, etc.). The result should be similar to Figure

9.2. Hint: First sort the data by the outcome variable, and

then plot the data for each category as separate series.

b. Create the same plot, this time using an interactive

visualization tool.

138

c. Compare the two processes of generating the plot in

terms of the effort as well as the quality of the resulting

plots. What are the advantages of each?

3.3 Laptop Sales at a London Computer Chain: Bar

Charts

and

Boxplots.

The

file

LaptopSalesJanuary2008.xls contains data for all sales of

laptops at a computer chain in London in January 2008.

This is a subset of the full dataset that includes data for the

entire year.

a. Create a bar chart, showing the average retail price by

store. Which store has the highest average? Which has the

lowest?

b. To better compare retail prices across stores, create

side-by-side boxplots of retail price by store. Now

compare the prices in the two stores above. Do you see a

difference between their price distributions? Explain.

3.4 Laptop Sales at a London Computer Chain:

Interactive Visualization. The next exercises are designed

for use with an interactive visualization tool. The file

LaptopSales.txt is a comma-separated file with nearly

300,000 rows. ENBIS (the European Network for Business

and Industrial Statistics) provided these data as part of a

contest organized in the fall of 2009.

Scenario: You are a new analyst for Acell, a company

selling laptops. You have been provided with data about

products and sales. Your task is to help the company to

plan product strategy and pricing policies that will

maximize Acell’s projected revenues in 2009. Using an

139

interactive visualization tool, answer the following

questions.

a. Price Questions

i. At what prices are the laptops actually selling?

ii. Does price change with time? (Hint: Make sure that the

date column is recognized as such. The software should

then enable different temporal aggregation choices, e.g.,

plotting the data by weekly or monthly aggregates, or even

by day of week.)

iii. Are prices consistent across retail outlets?

iv. How does price change with configuration?

b. Location Questions

i. Where are the stores and customers located?

ii. Which stores are selling the most?

iii. How far would customers travel to buy a laptop?

Hint 1: you should be able to aggregate the data, for

example, plot the sum or average of the prices.

Hint 2: Use the coordinated highlighting between

multiple visualizations in the same page, for example,

select a store in one view to see the matching customers in

another visualization.

140

Hint 3: Explore the use of filters to see differences. Make

sure to filter in the zoomed out view. For example, try to

use a “store location” slider as an alternative way to

dynamically compare store locations. This is especially

useful for spotting outlier patterns if there are many store

locations to compare.

iv. Try an alternative way of looking at how far customers

traveled. Do this by creating a new data column that

computes the distance between customer and store.

c. Revenue Questions

i. How do the sales volume in each store relate to Acell’s

revenues?

ii. How does this depend on the configuration?

d. Configuration Questions

i. What are the details of each configuration? How does

this relate to price?

ii. Do all stores sell all configurations?

1

We refer here to a bar chart with vertical bars. The same

principles apply if using a bar chart with horizontal bars,

except that the x axis is now associated with the numerical

variable and the y axis with the categorical variable.

2

See

http://blog.bzst.com/2009/08/

creating-color-coded-scatterplots-in.html.

141

Chapter 4

Dimension Reduction

In this chapter we describe the important step of dimension

reduction. The dimension of a dataset, which is the number

of variables, must be reduced for the data mining

algorithms to operate efficiently. We present and discuss

several dimension reduction approaches: (1) Incorporating

domain knowledge to remove or combine categories, (2)

using data summaries to detect information overlap

between variables (and remove or combine redundant

variables or categories), (3) using data conversion

techniques such as converting categorical variables into

numerical variables, and (4) employing automated

reduction techniques, such as principal components

analysis (PCA), where a new set of variables (which are

weighted averages of the original variables) is created.

These new variables are uncorrelated and a small subset of

them usually contains most of their combined information

(hence, we can reduce dimension by using only a subset of

the new variables). Finally, we mention data mining

methods such as regression models and regression and

classification trees, which can be used for removing

redundant variables and for combining “similar” categories

of categorical variables.

4.1 Introduction

In data mining one often encounters situations where there

are a large number of variables in the database. In such

situations it is very likely that subsets of variables are

142

highly correlated with each other. Included in a

classification or prediction model, highly correlated

variables, or variables that are unrelated to the outcome of

interest, can lead to overfitting, and accuracy and

reliability can suffer. Large numbers of variables also pose

computational problems for some models (aside from

questions of correlation). In model deployment,

superfluous variables can increase costs due to the

collection and processing of these variables. The

dimensionality of a model is the number of independent or

input variables used by the model. One of the key steps in

data mining, therefore, is finding ways to reduce

dimensionality without sacrificing accuracy. In the

artificial intelligence literature, dimension reduction is

often referred to as factor selection or feature extraction.

4.2 Practical Considerations

Although data mining prefers automated methods over

domain knowledge, it is important at the first step of data

exploration to make sure that the variables measured are

reasonable for the task at hand. The integration of expert

knowledge through a discussion with the data provider (or

user) will probably lead to better results. Practical

considerations include: Which variables are most

important for the task at hand, and which are most likely to

be useless? Which variables are likely to contain much

error? Which variables will be available for measurement

(and what will it cost to measure them) in the future if the

analysis is repeated? Which variables can actually be

measured before the outcome occurs? (For example, if we

want to predict the closing price of an ongoing online

143

auction, we cannot use the number of bids as a predictor

because this will not be known until the auction closes.)

Example 1: House Prices in Boston

We return to the Boston housing example introduced in

Chapter 2. For each neighborhood, a number of variables

are given, such as the crime rate, the student/teacher ratio,

and the median value of a housing unit in the

neighborhood. A description of all 14 variables is given in

Table 4.1. The first 10 records of the data are shown in

Figure 4.1. The first row in this figure represents the first

neighborhood, which had an average per capita crime rate

of 0.006, 18% of the residential land zoned for lots over

25,000 ft2, 2.31% of the land devoted to nonretail

business, no border on the Charles River, and so on.

TABLE 4.1 DESCRIPTION OF VARIABLES IN THE BOSTON

HOUSING DATASET

CRIM

Crime rate

ZN

Percentage of residential land zoned for lots over 25,000 ft2

INDUS

Percentage of land occupied by nonretail business

CHAS

Charles River dummy variable (= 1 if tract bounds river; =0

otherwise)

NOX

Nitric oxide concentration (parts per 10 million)

RM

Average number of rooms per dwelling

AGE

Percentage of owner-occupied units built prior to 1940

DIS

Weighted distances to five Boston employment centers

RAD

Index of accessibility to radial highways

TAX

Full-value property tax rate per $10,000

PTRATIO Pupil/teacher ratio by town

144

B

1000(Bk minus 0.63)2, where Bk is the proportion of blacks by

town

LSTAT

% Lower status of the population

MEDV

Median value of owner-occupied homes in $1000s

FIGURE 4.1 FIRST NINE RECORDS IN THE BOSTON HOUSING

DATASET

4.3 Data Summaries

As we have seen in Chapter 3 on data visualization, an

important initial step of data exploration is getting familiar

with the data and their characteristics through summaries

and graphs. The importance of this step cannot be

overstated. The better you understand the data, the better

the results from the modeling or mining process will be.

Numerical summaries and graphs of the data are very

helpful for data reduction. The information that they

convey can assist in combining categories of a categorical

variable, in choosing variables to remove, in assessing the

level of information overlap between variables, and more.

Before discussing such strategies for reducing the

dimension of a data set, let us consider useful summaries

and tools.

Summary Statistics

145

Excel has several functions and facilities that assist in

summarizing data. The functions average, stdev, min, max,

median, and count are very helpful for learning about the

characteristics of each variable. First, they give us

information about the scale and type of values that the

variable takes. The min and max functions can be used to

detect extreme values that might be errors. The average

and median give a sense of the central values of that

variable, and a large deviation between the two also

indicates skew. The standard deviation gives a sense of

how dispersed the data are (relative to the mean). Other

functions, such as countblank, which gives the number of

empty cells, can tell us about missing values. It is also

possible to use Excel’s Descriptive Statistics facility in the

Data > Data Analysis menu (in Excel 2003: Tools > Data

Analysis). This will generate a set of 13 summary statistics

for each of the variables.

FIGURE 4.2 SUMMARY STATISTICS FOR THE BOSTON

HOUSING DATA

Figure 4.2 shows six summary statistics for the Boston

housing example. We see immediately that the different

146

Xem Thêm

6 Summary of major visualizations and operations, according to data mining goal

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về