final exam The final exam will be comprised of these five questions. This is a comprehensive review for all chapters in the textbook. Please address

final exam
The final exam will be comprised of these five questions. This is a comprehensive review for all chapters in the textbook. Please address the questions and then submit. You will need to ensure to use proper APA citations with any content that is not your own work.

1. Suppose that you are employed as a data mining consultant for an Internet search engine company. Describe how data mining can help the company by giving specific examples of how techniques, such as clustering, classification, association rule mining, and anomaly detection can be applied.

Don't use plagiarized sources. Get Your Custom Assignment on
final exam The final exam will be comprised of these five questions. This is a comprehensive review for all chapters in the textbook. Please address
From as Little as $13/Page

2. Identify at least two advantages and two disadvantages of using color to visually represent information.

3. Consider a group of documents that has been selected from a much larger set of diverse documents so that the selected documents are as dissimilar from one another as possible. If we consider documents that are not highly related (connected, similar) to one another as being anomalous, then all of the documents that we have selected might be classified as anomalies. Is it possible for a data set to consist only of anomalous objects or is this an abuse of the terminology?
4. For sparse data, discuss why considering only the presence of non-zero values might give a more accurate view of the objects than considering the actual magnitudes of values. When would such an approach not be desirable?

5. Assume that all documents have been normalized to have unit length of 1. What is the shape of a cluster that consists of all documents whose cosine similarity to a centroid is greater than some specified constant? In other words, cos(d, c) , where 0 < 1. ' ' I ' __ ,_ .... ~::-- PANG-NING TAN Michigan State University MICHAEL STEINBACH Un iversity of Minnesota VIPIN KUMAR Univers i ty of Minnesota and Army High Performance Comput ing Research Center ~ TT . Boston S;m Fr.mcisco New York London Toronto Sydney Tokyo Singapore Madrid Mexico Cicy Munich Paris Cape Town Hong Kong Montreal Contents Preface vii 1 Introduction 1 1.1 What Is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivating Challenges . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 The Origins of Data Mining . . . . . . . . . . . . . . . . . . . . 6 1.4 Data Mining Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Scope and Organization of the Book . . . . . . . . . . . . . . . 11 1.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Data 19 2.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.1 Attributes and Measurement . . . . . . . . . . . . . . . 23 2.1.2 Types of Data Sets . . . . . . . . . . . . . . . . . . . . . 29 2.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.1 Measurement and Data Collection Issues . . . . . . . . . 37 2.2.2 Issues Related to Applications . . . . . . . . . . . . . . 43 2.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.1 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3.3 Dimensionality Reduction . . . . . . . . . . . . . . . . . 50 2.3.4 Feature Subset Selection . . . . . . . . . . . . . . . . . . 52
2.3.5 Feature Creation . . . . . . . . . . . . . . . . . . . . . . 55
2.3.6 Discretization and Binarization . . . . . . . . . . . . . . 57
2.3.7 Variable Transformation . . . . . . . . . . . . . . . . . . 63

2.4 Measures of Similarity and Dissimilarity . . . . . . . . . . . . . 65
2.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.2 Similarity and Dissimilarity between Simple Attributes . 67
2.4.3 Dissimilarities between Data Objects . . . . . . . . . . . 69
2.4.4 Similarities between Data Objects . . . . . . . . . . . . 72

xiv Contents

2.4.5 Examples of Proximity Measures . . . . . . . . . . . . . 73
2.4.6 Issues in Proximity Calculation . . . . . . . . . . . . . . 80
2.4.7 Selecting the Right Proximity Measure . . . . . . . . . . 83

2.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3 Exploring Data 97
3.1 The Iris Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.2.1 Frequencies and the Mode . . . . . . . . . . . . . . . . . 99
3.2.2 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.3 Measures of Location: Mean and Median . . . . . . . . 101
3.2.4 Measures of Spread: Range and Variance . . . . . . . . 102
3.2.5 Multivariate Summary Statistics . . . . . . . . . . . . . 104
3.2.6 Other Ways to Summarize the Data . . . . . . . . . . . 105

3.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.1 Motivations for Visualization . . . . . . . . . . . . . . . 105
3.3.2 General Concepts . . . . . . . . . . . . . . . . . . . . . . 106
3.3.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.3.4 Visualizing Higher-Dimensional Data . . . . . . . . . . . 124
3.3.5 Dos and Donts . . . . . . . . . . . . . . . . . . . . . . 130

3.4 OLAP and Multidimensional Data Analysis . . . . . . . . . . . 131
3.4.1 Representing Iris Data as a Multidimensional Array . . 131
3.4.2 Multidimensional Data: The General Case . . . . . . . . 133
3.4.3 Analyzing Multidimensional Data . . . . . . . . . . . . 135
3.4.4 Final Comments on Multidimensional Data Analysis . . 139

3.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

4 Classification:
Basic Concepts, Decision Trees, and Model Evaluation 145
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.2 General Approach to Solving a Classification Problem . . . . . 148
4.3 Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . 150

4.3.1 How a Decision Tree Works . . . . . . . . . . . . . . . . 150
4.3.2 How to Build a Decision Tree . . . . . . . . . . . . . . . 151
4.3.3 Methods for Expressing Attribute Test Conditions . . . 155
4.3.4 Measures for Selecting the Best Split . . . . . . . . . . . 158
4.3.5 Algorithm for Decision Tree Induction . . . . . . . . . . 164
4.3.6 An Example: Web Robot Detection . . . . . . . . . . . 166

Contents xv

4.3.7 Characteristics of Decision Tree Induction . . . . . . . . 168
4.4 Model Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 172

4.4.1 Overfitting Due to Presence of Noise . . . . . . . . . . . 175
4.4.2 Overfitting Due to Lack of Representative Samples . . . 177
4.4.3 Overfitting and the Multiple Comparison Procedure . . 178
4.4.4 Estimation of Generalization Errors . . . . . . . . . . . 179
4.4.5 Handling Overfitting in Decision Tree Induction . . . . 184

4.5 Evaluating the Performance of a Classifier . . . . . . . . . . . . 186
4.5.1 Holdout Method . . . . . . . . . . . . . . . . . . . . . . 186
4.5.2 Random Subsampling . . . . . . . . . . . . . . . . . . . 187
4.5.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . 187
4.5.4 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 188

4.6 Methods for Comparing Classifiers . . . . . . . . . . . . . . . . 188
4.6.1 Estimating a Confidence Interval for Accuracy . . . . . 189
4.6.2 Comparing the Performance of Two Models . . . . . . . 191
4.6.3 Comparing the Performance of Two Classifiers . . . . . 192

4.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

5 Classification: Alternative Techniques 207
5.1 Rule-Based Classifier . . . . . . . . . . . . . . . . . . . . . . . . 207

5.1.1 How a Rule-Based Classifier Works . . . . . . . . . . . . 209
5.1.2 Rule-Ordering Schemes . . . . . . . . . . . . . . . . . . 211
5.1.3 How to Build a Rule-Based Classifier . . . . . . . . . . . 212
5.1.4 Direct Methods for Rule Extraction . . . . . . . . . . . 213
5.1.5 Indirect Methods for Rule Extraction . . . . . . . . . . 221
5.1.6 Characteristics of Rule-Based Classifiers . . . . . . . . . 223

5.2 Nearest-Neighbor classifiers . . . . . . . . . . . . . . . . . . . . 223
5.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.2.2 Characteristics of Nearest-Neighbor Classifiers . . . . . 226

5.3 Bayesian Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . 228
5.3.2 Using the Bayes Theorem for Classification . . . . . . . 229
5.3.3 Nave Bayes Classifier . . . . . . . . . . . . . . . . . . . 231
5.3.4 Bayes Error Rate . . . . . . . . . . . . . . . . . . . . . . 238
5.3.5 Bayesian Belief Networks . . . . . . . . . . . . . . . . . 240

5.4 Artificial Neural Network (ANN) . . . . . . . . . . . . . . . . . 246
5.4.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 247
5.4.2 Multilayer Artificial Neural Network . . . . . . . . . . . 251
5.4.3 Characteristics of ANN . . . . . . . . . . . . . . . . . . 255

xvi Contents

5.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . 256
5.5.1 Maximum Margin Hyperplanes . . . . . . . . . . . . . . 256
5.5.2 Linear SVM: Separable Case . . . . . . . . . . . . . . . 259
5.5.3 Linear SVM: Nonseparable Case . . . . . . . . . . . . . 266
5.5.4 Nonlinear SVM . . . . . . . . . . . . . . . . . . . . . . . 270
5.5.5 Characteristics of SVM . . . . . . . . . . . . . . . . . . 276

5.6 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . 276
5.6.1 Rationale for Ensemble Method . . . . . . . . . . . . . . 277
5.6.2 Methods for Constructing an Ensemble Classifier . . . . 278
5.6.3 Bias-Variance Decomposition . . . . . . . . . . . . . . . 281
5.6.4 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
5.6.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
5.6.6 Random Forests . . . . . . . . . . . . . . . . . . . . . . 290
5.6.7 Empirical Comparison among Ensemble Methods . . . . 294

5.7 Class Imbalance Problem . . . . . . . . . . . . . . . . . . . . . 294
5.7.1 Alternative Metrics . . . . . . . . . . . . . . . . . . . . . 295
5.7.2 The Receiver Operating Characteristic Curve . . . . . . 298
5.7.3 Cost-Sensitive Learning . . . . . . . . . . . . . . . . . . 302
5.7.4 Sampling-Based Approaches . . . . . . . . . . . . . . . . 305

5.8 Multiclass Problem . . . . . . . . . . . . . . . . . . . . . . . . . 306
5.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 309
5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

6 Association Analysis: Basic Concepts and Algorithms 327
6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 328
6.2 Frequent Itemset Generation . . . . . . . . . . . . . . . . . . . 332

6.2.1 The Apriori Principle . . . . . . . . . . . . . . . . . . . 333
6.2.2 Frequent Itemset Generation in the Apriori Algorithm . 335
6.2.3 Candidate Generation and Pruning . . . . . . . . . . . . 338
6.2.4 Support Counting . . . . . . . . . . . . . . . . . . . . . 342
6.2.5 Computational Complexity . . . . . . . . . . . . . . . . 345

6.3 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 349
6.3.1 Confidence-Based Pruning . . . . . . . . . . . . . . . . . 350
6.3.2 Rule Generation in Apriori Algorithm . . . . . . . . . . 350
6.3.3 An Example: Congressional Voting Records . . . . . . . 352

6.4 Compact Representation of Frequent Itemsets . . . . . . . . . . 353
6.4.1 Maximal Frequent Itemsets . . . . . . . . . . . . . . . . 354
6.4.2 Closed Frequent Itemsets . . . . . . . . . . . . . . . . . 355

6.5 Alternative Methods for Generating Frequent Itemsets . . . . . 359
6.6 FP-Growth Algorithm . . . . . . . . . . . . . . . . . . . . . . . 363

Contents xvii

6.6.1 FP-Tree Representation . . . . . . . . . . . . . . . . . . 363
6.6.2 Frequent Itemset Generation in FP-Growth Algorithm . 366

6.7 Evaluation of Association Patterns . . . . . . . . . . . . . . . . 370
6.7.1 Objective Measures of Interestingness . . . . . . . . . . 371
6.7.2 Measures beyond Pairs of Binary Variables . . . . . . . 382
6.7.3 Simpsons Paradox . . . . . . . . . . . . . . . . . . . . . 384

6.8 Effect of Skewed Support Distribution . . . . . . . . . . . . . . 386
6.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 390
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

7 Association Analysis: Advanced Concepts 415
7.1 Handling Categorical Attributes . . . . . . . . . . . . . . . . . 415
7.2 Handling Continuous Attributes . . . . . . . . . . . . . . . . . 418

7.2.1 Discretization-Based Methods . . . . . . . . . . . . . . . 418
7.2.2 Statistics-Based Methods . . . . . . . . . . . . . . . . . 422
7.2.3 Non-discretization Methods . . . . . . . . . . . . . . . . 424

7.3 Handling a Concept Hierarchy . . . . . . . . . . . . . . . . . . 426
7.4 Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 429

7.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . 429
7.4.2 Sequential Pattern Discovery . . . . . . . . . . . . . . . 431
7.4.3 Timing Constraints . . . . . . . . . . . . . . . . . . . . . 436
7.4.4 Alternative Counting Schemes . . . . . . . . . . . . . . 439

7.5 Subgraph Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 442
7.5.1 Graphs and Subgraphs . . . . . . . . . . . . . . . . . . . 443
7.5.2 Frequent Subgraph Mining . . . . . . . . . . . . . . . . 444
7.5.3 Apriori -like Method . . . . . . . . . . . . . . . . . . . . 447
7.5.4 Candidate Generation . . . . . . . . . . . . . . . . . . . 448
7.5.5 Candidate Pruning . . . . . . . . . . . . . . . . . . . . . 453
7.5.6 Support Counting . . . . . . . . . . . . . . . . . . . . . 457

7.6 Infrequent Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 457
7.6.1 Negative Patterns . . . . . . . . . . . . . . . . . . . . . 458
7.6.2 Negatively Correlated Patterns . . . . . . . . . . . . . . 458
7.6.3 Comparisons among Infrequent Patterns, Negative Pat-

terns, and Negatively Correlated Patterns . . . . . . . . 460
7.6.4 Techniques for Mining Interesting Infrequent Patterns . 461
7.6.5 Techniques Based on Mining Negative Patterns . . . . . 463
7.6.6 Techniques Based on Support Expectation . . . . . . . . 465

7.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 469
7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

xviii Contents

8 Cluster Analysis: Basic Concepts and Algorithms 487
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490

8.1.1 What Is Cluster Analysis? . . . . . . . . . . . . . . . . . 490
8.1.2 Different Types of Clusterings . . . . . . . . . . . . . . . 491
8.1.3 Different Types of Clusters . . . . . . . . . . . . . . . . 493

8.2 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
8.2.1 The Basic K-means Algorithm . . . . . . . . . . . . . . 497
8.2.2 K-means: Additional Issues . . . . . . . . . . . . . . . . 506
8.2.3 Bisecting K-means . . . . . . . . . . . . . . . . . . . . . 508
8.2.4 K-means and Different Types of Clusters . . . . . . . . 510
8.2.5 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 510
8.2.6 K-means as an Optimization Problem . . . . . . . . . . 513

8.3 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . 515
8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 516
8.3.2 Specific Techniques . . . . . . . . . . . . . . . . . . . . . 518
8.3.3 The Lance-Williams Formula for Cluster Proximity . . . 524
8.3.4 Key Issues in Hierarchical Clustering . . . . . . . . . . . 524
8.3.5 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 526

8.4 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
8.4.1 Traditional Density: Center-Based Approach . . . . . . 527
8.4.2 The DBSCAN Algorithm . . . . . . . . . . . . . . . . . 528
8.4.3 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 530

8.5 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 532
8.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 533
8.5.2 Unsupervised Cluster Evaluation Using Cohesion and

Separation . . . . . . . . . . . . . . . . . . . . . . . . . 536
8.5.3 Unsupervised Cluster Evaluation Using the Proximity

Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
8.5.4 Unsupervised Evaluation of Hierarchical Clustering . . . 544
8.5.5 Determining the Correct Number of Clusters . . . . . . 546
8.5.6 Clustering Tendency . . . . . . . . . . . . . . . . . . . . 547
8.5.7 Supervised Measures of Cluster Validity . . . . . . . . . 548
8.5.8 Assessing the Significance of Cluster Validity Measures . 553

8.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 555
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559

9 Cluster Analysis: Additional Issues and Algorithms 569
9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570

9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570
9.1.2 Data Characteristics . . . . . . . . . . . . . . . . . . . . 571

Contents xix

9.1.3 Cluster Characteristics . . . . . . . . . . . . . . . . . . . 573
9.1.4 General Characteristics of Clustering Algorithms . . . . 575

9.2 Prototype-Based Clustering . . . . . . . . . . . . . . . . . . . . 577
9.2.1 Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . 577
9.2.2 Clustering Using Mixture Models . . . . . . . . . . . . . 583
9.2.3 Self-Organizing Maps (SOM) . . . . . . . . . . . . . . . 594

9.3 Density-Based Clustering . . . . . . . . . . . . . . . . . . . . . 600
9.3.1 Grid-Based Clustering . . . . . . . . . . . . . . . . . . . 601
9.3.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . . 604
9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based

Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 608
9.4 Graph-Based Clustering . . . . . . . . . . . . . . . . . . . . . . 612

9.4.1 Sparsification . . . . . . . . . . . . . . . . . . . . . . . . 613
9.4.2 Minimum Spanning Tree (MST) Clustering . . . . . . . 614
9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities

Using METIS . . . . . . . . . . . . . . . . . . . . . . . . 616
9.4.4 Chameleon: Hierarchical Clustering with Dynamic

Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 616
9.4.5 Shared Nearest Neighbor Similarity . . . . . . . . . . . 622
9.4.6 The Jarvis-Patrick Clustering Algorithm . . . . . . . . . 625
9.4.7 SNN Density . . . . . . . . . . . . . . . . . . . . . . . . 627
9.4.8 SNN Density-Based Clustering . . . . . . . . . . . . . . 629

9.5 Scalable Clustering Algorithms . . . . . . . . . . . . . . . . . . 630
9.5.1 Scalability: General Issues and Approaches . . . . . . . 630
9.5.2 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
9.5.3 CURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635

9.6 Which Clustering Algorithm? . . . . . . . . . . . . . . . . . . . 639
9.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 643
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647

10 Anomaly Detection 651
10.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653

10.1.1 Causes of Anomalies . . . . . . . . . . . . . . . . . . . . 653
10.1.2 Approaches to Anomaly Detection . . . . . . . . . . . . 654
10.1.3 The Use of Class Labels . . . . . . . . . . . . . . . . . . 655
10.1.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656

10.2 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . . 658
10.2.1 Detecting Outliers in a Univariate Normal Distribution 659
10.2.2 Outliers in a Multivariate Normal Distribution . . . . . 661
10.2.3 A Mixture Model Approach for Anomaly Detection . . . 662

xx Contents

10.2.4 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 665
10.3 Proximity-Based Outlier Detection . . . . . . . . . . . . . . . . 666

10.3.1 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 666
10.4 Density-Based Outlier Detection . . . . . . . . . . . . . . . . . 668

10.4.1 Detection of Outliers Using Relative Density . . . . . . 669
10.4.2 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 670

10.5 Clustering-Based Techniques . . . . . . . . . . . . . . . . . . . 671
10.5.1 Assessing the Extent to Which an Object Belongs to a

Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
10.5.2 Impact of Outliers on the Initial Clustering . . . . . . . 674
10.5.3 The Number of Clusters to Use . . . . . . . . . . . . . . 674
10.5.4 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 674

10.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 675
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680

Appendix A Linear Algebra 685
A.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685

A.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 685
A.1.2 Vector Addition and Multiplication by a Scalar . . . . . 685
A.1.3 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . 687
A.1.4 The Dot Product, Orthogonality, and Orthogonal

Projections . . . . . . . . . . . . . . . . . . . . . . . . . 688
A.1.5 Vectors and Data Analysis . . . . . . . . . . . . . . . . 690

A.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
A.2.1 Matrices: Definitions . . . . . . . . . . . . . . . . . . . . 691
A.2.2 Matrices: Addition and Multiplication by a Scalar . . . 692
A.2.3 Matrices: Multiplication . . . . . . . . . . . . . . . . . . 693
A.2.4 Linear Transformations and Inverse Matrices . . . . . . 695
A.2.5 Eigenvalue and Singular Value Decomposition . . . . . . 697
A.2.6 Matrices and Data Analysis . . . . . . . . . . . . . . . . 699

A.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 700

Appendix B Dimensionality Reduction 701
B.1 PCA and SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . 701

B.1.1 Principal Components Analysis (PCA) . . . . . . . . . . 701
B.1.2 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706

B.2 Other Dimensionality Reduction Techniques . . . . . . . . . . . 708
B.2.1 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . 708
B.2.2 Locally Linear Embedding (LLE) . . . . . . . . . . . . . 710
B.2.3 Multidimensional Scaling, FastMap, and ISOMAP . . . 712

Contents xxi

B.2.4 Common Issues . . . . . . . . . . . . . . . . . . . . . . . 715
B.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 716

Appendix C Probability and Statistics 719
C.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719

C.1.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . 722
C.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723

C.2.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . 724
C.2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . 724
C.2.3 Interval Estimation . . . . . . . . . . . . . . . . . . . . . 725

C.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 726

Appendix D Regression 729
D.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
D.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 730

D.2.1 Least Square Method . . . . . . . . . . . . . . . . . . . 731
D.2.2 Analyzing Regression Errors . . . . . . . . . . . . . . . 733
D.2.3 Analyzing Goodness of Fit . . . . . . . . . . . . . . . . 735

D.3 Multivariate Linear Regression . . . . . . . . . . . . . . . . . . 736
D.4 Alternative Least-Square Regression Methods . . . . . . . . . . 737

Appendix E Optimization 739
E.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . 739

E.1.1 Numerical Methods . . . . . . . . . . . . . . . . . . . . 742
E.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . 746

E.2.1 Equality Constraints . . . . . . . . . . . . . . . . . . . . 746
E.2.2 Inequality Constraints . . . . . . . . . . . . . . . . . . . 747

Author Index 750

Subject Index 758

Copyright Permissions 769

1

I htrod uctiori

Rapid advances in data collection and storage technology have enabled or-
ganizations to accumulate vast amounts of data. However, extracting useful
information has proven extremely challenging. Often, traditional data analy-
sis tools and techniques cannot be used because of the massive size of a data
set. Sometimes, the non-traditional nature of the data means that traditional
approaches cannot be applied even if the data set is relatively small. In other
situations, the questions that need to be answered cannot be addressed using
existing data analysis techniques, and thus, new methods need to be devel-
oped.

Data mining is a technology that blends traditional data analysis methods
with sophisticated algorithms for processing large volumes of data. It has also
opened up exciting opportunities for exploring and analyzing new types of
data and for analyzing old types of data in new ways. In this introductory
chapter, we present an overview of data mining and outline the key topics
to be covered in this book. We start with a descript ion of some well-known
applications that require new techniques for data analysis.

Business Point-of-sale data collection (bar code scanners, radio frequency
identification (RFID), and smart card technology) have allowed retailers to
collect up-to-the-minute data about customer purchases at the checkout coun-
ters of their stores. Retailers can utilize this information, along with other
business-critical data such as Web logs from e-commerce Web sites and cus-
tomer service records from call centers, to help them better understand the
needs of their customers and make more informed business decisions.

Data mining techniques can be used to support a wide range of business
intelligence applications such as customer profiling, targeted marketing, work-
flow management, store layout , and fraud detection. It can also help retailers

….. -~-::….o:._-:—”

2 Chapter 1 Introduction

answer important business questions such as “Who are the most profitable
customers?” “What products can be cross-sold or up-sold?” and “What is the
revenue outlook of the company for next year?” Some of these questions mo-
tivated the creation of association analysis (Chapters 6 and 7) , a new data
analysis technique.

Medicine, Science, and Engineering Researchers in medicine, science,
and engineering are rapidly accumulating data that is key to important new
discoveries. For example, as an important step toward improving our under-
standing of the Earth’s climate system, NASA has deployed a series of Earth-
orbiting satellites that continuously generate global observations of the land
surface, oceans, and atmosphere. However, because of the size and spatia-
temporal nature of the data, tradit ional methods are often not suitable for
analyzing these data sets. Techniques developed in data mining can aid Earth
scientists in answering questions such as “What is the relationship between
the frequency and intensity of ecosystem disturbances such as drougllts and
hurricanes to global warming?” “How is land surface precipitation and temper-
ature affected by ocean surface temperature?” and “How well can we predict
the beginning and end of the growing season for a region?”

As another example, researchers in molecular biology hope to use the large
amounts of genomic data currently being gathered to better understand the
structure and function of genes. In the past, traditional methods in molecu-
lar biology allowed scientists to study only a few genes at a time in a given
experiment. Recent breakthroughs in microarray technology have enabled sci-
entists to compare the behavior of thousands of genes under various situations.
Such comparisons can help determine the function of each gene and perhaps
isolate the genes responsible for certain diseases. However, the noisy and high-
dimensional nature of data requires new types of data analysis. In addition
to analyzing gene array data, data mining can also be used to address other
important biological challenges such as protein structure prediction, multiple
sequence alignment, the modeling of biochemical pathways, and phylogenetics.

1.1 What Is Data Mining?

Data mining is the process of automatically discovering useful information in
large data repositories. Data mining techniques are deployed to scour large
databases in order to find novel and useful patterns that might otherwise
remai n unknown. They also provide capabili ties to predict t.he outcome of a

1.1 What Is Data Min ing? 3

future observation, such as predicting whether a newly arrived customer will
spend more t han $100 at a department store.

Not all information d iscovery tasks are considered to be data mining . For
example, looking up ind ividual records using a database management system
or finding particular Web pages via a query to an Internet search engine are
tasks related to the area of information r etr ieval. Although such tasks are
important and may involve the use of the sophisticated algorithms and data
structures, t hey rely on traditional computer science techniques and obvious
feat ures of the data to create index structures for efficiently organizing and
retrieving information. Nonetheless, data mining techniques have been used
to enhance information retrieval systems.

Data M ining and Knowledge Discovery

Data mining is an integral part of knowledge d iscovery in databases
(KDD), which is t he overall process of convert ing raw data into useful in-
formation, as shown in Figure 1.1. This process consists of a series of trans-
formation steps, from data preprocessing to postprocessing of data mining
results.

Input
Data

Feature Selection
Dimensionality Reduction
Normalization
Data Subsetting

Information

Filtering Patterns
Visualization
Pattern Interpretation

Figure 1.1. The process of knowledge discovery In databases (KDO).

The input dat,a can be stored in a variety of formats (flat files, spread-
sheets, or relational tables) and may reside in a centralized data repository
or be dist,r ibu ted across multip le sites. The pu rpose of p r eprocessing is
to transform the raw input data into an appropriate format for subsequent
analysis. The steps involved in data preprocessing include fusing data from
multip le sources, cleaning data to remove noise and duplicate observations,
and selecting records and features t hat are relevant to t he data mining task
at hand. Because of the many ways data can be collected and stored, data

4 Chapter 1 Introduction

preprocessing is perhaps the most laborious and time-consuming step in the
overall knowledge discovery process.

“Closing the loop” is the phrase often used to refer to the process of in-
tegrating data mining results into decision support systems. For example,
in business applications, the insights offered by data mining results can be
integrated with campaign management tools so that effective marketing pro-
motions can be conducted and tested. Such integration requires a postpro-
cessing step that ensures that only valid and useful results are incorporated
into the decision support system. An example of postprocessing is visualiza-
tion (see Chapter 3), which allows analysts to explore the data and the data
mining results from a variety of viewpoints. Statistical measures or hypoth-
esis testing methods can also be applied during postprocessing to eliminate
spurious data mining results.

1.2 Motiva ting C ha llenges

As mentioned earlier , traditional data analysis techniques have often encoun-
tered practical difficult ies in meeting the challenges posed by new data sets.
The following are some of the specific challenges that motivated the develop-
ment of data mining.

Scalability Because of advances in data generation and collection, data sets
with sizes of gigabytes, terabytes, or even petabytes are becoming common.
lf data mining algorithms are to handle these massive data sets, then they
must be scalable. Many data mining algorithms employ special search strate-
gies to handle exponential search problems. Scalability may also require the
implementation of novel data structures to access individual records in an ef-
ficient manner. For instance, out-of-core algorithms may be necessary when
processing da