Preface

This is the study guide for 161.324 Data Mining (undergraduate) and 161.777 Practical Data Mining (postgraduate) at Massey University.

Data mining sits at the intersection of statistics, machine learning, and computing. These notes take a statistician’s approach to the subject, emphasising a principled workflow: understanding the data, choosing an appropriate method, evaluating results, and communicating findings.

0.1 Course structure

The course is divided into two main parts:

  1. Prediction and classification (weeks 1–9): data wrangling, missing value imputation, linear models, regression trees, random forests, neural networks, linear discriminant analysis, logistic regression, classification trees, and ensemble methods.
  2. Unsupervised learning and pattern discovery (weeks 10–12): clustering (hierarchical, k-means, k-medoids), association rule mining, and ensemble learning including bagging, boosting (XGBoost), and stacked ensembles.

Each topic is accompanied by practical lab sessions in R, using the tidyverse and tidymodels ecosystems, along with a range of specialised packages for clustering, association rules, and ensemble methods.

0.2 Prerequisites

The notes assume familiarity with introductory statistics and basic R programming (data frames, vectors, functions). Some experience with ggplot2 and dplyr is helpful but not essential—the early weeks include a review of these tools.

0.3 How to use these notes

The notes are designed to be read alongside the lectures and labs. Each chapter corresponds to roughly one week of material. Code examples are embedded throughout; running them yourself will give you a much deeper understanding than reading alone.

0.4 Acknowledgements

These notes build on earlier course materials written by Jonathan Marshall and Martin Hazelton, with contributions from Nicholas Knowlton.