Data mining
2026-05-22
Preface
This is the study guide for 161.324 Data Mining (undergraduate) and 161.777 Practical Data Mining (postgraduate) at Massey University.
Data mining sits at the intersection of statistics, machine learning, and computing. These notes take a statistician’s approach to the subject, emphasising a principled workflow: understanding the data, choosing an appropriate method, evaluating results, and communicating findings.
0.1 Course structure
The course is divided into two main parts:
- Prediction and classification (weeks 1–9): data wrangling, missing value imputation, linear models, regression trees, random forests, neural networks, linear discriminant analysis, logistic regression, classification trees, and ensemble methods.
- Unsupervised learning and pattern discovery (weeks 10–12): clustering (hierarchical, k-means, k-medoids), association rule mining, and ensemble learning including bagging, boosting (XGBoost), and stacked ensembles.
Each topic is accompanied by practical lab sessions in R, using the tidyverse and tidymodels ecosystems, along with a range of specialised packages for clustering, association rules, and ensemble methods.
0.2 Prerequisites
The notes assume familiarity with introductory statistics and basic R
programming (data frames, vectors, functions). Some experience with ggplot2
and dplyr is helpful but not essential—the early weeks include a review of
these tools.
0.3 How to use these notes
The notes are designed to be read alongside the lectures and labs. Each chapter corresponds to roughly one week of material. Code examples are embedded throughout; running them yourself will give you a much deeper understanding than reading alone.
0.4 Acknowledgements
These notes build on earlier course materials written by Jonathan Marshall and Martin Hazelton, with contributions from Nicholas Knowlton.