This is my final project submission for the Practical Machine Learning course by John Hopkins on Coursera.
The objective of the project was to develop a predictive model using accelerometer data to accurately classify the quality barbell lifts performed by athletes.
The provided weight lifting exercise data data included data from accelerometers placed on the belt, forearm, arm, and dumbell of 6 participants. The atheletes were asked to perform barbell lifts correctly, and then incorrectly in 5 different ways.
The first step is to clean the loaded training data. I took the following steps to get the data into a state where it could be analyzed:
train_clean <- training_raw %>%
#convert strings/character where there are blank cells with NA
mutate(across(where(is.character), ~na_if(trimws(.), ""))) %>%
#convert outcome variable into a factor
mutate(classe = as.factor(classe))
#remove variables with > 95% missing values
missing_pct <- function(data) {
sapply(data, function(x) mean(is.na(x))*100)
}
vars_over_50_missing <- function(data) {
missing_pct <- sapply(data, function(x) mean(is.na(x))*100)
names(missing_pct[missing_pct >= 50])
}
##Code to see the list of variables with missing data
#missing_pct(train_clean)
#vars_over_50_missing(train_clean)
#all of the variables that have missing data have 97% missing data.
#Safe in this case to remove more than 95% missing data, although 80% is ideal
train_clean <- train_clean %>%
select(where(~ mean(!is.na(.)) > 0.95))
#drop non-sensor data as it will not be helpful for predicting
train_clean <- train_clean %>%
select(-c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window))
Using a seed for reproducibility, I randomly split the training data into a training data set (to develop the model) and testing data set (to calculate the out of sample error). I have called the project “testing” set the validation set, for clarity.
#create test and training data sets from the provided training data ####
set.seed(652)
inTrain <- createDataPartition(y = train_clean$classe, p = 0.7, list = FALSE)
train <- train_clean[inTrain,]
test <- train_clean[-inTrain,]
dim(train)
## [1] 13737 53
## [1] 5885 53
## [1] 20 160
The training set for model building comprised 70% of the entire raw training set.
Random forest models are well suited for classification tasks and are generally robust to multicollinearity, making them a strong choice for this dataset.
I implemented a random forest model using 5-fold cross-validation with the random forest being comprised of 500 trees. This level of k-fold cross validation should help to ensure that the model is robust and not over-fitted to the training data.
#model using random forest for accuracy and interpretability
set.seed(892)
control <- trainControl(method = "cv", number = 5)
model_rf <- train(classe ~ ., data = train, method = "rf", trControl = control)#, ntree = 100
model_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.68%
## Confusion matrix:
## A B C D E class.error
## A 3899 5 2 0 0 0.001792115
## B 21 2630 7 0 0 0.010534236
## C 0 10 2376 10 0 0.008347245
## D 0 2 22 2226 2 0.011545293
## E 0 0 4 8 2513 0.004752475
The model tried 27 variables at each split and had an out-of-bag estimated error rate of 0.68% - meaning an estimated 99.32% accuracy based on the training data.
The next step is to use the results of the random forest model to predict the outcome on the testing data to determine it’s performance, including the out-of-sample error rate.
#predictions ####
predictions <- predict(model_rf, test)
conf_mat <- confusionMatrix(predictions, test$classe)
#out of sample error
oos_error <- 1 - conf_mat$overall["Accuracy"]
#print(oos_error)
Below is a summary of the performance of the random forest model for both in and out of sample errors:
#Calculate in-sample error
train_predictions <- predict(model_rf, train)
train_conf_mat <- confusionMatrix(train_predictions, train$classe)
in_sample_error <- 1 - train_conf_mat$overall["Accuracy"]
#Create a comparison table
error_df <- data.frame(
Dataset = c("Training (In-Sample)", "Testing (Out-of-Sample)"),
Accuracy = c(train_conf_mat$overall["Accuracy"], conf_mat$overall["Accuracy"]), Error = c(in_sample_error, oos_error)
)
#Display as a nice table
knitr::kable(error_df, caption = "Comparison of In-Sample and Out-of-Sample Errors", digits = 4)
Dataset | Accuracy | Error |
---|---|---|
Training (In-Sample) | 1.0000 | 0.0000 |
Testing (Out-of-Sample) | 0.9929 | 0.0071 |
The model had an out-of-sample error rate of 0.71%, meaning that the model was 99.29% accurate. This is represents excellent predictive performance of the model.
First, the same cleaning steps were applied to the testing (validation) data set as were applied in the initial cleaning of the raw training data.
validation_clean <- validation %>%
mutate(across(where(is.character), ~na_if(trimws(.), ""))) #convert strings/character where there are blank cells with NA
#all of the variables that have missing data have 97% missing data.
#Safe in this case to remove more than 95% missing data, although 80% is ideal
validation_clean <- validation_clean %>%
select(where(~ mean(!is.na(.)) > 0.95))
#drop non-sensor data as it will not be helpful for predicting
validation_clean <- validation_clean %>%
select(-c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window))
Next, use the model make predictions on the cleaned data.
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E