Predicting Penguin Gender Through Their Characteristics

Libraries:

library(tidyverse)
library(ggplot2)
library(caret)
library(e1071)
library(kernlab) 
library(rattle)
library(mlbench)
library(rpart)
library(partykit)
library(psych)
library(pROC)
library(kknn)
library(stats)
library(factoextra)
library(cluster)
library(pROC)



library(tinytex)
library(rmarkdown)

library(tidyverse), library(ggplot2), library(caret), library(e1071), library(kernlab), library(rattle), library(mlbench), library(rpart), library(partykit),library(psych), library(kknn), library(stats), library(factoextra), library(cluster), library(pROC).

Libraries needed for Knit to PDF: library(tinytex) , library(rmarkdown).

a. Data gathering and integration

For this exercise, I will be using the penguins_size.csv data set to perform HW 5.

https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data

This dataset contains 344 rows of data. Each row represents a penguin.

There are 7 columns total. 3 columns are categorical variables, and 4 columns are numerical variables.

pathname <- "penguins_size.csv"

penguin_size <- read.csv(pathname)

# Setting seed
set.seed(123)

glimpse(penguin_size)

## Rows: 344
## Columns: 7
## $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…

b. Data Exploration

Variables:

penguins_size.csv contains variables:

species: penguin species -(“Adelie”, “Chinstrap”, “Gentoo”)

culmen_length_mm: culmen length (mm)

culmen_depth_mm: culmen depth (mm)

flipper_length_mm: flipper length (mm)

body_mass_g: body mass (grams)

island: island name in the Palmer Archipelago (Antarctica) -(“Torgersen”, “Biscoe”, “Dream”)

sex: penguin sex -(“MALE”, “FEMALE”, NA, “.”)

The numerical columns show the following distributions:

culmen_length_mm: Mean = 43.92 mm, with a min and max of 32.10 mm to 59.60 mm.

culmen_depth_mm: Mean = 17.15 mm, with a min and max of 13.10 mm to 21.50 mm.

flipper_length_mm: Mean = 200.90 mm, with a min and max of 172.00 mm to 231.00 mm.

body_mass_g: Mean = 4202 g, with a min and max of 2700 grams to 6300 grams.

After exploring the data further, there appears to be 2 NA’s in each of the columns: culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g.

At first glance, the sex column has 10 NAs with 1 entry with the unique chr of “.”. This “.” appears to be a placeholder for an unknown data point/ missing data. In total, there are 11 NAs, considering the “.”.

Relationship between pairs of variables: What are culmen length & depth and how do they relate to the penguin’s beak?

The culmen is “the upper ridge of a bird’s beak” (definition from Oxford Languages). For this penguin data, the culmen length and culmen depth are considered as the culmen width and culmen height of the penguin’s beak.

After c. Data Cleaning, let’s create a new variable called culmen size, which will be the value of (culmen length x culmen depth), to explore the data more and see if adding this variable can help with prediction accuracy.

After c. Data Cleaning, I’ll also convert the body_mass_g column from grams to pounds with this conversion formula: body_mass_lbs = body_mass_g / 453.59237

head(penguin_size)

# numeric vector that calculates the sum of NAs per column in the penguin_size data.
nas_per_column<- colSums(is.na(penguin_size))

print(nas_per_column)

##           species            island  culmen_length_mm   culmen_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex 
##                 2                 2                10

summary(penguin_size)

##    species             island          culmen_length_mm culmen_depth_mm
##  Length:344         Length:344         Min.   :32.10    Min.   :13.10  
##  Class :character   Class :character   1st Qu.:39.23    1st Qu.:15.60  
##  Mode  :character   Mode  :character   Median :44.45    Median :17.30  
##                                        Mean   :43.92    Mean   :17.15  
##                                        3rd Qu.:48.50    3rd Qu.:18.70  
##                                        Max.   :59.60    Max.   :21.50  
##                                        NA's   :2        NA's   :2      
##  flipper_length_mm  body_mass_g       sex           
##  Min.   :172.0     Min.   :2700   Length:344        
##  1st Qu.:190.0     1st Qu.:3550   Class :character  
##  Median :197.0     Median :4050   Mode  :character  
##  Mean   :200.9     Mean   :4202                     
##  3rd Qu.:213.0     3rd Qu.:4750                     
##  Max.   :231.0     Max.   :6300                     
##  NA's   :2         NA's   :2

unique_species <- unique(penguin_size$species)
print(unique_species)

## [1] "Adelie"    "Chinstrap" "Gentoo"

unique_island <- unique(penguin_size$island)
print(unique_island)

## [1] "Torgersen" "Biscoe"    "Dream"

unique_sex <- unique(penguin_size$sex)
print(unique_sex)

## [1] "MALE"   "FEMALE" NA       "."

# Numeric variable that calculates the sum of every instnace of "." in the sex column
sum_unique_entry <- sum(penguin_size$sex == ".", na.rm = TRUE)

# Sum of instances of "."
print(paste("Sum of '.' entries in sex column:", sum_unique_entry))

## [1] "Sum of '.' entries in sex column: 1"

c. Data Cleaning

Let’s consider the missing NAs and determine how much effect they have on the penguin_size data.

First, I’ll convert the single “.” entry in the sex column to NA.

After calculating the proportion of missing data per column to see the significance of its impact on the data, here are the results:

species: 0%

island: 0%

culmen_length_mm: 0.58%

culmen_depth_mm: 0.58%

flipper_length_mm: 0.58%

body_mass_g: 0.58%

sex: 3.20%

The sex column has the most significant impact on the data, with 3.20% contributing to missing NAs. As opposed to 0.58% for culmen_length_mm, culmen_depth_mm, flipper_length_mm, and body_mass_g.

Considering, 3.20% is relatively low, I will drop the NAs, since its removal should not have too large of an effect on predictability.

After dropping the NAs, 11 rows have been removed, and the dataset now contains 333 rows now.

# Changing the "." in the sex column to NA
penguin_size <- penguin_size %>%
  mutate(sex = na_if(sex, "."))

# numeric vector that calculates the sum of NAs per column in the penguin_size data.
nas_per_column<- colSums(is.na(penguin_size))

print(nas_per_column)

##           species            island  culmen_length_mm   culmen_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex 
##                 2                 2                11

# Calculating the proportion of missing data per column to see the significance of its impact on the data.
# There are 344 rows
nas_per_column_percent <- (nas_per_column/344) * 100

print(nas_per_column_percent)

##           species            island  culmen_length_mm   culmen_depth_mm 
##         0.0000000         0.0000000         0.5813953         0.5813953 
## flipper_length_mm       body_mass_g               sex 
##         0.5813953         0.5813953         3.1976744

# Dropping the NAs in the penguin_size data.
penguin_size <- na.omit(penguin_size)

# numeric vector that calculates the sum of NAs per column in the penguin_size data.
nas_per_column<- colSums(is.na(penguin_size))

# Observing the NAs per column after dropping them.
print(nas_per_column)

##           species            island  culmen_length_mm   culmen_depth_mm 
##                 0                 0                 0                 0 
## flipper_length_mm       body_mass_g               sex 
##                 0                 0                 0

glimpse(penguin_size)

## Rows: 333
## Columns: 7
## $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
## $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
## $ sex               <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "FEMAL…

summary(penguin_size)

##    species             island          culmen_length_mm culmen_depth_mm
##  Length:333         Length:333         Min.   :32.10    Min.   :13.10  
##  Class :character   Class :character   1st Qu.:39.50    1st Qu.:15.60  
##  Mode  :character   Mode  :character   Median :44.50    Median :17.30  
##                                        Mean   :43.99    Mean   :17.16  
##                                        3rd Qu.:48.60    3rd Qu.:18.70  
##                                        Max.   :59.60    Max.   :21.50  
##  flipper_length_mm  body_mass_g       sex           
##  Min.   :172       Min.   :2700   Length:333        
##  1st Qu.:190       1st Qu.:3550   Class :character  
##  Median :197       Median :4050   Mode  :character  
##  Mean   :201       Mean   :4207                     
##  3rd Qu.:213       3rd Qu.:4775                     
##  Max.   :231       Max.   :6300

d. Data Preprocessing

Case: Let’s use this data to predict the gender type of a penguin.

In this step, let’s create 2 new columns:

culmen_size_mm2: Which will be the value of (culmen length x culmen depth), to explore the data more and see if adding this variable can help with prediction accuracy.

body_mass_lbs: Converting the body_mass_g column from grams to pounds with this conversion formula: body_mass_lbs = body_mass_g / 453.59237

Next, I’ll convert the categorical variables into factor.

Now, I’ll prepare a subset of the penguin_size data without the class labels to perform clustering. Afterwards, I’ll preprocess the data using the center scaled method.

predictors_clustering

Since, we’re going to perform predictions using other classifiers, I’ll partition the data after using a 70/30 train-test split and preprocess the data when creating the classifier models.

train_penguin

test_penguin

# Creating culmen_size_mm2 column
penguin_size$culmen_size_mm2 <- penguin_size$culmen_length_mm * penguin_size$culmen_depth_mm

# Creating body_mass_lbs column
penguin_size$body_mass_lbs <- penguin_size$body_mass_g / 453.59237

# Converting <chr> to <fctr>'s
penguin_size$species <- as.factor(penguin_size$species)
penguin_size$island <- as.factor(penguin_size$island)
penguin_size$sex <- as.factor(penguin_size$sex)

# Observing changes
glimpse(penguin_size)

## Rows: 333
## Columns: 9
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
## $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
## $ sex               <fct> MALE, FEMALE, FEMALE, FEMALE, MALE, FEMALE, MALE, FE…
## $ culmen_size_mm2   <dbl> 731.17, 687.30, 725.40, 708.31, 809.58, 692.42, 768.…
## $ body_mass_lbs     <dbl> 8.267335, 8.377566, 7.165024, 7.605948, 8.046873, 7.…

# Creating subset of penguin_size data to perform clustering
# Removing class labels
predictors_clustering <- penguin_size %>% select(-c(species, island, sex))

# Checking predictors_clustering subset
glimpse(predictors_clustering)

## Rows: 333
## Columns: 6
## $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
## $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
## $ culmen_size_mm2   <dbl> 731.17, 687.30, 725.40, 708.31, 809.58, 692.42, 768.…
## $ body_mass_lbs     <dbl> 8.267335, 8.377566, 7.165024, 7.605948, 8.046873, 7.…

# Center scale allows us to standardize the data
preproc <- preProcess(predictors_clustering, method = c("center", "scale"))

# We have to call predict to fit our data based on preprocessing
# Using predictors_clustering for g. Clustering
predictors_clustering <- predict(preproc, predictors_clustering)

# Partitioning the data to be used in f. Classification
index = createDataPartition(y = penguin_size$sex, p = 0.7, list=FALSE)

# Everything in the generated index list
train_penguin = penguin_size[index,]

# Everything except the generated indices
test_penguin = penguin_size[-index,]

e. Clustering

Using subset predictors_clustering without class labels.

Let’s use kmeans to determine the best number for K before clustering. Using the plots for the Within Sum of Squares method and the Silhouette method.

Using the Within Sum of Squares method, K = 4 represents the last non flat slope, or some argue that we should use K = 5 because that is the last point before the slope goes flat.

Using the Silhouette method, K = 2.

Either option can be justified but since the K = 4 is the second best option for the silhoutte score it is a promising option for both. Using K = 4 as suggested by the plots we can fit our model using the kmeans function.

I generated the following plots:

Cluster plot

PCA Projection Colored by Penguin Sex

PCA Projection Colored by Cluster

# Checking subset of without class labels
glimpse(predictors_clustering)

## Rows: 333
## Columns: 6
## $ culmen_length_mm  <dbl> -0.8946955, -0.8215515, -0.6752636, -1.3335592, -0.8…
## $ culmen_depth_mm   <dbl> 0.77955895, 0.11940428, 0.42409105, 1.08424573, 1.74…
## $ flipper_length_mm <dbl> -1.4246077, -1.0678666, -0.4257325, -0.5684290, -0.7…
## $ body_mass_g       <dbl> -0.567620576, -0.505525421, -1.188572125, -0.9401915…
## $ culmen_size_mm2   <dbl> -0.18555728, -0.56408104, -0.23534260, -0.38280034, …
## $ body_mass_lbs     <dbl> -0.567620576, -0.505525421, -1.188572125, -0.9401915…

# Observing optimal number of K clusters using Within Sum of Square
fviz_nbclust(predictors_clustering, kmeans, method = "wss")

# ggsave("predictors_clustering_wss.png", plot = last_plot(), width = 10, height = 6)

# Observing optimal number of K clusters using Silhouette
fviz_nbclust(predictors_clustering, kmeans, method = "silhouette")

# ggsave("predictors_clustering_silhouette.png", plot = last_plot(), width = 10, height = 6)

# Fit the data
# K = 4 with 25 restarts
fit_clustering <- kmeans(predictors_clustering, centers = 4, nstart = 25)

# Display the kmeans object information
fit_clustering

## K-means clustering with 4 clusters of sizes 69, 66, 145, 53
## 
## Cluster means:
##   culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## 1        0.8308127       0.9238068        -0.2737298  -0.3597368
## 2        0.2980496      -1.4332672         0.8812373   0.6531137
## 3       -0.9318980       0.4640155        -0.8223302  -0.7312092
## 4        1.0967521      -0.6873450         1.5087468   1.6555031
##   culmen_size_mm2 body_mass_lbs
## 1       1.4535194    -0.3597368
## 2      -0.8449154     0.6531137
## 3      -0.4273468    -0.7312092
## 4       0.3289974     1.6555031
## 
## Clustering vector:
##   1   2   3   5   6   7   8  13  14  15  16  17  18  19  20  21  22  23  24  25 
##   3   3   3   3   3   3   3   3   3   3   3   3   1   3   1   3   3   3   3   3 
##  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   3 
##  46  47  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 
##   3   3   3   1   3   3   3   1   3   3   3   3   3   3   3   1   3   3   3   3 
##  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86 
##   3   3   3   1   3   3   3   1   3   3   3   3   3   3   3   3   3   3   3   3 
##  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   3   3   3   3 
## 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
##   3   3   3   1   3   1   3   1   3   3   3   3   3   3   3   3   3   3   3   3 
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 
##   3   3   3   1   3   1   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 
##   3   3   3   3   3   3   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 
##   3   1   1   1   1   1   3   1   3   1   1   1   1   1   1   1   3   1   3   1 
## 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 
##   1   1   1   1   3   1   1   1   1   3   1   1   1   1   1   1   1   1   3   1 
## 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 
##   3   1   3   1   1   1   1   3   3   1   3   1   1   1   2   4   2   4   2   2 
## 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 
##   2   4   2   2   2   4   2   4   2   4   2   4   2   4   4   2   2   2   2   2 
## 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 
##   4   2   4   4   2   2   4   4   4   2   4   2   4   2   4   2   2   4   2   2 
## 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 288 
##   4   2   4   2   2   2   4   2   2   2   2   2   4   2   2   2   4   2   4   4 
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 
##   2   4   2   2   4   2   2   4   2   4   2   4   2   4   2   4   2   4   2   4 
## 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 326 327 328 329 
##   2   4   2   4   2   4   2   4   4   2   2   4   2   4   2   4   4   2   4   2 
## 330 331 332 333 334 335 336 338 339 341 342 343 344 
##   4   4   4   2   4   2   4   4   2   2   4   2   4 
## 
## Within cluster sum of squares by cluster:
## [1] 134.21844  48.68650 240.08827  55.15159
##  (between_SS / total_SS =  76.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Display the cluster plot
fviz_cluster(fit_clustering, data = predictors_clustering)

# ggsave("fit_clustering.png", plot = last_plot(), width = 10, height = 6)

# Calculate PCA
pca = prcomp(predictors_clustering)

# Save as dataframe
rotated_data = as.data.frame(pca$x)
# Add original labels as a reference
rotated_data$Color <- penguin_size$sex

# Plot and color by labels
ggplot(data = rotated_data, aes(x = PC1, y = PC2, col = Color)) + 
  geom_point(alpha = 0.3)+
  labs(title = "PCA Projection Colored by Penguin Sex")

# ggsave("pca_clustering.png", plot = last_plot(), width = 10, height = 6)

# Assign clusters as a new column
rotated_data$Clusters = as.factor(fit_clustering$cluster)
# Plot and color by labels
ggplot(data = rotated_data, aes(x = PC1, y = PC2, col = Clusters)) + 
  geom_point() +
  labs(title = "PCA Projection Colored by Cluster")

# ggsave("pca_clustering_colored.png", plot = last_plot(), width = 10, height = 6)

f. Classification

Now that we’ve partitioned that data in d. Data Preprocessing, let’s preprocess the data as we create the classifiers.

Using SVM, and grid search to tune for C: The final value used for the model was C = 0.1.

Accuracy was 0.9777766.

Using KNN, and tuning the choice of k plus the type of distance function: ## The final values used for the model were kmax = 7, distance = 2 and kernel = cos.

Accuracy was 0.9228261.

Using Decision Tree, and basic rpart method, the final value used for the model was cp = 0.01724138.

Accuracy was 0.8425231.

When comparing the classifiers to predict the penguin’s sex, SVM performed the best, with an accuracy of 0.9777766. KNN was a close second with an accuracy of 0.9228261.

For part g. Evaluation, let’s proceed with the best performing classifier, SVM.

# How to plot ROC Curve for SVM: https://topepo.github.io/caret/model-training-and-tuning.html#control
# Define the grid of tuning parameters
# C = (0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000)
grid <- expand.grid(C = 10^seq(log10(0.00001), log10(1000), 1))

# Define the training control
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 25, 
                              classProbs = TRUE, summaryFunction = twoClassSummary)
# Scaling method
preproc = c("center", "scale")

# Training the SVM model
svm_grid <- train(sex ~ ., data = train_penguin, method = "svmLinear", 
                   trControl = train_control, preProcess = preproc, tuneGrid = grid, metric = "ROC")

svm_grid

## Support Vector Machines with Linear Kernel 
## 
## 234 samples
##   8 predictor
##   2 classes: 'FEMALE', 'MALE' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 25 times) 
## Summary of sample sizes: 211, 210, 211, 211, 210, 211, ... 
## Resampling results across tuning parameters:
## 
##   C      ROC        Sens       Spec     
##   1e-05  0.2668214  0.1937576  0.9116364
##   1e-04  0.8836749  0.1933939  0.9112727
##   1e-03  0.8836446  0.8833030  0.6864545
##   1e-02  0.9632826  0.8945152  0.9023333
##   1e-01  0.9777766  0.8927879  0.9305455
##   1e+00  0.9764396  0.9048182  0.9250606
##   1e+01  0.9749635  0.9105152  0.9171212
##   1e+02  0.9741568  0.9092121  0.9110909
##   1e+03  0.9733584  0.9055758  0.9047576
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.1.

# Using KNN classifier 
# setup a tuneGrid with the tuning parameters
tuneGrid <- expand.grid(kmax = 1:7,                        # test a range of k values 1 to 7
                        kernel = c("rectangular", "cos"),  # regular and cosine-based distance functions
                        distance = 1:3)                    # powers of Minkowski 1 to 3

# Evaluation method parameter using 10 Fold Cross Validation
train_control <- trainControl(method = "cv", number = 10) 

# Scaling method
preproc = c("center", "scale")

# tune and fit the model with 10-fold cross validation,
kknn_fit <- train(sex ~ ., data = train_penguin, method = 'kknn', trControl = train_control, preProcess = preproc,
                  tuneGrid = tuneGrid)

kknn_fit

## k-Nearest Neighbors 
## 
## 234 samples
##   8 predictor
##   2 classes: 'FEMALE', 'MALE' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 211, 210, 211, 211, 211, 210, ... 
## Resampling results across tuning parameters:
## 
##   kmax  kernel       distance  Accuracy   Kappa    
##   1     rectangular  1         0.8755435  0.7508191
##   1     rectangular  2         0.8759058  0.7515104
##   1     rectangular  3         0.8759058  0.7517089
##   1     cos          1         0.8755435  0.7508191
##   1     cos          2         0.8759058  0.7515104
##   1     cos          3         0.8759058  0.7517089
##   2     rectangular  1         0.8755435  0.7508191
##   2     rectangular  2         0.8759058  0.7515104
##   2     rectangular  3         0.8759058  0.7517089
##   2     cos          1         0.8755435  0.7508191
##   2     cos          2         0.8759058  0.7515104
##   2     cos          3         0.8759058  0.7517089
##   3     rectangular  1         0.9143116  0.8284392
##   3     rectangular  2         0.9143116  0.8281082
##   3     rectangular  3         0.8967391  0.7927771
##   3     cos          1         0.8969203  0.7935885
##   3     cos          2         0.8885870  0.7766558
##   3     cos          3         0.9012681  0.8020682
##   4     rectangular  1         0.9143116  0.8284392
##   4     rectangular  2         0.9143116  0.8281082
##   4     rectangular  3         0.8967391  0.7927771
##   4     cos          1         0.9012681  0.8023998
##   4     cos          2         0.9014493  0.8024129
##   4     cos          3         0.9097826  0.8189458
##   5     rectangular  1         0.9057971  0.8113609
##   5     rectangular  2         0.9144928  0.8285204
##   5     rectangular  3         0.8925725  0.7843147
##   5     cos          1         0.9097826  0.8196108
##   5     cos          2         0.8972826  0.7940796
##   5     cos          3         0.8971014  0.7936014
##   6     rectangular  1         0.9057971  0.8111629
##   6     rectangular  2         0.9144928  0.8285204
##   6     rectangular  3         0.9012681  0.8016041
##   6     cos          1         0.9097826  0.8196108
##   6     cos          2         0.9057971  0.8110296
##   6     cos          3         0.8971014  0.7934671
##   7     rectangular  1         0.9099638  0.8194962
##   7     rectangular  2         0.9184783  0.8363105
##   7     rectangular  3         0.9014493  0.8021503
##   7     cos          1         0.9141304  0.8282236
##   7     cos          2         0.9228261  0.8452535
##   7     cos          3         0.9141304  0.8274952
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 7, distance = 2 and kernel
##  = cos.

# Using Decision Trees (basic rpart method)
# Evaluation method parameter using 10 Fold Cross Validation
train_control <- trainControl(method = "cv", number = 10)

# Scaling method
preproc = c("center", "scale")

# Fit the model
decision_tree <- train(sex ~., data = train_penguin, method = "rpart", trControl = train_control, preProcess = preproc)

decision_tree

## CART 
## 
## 234 samples
##   8 predictor
##   2 classes: 'FEMALE', 'MALE' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 212, 210, 210, 211, 211, 210, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01724138  0.8425231  0.6842408
##   0.10344828  0.7948452  0.5898477
##   0.63793103  0.6629611  0.3172756
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01724138.

# # Opening a PNG device
# png("decision_tree_plot.png", width = 800, height = 600)

# Plotting the decision tree
fancyRpartPlot(decision_tree$finalModel, caption = "")

# # Closing device to save the plot
# dev.off()

#Generating IF-THEN statements
if_then_rules <- as.party.rpart(decision_tree$finalModel, data = TRUE)
if_then_rules

## 
## Model formula:
## .outcome ~ speciesChinstrap + speciesGentoo + islandDream + islandTorgersen + 
##     culmen_length_mm + culmen_depth_mm + flipper_length_mm + 
##     body_mass_g + culmen_size_mm2 + body_mass_lbs
## 
## Fitted party:
## [1] root
## |   [2] culmen_size_mm2 < -0.33888: FEMALE (n = 96, err = 11.5%)
## |   [3] culmen_size_mm2 >= -0.33888
## |   |   [4] body_mass_g < -0.70936: FEMALE (n = 28, err = 28.6%)
## |   |   [5] body_mass_g >= -0.70936: MALE (n = 110, err = 10.0%)
## 
## Number of inner nodes:    2
## Number of terminal nodes: 3

g. Evaluation

After producing the 2 x 2 confusion matrix for the FEMALE and MALE class, here is the result:

Prediction	FEMALE	MALE
FEMALE	45	9
MALE	4	41

Since the confusion matrix object only shows statistics for ‘Positive’ Class : FEMALE, I created a confusion matrix for ‘Positive’ Class : MALE, combined their class statistics in a dataset to easily pull calculations.

When calculating precision manually, here is the result:

Class	Precision
FEMALE	0.8333333
MALE	0.9111111

When calculating recall manually, here is the result:

Class	Recall
FEMALE	0.9183673
MALE	0.8200000

When analyzing the statistics above, MALE appears to have a higher precision and FEMALE appears to have a higher recall.

Precision measures the accuracy of forming positive predictions. Recall measures the model’s ability to find true positives.

In summary, high precision is important when failing to identify a negative prediction as a false positive is more costly than missing actual false negatives (positive predictions). High recall is important when failing to identify a positive prediction is more costly than incorrectly identifying a negative case.

The ROC curve will plot Recall or Sensitivity (True Positive Rate) against Specificity (False Positive Rate). In this result, the area under the ROC curve (AUC) has a high value with AUC: 0.951, indicating good test performance and classification abilities. The model can identify the differences positive and negative classes effectively.

# (1) Produce a 2x2 confusion matrix of the best performing classifier.
# Now that our svm_grid is made, let's, evaluate the fit with a confusion matrix
pred_penguin_size <- predict(svm_grid, newdata = test_penguin)

# Confusion Matrix
cm_svm_grid <- confusionMatrix(pred_penguin_size, test_penguin$sex)

cm_svm_grid

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FEMALE MALE
##     FEMALE     45    9
##     MALE        4   41
##                                           
##                Accuracy : 0.8687          
##                  95% CI : (0.7859, 0.9282)
##     No Information Rate : 0.5051          
##     P-Value [Acc > NIR] : 2.389e-14       
##                                           
##                   Kappa : 0.7376          
##                                           
##  Mcnemar's Test P-Value : 0.2673          
##                                           
##             Sensitivity : 0.9184          
##             Specificity : 0.8200          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.9111          
##              Prevalence : 0.4949          
##          Detection Rate : 0.4545          
##    Detection Prevalence : 0.5455          
##       Balanced Accuracy : 0.8692          
##                                           
##        'Positive' Class : FEMALE          
##

# 'FEMALE' for the positive class
cm_positive_class_female <- confusionMatrix(pred_penguin_size, test_penguin$sex, positive = "FEMALE")

# 'MALE' for the positive class
cm_positive_class_male <- confusionMatrix(pred_penguin_size, test_penguin$sex, positive = "MALE")
cm_positive_class_male

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FEMALE MALE
##     FEMALE     45    9
##     MALE        4   41
##                                           
##                Accuracy : 0.8687          
##                  95% CI : (0.7859, 0.9282)
##     No Information Rate : 0.5051          
##     P-Value [Acc > NIR] : 2.389e-14       
##                                           
##                   Kappa : 0.7376          
##                                           
##  Mcnemar's Test P-Value : 0.2673          
##                                           
##             Sensitivity : 0.8200          
##             Specificity : 0.9184          
##          Pos Pred Value : 0.9111          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5051          
##          Detection Rate : 0.4141          
##    Detection Prevalence : 0.4545          
##       Balanced Accuracy : 0.8692          
##                                           
##        'Positive' Class : MALE            
##

# Creating data frame of byClass statistics for 'FEMALE'
metrics_female <- as.data.frame(cm_positive_class_female$byClass)

colnames(metrics_female)[1] <- "FEMALE"

# Creating data frame byClass statistics for 'MALE'
metrics_male <- as.data.frame(cm_positive_class_male$byClass)

colnames(metrics_male)[1] <- "MALE"

# Combining the data frames by their metrics
combined_metrics <- data.frame(FEMALE = metrics_female[], MALE = metrics_male[])

# Let's check the new data frame
combined_metrics

# Let's rotate the columns and rows so we can select by specific statistics for both Classes: FEMALE, MALE
# Transpose the combined_metrics data frame
transposed_metrics <- t(combined_metrics)

# Convert the transposed matrix back to a data frame
transposed_metrics_df <- as.data.frame(transposed_metrics)

# Row names as a first column in the data frame
transposed_metrics_df <- cbind(Class = row.names(transposed_metrics_df), transposed_metrics_df)

head(transposed_metrics_df)

# (2) Calculate the precision and recall 
# Get the precision value for each class
transposed_metrics_df %>% select(c(Class, Precision))

# (2) Calculate the recall 
# Get the recall value for each class
transposed_metrics_df %>% select(c(Class, Recall))

# (3) Let's create the ROC curve
str(penguin_size$sex)

##  Factor w/ 2 levels "FEMALE","MALE": 2 1 1 1 2 1 2 1 2 2 ...

# Predict probabilities on the test set
pred_prob <- predict(svm_grid, test_penguin, type = "prob")

head(pred_prob)

roc_obj <- roc((test_penguin$sex), pred_prob[,1])

## Setting levels: control = FEMALE, case = MALE

## Setting direction: controls > cases

# png("roc_obj.png", width = 800, height = 600)

plot(roc_obj, print.auc=TRUE)

# dev.off()