View on GitHub

Predicting Penguin Gender Through Their Characteristics

Dataset

This dataset, penguins_size.csv, was sourced from Kaggle.

RMarkdown Code

Purpose

This project explores the accuracy of different machine-learning algorithms to predict penguin gender.

Our target variable will be penguin gender.

Part 1. Data Exploration

Categorical Variables

Variable Description Unique Values
Species Penguin species Adelie, Chinstrap, Gentoo
Island Island name in the Palmer Archipelago (Antarctica) Torgersen, Biscoe, Dream
Sex Penguin gender MALE, FEMALE, NA, “.”

Numerical Variables

Measurement Mean Min Max
Culmen Length (mm) 43.92 32.10 59.60
Culmen Depth (mm) 17.15 13.10 21.50
Flipper Length (mm) 200.90 172.00 231.00
Body Mass (g) 4202 2700 6300

After exploring the data further, there appear to be 2 NAs in each of the columns: culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g.

At first glance, the sex column has 10 NAs with 1 entry with the unique chr of “.”. This “.” appears to be a placeholder for an unknown data point/ missing data. In total, there are 11 NAs, considering the “.”.

Part 2. Data Cleaning

Let’s consider the missing NAs and determine how much effect they have on the penguin_size data. First, I’ll convert the single “.” entry in the sex column to NA.

After calculating the proportion of missing data per column to see the significance of its impact on the data, here are the results:

Variable Missing Percentage
species 0%
island 0%
culmen_length_mm 0.58%
culmen_depth_mm 0.58%
flipper_length_mm 0.58%
body_mass_g 0.58%
sex 3.20%

The sex column has the most significant impact on the data, with 3.20% contributing to missing NAs. As opposed to 0.58% for culmen_length_mm, culmen_depth_mm, flipper_length_mm, and body_mass_g.

Considering, 3.20% is relatively low, I will drop the NAs since its removal should not have too large of an effect on predictability. After dropping the NAs, 11 rows have been removed, and the dataset now contains 333 rows.

Part 3. Data Preprocessing

In this step, let’s create 2 new columns:

Next, I’ll convert the categorical variables into factors. I’ll also prepare a subset of the penguin_size data without the class labels to perform clustering. Afterward, I’ll preprocess the data using the center-scaled method.

Since, we’re going to perform predictions using other classifiers, I’ll partition the data after using a 70/30 train-test split and preprocess the data when creating the classifier models.

  # Center scale allows us to standardize the data
  preproc <- preProcess(predictors_clustering, method = c("center", "scale"))
  
  # We have to call predict to fit our data based on preprocessing
  # Using predictors_clustering for Clustering
  predictors_clustering <- predict(preproc, predictors_clustering)
  
  # Partitioning the data to be used in Classification
  index = createDataPartition(y = penguin_size$sex, p = 0.7, list=FALSE)
  
  # Everything in the generated index list
  train_penguin = penguin_size[index,]
  
  # Everything except the generated indices
  test_penguin = penguin_size[-index,]

Part 4. Clustering

Let’s use kmeans to determine the best number for K before clustering. Using the plots for the Within Sum of Squares method and the Silhouette method.

predictors_clustering_wss predictors_clustering_silhouette

Either option can be justified but since the K = 4 is the second best option for the silhouette score it is a promising option for both. Using K = 4 as suggested by the plots we can fit our model using the kmeans function.

Below are the plots created with K = 4

Part 5. Classification

Now that we’ve partitioned that data in part 3. Data Preprocessing, let’s preprocess the data as we create the classifiers.

Using SVM, and grid search to tune for C: The final value used for the model was C = 0.1.

svm_grid
Support Vector Machines with Linear Kernel 

234 samples
  8 predictor
  2 classes: 'FEMALE', 'MALE' 

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold, repeated 25 times) 
Summary of sample sizes: 211, 210, 211, 211, 210, 211, ... 
Resampling results across tuning parameters:

  C      ROC        Sens       Spec     
  1e-05  0.2668214  0.1937576  0.9116364
  1e-04  0.8836749  0.1933939  0.9112727
  1e-03  0.8836446  0.8833030  0.6864545
  1e-02  0.9632826  0.8945152  0.9023333
  1e-01  0.9777766  0.8927879  0.9305455
  1e+00  0.9764396  0.9048182  0.9250606
  1e+01  0.9749635  0.9105152  0.9171212
  1e+02  0.9741568  0.9092121  0.9110909
  1e+03  0.9733584  0.9055758  0.9047576

ROC was used to select the optimal model using the largest value.
The final value used for the model was C = 0.1.

Using KNN, and tuning the choice of k plus the type of distance function: The final values used for the model were kmax = 7, distance = 2 and kernel = cos.

kknn_fit
k-Nearest Neighbors 

234 samples
  8 predictor
  2 classes: 'FEMALE', 'MALE' 

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 211, 210, 211, 211, 211, 210, ... 
Resampling results across tuning parameters:

  kmax  kernel       distance  Accuracy   Kappa    
  1     rectangular  1         0.8755435  0.7508191
  1     rectangular  2         0.8759058  0.7515104
  1     rectangular  3         0.8759058  0.7517089
  1     cos          1         0.8755435  0.7508191
  1     cos          2         0.8759058  0.7515104
  1     cos          3         0.8759058  0.7517089
  2     rectangular  1         0.8755435  0.7508191
  2     rectangular  2         0.8759058  0.7515104
  2     rectangular  3         0.8759058  0.7517089
  2     cos          1         0.8755435  0.7508191
  2     cos          2         0.8759058  0.7515104
  2     cos          3         0.8759058  0.7517089
  3     rectangular  1         0.9143116  0.8284392
  3     rectangular  2         0.9143116  0.8281082
  3     rectangular  3         0.8967391  0.7927771
  3     cos          1         0.8969203  0.7935885
  3     cos          2         0.8885870  0.7766558
  3     cos          3         0.9012681  0.8020682
  4     rectangular  1         0.9143116  0.8284392
  4     rectangular  2         0.9143116  0.8281082
  4     rectangular  3         0.8967391  0.7927771
  4     cos          1         0.9012681  0.8023998
  4     cos          2         0.9014493  0.8024129
  4     cos          3         0.9097826  0.8189458
  5     rectangular  1         0.9057971  0.8113609
  5     rectangular  2         0.9144928  0.8285204
  5     rectangular  3         0.8925725  0.7843147
  5     cos          1         0.9097826  0.8196108
  5     cos          2         0.8972826  0.7940796
  5     cos          3         0.8971014  0.7936014
  6     rectangular  1         0.9057971  0.8111629
  6     rectangular  2         0.9144928  0.8285204
  6     rectangular  3         0.9012681  0.8016041
  6     cos          1         0.9097826  0.8196108
  6     cos          2         0.9057971  0.8110296
  6     cos          3         0.8971014  0.7934671
  7     rectangular  1         0.9099638  0.8194962
  7     rectangular  2         0.9184783  0.8363105
  7     rectangular  3         0.9014493  0.8021503
  7     cos          1         0.9141304  0.8282236
  7     cos          2         0.9228261  0.8452535
  7     cos          3         0.9141304  0.8274952

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were kmax = 7, distance = 2 and kernel = cos.

Using Decision Tree, and basic rpart method, the final value used for the model was cp = 0.01724138.

decision_tree
CART 

234 samples
  8 predictor
  2 classes: 'FEMALE', 'MALE' 

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 212, 210, 210, 211, 211, 210, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.01724138  0.8425231  0.6842408
  0.10344828  0.7948452  0.5898477
  0.63793103  0.6629611  0.3172756

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01724138.

decision_tree_plot

When comparing the classifiers to predict the penguin’s sex, SVM performed the best, with an accuracy of 0.9777766. KNN was a close second with an accuracy of 0.9228261.

Part 6. Evaluation

After producing the 2 x 2 confusion matrix for the FEMALE and MALE classes, here is the result:

Prediction FEMALE MALE
FEMALE 45 9
MALE 4 41

Since the confusion matrix object only shows statistics for ‘Positive’ Class : FEMALE, I created a confusion matrix for ‘Positive’ Class : MALE, combined their class statistics in a dataset to easily pull calculations.

When calculating precision manually, here is the result:

Class Precision
FEMALE 0.8333333
MALE 0.9111111

When calculating recall manually, here is the result:

Class Recall
FEMALE 0.9183673
MALE 0.8200000

When analyzing the statistics above, MALE appears to have a higher precision and FEMALE appears to have a higher recall.

Precision measures the accuracy of forming positive predictions. Recall measures the model’s ability to find true positives.

In summary, high precision is important when failing to identify a negative prediction as a false positive is more costly than missing actual false negatives (positive predictions). High recall is important when failing to identify a positive prediction is more costly than incorrectly identifying a negative case.

The ROC curve will plot Recall or Sensitivity (True Positive Rate) against Specificity (False Positive Rate). In this result, the area under the ROC curve (AUC) has a high value with AUC: 0.951, indicating good test performance and classification abilities. The model can identify the differences positive and negative classes effectively.

roc_obj

After performing analysis to predict penguin gender using the three classifiers above, SVM performed the best with an accuracy of 0.9777766 with tuning C = 0.1. Adjusting the SVM classifier’s train control and train method to plot an ROC curve initially posed a challenge but after figuring out it is possible really opened my mind to understand just how vast and powerful of a tool R can be to perform machine learning models.

Before performing the analysis, I was predicting that Decision Tree would work the best since we are predicting 2 different classes, MALE and FEMALE. However, it produced the worst result with an accuracy of 0.8425231. One major takeaway from testing the Decision Tree classifier is that it used the variable I created called culmen_size_mm2 when classifying between MALE and FEMALE genders. So, my original assumption about the classifier using this new variable to help with predictions turned out to be true.

When cleaning the data to identify unique values for each categorical variable, I was surprised to find the anomaly entry of “.” In the sex column. Which turned out to be a human error. If the data was not cleaned prior to running the classifiers, this could have been detrimental to the performance of the classifiers. After reflecting on this scenario, it really is true that data cleaning can take 60-80% of the testing process and it is a step not to be ignored.