DataGators 👋

Help With Data Science at Allegheny College!

Data Gators

DataGators 👋

Help With Data Science at Allegheny College!

Exploratory Steps with Penguins Data

Introduction to R Programming with the Penguin Dataset

Let’s work with the penguins dataset, which contains data about three species of penguins in the Palmer Archipelago, Antarctica. The dataset includes measurements for:

We will perform basic exploratory data analysis (EDA) by calculating correlations and creating visualizations to better understand the relationships between these variables.


Step 1: Install and Load the Required Libraries

First, we’ll need to install and load the palmerpenguins package, which contains the dataset.

# Install the palmerpenguins package 
# (if not already installed)
# install.packages("palmerpenguins")

# Load the required libraries
library(palmerpenguins)
library(ggplot2)

Explanation:


Step 2: Load the Penguin Dataset

We can now load the penguins dataset into R.

# Load the penguins dataset
data("penguins")

# Check the first few rows of the dataset
head(penguins)

Explanation:

Output:

  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1           18.7               181       3750
2  Adelie Torgersen           39.5           17.4               186       3800
3  Adelie Torgersen           40.3           18.0               195       3250
4  Adelie Torgersen           36.7           19.3               193       3450
5  Adelie Torgersen           39.3           20.6               190       3650
6  Adelie Torgersen           38.9           19.8               195       3625

Explanation:


Step 3: Calculate Correlations

We can calculate the correlation matrix for the numeric variables (excluding species and island) to explore the relationships between the features.

# Select numeric columns and calculate correlations
numeric_data <- penguins[, c(
  "bill_length_mm", 
  "bill_depth_mm", 
  "flipper_length_mm", 
  "body_mass_g")]
cor(
  numeric_data,
  use = "complete.obs")

Explanation:

Output:

                    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm            1.000000       0.460317          0.871024    0.809872
bill_depth_mm             0.460317       1.000000          0.648209    0.650138
flipper_length_mm         0.871024       0.648209          1.000000    0.910144
body_mass_g               0.809872       0.650138          0.910144    1.000000

Explanation of Output:


Step 4: Create Plots

Scatter Plot: Bill Length vs Bill Depth

We can create a scatter plot to visualize the relationship between bill length and bill depth.

# Scatter plot of Bill Length vs Bill Depth
ggplot(
  penguins,
  aes(x = bill_length_mm,
  y = bill_depth_mm,
  color = species)
  ) +
  geom_point() +
  labs(
    title = "Bill Length vs Bill Depth",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)") +
  theme_minimal()

Explanation:

Output: The plot will show how the bill length and depth vary by species, with different colors representing different species.


Boxplot: Body Mass by Species

Next, we create a boxplot to compare the distribution of body mass across species.

# Boxplot of Body Mass by Species
ggplot(penguins, aes(
  x = species,
  y = body_mass_g,
  fill = species)) +
geom_boxplot() +
  labs(
    title = "Body Mass Distribution by Species",
    x = "Species",
    y = "Body Mass (g)"
    ) +
  theme_minimal()

Explanation:

Output: This boxplot will display the distribution of body mass for each species, including the median, interquartile range, and outliers.


Pair Plot: Visualizing All Numeric Variables

To explore the relationships between multiple numeric variables, we can create a pair plot.

# Install and load GGally for
# ggpairs if not already installed
# install.packages("GGally")

library(GGally)

# Create a pair plot of numeric variables
ggpairs(
  penguins[, c(
  "bill_length_mm",
  "bill_depth_mm",
  "flipper_length_mm",
  "body_mass_g")
  ],
  aes(color = penguins$species))

Explanation:

Output: This will generate a matrix of scatter plots for each pair of numeric variables, along with histograms for the individual variables, giving a comprehensive view of the relationships between the features.


Step 5: Conclusion

In this lesson, we:

These steps are essential for performing exploratory data analysis (EDA) in R, which helps you understand the structure of your data and identify important patterns.


Next Steps: