DataGators 👋

Help With Data Science at Allegheny College!

Data Gators

DataGators 👋

Help With Data Science at Allegheny College!

Exploratory Steps With Iris Data

Introduction to R Programming with the Iris Dataset

Let’s work with the built-in iris dataset, which contains data on 150 different species of iris flowers. Each entry contains the following measurements for the flowers:

We will go through the steps of loading the dataset, calculating basic correlations, and creating visualizations to explore the relationships between the different features.


Step 1: Load the Iris Dataset

In R, you don’t need to load the iris dataset manually because it is built-in. We can directly access it.

# Load the iris dataset
data(iris)

# Check the first few rows of the dataset
head(iris)

Explanation:

Output:

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           5.1          3.5           1.4          0.2     setosa
2           4.9          3.0           1.4          0.2     setosa
3           4.7          3.2           1.3          0.2     setosa
4           4.6          3.1           1.5          0.2     setosa
5           5.0          3.6           1.4          0.2     setosa
6           5.4          3.9           1.7          0.4     setosa

Step 2: Calculate Correlations

We can calculate the correlation between numeric variables in the dataset to explore how they are related.

# Calculate correlations between 
# numeric features (excluding the Species column)
cor(iris[, 1:4])

Explanation:

Output:

               Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length        1.000000   -0.117570     0.871754     0.817941
Sepal.Width        -0.117570    1.000000    -0.428440    -0.366125
Petal.Length        0.871754   -0.428440     1.000000     0.962865
Petal.Width         0.817941   -0.366125     0.962865     1.000000

Explanation of Output:


Step 3: Create Basic Plots

Scatter Plot: Sepal Length vs Petal Length

We can plot Sepal Length against Petal Length to visualize their relationship.

# Scatter plot of Sepal Length vs Petal Length
plot(iris$Sepal.Length, iris$Petal.Length, 
     main = "
        Scatter Plot of Sepal Length vs Petal Length",
     xlab = "Sepal Length",
     ylab = "Petal Length", 
     pch = 19,
     col = iris$Species)

Explanation:

Output: This plot will show a scatter plot where each point represents a flower, and the color indicates the species.


Boxplot: Sepal Length by Species

A boxplot allows us to compare the distribution of sepal length across different species.

# Boxplot of Sepal Length grouped by Species
boxplot(Sepal.Length ~ Species, data = iris,
        main = 
          "Boxplot of Sepal Length by Species",
        xlab = "Species",
        ylab = "Sepal Length",
        col = c(
          "lightblue", 
          "lightgreen",
          "lightcoral")
          )

Explanation:

Output: This boxplot will display three boxes, one for each species, showing the range, median, and interquartile range of Sepal Length.


Pair Plot: Visualizing Relationships Between All Variables

To explore the relationships between all the numeric variables simultaneously, we can use a pair plot (also known as a scatterplot matrix).

# Install and load the GGally 
# package if not already installed
# install.packages("GGally")
library(GGally)

# Create a pair plot of the numeric variables
ggpairs(iris[, 1:4],
  aes(color = iris$Species))

Explanation:

Output: This will generate a matrix of scatter plots, histograms, and correlations, making it easy to visually inspect the relationships between sepal length, sepal width, petal length, and petal width.


Step 4: Conclusion

In this lesson, we have:

This exercise helps in understanding both the basic analysis and visualization techniques in R, which are essential for data exploration in data science.


Next Steps: