Exploratory Steps With Iris Data
Introduction to R Programming with the Iris Dataset
Let’s work with the built-in iris
dataset, which contains data on 150 different species of iris flowers. Each entry contains the following measurements for the flowers:
- Sepal length
- Sepal width
- Petal length
- Petal width
- Species
We will go through the steps of loading the dataset, calculating basic correlations, and creating visualizations to explore the relationships between the different features.
Step 1: Load the Iris Dataset
In R, you don’t need to load the iris
dataset manually because it is built-in. We can directly access it.
# Load the iris dataset
data(iris)
# Check the first few rows of the dataset
head(iris)
Explanation:
data(iris)
loads the dataset into memory.head(iris)
displays the first six rows of theiris
dataset to give us an idea of its structure.
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Step 2: Calculate Correlations
We can calculate the correlation between numeric variables in the dataset to explore how they are related.
# Calculate correlations between
# numeric features (excluding the Species column)
cor(iris[, 1:4])
Explanation:
iris[, 1:4]
selects all rows and the first four columns (which contain the numeric data: sepal length, sepal width, petal length, and petal width).cor()
computes the correlation matrix, which measures the linear relationships between pairs of variables.
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.000000 -0.117570 0.871754 0.817941
Sepal.Width -0.117570 1.000000 -0.428440 -0.366125
Petal.Length 0.871754 -0.428440 1.000000 0.962865
Petal.Width 0.817941 -0.366125 0.962865 1.000000
Explanation of Output:
The correlation values range from -1 to 1, where:
1
indicates a perfect positive correlation.-1
indicates a perfect negative correlation.0
indicates no linear correlation.
For example, Sepal Length and Petal Length have a high positive correlation (
0.87
), while Sepal Width and Petal Length have a moderate negative correlation (-0.43
).
Step 3: Create Basic Plots
Scatter Plot: Sepal Length vs Petal Length
We can plot Sepal Length against Petal Length to visualize their relationship.
# Scatter plot of Sepal Length vs Petal Length
plot(iris$Sepal.Length, iris$Petal.Length,
main = "
Scatter Plot of Sepal Length vs Petal Length",
xlab = "Sepal Length",
ylab = "Petal Length",
pch = 19,
col = iris$Species)
Explanation:
plot()
is used to create a scatter plot. Thepch = 19
argument sets the point type (solid circle), andcol = iris$Species
colors the points according to the flower species.main
,xlab
, andylab
set the title and axis labels for the plot.
Output: This plot will show a scatter plot where each point represents a flower, and the color indicates the species.
Boxplot: Sepal Length by Species
A boxplot allows us to compare the distribution of sepal length across different species.
# Boxplot of Sepal Length grouped by Species
boxplot(Sepal.Length ~ Species, data = iris,
main =
"Boxplot of Sepal Length by Species",
xlab = "Species",
ylab = "Sepal Length",
col = c(
"lightblue",
"lightgreen",
"lightcoral")
)
Explanation:
Sepal.Length ~ Species
specifies that we want to plot Sepal Length for each species.boxplot()
creates the boxplot, and thecol
argument specifies the colors for each species group.
Output: This boxplot will display three boxes, one for each species, showing the range, median, and interquartile range of Sepal Length.
Pair Plot: Visualizing Relationships Between All Variables
To explore the relationships between all the numeric variables simultaneously, we can use a pair plot (also known as a scatterplot matrix).
# Install and load the GGally
# package if not already installed
# install.packages("GGally")
library(GGally)
# Create a pair plot of the numeric variables
ggpairs(iris[, 1:4],
aes(color = iris$Species))
Explanation:
- The
ggpairs()
function from theGGally
package creates a matrix of scatter plots, showing the pairwise relationships between all numeric columns in theiris
dataset. - The
aes(color = iris$Species)
argument colors the points by species, making it easier to distinguish between the species in the plots.
Output: This will generate a matrix of scatter plots, histograms, and correlations, making it easy to visually inspect the relationships between sepal length, sepal width, petal length, and petal width.
Step 4: Conclusion
In this lesson, we have:
- Loaded and explored the
iris
dataset. - Calculated correlations between numeric variables.
- Created scatter plots, boxplots, and pair plots to visualize relationships between the features.
This exercise helps in understanding both the basic analysis and visualization techniques in R, which are essential for data exploration in data science.
Next Steps:
- You can try modifying the plots to explore other variables or try additional visualizations like histograms, density plots, or heatmaps.
- Experiment with different
ggplot2
functions for more advanced visualizations.