DataGators 👋

Help With Data Science at Allegheny College!

Data Gators

DataGators 👋

Help With Data Science at Allegheny College!

Exploratory Steps With MPG Data

Let’s explore the mpg dataset from the ggplot2 package in R. This lesson will cover how to load the dataset, calculate basic correlations, and create various plots, with detailed explanations for each step.


Introduction to R Programming with the MPG Dataset

In this lesson, we will work with the mpg dataset, which is included in the ggplot2 package. The dataset contains information about the fuel efficiency (miles per gallon) of various car models, with additional details about each car such as:

We will load the dataset, explore basic correlations, and visualize the data using plots.


Step 1: Install and Load Required Libraries

To work with the mpg dataset, we first need to install and load the ggplot2 package. If it’s already installed, we can simply load it.

# Install ggplot2 package if not already installed
# install.packages("ggplot2")

# Load the ggplot2 package
library(ggplot2)

Explanation:


Step 2: Load the MPG Dataset

The mpg dataset is included in ggplot2 by default, so we can load it without any extra steps.

# Load the mpg dataset
data("mpg")

# Check the first few rows of the dataset
head(mpg)

Explanation:

Output:

  manufacturer model displ  year cyl trans drv cty hwy fl
1         audi  a4  1.8  1999   4   auto   f  18  29  2
2         audi  a4  1.8  1999   4   auto   f  21  29  2
3         audi  a4  2.0  1999   4   auto   f  21  29  2
4         audi  a4  2.0  1999   4   auto   f  19  27  2
5         audi  a4  2.8  1999   6   auto   f  16  26  3
6         audi  a4  2.8  1999   6   auto   f  18  26  3

Explanation:


Step 3: Calculate Correlations

We can calculate the correlation between numeric variables such as displ (engine displacement), cty (city mileage), and hwy (highway mileage).

# Calculate correlations
# between numeric variables (displ, cty, hwy)
cor(mpg[, c("displ", "cty", "hwy")])

Explanation:

Output:

           displ       cty      hwy
displ  1.0000000 -0.897358 -0.779753
cty    -0.897358  1.000000  0.918382
hwy    -0.779753  0.918382  1.000000

Explanation of Output:


Step 4: Create Visualizations

Scatter Plot: Engine Displacement vs Highway Mileage

A scatter plot is a great way to visualize the relationship between two continuous variables. Let’s plot displ (engine displacement) against hwy (highway mileage).

# Scatter plot of Displacement vs Highway Mileage
ggplot(mpg, aes(x = displ, y = hwy)) +
  # Color points by car class
  geom_point(aes(color = class)) +  
  labs(
    title = "Engine Displacement vs Highway Mileage",
    x = "Engine Displacement (L)",
    y = "Highway Mileage (mpg)") +
    theme_minimal()

Explanation:

Output: This plot will display a scatter plot of engine displacement versus highway mileage, with points colored by car class.


Boxplot: Highway Mileage by Cylinder Count

A boxplot is useful for comparing the distribution of a numeric variable across different categories. We can use it to compare hwy (highway mileage) across different values of cyl (number of cylinders).

# Boxplot of Highway Mileage by Number of Cylinders
ggplot(mpg, aes(x = factor(cyl), 
   y = hwy, fill = factor(cyl))) +
    geom_boxplot() +
    labs(title = 
        "Highway Mileage by Number of Cylinders", 
    x = "Number of Cylinders", 
    y = "Highway Mileage (mpg)") +
  theme_minimal()

Explanation:

Output: The boxplot will display how highway mileage varies across different cylinder categories, showing the median, interquartile range, and potential outliers.


Histogram: Distribution of City Mileage

We can create a histogram to visualize the distribution of cty (city mileage).

# Histogram of City Mileage
ggplot(mpg, aes(x = cty)) +
  geom_histogram(binwidth = 1,
    fill = "skyblue", 
    color = "black") +
    labs(title = 
        "Distribution of City Mileage",
    x = "City Mileage (mpg)",
    y = "Frequency") +
    theme_minimal()

Explanation:

Output: This histogram will display how city mileage is distributed in the dataset, showing the frequency of different mileage values.


Step 5: Conclusion

In this lesson, we have:

These techniques are

fundamental for performing exploratory data analysis (EDA) in R, which is crucial for understanding your data before moving on to more complex analyses.


Next Steps: