Lesson_2

Data Cleaning and Visualization in R

📘 Overview

Data cleaning is a crucial first step in any data analysis project. It is estimated that 80% of a data scientist’s work involves data preparation, with around 60% of that time dedicated to cleaning and organizing data.

In this lesson, we will use the nycflights13::flights dataset to learn practical data cleaning and visualization techniques. This dataset contains over 300,000 flights departing from New York City in 2013.

Info

Expected Duration: 45 minutes
This is an introductory lesson - no prior R knowledge required!

Required packages:

  • nycflights13
  • dplyr
  • tidyr
  • ggplot2

📦 Package Installation

First, let’s install and load the required packages:

# Install necessary packages
install.packages("nycflights13")
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")

# Load libraries
library(nycflights13)
library(dplyr)
library(tidyr)
library(ggplot2)

📌 Dataset Overview

The flights dataset contains:

  • Date & Time: year, month, day, dep_time, sched_dep_time
  • Delays: dep_delay, arr_delay
  • Flight Details: carrier, flight, tailnum
  • Route Info: origin, dest, distance, air_time
# Load and inspect the data
data("flights")
dim(flights)
head(flights)

🛠 Data Cleaning Steps

1. Handling Missing Values

# Count NAs in delay columns
sum(is.na(flights$dep_delay))  # NAs in departure delays
sum(is.na(flights$arr_delay))  # NAs in arrival delays

# Option 1: Remove rows with missing values
flights_no_na <- na.omit(flights)

# Option 2: Replace NAs with zeros
flights_filled <- flights %>%
    mutate(dep_delay = replace_na(dep_delay, 0),
           arr_delay = replace_na(arr_delay, 0))

2. Removing Duplicates

# Check for and remove any duplicates
nrow(flights)           # before
nrow(distinct(flights)) # after

3. Improving Data Format

# Clean up the data structure
flights_clean <- flights %>%
    filter(!is.na(dep_time)) %>%
    left_join(airlines, by = "carrier") %>%
    rename(airline_name = name, 
           carrier_code = carrier) %>%
    mutate(
        origin = factor(origin),
        dest = factor(dest),
        carrier_code = factor(carrier_code)
    )

📊 Data Visualization

1. Departure vs. Arrival Delays

ggplot(flights_clean, aes(x = dep_delay, y = arr_delay)) +
    geom_point(alpha = 0.2) +
    labs(title = "Departure vs. Arrival Delay",
         x = "Departure Delay (minutes)",
         y = "Arrival Delay (minutes)")

2. Delay Distribution

ggplot(flights_clean, aes(x = dep_delay)) +
    geom_histogram(binwidth = 15, fill = "skyblue", color = "black") +
    labs(title = "Distribution of Departure Delays",
         x = "Departure Delay (minutes)",
         y = "Number of Flights")

3. Delays by Airport

ggplot(flights_clean, aes(x = origin, y = arr_delay)) +
    geom_boxplot(fill = "orange") +
    labs(title = "Arrival Delays by Origin Airport",
         x = "Origin Airport",
         y = "Arrival Delay (minutes)")

4. Monthly Flight Patterns

flights_per_month <- flights_clean %>%
    count(month)

ggplot(flights_per_month, aes(x = month, y = n)) +
    geom_line(color = "blue") +
    geom_point() +
    labs(title = "Flights per Month (2013)",
         x = "Month",
         y = "Number of Flights")

✅ Key Takeaways

  • Learned to identify and handle missing values using na.omit() and replace_na()
  • Checked for duplicate records using distinct()
  • Improved data structure with rename() and mutate()
  • Created various visualizations to explore the data using ggplot2

🚀 Practice Exercise

Try the code yourself in our interactive notebook:

Open in Google Colab
Info

Important: After clicking the link above, click the “Copy to Drive” button in Colab to create your own editable copy of the notebook.