Lesson_2
Data Cleaning and Visualization in R
📘 Overview
Data cleaning is a crucial first step in any data analysis project. It is estimated that 80% of a data scientist’s work involves data preparation, with around 60% of that time dedicated to cleaning and organizing data.
In this lesson, we will use the nycflights13::flights dataset to learn practical data cleaning and visualization techniques. This dataset contains over 300,000 flights departing from New York City in 2013.
Info
Expected Duration: 45 minutes
This is an introductory lesson - no prior R knowledge required!
Required packages:
- nycflights13
- dplyr
- tidyr
- ggplot2
📦 Package Installation
First, let’s install and load the required packages:
# Install necessary packages
install.packages("nycflights13")
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")
# Load libraries
library(nycflights13)
library(dplyr)
library(tidyr)
library(ggplot2)
📌 Dataset Overview
The flights
dataset contains:
- Date & Time:
year
,month
,day
,dep_time
,sched_dep_time
- Delays:
dep_delay
,arr_delay
- Flight Details:
carrier
,flight
,tailnum
- Route Info:
origin
,dest
,distance
,air_time
# Load and inspect the data
data("flights")
dim(flights)
head(flights)
🛠 Data Cleaning Steps
1. Handling Missing Values
# Count NAs in delay columns
sum(is.na(flights$dep_delay)) # NAs in departure delays
sum(is.na(flights$arr_delay)) # NAs in arrival delays
# Option 1: Remove rows with missing values
flights_no_na <- na.omit(flights)
# Option 2: Replace NAs with zeros
flights_filled <- flights %>%
mutate(dep_delay = replace_na(dep_delay, 0),
arr_delay = replace_na(arr_delay, 0))
2. Removing Duplicates
# Check for and remove any duplicates
nrow(flights) # before
nrow(distinct(flights)) # after
3. Improving Data Format
# Clean up the data structure
flights_clean <- flights %>%
filter(!is.na(dep_time)) %>%
left_join(airlines, by = "carrier") %>%
rename(airline_name = name,
carrier_code = carrier) %>%
mutate(
origin = factor(origin),
dest = factor(dest),
carrier_code = factor(carrier_code)
)
📊 Data Visualization
1. Departure vs. Arrival Delays
ggplot(flights_clean, aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2) +
labs(title = "Departure vs. Arrival Delay",
x = "Departure Delay (minutes)",
y = "Arrival Delay (minutes)")
2. Delay Distribution
ggplot(flights_clean, aes(x = dep_delay)) +
geom_histogram(binwidth = 15, fill = "skyblue", color = "black") +
labs(title = "Distribution of Departure Delays",
x = "Departure Delay (minutes)",
y = "Number of Flights")
3. Delays by Airport
ggplot(flights_clean, aes(x = origin, y = arr_delay)) +
geom_boxplot(fill = "orange") +
labs(title = "Arrival Delays by Origin Airport",
x = "Origin Airport",
y = "Arrival Delay (minutes)")
4. Monthly Flight Patterns
flights_per_month <- flights_clean %>%
count(month)
ggplot(flights_per_month, aes(x = month, y = n)) +
geom_line(color = "blue") +
geom_point() +
labs(title = "Flights per Month (2013)",
x = "Month",
y = "Number of Flights")
✅ Key Takeaways
- Learned to identify and handle missing values using
na.omit()
andreplace_na()
- Checked for duplicate records using
distinct()
- Improved data structure with
rename()
andmutate()
- Created various visualizations to explore the data using
ggplot2
🚀 Practice Exercise
Try the code yourself in our interactive notebook:
Open in Google ColabInfo
Important: After clicking the link above, click the “Copy to Drive” button in Colab to create your own editable copy of the notebook.