Exploratory Data Analysis: A Systematic Approach
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is like being a detective with your data. Before you run fancy models or test hypotheses, you need to understand what you’re working with. EDA helps you:
- Spot patterns and relationships
- Identify outliers and anomalies
- Check assumptions
- Generate hypotheses
- Find data quality issues
- Tell stories with your data
The golden rule: Never skip EDA! It saves you from embarrassing mistakes later.
The EDA Framework: Your Step-by-Step Guide
Here’s a systematic approach that works for any dataset:
- First Look - Dimensions, types, missing values
- Univariate Analysis - One variable at a time
- Bivariate Analysis - Relationships between pairs
- Multivariate Analysis - Multiple variables together
- Hypothesis Generation - What questions emerge?
- Data Quality Check - Issues to address
Let’s walk through each step!
Step 1: The First Look
Always start with these basic questions about your data.
Python Approach
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
# Load data
data = pd.read_csv('your_dataset.csv')
print("=== DATASET OVERVIEW ===\n")
# 1. Shape
print(f"Rows: {data.shape[0]:,}")
print(f"Columns: {data.shape[1]}\n")
# 2. Column info
print("=== COLUMN INFORMATION ===")
print(data.info())
print()
# 3. First few rows
print("=== FIRST ROWS ===")
print(data.head())
print()
# 4. Basic statistics
print("=== NUMERIC SUMMARY ===")
print(data.describe())
print()
# 5. Missing values
print("=== MISSING VALUES ===")
missing = data.isnull().sum()
missing_pct = 100 * data.isnull().sum() / len(data)
missing_df = pd.DataFrame({
'Missing': missing,
'Percent': missing_pct
})
print(missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False))R Approach
library(tidyverse)
library(skimr)
# Load data
data <- read_csv('your_dataset.csv')
cat("=== DATASET OVERVIEW ===\n\n")
# 1. Dimensions
cat(sprintf("Rows: %s\n", format(nrow(data), big.mark = ",")))
cat(sprintf("Columns: %d\n\n", ncol(data)))
# 2. Structure
cat("=== DATA STRUCTURE ===\n")
str(data)
# 3. First rows
cat("\n=== FIRST ROWS ===\n")
print(head(data))
# 4. Comprehensive summary
cat("\n=== COMPREHENSIVE SUMMARY ===\n")
skim(data)
# 5. Missing values detail
missing_summary <- data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "column",
values_to = "missing") %>%
mutate(percent = 100 * missing / nrow(data)) %>%
filter(missing > 0) %>%
arrange(desc(missing))
cat("\n=== MISSING VALUES ===\n")
print(missing_summary)Step 2: Univariate Analysis
Examine each variable individually to understand its distribution.
For Continuous Variables
Python:
def analyze_continuous_variable(data, column):
"""Comprehensive analysis of a continuous variable"""
print(f"\n=== Analysis of {column} ===\n")
# Statistics
print(f"Mean: {data[column].mean():.2f}")
print(f"Median: {data[column].median():.2f}")
print(f"Std Dev: {data[column].std():.2f}")
print(f"Min: {data[column].min():.2f}")
print(f"Max: {data[column].max():.2f}")
print(f"Skewness: {data[column].skew():.2f}")
print(f"Kurtosis: {data[column].kurtosis():.2f}")
# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Histogram
axes[0].hist(data[column].dropna(), bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(data[column].mean(), color='red', linestyle='--', label='Mean')
axes[0].axvline(data[column].median(), color='green', linestyle='--', label='Median')
axes[0].set_title(f'Distribution of {column}')
axes[0].set_xlabel(column)
axes[0].set_ylabel('Frequency')
axes[0].legend()
# Box plot
axes[1].boxplot(data[column].dropna())
axes[1].set_title(f'Box Plot of {column}')
axes[1].set_ylabel(column)
# Q-Q plot (for normality)
from scipy import stats
stats.probplot(data[column].dropna(), dist="norm", plot=axes[2])
axes[2].set_title(f'Q-Q Plot of {column}')
plt.tight_layout()
plt.show()
# Use it
analyze_continuous_variable(data, 'age')
analyze_continuous_variable(data, 'income')R:
analyze_continuous_variable <- function(data, column) {
cat(sprintf("\n=== Analysis of %s ===\n\n", column))
# Statistics
col_data <- data[[column]]
cat(sprintf("Mean: %.2f\n", mean(col_data, na.rm = TRUE)))
cat(sprintf("Median: %.2f\n", median(col_data, na.rm = TRUE)))
cat(sprintf("Std Dev: %.2f\n", sd(col_data, na.rm = TRUE)))
cat(sprintf("Min: %.2f\n", min(col_data, na.rm = TRUE)))
cat(sprintf("Max: %.2f\n", max(col_data, na.rm = TRUE)))
# Visualizations
library(gridExtra)
# Histogram
p1 <- ggplot(data, aes_string(x = column)) +
geom_histogram(bins = 30, fill = 'steelblue', color = 'black', alpha = 0.7) +
geom_vline(aes(xintercept = mean(col_data, na.rm = TRUE)),
color = 'red', linetype = 'dashed', size = 1) +
geom_vline(aes(xintercept = median(col_data, na.rm = TRUE)),
color = 'green', linetype = 'dashed', size = 1) +
labs(title = paste('Distribution of', column),
x = column, y = 'Frequency') +
theme_minimal()
# Box plot
p2 <- ggplot(data, aes_string(y = column)) +
geom_boxplot(fill = 'lightblue') +
labs(title = paste('Box Plot of', column),
y = column) +
theme_minimal()
# Density plot
p3 <- ggplot(data, aes_string(x = column)) +
geom_density(fill = 'steelblue', alpha = 0.5) +
labs(title = paste('Density of', column),
x = column, y = 'Density') +
theme_minimal()
grid.arrange(p1, p2, p3, ncol = 3)
}
# Use it
analyze_continuous_variable(data, 'age')
analyze_continuous_variable(data, 'income')For Categorical Variables
Python:
def analyze_categorical_variable(data, column):
"""Comprehensive analysis of a categorical variable"""
print(f"\n=== Analysis of {column} ===\n")
# Value counts
value_counts = data[column].value_counts()
print("Value Counts:")
print(value_counts)
print()
# Percentages
print("Percentages:")
print(data[column].value_counts(normalize=True) * 100)
print()
# Unique values
print(f"Number of unique values: {data[column].nunique()}")
# Visualizations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar plot
value_counts.plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_title(f'Count of {column}')
axes[0].set_xlabel(column)
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)
# Pie chart (if not too many categories)
if len(value_counts) <= 10:
axes[1].pie(value_counts, labels=value_counts.index, autopct='%1.1f%%')
axes[1].set_title(f'Distribution of {column}')
else:
# Show top 10
value_counts.head(10).plot(kind='barh', ax=axes[1], color='coral')
axes[1].set_title(f'Top 10 {column} Categories')
axes[1].set_xlabel('Count')
plt.tight_layout()
plt.show()
# Use it
analyze_categorical_variable(data, 'category')
analyze_categorical_variable(data, 'region')R:
analyze_categorical_variable <- function(data, column) {
cat(sprintf("\n=== Analysis of %s ===\n\n", column))
# Value counts
value_counts <- data %>%
count(!!sym(column)) %>%
arrange(desc(n))
cat("Value Counts:\n")
print(value_counts)
# Percentages
cat("\nPercentages:\n")
print(value_counts %>% mutate(percent = 100 * n / sum(n)))
cat(sprintf("\nNumber of unique values: %d\n",
n_distinct(data[[column]], na.rm = TRUE)))
# Visualizations
p1 <- ggplot(data, aes_string(x = column)) +
geom_bar(fill = 'steelblue', color = 'black') +
labs(title = paste('Count of', column),
x = column, y = 'Count') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# If not too many categories, add pie chart
if (n_distinct(data[[column]], na.rm = TRUE) <= 10) {
p2 <- ggplot(value_counts, aes(x = "", y = n, fill = !!sym(column))) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = paste('Distribution of', column)) +
theme_void()
grid.arrange(p1, p2, ncol = 2)
} else {
print(p1)
}
}
# Use it
analyze_categorical_variable(data, 'category')
analyze_categorical_variable(data, 'region')Step 3: Bivariate Analysis
Now look at relationships between pairs of variables.
Two Continuous Variables
Python:
def analyze_continuous_relationship(data, var1, var2):
"""Analyze relationship between two continuous variables"""
# Correlation
correlation = data[[var1, var2]].corr().iloc[0, 1]
print(f"\nCorrelation between {var1} and {var2}: {correlation:.3f}")
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
plt.scatter(data[var1], data[var2], alpha=0.5)
# Add regression line
z = np.polyfit(data[var1].dropna(), data[var2].dropna(), 1)
p = np.poly1d(z)
plt.plot(data[var1], p(data[var1]), "r--", linewidth=2, label=f'y={z[0]:.2f}x+{z[1]:.2f}')
plt.xlabel(var1)
plt.ylabel(var2)
plt.title(f'{var1} vs {var2}\nCorrelation: {correlation:.3f}')
plt.legend()
plt.show()
# Use it
analyze_continuous_relationship(data, 'age', 'income')
analyze_continuous_relationship(data, 'hours_studied', 'test_score')R:
analyze_continuous_relationship <- function(data, var1, var2) {
# Correlation
correlation <- cor(data[[var1]], data[[var2]], use = "complete.obs")
cat(sprintf("\nCorrelation between %s and %s: %.3f\n", var1, var2, correlation))
# Scatter plot with regression line
ggplot(data, aes_string(x = var1, y = var2)) +
geom_point(alpha = 0.5) +
geom_smooth(method = 'lm', color = 'red', linetype = 'dashed') +
labs(title = sprintf('%s vs %s\nCorrelation: %.3f', var1, var2, correlation),
x = var1, y = var2) +
theme_minimal()
}
# Use it
analyze_continuous_relationship(data, 'age', 'income')
analyze_continuous_relationship(data, 'hours_studied', 'test_score')Categorical vs Continuous
Python:
def analyze_cat_continuous(data, categorical, continuous):
"""Analyze relationship between categorical and continuous variables"""
# Group statistics
print(f"\n{continuous} by {categorical}:")
print(data.groupby(categorical)[continuous].describe())
# Visualizations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box plot
data.boxplot(column=continuous, by=categorical, ax=axes[0])
axes[0].set_title(f'{continuous} by {categorical}')
axes[0].set_xlabel(categorical)
axes[0].set_ylabel(continuous)
# Violin plot
sns.violinplot(data=data, x=categorical, y=continuous, ax=axes[1])
axes[1].set_title(f'Distribution of {continuous} by {categorical}')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
# Use it
analyze_cat_continuous(data, 'education_level', 'income')
analyze_cat_continuous(data, 'region', 'sales')R:
analyze_cat_continuous <- function(data, categorical, continuous) {
# Group statistics
cat(sprintf("\n%s by %s:\n", continuous, categorical))
summary_stats <- data %>%
group_by(!!sym(categorical)) %>%
summarise(
count = n(),
mean = mean(!!sym(continuous), na.rm = TRUE),
median = median(!!sym(continuous), na.rm = TRUE),
sd = sd(!!sym(continuous), na.rm = TRUE)
)
print(summary_stats)
# Visualizations
p1 <- ggplot(data, aes_string(x = categorical, y = continuous)) +
geom_boxplot(fill = 'lightblue') +
labs(title = paste(continuous, 'by', categorical)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(data, aes_string(x = categorical, y = continuous, fill = categorical)) +
geom_violin() +
labs(title = paste('Distribution of', continuous, 'by', categorical)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'none')
grid.arrange(p1, p2, ncol = 2)
}
# Use it
analyze_cat_continuous(data, 'education_level', 'income')
analyze_cat_continuous(data, 'region', 'sales')Step 4: Multivariate Analysis
Look at multiple variables simultaneously.
Correlation Matrix
Python:
# Select numeric columns
numeric_cols = data.select_dtypes(include=[np.number]).columns
# Correlation matrix
correlation_matrix = data[numeric_cols].corr()
# Heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix,
annot=True,
cmap='coolwarm',
center=0,
square=True,
linewidths=1,
fmt='.2f')
plt.title('Correlation Matrix of Numeric Variables')
plt.tight_layout()
plt.show()
# Find strong correlations
print("\nStrong Correlations (|r| > 0.7):")
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
if abs(correlation_matrix.iloc[i, j]) > 0.7:
print(f"{correlation_matrix.columns[i]} <-> {correlation_matrix.columns[j]}: {correlation_matrix.iloc[i, j]:.3f}")R:
library(corrplot)
# Select numeric columns
numeric_data <- data %>% select(where(is.numeric))
# Correlation matrix
correlation_matrix <- cor(numeric_data, use = "complete.obs")
# Visualization
corrplot(correlation_matrix,
method = 'color',
type = 'upper',
addCoef.col = 'black',
tl.col = 'black',
tl.srt = 45,
number.cex = 0.7)
# Find strong correlations
cat("\nStrong Correlations (|r| > 0.7):\n")
high_corr <- which(abs(correlation_matrix) > 0.7 & correlation_matrix < 1, arr.ind = TRUE)
for (i in 1:nrow(high_corr)) {
row_idx <- high_corr[i, 1]
col_idx <- high_corr[i, 2]
if (row_idx < col_idx) { # Avoid duplicates
cat(sprintf("%s <-> %s: %.3f\n",
rownames(correlation_matrix)[row_idx],
colnames(correlation_matrix)[col_idx],
correlation_matrix[row_idx, col_idx]))
}
}Pair Plot
Python:
# For a subset of variables (pair plots get messy with too many)
selected_vars = ['var1', 'var2', 'var3', 'var4', 'target']
sns.pairplot(data[selected_vars],
hue='target', # Color by target variable
diag_kind='kde',
plot_kws={'alpha': 0.6})
plt.suptitle('Pair Plot of Selected Variables', y=1.01)
plt.show()R:
library(GGally)
# Pair plot
selected_vars <- c('var1', 'var2', 'var3', 'var4', 'target')
ggpairs(data[selected_vars],
aes(color = target, alpha = 0.6),
upper = list(continuous = 'cor'),
lower = list(continuous = 'points'),
title = 'Pair Plot of Selected Variables')Step 5: Hypothesis Generation
Based on your EDA, generate research questions:
Example thought process:
Observations from EDA:
- Income varies significantly by education level
- Age and income are positively correlated
- Region "West" has higher average sales
- Customer churn is higher for low-usage customers
Research hypotheses to test:
1. H1: Education level significantly predicts income
controlling for age
2. H2: Regional differences in sales persist after
controlling for population density
3. H3: Usage hours can predict customer churn
4. H4: There's an interaction effect between education
and region on incomeStep 6: Data Quality Check
Check for Outliers
Python:
def detect_outliers_iqr(data, column):
"""Detect outliers using IQR method"""
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
print(f"\n{column} outliers:")
print(f"Lower bound: {lower_bound:.2f}")
print(f"Upper bound: {upper_bound:.2f}")
print(f"Number of outliers: {len(outliers)} ({100*len(outliers)/len(data):.1f}%)")
return outliers
# Check all numeric columns
for col in data.select_dtypes(include=[np.number]).columns:
outliers = detect_outliers_iqr(data, col)R:
detect_outliers_iqr <- function(data, column) {
col_data <- data[[column]]
Q1 <- quantile(col_data, 0.25, na.rm = TRUE)
Q3 <- quantile(col_data, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data %>%
filter(!!sym(column) < lower_bound | !!sym(column) > upper_bound)
cat(sprintf("\n%s outliers:\n", column))
cat(sprintf("Lower bound: %.2f\n", lower_bound))
cat(sprintf("Upper bound: %.2f\n", upper_bound))
cat(sprintf("Number of outliers: %d (%.1f%%)\n",
nrow(outliers),
100 * nrow(outliers) / nrow(data)))
return(outliers)
}
# Check all numeric columns
numeric_cols <- names(data %>% select(where(is.numeric)))
for (col in numeric_cols) {
outliers <- detect_outliers_iqr(data, col)
}Complete EDA Report Template
Here’s a template for documenting your EDA:
"""
Exploratory Data Analysis Report
Dataset: [Name]
Date: 2026-01-26
Analyst: Your Name
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Configuration
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Load data
data = pd.read_csv('data.csv')
print("="*60)
print("EXPLORATORY DATA ANALYSIS REPORT")
print("="*60)
# 1. DATASET OVERVIEW
print("\n1. DATASET OVERVIEW")
print(f" Rows: {len(data):,}")
print(f" Columns: {len(data.columns)}")
print(f" Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# 2. MISSING DATA ANALYSIS
print("\n2. MISSING DATA ANALYSIS")
missing = data.isnull().sum()
if missing.sum() > 0:
missing_df = pd.DataFrame({
'Count': missing[missing > 0],
'Percentage': (missing[missing > 0] / len(data) * 100).round(2)
})
print(missing_df)
else:
print(" No missing values detected!")
# 3. NUMERIC VARIABLES SUMMARY
print("\n3. NUMERIC VARIABLES SUMMARY")
print(data.describe())
# 4. CATEGORICAL VARIABLES SUMMARY
print("\n4. CATEGORICAL VARIABLES SUMMARY")
cat_cols = data.select_dtypes(include=['object']).columns
for col in cat_cols:
print(f"\n {col}:")
print(f" Unique values: {data[col].nunique()}")
print(f" Top 5 values:")
print(data[col].value_counts().head())
# 5. KEY FINDINGS
print("\n5. KEY FINDINGS")
print(" - [List your observations]")
print(" - [Interesting patterns]")
print(" - [Data quality issues]")
print(" - [Relationships discovered]")
# 6. NEXT STEPS
print("\n6. RECOMMENDED NEXT STEPS")
print(" - [Data cleaning needed]")
print(" - [Variables to engineer]")
print(" - [Hypotheses to test]")
print(" - [Models to try]")
print("\n" + "="*60)
print("END OF REPORT")
print("="*60)EDA Checklist
Before moving to modeling, make sure you’ve:
- ✅ Understood data dimensions and types
- ✅ Checked for missing values
- ✅ Examined distribution of each variable
- ✅ Identified outliers
- ✅ Explored relationships between variables
- ✅ Created key visualizations
- ✅ Generated hypotheses
- ✅ Documented findings
- ✅ Identified data quality issues
- ✅ Planned next steps
Common EDA Mistakes to Avoid
- Skipping EDA entirely - Never jump straight to modeling!
- Only looking at summary statistics - Always visualize!
- Ignoring missing data patterns - Missing data might be informative
- Not checking assumptions - Many methods assume normality, independence, etc.
- Confirmation bias - Look for evidence against your hypothesis too
- Forgetting domain knowledge - Statistical patterns should make sense!
Next Steps
Now that you’ve explored your data:
- Data Cleaning Strategies - Clean up issues you found
- Feature Engineering - Create better variables
- Statistical Testing - Test your hypotheses
Remember: EDA is iterative! You’ll often come back to it as you learn more about your data.
Happy exploring! 🔍