Exploratory Data Analysis: A Systematic Approach

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is like being a detective with your data. Before you run fancy models or test hypotheses, you need to understand what you’re working with. EDA helps you:

  • Spot patterns and relationships
  • Identify outliers and anomalies
  • Check assumptions
  • Generate hypotheses
  • Find data quality issues
  • Tell stories with your data

The golden rule: Never skip EDA! It saves you from embarrassing mistakes later.


The EDA Framework: Your Step-by-Step Guide

Here’s a systematic approach that works for any dataset:

  1. First Look - Dimensions, types, missing values
  2. Univariate Analysis - One variable at a time
  3. Bivariate Analysis - Relationships between pairs
  4. Multivariate Analysis - Multiple variables together
  5. Hypothesis Generation - What questions emerge?
  6. Data Quality Check - Issues to address

Let’s walk through each step!


Step 1: The First Look

Always start with these basic questions about your data.

Python Approach

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Load data
data = pd.read_csv('your_dataset.csv')

print("=== DATASET OVERVIEW ===\n")

# 1. Shape
print(f"Rows: {data.shape[0]:,}")
print(f"Columns: {data.shape[1]}\n")

# 2. Column info
print("=== COLUMN INFORMATION ===")
print(data.info())
print()

# 3. First few rows
print("=== FIRST ROWS ===")
print(data.head())
print()

# 4. Basic statistics
print("=== NUMERIC SUMMARY ===")
print(data.describe())
print()

# 5. Missing values
print("=== MISSING VALUES ===")
missing = data.isnull().sum()
missing_pct = 100 * data.isnull().sum() / len(data)
missing_df = pd.DataFrame({
    'Missing': missing,
    'Percent': missing_pct
})
print(missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False))

R Approach

library(tidyverse)
library(skimr)

# Load data
data <- read_csv('your_dataset.csv')

cat("=== DATASET OVERVIEW ===\n\n")

# 1. Dimensions
cat(sprintf("Rows: %s\n", format(nrow(data), big.mark = ",")))
cat(sprintf("Columns: %d\n\n", ncol(data)))

# 2. Structure
cat("=== DATA STRUCTURE ===\n")
str(data)

# 3. First rows
cat("\n=== FIRST ROWS ===\n")
print(head(data))

# 4. Comprehensive summary
cat("\n=== COMPREHENSIVE SUMMARY ===\n")
skim(data)

# 5. Missing values detail
missing_summary <- data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), 
               names_to = "column", 
               values_to = "missing") %>%
  mutate(percent = 100 * missing / nrow(data)) %>%
  filter(missing > 0) %>%
  arrange(desc(missing))

cat("\n=== MISSING VALUES ===\n")
print(missing_summary)

Step 2: Univariate Analysis

Examine each variable individually to understand its distribution.

For Continuous Variables

Python:

def analyze_continuous_variable(data, column):
    """Comprehensive analysis of a continuous variable"""
    
    print(f"\n=== Analysis of {column} ===\n")
    
    # Statistics
    print(f"Mean: {data[column].mean():.2f}")
    print(f"Median: {data[column].median():.2f}")
    print(f"Std Dev: {data[column].std():.2f}")
    print(f"Min: {data[column].min():.2f}")
    print(f"Max: {data[column].max():.2f}")
    print(f"Skewness: {data[column].skew():.2f}")
    print(f"Kurtosis: {data[column].kurtosis():.2f}")
    
    # Create visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Histogram
    axes[0].hist(data[column].dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[0].axvline(data[column].mean(), color='red', linestyle='--', label='Mean')
    axes[0].axvline(data[column].median(), color='green', linestyle='--', label='Median')
    axes[0].set_title(f'Distribution of {column}')
    axes[0].set_xlabel(column)
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    
    # Box plot
    axes[1].boxplot(data[column].dropna())
    axes[1].set_title(f'Box Plot of {column}')
    axes[1].set_ylabel(column)
    
    # Q-Q plot (for normality)
    from scipy import stats
    stats.probplot(data[column].dropna(), dist="norm", plot=axes[2])
    axes[2].set_title(f'Q-Q Plot of {column}')
    
    plt.tight_layout()
    plt.show()

# Use it
analyze_continuous_variable(data, 'age')
analyze_continuous_variable(data, 'income')

R:

analyze_continuous_variable <- function(data, column) {
  cat(sprintf("\n=== Analysis of %s ===\n\n", column))
  
  # Statistics
  col_data <- data[[column]]
  cat(sprintf("Mean: %.2f\n", mean(col_data, na.rm = TRUE)))
  cat(sprintf("Median: %.2f\n", median(col_data, na.rm = TRUE)))
  cat(sprintf("Std Dev: %.2f\n", sd(col_data, na.rm = TRUE)))
  cat(sprintf("Min: %.2f\n", min(col_data, na.rm = TRUE)))
  cat(sprintf("Max: %.2f\n", max(col_data, na.rm = TRUE)))
  
  # Visualizations
  library(gridExtra)
  
  # Histogram
  p1 <- ggplot(data, aes_string(x = column)) +
    geom_histogram(bins = 30, fill = 'steelblue', color = 'black', alpha = 0.7) +
    geom_vline(aes(xintercept = mean(col_data, na.rm = TRUE)), 
               color = 'red', linetype = 'dashed', size = 1) +
    geom_vline(aes(xintercept = median(col_data, na.rm = TRUE)), 
               color = 'green', linetype = 'dashed', size = 1) +
    labs(title = paste('Distribution of', column),
         x = column, y = 'Frequency') +
    theme_minimal()
  
  # Box plot
  p2 <- ggplot(data, aes_string(y = column)) +
    geom_boxplot(fill = 'lightblue') +
    labs(title = paste('Box Plot of', column),
         y = column) +
    theme_minimal()
  
  # Density plot
  p3 <- ggplot(data, aes_string(x = column)) +
    geom_density(fill = 'steelblue', alpha = 0.5) +
    labs(title = paste('Density of', column),
         x = column, y = 'Density') +
    theme_minimal()
  
  grid.arrange(p1, p2, p3, ncol = 3)
}

# Use it
analyze_continuous_variable(data, 'age')
analyze_continuous_variable(data, 'income')

For Categorical Variables

Python:

def analyze_categorical_variable(data, column):
    """Comprehensive analysis of a categorical variable"""
    
    print(f"\n=== Analysis of {column} ===\n")
    
    # Value counts
    value_counts = data[column].value_counts()
    print("Value Counts:")
    print(value_counts)
    print()
    
    # Percentages
    print("Percentages:")
    print(data[column].value_counts(normalize=True) * 100)
    print()
    
    # Unique values
    print(f"Number of unique values: {data[column].nunique()}")
    
    # Visualizations
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Bar plot
    value_counts.plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
    axes[0].set_title(f'Count of {column}')
    axes[0].set_xlabel(column)
    axes[0].set_ylabel('Count')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Pie chart (if not too many categories)
    if len(value_counts) <= 10:
        axes[1].pie(value_counts, labels=value_counts.index, autopct='%1.1f%%')
        axes[1].set_title(f'Distribution of {column}')
    else:
        # Show top 10
        value_counts.head(10).plot(kind='barh', ax=axes[1], color='coral')
        axes[1].set_title(f'Top 10 {column} Categories')
        axes[1].set_xlabel('Count')
    
    plt.tight_layout()
    plt.show()

# Use it
analyze_categorical_variable(data, 'category')
analyze_categorical_variable(data, 'region')

R:

analyze_categorical_variable <- function(data, column) {
  cat(sprintf("\n=== Analysis of %s ===\n\n", column))
  
  # Value counts
  value_counts <- data %>% 
    count(!!sym(column)) %>%
    arrange(desc(n))
  
  cat("Value Counts:\n")
  print(value_counts)
  
  # Percentages
  cat("\nPercentages:\n")
  print(value_counts %>% mutate(percent = 100 * n / sum(n)))
  
  cat(sprintf("\nNumber of unique values: %d\n", 
              n_distinct(data[[column]], na.rm = TRUE)))
  
  # Visualizations
  p1 <- ggplot(data, aes_string(x = column)) +
    geom_bar(fill = 'steelblue', color = 'black') +
    labs(title = paste('Count of', column),
         x = column, y = 'Count') +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  # If not too many categories, add pie chart
  if (n_distinct(data[[column]], na.rm = TRUE) <= 10) {
    p2 <- ggplot(value_counts, aes(x = "", y = n, fill = !!sym(column))) +
      geom_bar(stat = "identity", width = 1) +
      coord_polar("y", start = 0) +
      labs(title = paste('Distribution of', column)) +
      theme_void()
    
    grid.arrange(p1, p2, ncol = 2)
  } else {
    print(p1)
  }
}

# Use it
analyze_categorical_variable(data, 'category')
analyze_categorical_variable(data, 'region')

Step 3: Bivariate Analysis

Now look at relationships between pairs of variables.

Two Continuous Variables

Python:

def analyze_continuous_relationship(data, var1, var2):
    """Analyze relationship between two continuous variables"""
    
    # Correlation
    correlation = data[[var1, var2]].corr().iloc[0, 1]
    print(f"\nCorrelation between {var1} and {var2}: {correlation:.3f}")
    
    # Scatter plot with regression line
    plt.figure(figsize=(10, 6))
    plt.scatter(data[var1], data[var2], alpha=0.5)
    
    # Add regression line
    z = np.polyfit(data[var1].dropna(), data[var2].dropna(), 1)
    p = np.poly1d(z)
    plt.plot(data[var1], p(data[var1]), "r--", linewidth=2, label=f'y={z[0]:.2f}x+{z[1]:.2f}')
    
    plt.xlabel(var1)
    plt.ylabel(var2)
    plt.title(f'{var1} vs {var2}\nCorrelation: {correlation:.3f}')
    plt.legend()
    plt.show()

# Use it
analyze_continuous_relationship(data, 'age', 'income')
analyze_continuous_relationship(data, 'hours_studied', 'test_score')

R:

analyze_continuous_relationship <- function(data, var1, var2) {
  # Correlation
  correlation <- cor(data[[var1]], data[[var2]], use = "complete.obs")
  cat(sprintf("\nCorrelation between %s and %s: %.3f\n", var1, var2, correlation))
  
  # Scatter plot with regression line
  ggplot(data, aes_string(x = var1, y = var2)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = 'lm', color = 'red', linetype = 'dashed') +
    labs(title = sprintf('%s vs %s\nCorrelation: %.3f', var1, var2, correlation),
         x = var1, y = var2) +
    theme_minimal()
}

# Use it
analyze_continuous_relationship(data, 'age', 'income')
analyze_continuous_relationship(data, 'hours_studied', 'test_score')

Categorical vs Continuous

Python:

def analyze_cat_continuous(data, categorical, continuous):
    """Analyze relationship between categorical and continuous variables"""
    
    # Group statistics
    print(f"\n{continuous} by {categorical}:")
    print(data.groupby(categorical)[continuous].describe())
    
    # Visualizations
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Box plot
    data.boxplot(column=continuous, by=categorical, ax=axes[0])
    axes[0].set_title(f'{continuous} by {categorical}')
    axes[0].set_xlabel(categorical)
    axes[0].set_ylabel(continuous)
    
    # Violin plot
    sns.violinplot(data=data, x=categorical, y=continuous, ax=axes[1])
    axes[1].set_title(f'Distribution of {continuous} by {categorical}')
    axes[1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

# Use it
analyze_cat_continuous(data, 'education_level', 'income')
analyze_cat_continuous(data, 'region', 'sales')

R:

analyze_cat_continuous <- function(data, categorical, continuous) {
  # Group statistics
  cat(sprintf("\n%s by %s:\n", continuous, categorical))
  summary_stats <- data %>%
    group_by(!!sym(categorical)) %>%
    summarise(
      count = n(),
      mean = mean(!!sym(continuous), na.rm = TRUE),
      median = median(!!sym(continuous), na.rm = TRUE),
      sd = sd(!!sym(continuous), na.rm = TRUE)
    )
  print(summary_stats)
  
  # Visualizations
  p1 <- ggplot(data, aes_string(x = categorical, y = continuous)) +
    geom_boxplot(fill = 'lightblue') +
    labs(title = paste(continuous, 'by', categorical)) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  p2 <- ggplot(data, aes_string(x = categorical, y = continuous, fill = categorical)) +
    geom_violin() +
    labs(title = paste('Distribution of', continuous, 'by', categorical)) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'none')
  
  grid.arrange(p1, p2, ncol = 2)
}

# Use it
analyze_cat_continuous(data, 'education_level', 'income')
analyze_cat_continuous(data, 'region', 'sales')

Step 4: Multivariate Analysis

Look at multiple variables simultaneously.

Correlation Matrix

Python:

# Select numeric columns
numeric_cols = data.select_dtypes(include=[np.number]).columns

# Correlation matrix
correlation_matrix = data[numeric_cols].corr()

# Heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, 
            annot=True, 
            cmap='coolwarm', 
            center=0,
            square=True,
            linewidths=1,
            fmt='.2f')
plt.title('Correlation Matrix of Numeric Variables')
plt.tight_layout()
plt.show()

# Find strong correlations
print("\nStrong Correlations (|r| > 0.7):")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            print(f"{correlation_matrix.columns[i]} <-> {correlation_matrix.columns[j]}: {correlation_matrix.iloc[i, j]:.3f}")

R:

library(corrplot)

# Select numeric columns
numeric_data <- data %>% select(where(is.numeric))

# Correlation matrix
correlation_matrix <- cor(numeric_data, use = "complete.obs")

# Visualization
corrplot(correlation_matrix, 
         method = 'color', 
         type = 'upper',
         addCoef.col = 'black',
         tl.col = 'black',
         tl.srt = 45,
         number.cex = 0.7)

# Find strong correlations
cat("\nStrong Correlations (|r| > 0.7):\n")
high_corr <- which(abs(correlation_matrix) > 0.7 & correlation_matrix < 1, arr.ind = TRUE)
for (i in 1:nrow(high_corr)) {
  row_idx <- high_corr[i, 1]
  col_idx <- high_corr[i, 2]
  if (row_idx < col_idx) {  # Avoid duplicates
    cat(sprintf("%s <-> %s: %.3f\n",
                rownames(correlation_matrix)[row_idx],
                colnames(correlation_matrix)[col_idx],
                correlation_matrix[row_idx, col_idx]))
  }
}

Pair Plot

Python:

# For a subset of variables (pair plots get messy with too many)
selected_vars = ['var1', 'var2', 'var3', 'var4', 'target']

sns.pairplot(data[selected_vars], 
             hue='target',  # Color by target variable
             diag_kind='kde',
             plot_kws={'alpha': 0.6})
plt.suptitle('Pair Plot of Selected Variables', y=1.01)
plt.show()

R:

library(GGally)

# Pair plot
selected_vars <- c('var1', 'var2', 'var3', 'var4', 'target')

ggpairs(data[selected_vars],
        aes(color = target, alpha = 0.6),
        upper = list(continuous = 'cor'),
        lower = list(continuous = 'points'),
        title = 'Pair Plot of Selected Variables')

Step 5: Hypothesis Generation

Based on your EDA, generate research questions:

Example thought process:

Observations from EDA:
- Income varies significantly by education level
- Age and income are positively correlated
- Region "West" has higher average sales
- Customer churn is higher for low-usage customers

Research hypotheses to test:
1. H1: Education level significantly predicts income
   controlling for age
2. H2: Regional differences in sales persist after
   controlling for population density
3. H3: Usage hours can predict customer churn
4. H4: There's an interaction effect between education
   and region on income

Step 6: Data Quality Check

Check for Outliers

Python:

def detect_outliers_iqr(data, column):
    """Detect outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    
    print(f"\n{column} outliers:")
    print(f"Lower bound: {lower_bound:.2f}")
    print(f"Upper bound: {upper_bound:.2f}")
    print(f"Number of outliers: {len(outliers)} ({100*len(outliers)/len(data):.1f}%)")
    
    return outliers

# Check all numeric columns
for col in data.select_dtypes(include=[np.number]).columns:
    outliers = detect_outliers_iqr(data, col)

R:

detect_outliers_iqr <- function(data, column) {
  col_data <- data[[column]]
  
  Q1 <- quantile(col_data, 0.25, na.rm = TRUE)
  Q3 <- quantile(col_data, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  outliers <- data %>% 
    filter(!!sym(column) < lower_bound | !!sym(column) > upper_bound)
  
  cat(sprintf("\n%s outliers:\n", column))
  cat(sprintf("Lower bound: %.2f\n", lower_bound))
  cat(sprintf("Upper bound: %.2f\n", upper_bound))
  cat(sprintf("Number of outliers: %d (%.1f%%)\n", 
              nrow(outliers), 
              100 * nrow(outliers) / nrow(data)))
  
  return(outliers)
}

# Check all numeric columns
numeric_cols <- names(data %>% select(where(is.numeric)))
for (col in numeric_cols) {
  outliers <- detect_outliers_iqr(data, col)
}

Complete EDA Report Template

Here’s a template for documenting your EDA:

"""
Exploratory Data Analysis Report
Dataset: [Name]
Date: 2026-01-26
Analyst: Your Name
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
data = pd.read_csv('data.csv')

print("="*60)
print("EXPLORATORY DATA ANALYSIS REPORT")
print("="*60)

# 1. DATASET OVERVIEW
print("\n1. DATASET OVERVIEW")
print(f"   Rows: {len(data):,}")
print(f"   Columns: {len(data.columns)}")
print(f"   Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# 2. MISSING DATA ANALYSIS
print("\n2. MISSING DATA ANALYSIS")
missing = data.isnull().sum()
if missing.sum() > 0:
    missing_df = pd.DataFrame({
        'Count': missing[missing > 0],
        'Percentage': (missing[missing > 0] / len(data) * 100).round(2)
    })
    print(missing_df)
else:
    print("   No missing values detected!")

# 3. NUMERIC VARIABLES SUMMARY
print("\n3. NUMERIC VARIABLES SUMMARY")
print(data.describe())

# 4. CATEGORICAL VARIABLES SUMMARY
print("\n4. CATEGORICAL VARIABLES SUMMARY")
cat_cols = data.select_dtypes(include=['object']).columns
for col in cat_cols:
    print(f"\n   {col}:")
    print(f"   Unique values: {data[col].nunique()}")
    print(f"   Top 5 values:")
    print(data[col].value_counts().head())

# 5. KEY FINDINGS
print("\n5. KEY FINDINGS")
print("   - [List your observations]")
print("   - [Interesting patterns]")
print("   - [Data quality issues]")
print("   - [Relationships discovered]")

# 6. NEXT STEPS
print("\n6. RECOMMENDED NEXT STEPS")
print("   - [Data cleaning needed]")
print("   - [Variables to engineer]")
print("   - [Hypotheses to test]")
print("   - [Models to try]")

print("\n" + "="*60)
print("END OF REPORT")
print("="*60)

EDA Checklist

Before moving to modeling, make sure you’ve:

  • ✅ Understood data dimensions and types
  • ✅ Checked for missing values
  • ✅ Examined distribution of each variable
  • ✅ Identified outliers
  • ✅ Explored relationships between variables
  • ✅ Created key visualizations
  • ✅ Generated hypotheses
  • ✅ Documented findings
  • ✅ Identified data quality issues
  • ✅ Planned next steps

Common EDA Mistakes to Avoid

  1. Skipping EDA entirely - Never jump straight to modeling!
  2. Only looking at summary statistics - Always visualize!
  3. Ignoring missing data patterns - Missing data might be informative
  4. Not checking assumptions - Many methods assume normality, independence, etc.
  5. Confirmation bias - Look for evidence against your hypothesis too
  6. Forgetting domain knowledge - Statistical patterns should make sense!

Next Steps

Now that you’ve explored your data:

  1. Data Cleaning Strategies - Clean up issues you found
  2. Feature Engineering - Create better variables
  3. Statistical Testing - Test your hypotheses

Remember: EDA is iterative! You’ll often come back to it as you learn more about your data.

Happy exploring! 🔍