Data Cleaning Strategies: From Messy to Analysis-Ready

Why Data Cleaning Matters

Real-world data is messy. Really messy. You’ll spend 60-80% of your time cleaning data, not running fancy models. But here’s the good news: clean data leads to better insights, more reliable results, and fewer “Wait, why did my model predict negative ages?” moments.

The truth: Garbage in, garbage out. No amount of sophisticated modeling can fix bad data.


The Data Cleaning Pipeline

Here’s a systematic approach that works for most datasets:

  1. Understand your data (you did EDA, right?)
  2. Handle missing values
  3. Address outliers
  4. Fix inconsistencies
  5. Handle duplicates
  6. Transform and encode
  7. Validate your cleaning

Let’s tackle each step!


Step 1: Missing Values

Missing data is the most common issue you’ll face.

Types of Missing Data

MAR (Missing At Random): Missingness depends on observed data

  • Example: Wealthy people less likely to report income

MCAR (Missing Completely At Random): Missingness is random

  • Example: Survey responses lost due to technical error

MNAR (Missing Not At Random): Missingness depends on unobserved data

  • Example: People with low income don’t report income

Strategy 1: Delete

When to use: Small amount of missing data (<5%), MCAR

Python:

import pandas as pd
import numpy as np

# Load data
data = pd.read_csv('data.csv')

# Check missing values
print("Missing values per column:")
print(data.isnull().sum())

# Option 1: Drop rows with ANY missing values
data_clean = data.dropna()
print(f"\nRows before: {len(data)}, after: {len(data_clean)}")

# Option 2: Drop rows with missing values in SPECIFIC columns
critical_cols = ['customer_id', 'date', 'amount']
data_clean = data.dropna(subset=critical_cols)

# Option 3: Drop columns with too many missing values
threshold = 0.5  # Drop if >50% missing
data_clean = data.loc[:, data.isnull().mean() < threshold]

R:

library(tidyverse)

# Load data
data <- read_csv('data.csv')

# Check missing values
cat("Missing values per column:\n")
print(colSums(is.na(data)))

# Option 1: Drop rows with ANY missing values
data_clean <- data %>% drop_na()
cat(sprintf("\nRows before: %d, after: %d\n", nrow(data), nrow(data_clean)))

# Option 2: Drop rows with missing in specific columns
critical_cols <- c('customer_id', 'date', 'amount')
data_clean <- data %>% drop_na(all_of(critical_cols))

# Option 3: Drop columns with too many missing values
threshold <- 0.5
missing_prop <- colMeans(is.na(data))
data_clean <- data %>% select(where(~mean(is.na(.)) < threshold))

Strategy 2: Impute with Statistics

When to use: Moderate missing data, continuous variables

Python:

# Mean imputation
data['age'].fillna(data['age'].mean(), inplace=True)

# Median imputation (better for skewed data)
data['income'].fillna(data['income'].median(), inplace=True)

# Mode imputation (for categorical)
data['category'].fillna(data['category'].mode()[0], inplace=True)

# Forward fill (for time series)
data['temperature'].fillna(method='ffill', inplace=True)

# Backward fill
data['temperature'].fillna(method='bfill', inplace=True)

# Group-wise imputation (sophisticated!)
# Fill missing income with mean income of same education level
data['income'] = data.groupby('education')['income'].transform(
    lambda x: x.fillna(x.mean())
)

R:

library(tidyverse)

# Mean imputation
data <- data %>%
  mutate(age = ifelse(is.na(age), mean(age, na.rm = TRUE), age))

# Median imputation
data <- data %>%
  mutate(income = ifelse(is.na(income), median(income, na.rm = TRUE), income))

# Mode imputation (categorical)
get_mode <- function(x) {
  ux <- unique(na.omit(x))
  ux[which.max(tabulate(match(x, ux)))]
}

data <- data %>%
  mutate(category = ifelse(is.na(category), get_mode(category), category))

# Forward fill (zoo package)
library(zoo)
data <- data %>%
  mutate(temperature = na.locf(temperature, na.rm = FALSE))

# Group-wise imputation
data <- data %>%
  group_by(education) %>%
  mutate(income = ifelse(is.na(income), mean(income, na.rm = TRUE), income)) %>%
  ungroup()

Strategy 3: Advanced Imputation

Python with sklearn:

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Simple imputation
imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent'
data[['age', 'income']] = imputer.fit_transform(data[['age', 'income']])

# KNN imputation (uses similar rows)
imputer = KNNImputer(n_neighbors=5)
data_imputed = pd.DataFrame(
    imputer.fit_transform(data.select_dtypes(include=[np.number])),
    columns=data.select_dtypes(include=[np.number]).columns
)

# MICE (Multiple Imputation by Chained Equations)
imputer = IterativeImputer(random_state=42, max_iter=10)
data_imputed = pd.DataFrame(
    imputer.fit_transform(data.select_dtypes(include=[np.number])),
    columns=data.select_dtypes(include=[np.number]).columns
)

R with mice:

library(mice)

# MICE imputation
# m=5 means create 5 imputed datasets
imputed_data <- mice(data, m = 5, method = 'pmm', seed = 42)

# Get one completed dataset
data_complete <- complete(imputed_data, 1)

# Or pool results from all 5 datasets (for modeling)
# This accounts for imputation uncertainty
model_fit <- with(imputed_data, lm(outcome ~ predictor1 + predictor2))
pooled_results <- pool(model_fit)
summary(pooled_results)

Strategy 4: Create Missing Indicator

When to use: Missingness might be informative

Python:

# Create binary indicator for missingness
data['income_missing'] = data['income'].isnull().astype(int)

# Then impute
data['income'].fillna(data['income'].median(), inplace=True)

# Now you can analyze if missingness patterns are predictive!

R:

# Create missing indicator
data <- data %>%
  mutate(income_missing = as.integer(is.na(income)))

# Then impute
data <- data %>%
  mutate(income = ifelse(is.na(income), median(income, na.rm = TRUE), income))

Step 2: Handling Outliers

Outliers can be legitimate extreme values or errors. Be careful!

Detecting Outliers

Method 1: IQR Method

Python:

def remove_outliers_iqr(data, column):
    """Remove outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Count outliers
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    print(f"Found {len(outliers)} outliers in {column}")
    
    # Remove them
    data_clean = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
    return data_clean

data_clean = remove_outliers_iqr(data, 'income')

R:

remove_outliers_iqr <- function(data, column) {
  Q1 <- quantile(data[[column]], 0.25, na.rm = TRUE)
  Q3 <- quantile(data[[column]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  outliers <- data %>% 
    filter(!!sym(column) < lower_bound | !!sym(column) > upper_bound)
  cat(sprintf("Found %d outliers in %s\n", nrow(outliers), column))
  
  data_clean <- data %>%
    filter(!!sym(column) >= lower_bound & !!sym(column) <= upper_bound)
  
  return(data_clean)
}

data_clean <- remove_outliers_iqr(data, 'income')

Method 2: Z-Score Method

Python:

from scipy import stats

def remove_outliers_zscore(data, column, threshold=3):
    """Remove outliers using z-score method"""
    z_scores = np.abs(stats.zscore(data[column].dropna()))
    outlier_indices = data[column].dropna().index[z_scores > threshold]
    
    print(f"Found {len(outlier_indices)} outliers in {column}")
    
    data_clean = data.drop(outlier_indices)
    return data_clean

data_clean = remove_outliers_zscore(data, 'income', threshold=3)

R:

remove_outliers_zscore <- function(data, column, threshold = 3) {
  z_scores <- abs(scale(data[[column]]))
  
  outlier_count <- sum(z_scores > threshold, na.rm = TRUE)
  cat(sprintf("Found %d outliers in %s\n", outlier_count, column))
  
  data_clean <- data %>%
    filter(abs(scale(!!sym(column))) <= threshold | is.na(!!sym(column)))
  
  return(data_clean)
}

data_clean <- remove_outliers_zscore(data, 'income', threshold = 3)

Transforming Outliers (Instead of Removing)

Python:

# Cap outliers at percentiles
lower_cap = data['income'].quantile(0.01)
upper_cap = data['income'].quantile(0.99)

data['income_capped'] = data['income'].clip(lower=lower_cap, upper=upper_cap)

# Log transformation (for right-skewed data)
data['log_income'] = np.log1p(data['income'])  # log1p handles zeros

# Square root transformation
data['sqrt_income'] = np.sqrt(data['income'])

# Box-Cox transformation (finds optimal transformation)
from scipy.stats import boxcox
data['income_boxcox'], lambda_param = boxcox(data['income'] + 1)  # +1 if zeros exist
print(f"Optimal lambda: {lambda_param}")

R:

# Cap outliers
lower_cap <- quantile(data$income, 0.01, na.rm = TRUE)
upper_cap <- quantile(data$income, 0.99, na.rm = TRUE)

data <- data %>%
  mutate(income_capped = pmax(pmin(income, upper_cap), lower_cap))

# Log transformation
data <- data %>%
  mutate(log_income = log1p(income))

# Square root transformation
data <- data %>%
  mutate(sqrt_income = sqrt(income))

# Box-Cox transformation
library(MASS)
bc_result <- boxcox(income ~ 1, data = data)
lambda <- bc_result$x[which.max(bc_result$y)]
cat(sprintf("Optimal lambda: %.2f\n", lambda))

data <- data %>%
  mutate(income_boxcox = (income^lambda - 1) / lambda)

Step 3: Fixing Inconsistencies

String Cleaning

Python:

# Convert to lowercase
data['name'] = data['name'].str.lower()

# Remove whitespace
data['name'] = data['name'].str.strip()

# Remove special characters
data['name'] = data['name'].str.replace(r'[^a-zA-Z0-9\s]', '', regex=True)

# Standardize categories
category_map = {
    'USA': 'United States',
    'US': 'United States',
    'U.S.': 'United States',
    'United States of America': 'United States'
}
data['country'] = data['country'].replace(category_map)

# Fix typos with fuzzy matching
from fuzzywuzzy import process

def fix_typos(value, choices, threshold=80):
    """Fix typos using fuzzy string matching"""
    if pd.isna(value):
        return value
    match, score = process.extractOne(value, choices)
    return match if score >= threshold else value

correct_categories = ['Electronics', 'Clothing', 'Food', 'Books']
data['category'] = data['category'].apply(
    lambda x: fix_typos(x, correct_categories)
)

R:

library(stringr)

# Convert to lowercase
data <- data %>%
  mutate(name = str_to_lower(name))

# Remove whitespace
data <- data %>%
  mutate(name = str_trim(name))

# Remove special characters
data <- data %>%
  mutate(name = str_replace_all(name, "[^a-zA-Z0-9\\s]", ""))

# Standardize categories
data <- data %>%
  mutate(country = recode(country,
    'USA' = 'United States',
    'US' = 'United States',
    'U.S.' = 'United States',
    'United States of America' = 'United States'
  ))

# Fix typos with stringdist
library(stringdist)

fix_typos <- function(value, choices, threshold = 0.2) {
  if (is.na(value)) return(value)
  
  distances <- stringdist(value, choices, method = "jw")
  best_match_idx <- which.min(distances)
  
  if (distances[best_match_idx] <= threshold) {
    return(choices[best_match_idx])
  } else {
    return(value)
  }
}

correct_categories <- c('Electronics', 'Clothing', 'Food', 'Books')
data <- data %>%
  mutate(category = sapply(category, fix_typos, choices = correct_categories))

Date Cleaning

Python:

# Parse dates
data['date'] = pd.to_datetime(data['date'], errors='coerce')

# Handle different formats
data['date'] = pd.to_datetime(data['date'], 
                              format='%Y-%m-%d', 
                              errors='coerce')

# Extract components
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day_of_week'] = data['date'].dt.dayofweek
data['is_weekend'] = data['day_of_week'].isin([5, 6]).astype(int)

# Fix date ranges
# Remove impossible dates
data = data[(data['date'] >= '1900-01-01') & (data['date'] <= '2030-12-31')]

R:

library(lubridate)

# Parse dates
data <- data %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d"))

# Try multiple formats
data <- data %>%
  mutate(date = parse_date_time(date, orders = c("ymd", "mdy", "dmy")))

# Extract components
data <- data %>%
  mutate(
    year = year(date),
    month = month(date),
    day_of_week = wday(date),
    is_weekend = as.integer(day_of_week %in% c(1, 7))  # Sunday=1, Saturday=7
  )

# Fix date ranges
data <- data %>%
  filter(date >= as.Date("1900-01-01") & date <= as.Date("2030-12-31"))

Step 4: Handling Duplicates

Python:

# Check for duplicates
print(f"Total duplicates: {data.duplicated().sum()}")

# See duplicate rows
duplicates = data[data.duplicated(keep=False)]
print(duplicates)

# Remove exact duplicates
data_clean = data.drop_duplicates()

# Remove duplicates based on specific columns
data_clean = data.drop_duplicates(subset=['customer_id', 'date'], keep='first')

# Find near-duplicates (fuzzy matching)
from difflib import SequenceMatcher

def similar(a, b, threshold=0.9):
    return SequenceMatcher(None, a, b).ratio() > threshold

# Identify similar names
for i in range(len(data)):
    for j in range(i+1, len(data)):
        if similar(data.iloc[i]['name'], data.iloc[j]['name']):
            print(f"Possible duplicates: {data.iloc[i]['name']} and {data.iloc[j]['name']}")

R:

# Check for duplicates
cat(sprintf("Total duplicates: %d\n", sum(duplicated(data))))

# See duplicate rows
duplicates <- data %>%
  filter(duplicated(.) | duplicated(., fromLast = TRUE))
print(duplicates)

# Remove exact duplicates
data_clean <- data %>% distinct()

# Remove duplicates based on specific columns
data_clean <- data %>%
  distinct(customer_id, date, .keep_all = TRUE)

# Find near-duplicates
library(stringdist)

names <- data$name
for (i in 1:(length(names)-1)) {
  for (j in (i+1):length(names)) {
    similarity <- stringsim(names[i], names[j], method = "jw")
    if (similarity > 0.9) {
      cat(sprintf("Possible duplicates: %s and %s (%.2f similarity)\n",
                  names[i], names[j], similarity))
    }
  }
}

Step 5: Data Validation

Always validate your cleaning!

Python:

def validate_data(data):
    """Comprehensive data validation"""
    
    print("=== DATA VALIDATION REPORT ===\n")
    
    # 1. Check for missing values
    missing = data.isnull().sum()
    if missing.sum() > 0:
        print("āš ļø  Still have missing values:")
        print(missing[missing > 0])
    else:
        print("āœ… No missing values")
    
    # 2. Check for duplicates
    dups = data.duplicated().sum()
    if dups > 0:
        print(f"\nāš ļø  Found {dups} duplicate rows")
    else:
        print("\nāœ… No duplicates")
    
    # 3. Check data types
    print("\nāœ… Data types:")
    print(data.dtypes)
    
    # 4. Check value ranges
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        min_val = data[col].min()
        max_val = data[col].max()
        if min_val < 0 and col in ['age', 'price', 'quantity']:
            print(f"\nāš ļø  {col} has negative values (min: {min_val})")
        if max_val > 150 and col == 'age':
            print(f"\nāš ļø  {col} has suspicious high values (max: {max_val})")
    
    # 5. Check categorical consistency
    cat_cols = data.select_dtypes(include=['object']).columns
    for col in cat_cols:
        unique_count = data[col].nunique()
        if unique_count > 100:
            print(f"\nāš ļø  {col} has {unique_count} unique values (might need grouping)")
    
    print("\n=== END VALIDATION ===")

validate_data(data_clean)

R:

validate_data <- function(data) {
  cat("=== DATA VALIDATION REPORT ===\n\n")
  
  # 1. Check for missing values
  missing <- colSums(is.na(data))
  if (sum(missing) > 0) {
    cat("āš ļø  Still have missing values:\n")
    print(missing[missing > 0])
  } else {
    cat("āœ… No missing values\n")
  }
  
  # 2. Check for duplicates
  dups <- sum(duplicated(data))
  if (dups > 0) {
    cat(sprintf("\nāš ļø  Found %d duplicate rows\n", dups))
  } else {
    cat("\nāœ… No duplicates\n")
  }
  
  # 3. Check data types
  cat("\nāœ… Data types:\n")
  print(sapply(data, class))
  
  # 4. Check value ranges
  numeric_cols <- names(data %>% select(where(is.numeric)))
  for (col in numeric_cols) {
    min_val <- min(data[[col]], na.rm = TRUE)
    max_val <- max(data[[col]], na.rm = TRUE)
    
    if (min_val < 0 && col %in% c('age', 'price', 'quantity')) {
      cat(sprintf("\nāš ļø  %s has negative values (min: %.2f)\n", col, min_val))
    }
    if (max_val > 150 && col == 'age') {
      cat(sprintf("\nāš ļø  %s has suspicious high values (max: %.2f)\n", col, max_val))
    }
  }
  
  # 5. Check categorical consistency
  cat_cols <- names(data %>% select(where(is.character)))
  for (col in cat_cols) {
    unique_count <- n_distinct(data[[col]], na.rm = TRUE)
    if (unique_count > 100) {
      cat(sprintf("\nāš ļø  %s has %d unique values (might need grouping)\n", 
                  col, unique_count))
    }
  }
  
  cat("\n=== END VALIDATION ===\n")
}

validate_data(data_clean)

Complete Cleaning Pipeline Example

Python:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

def clean_customer_data(filepath):
    """Complete data cleaning pipeline"""
    
    print("Starting data cleaning pipeline...")
    
    # 1. Load data
    data = pd.read_csv(filepath)
    print(f"Loaded {len(data)} rows, {len(data.columns)} columns")
    
    # 2. Handle duplicates
    data = data.drop_duplicates()
    print(f"After removing duplicates: {len(data)} rows")
    
    # 3. Fix data types
    data['date'] = pd.to_datetime(data['date'], errors='coerce')
    data['customer_id'] = data['customer_id'].astype(str)
    
    # 4. Clean text columns
    text_cols = ['name', 'city', 'category']
    for col in text_cols:
        if col in data.columns:
            data[col] = data[col].str.lower().str.strip()
    
    # 5. Handle missing values
    # Critical columns: drop rows
    data = data.dropna(subset=['customer_id', 'date'])
    
    # Numeric columns: impute with median
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    imputer = SimpleImputer(strategy='median')
    data[numeric_cols] = imputer.fit_transform(data[numeric_cols])
    
    # Categorical: impute with mode
    cat_cols = data.select_dtypes(include=['object']).columns
    for col in cat_cols:
        mode_value = data[col].mode()[0] if len(data[col].mode()) > 0 else 'Unknown'
        data[col].fillna(mode_value, inplace=True)
    
    # 6. Handle outliers (cap at 1st and 99th percentile)
    for col in numeric_cols:
        lower = data[col].quantile(0.01)
        upper = data[col].quantile(0.99)
        data[col] = data[col].clip(lower=lower, upper=upper)
    
    # 7. Feature engineering
    data['age_group'] = pd.cut(data['age'], 
                               bins=[0, 18, 35, 50, 65, 100],
                               labels=['<18', '18-35', '35-50', '50-65', '65+'])
    
    # 8. Final validation
    assert data.isnull().sum().sum() == 0, "Still have missing values!"
    assert data.duplicated().sum() == 0, "Still have duplicates!"
    
    print(f"\nCleaning complete! Final shape: {data.shape}")
    
    # 9. Save cleaned data
    output_path = filepath.replace('.csv', '_cleaned.csv')
    data.to_csv(output_path, index=False)
    print(f"Saved to: {output_path}")
    
    return data

# Use it!
clean_data = clean_customer_data('raw_customer_data.csv')

Data Cleaning Checklist

Before moving to analysis:

  • āœ… No missing values (or properly handled)
  • āœ… No duplicates
  • āœ… Correct data types
  • āœ… Outliers addressed
  • āœ… Text standardized (lowercase, trimmed)
  • āœ… Dates parsed correctly
  • āœ… Categorical values consistent
  • āœ… Value ranges make sense
  • āœ… Documented all cleaning steps
  • āœ… Saved cleaned dataset

Common Pitfalls

  1. Not documenting cleaning decisions - Future you won’t remember why!
  2. Automatically removing all outliers - Some are legitimate!
  3. Imputing without understanding - Different strategies have different implications
  4. Over-cleaning - Don’t remove real variance
  5. Not checking after cleaning - Always validate!

Next Steps

Now that your data is clean:

  1. Feature Engineering - Create better variables
  2. Statistical Testing - Test hypotheses
  3. Model Evaluation - Build and evaluate models

Clean data = happy researcher! ✨