Data Cleaning Strategies: From Messy to Analysis-Ready
Why Data Cleaning Matters
Real-world data is messy. Really messy. You’ll spend 60-80% of your time cleaning data, not running fancy models. But here’s the good news: clean data leads to better insights, more reliable results, and fewer “Wait, why did my model predict negative ages?” moments.
The truth: Garbage in, garbage out. No amount of sophisticated modeling can fix bad data.
The Data Cleaning Pipeline
Here’s a systematic approach that works for most datasets:
- Understand your data (you did EDA, right?)
- Handle missing values
- Address outliers
- Fix inconsistencies
- Handle duplicates
- Transform and encode
- Validate your cleaning
Let’s tackle each step!
Step 1: Missing Values
Missing data is the most common issue you’ll face.
Types of Missing Data
MAR (Missing At Random): Missingness depends on observed data
- Example: Wealthy people less likely to report income
MCAR (Missing Completely At Random): Missingness is random
- Example: Survey responses lost due to technical error
MNAR (Missing Not At Random): Missingness depends on unobserved data
- Example: People with low income don’t report income
Strategy 1: Delete
When to use: Small amount of missing data (<5%), MCAR
Python:
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv('data.csv')
# Check missing values
print("Missing values per column:")
print(data.isnull().sum())
# Option 1: Drop rows with ANY missing values
data_clean = data.dropna()
print(f"\nRows before: {len(data)}, after: {len(data_clean)}")
# Option 2: Drop rows with missing values in SPECIFIC columns
critical_cols = ['customer_id', 'date', 'amount']
data_clean = data.dropna(subset=critical_cols)
# Option 3: Drop columns with too many missing values
threshold = 0.5 # Drop if >50% missing
data_clean = data.loc[:, data.isnull().mean() < threshold]R:
library(tidyverse)
# Load data
data <- read_csv('data.csv')
# Check missing values
cat("Missing values per column:\n")
print(colSums(is.na(data)))
# Option 1: Drop rows with ANY missing values
data_clean <- data %>% drop_na()
cat(sprintf("\nRows before: %d, after: %d\n", nrow(data), nrow(data_clean)))
# Option 2: Drop rows with missing in specific columns
critical_cols <- c('customer_id', 'date', 'amount')
data_clean <- data %>% drop_na(all_of(critical_cols))
# Option 3: Drop columns with too many missing values
threshold <- 0.5
missing_prop <- colMeans(is.na(data))
data_clean <- data %>% select(where(~mean(is.na(.)) < threshold))Strategy 2: Impute with Statistics
When to use: Moderate missing data, continuous variables
Python:
# Mean imputation
data['age'].fillna(data['age'].mean(), inplace=True)
# Median imputation (better for skewed data)
data['income'].fillna(data['income'].median(), inplace=True)
# Mode imputation (for categorical)
data['category'].fillna(data['category'].mode()[0], inplace=True)
# Forward fill (for time series)
data['temperature'].fillna(method='ffill', inplace=True)
# Backward fill
data['temperature'].fillna(method='bfill', inplace=True)
# Group-wise imputation (sophisticated!)
# Fill missing income with mean income of same education level
data['income'] = data.groupby('education')['income'].transform(
lambda x: x.fillna(x.mean())
)R:
library(tidyverse)
# Mean imputation
data <- data %>%
mutate(age = ifelse(is.na(age), mean(age, na.rm = TRUE), age))
# Median imputation
data <- data %>%
mutate(income = ifelse(is.na(income), median(income, na.rm = TRUE), income))
# Mode imputation (categorical)
get_mode <- function(x) {
ux <- unique(na.omit(x))
ux[which.max(tabulate(match(x, ux)))]
}
data <- data %>%
mutate(category = ifelse(is.na(category), get_mode(category), category))
# Forward fill (zoo package)
library(zoo)
data <- data %>%
mutate(temperature = na.locf(temperature, na.rm = FALSE))
# Group-wise imputation
data <- data %>%
group_by(education) %>%
mutate(income = ifelse(is.na(income), mean(income, na.rm = TRUE), income)) %>%
ungroup()Strategy 3: Advanced Imputation
Python with sklearn:
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Simple imputation
imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent'
data[['age', 'income']] = imputer.fit_transform(data[['age', 'income']])
# KNN imputation (uses similar rows)
imputer = KNNImputer(n_neighbors=5)
data_imputed = pd.DataFrame(
imputer.fit_transform(data.select_dtypes(include=[np.number])),
columns=data.select_dtypes(include=[np.number]).columns
)
# MICE (Multiple Imputation by Chained Equations)
imputer = IterativeImputer(random_state=42, max_iter=10)
data_imputed = pd.DataFrame(
imputer.fit_transform(data.select_dtypes(include=[np.number])),
columns=data.select_dtypes(include=[np.number]).columns
)R with mice:
library(mice)
# MICE imputation
# m=5 means create 5 imputed datasets
imputed_data <- mice(data, m = 5, method = 'pmm', seed = 42)
# Get one completed dataset
data_complete <- complete(imputed_data, 1)
# Or pool results from all 5 datasets (for modeling)
# This accounts for imputation uncertainty
model_fit <- with(imputed_data, lm(outcome ~ predictor1 + predictor2))
pooled_results <- pool(model_fit)
summary(pooled_results)Strategy 4: Create Missing Indicator
When to use: Missingness might be informative
Python:
# Create binary indicator for missingness
data['income_missing'] = data['income'].isnull().astype(int)
# Then impute
data['income'].fillna(data['income'].median(), inplace=True)
# Now you can analyze if missingness patterns are predictive!R:
# Create missing indicator
data <- data %>%
mutate(income_missing = as.integer(is.na(income)))
# Then impute
data <- data %>%
mutate(income = ifelse(is.na(income), median(income, na.rm = TRUE), income))Step 2: Handling Outliers
Outliers can be legitimate extreme values or errors. Be careful!
Detecting Outliers
Method 1: IQR Method
Python:
def remove_outliers_iqr(data, column):
"""Remove outliers using IQR method"""
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Count outliers
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
print(f"Found {len(outliers)} outliers in {column}")
# Remove them
data_clean = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
return data_clean
data_clean = remove_outliers_iqr(data, 'income')R:
remove_outliers_iqr <- function(data, column) {
Q1 <- quantile(data[[column]], 0.25, na.rm = TRUE)
Q3 <- quantile(data[[column]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data %>%
filter(!!sym(column) < lower_bound | !!sym(column) > upper_bound)
cat(sprintf("Found %d outliers in %s\n", nrow(outliers), column))
data_clean <- data %>%
filter(!!sym(column) >= lower_bound & !!sym(column) <= upper_bound)
return(data_clean)
}
data_clean <- remove_outliers_iqr(data, 'income')Method 2: Z-Score Method
Python:
from scipy import stats
def remove_outliers_zscore(data, column, threshold=3):
"""Remove outliers using z-score method"""
z_scores = np.abs(stats.zscore(data[column].dropna()))
outlier_indices = data[column].dropna().index[z_scores > threshold]
print(f"Found {len(outlier_indices)} outliers in {column}")
data_clean = data.drop(outlier_indices)
return data_clean
data_clean = remove_outliers_zscore(data, 'income', threshold=3)R:
remove_outliers_zscore <- function(data, column, threshold = 3) {
z_scores <- abs(scale(data[[column]]))
outlier_count <- sum(z_scores > threshold, na.rm = TRUE)
cat(sprintf("Found %d outliers in %s\n", outlier_count, column))
data_clean <- data %>%
filter(abs(scale(!!sym(column))) <= threshold | is.na(!!sym(column)))
return(data_clean)
}
data_clean <- remove_outliers_zscore(data, 'income', threshold = 3)Transforming Outliers (Instead of Removing)
Python:
# Cap outliers at percentiles
lower_cap = data['income'].quantile(0.01)
upper_cap = data['income'].quantile(0.99)
data['income_capped'] = data['income'].clip(lower=lower_cap, upper=upper_cap)
# Log transformation (for right-skewed data)
data['log_income'] = np.log1p(data['income']) # log1p handles zeros
# Square root transformation
data['sqrt_income'] = np.sqrt(data['income'])
# Box-Cox transformation (finds optimal transformation)
from scipy.stats import boxcox
data['income_boxcox'], lambda_param = boxcox(data['income'] + 1) # +1 if zeros exist
print(f"Optimal lambda: {lambda_param}")R:
# Cap outliers
lower_cap <- quantile(data$income, 0.01, na.rm = TRUE)
upper_cap <- quantile(data$income, 0.99, na.rm = TRUE)
data <- data %>%
mutate(income_capped = pmax(pmin(income, upper_cap), lower_cap))
# Log transformation
data <- data %>%
mutate(log_income = log1p(income))
# Square root transformation
data <- data %>%
mutate(sqrt_income = sqrt(income))
# Box-Cox transformation
library(MASS)
bc_result <- boxcox(income ~ 1, data = data)
lambda <- bc_result$x[which.max(bc_result$y)]
cat(sprintf("Optimal lambda: %.2f\n", lambda))
data <- data %>%
mutate(income_boxcox = (income^lambda - 1) / lambda)Step 3: Fixing Inconsistencies
String Cleaning
Python:
# Convert to lowercase
data['name'] = data['name'].str.lower()
# Remove whitespace
data['name'] = data['name'].str.strip()
# Remove special characters
data['name'] = data['name'].str.replace(r'[^a-zA-Z0-9\s]', '', regex=True)
# Standardize categories
category_map = {
'USA': 'United States',
'US': 'United States',
'U.S.': 'United States',
'United States of America': 'United States'
}
data['country'] = data['country'].replace(category_map)
# Fix typos with fuzzy matching
from fuzzywuzzy import process
def fix_typos(value, choices, threshold=80):
"""Fix typos using fuzzy string matching"""
if pd.isna(value):
return value
match, score = process.extractOne(value, choices)
return match if score >= threshold else value
correct_categories = ['Electronics', 'Clothing', 'Food', 'Books']
data['category'] = data['category'].apply(
lambda x: fix_typos(x, correct_categories)
)R:
library(stringr)
# Convert to lowercase
data <- data %>%
mutate(name = str_to_lower(name))
# Remove whitespace
data <- data %>%
mutate(name = str_trim(name))
# Remove special characters
data <- data %>%
mutate(name = str_replace_all(name, "[^a-zA-Z0-9\\s]", ""))
# Standardize categories
data <- data %>%
mutate(country = recode(country,
'USA' = 'United States',
'US' = 'United States',
'U.S.' = 'United States',
'United States of America' = 'United States'
))
# Fix typos with stringdist
library(stringdist)
fix_typos <- function(value, choices, threshold = 0.2) {
if (is.na(value)) return(value)
distances <- stringdist(value, choices, method = "jw")
best_match_idx <- which.min(distances)
if (distances[best_match_idx] <= threshold) {
return(choices[best_match_idx])
} else {
return(value)
}
}
correct_categories <- c('Electronics', 'Clothing', 'Food', 'Books')
data <- data %>%
mutate(category = sapply(category, fix_typos, choices = correct_categories))Date Cleaning
Python:
# Parse dates
data['date'] = pd.to_datetime(data['date'], errors='coerce')
# Handle different formats
data['date'] = pd.to_datetime(data['date'],
format='%Y-%m-%d',
errors='coerce')
# Extract components
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day_of_week'] = data['date'].dt.dayofweek
data['is_weekend'] = data['day_of_week'].isin([5, 6]).astype(int)
# Fix date ranges
# Remove impossible dates
data = data[(data['date'] >= '1900-01-01') & (data['date'] <= '2030-12-31')]R:
library(lubridate)
# Parse dates
data <- data %>%
mutate(date = as.Date(date, format = "%Y-%m-%d"))
# Try multiple formats
data <- data %>%
mutate(date = parse_date_time(date, orders = c("ymd", "mdy", "dmy")))
# Extract components
data <- data %>%
mutate(
year = year(date),
month = month(date),
day_of_week = wday(date),
is_weekend = as.integer(day_of_week %in% c(1, 7)) # Sunday=1, Saturday=7
)
# Fix date ranges
data <- data %>%
filter(date >= as.Date("1900-01-01") & date <= as.Date("2030-12-31"))Step 4: Handling Duplicates
Python:
# Check for duplicates
print(f"Total duplicates: {data.duplicated().sum()}")
# See duplicate rows
duplicates = data[data.duplicated(keep=False)]
print(duplicates)
# Remove exact duplicates
data_clean = data.drop_duplicates()
# Remove duplicates based on specific columns
data_clean = data.drop_duplicates(subset=['customer_id', 'date'], keep='first')
# Find near-duplicates (fuzzy matching)
from difflib import SequenceMatcher
def similar(a, b, threshold=0.9):
return SequenceMatcher(None, a, b).ratio() > threshold
# Identify similar names
for i in range(len(data)):
for j in range(i+1, len(data)):
if similar(data.iloc[i]['name'], data.iloc[j]['name']):
print(f"Possible duplicates: {data.iloc[i]['name']} and {data.iloc[j]['name']}")R:
# Check for duplicates
cat(sprintf("Total duplicates: %d\n", sum(duplicated(data))))
# See duplicate rows
duplicates <- data %>%
filter(duplicated(.) | duplicated(., fromLast = TRUE))
print(duplicates)
# Remove exact duplicates
data_clean <- data %>% distinct()
# Remove duplicates based on specific columns
data_clean <- data %>%
distinct(customer_id, date, .keep_all = TRUE)
# Find near-duplicates
library(stringdist)
names <- data$name
for (i in 1:(length(names)-1)) {
for (j in (i+1):length(names)) {
similarity <- stringsim(names[i], names[j], method = "jw")
if (similarity > 0.9) {
cat(sprintf("Possible duplicates: %s and %s (%.2f similarity)\n",
names[i], names[j], similarity))
}
}
}Step 5: Data Validation
Always validate your cleaning!
Python:
def validate_data(data):
"""Comprehensive data validation"""
print("=== DATA VALIDATION REPORT ===\n")
# 1. Check for missing values
missing = data.isnull().sum()
if missing.sum() > 0:
print("ā ļø Still have missing values:")
print(missing[missing > 0])
else:
print("ā
No missing values")
# 2. Check for duplicates
dups = data.duplicated().sum()
if dups > 0:
print(f"\nā ļø Found {dups} duplicate rows")
else:
print("\nā
No duplicates")
# 3. Check data types
print("\nā
Data types:")
print(data.dtypes)
# 4. Check value ranges
numeric_cols = data.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
min_val = data[col].min()
max_val = data[col].max()
if min_val < 0 and col in ['age', 'price', 'quantity']:
print(f"\nā ļø {col} has negative values (min: {min_val})")
if max_val > 150 and col == 'age':
print(f"\nā ļø {col} has suspicious high values (max: {max_val})")
# 5. Check categorical consistency
cat_cols = data.select_dtypes(include=['object']).columns
for col in cat_cols:
unique_count = data[col].nunique()
if unique_count > 100:
print(f"\nā ļø {col} has {unique_count} unique values (might need grouping)")
print("\n=== END VALIDATION ===")
validate_data(data_clean)R:
validate_data <- function(data) {
cat("=== DATA VALIDATION REPORT ===\n\n")
# 1. Check for missing values
missing <- colSums(is.na(data))
if (sum(missing) > 0) {
cat("ā ļø Still have missing values:\n")
print(missing[missing > 0])
} else {
cat("ā
No missing values\n")
}
# 2. Check for duplicates
dups <- sum(duplicated(data))
if (dups > 0) {
cat(sprintf("\nā ļø Found %d duplicate rows\n", dups))
} else {
cat("\nā
No duplicates\n")
}
# 3. Check data types
cat("\nā
Data types:\n")
print(sapply(data, class))
# 4. Check value ranges
numeric_cols <- names(data %>% select(where(is.numeric)))
for (col in numeric_cols) {
min_val <- min(data[[col]], na.rm = TRUE)
max_val <- max(data[[col]], na.rm = TRUE)
if (min_val < 0 && col %in% c('age', 'price', 'quantity')) {
cat(sprintf("\nā ļø %s has negative values (min: %.2f)\n", col, min_val))
}
if (max_val > 150 && col == 'age') {
cat(sprintf("\nā ļø %s has suspicious high values (max: %.2f)\n", col, max_val))
}
}
# 5. Check categorical consistency
cat_cols <- names(data %>% select(where(is.character)))
for (col in cat_cols) {
unique_count <- n_distinct(data[[col]], na.rm = TRUE)
if (unique_count > 100) {
cat(sprintf("\nā ļø %s has %d unique values (might need grouping)\n",
col, unique_count))
}
}
cat("\n=== END VALIDATION ===\n")
}
validate_data(data_clean)Complete Cleaning Pipeline Example
Python:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
def clean_customer_data(filepath):
"""Complete data cleaning pipeline"""
print("Starting data cleaning pipeline...")
# 1. Load data
data = pd.read_csv(filepath)
print(f"Loaded {len(data)} rows, {len(data.columns)} columns")
# 2. Handle duplicates
data = data.drop_duplicates()
print(f"After removing duplicates: {len(data)} rows")
# 3. Fix data types
data['date'] = pd.to_datetime(data['date'], errors='coerce')
data['customer_id'] = data['customer_id'].astype(str)
# 4. Clean text columns
text_cols = ['name', 'city', 'category']
for col in text_cols:
if col in data.columns:
data[col] = data[col].str.lower().str.strip()
# 5. Handle missing values
# Critical columns: drop rows
data = data.dropna(subset=['customer_id', 'date'])
# Numeric columns: impute with median
numeric_cols = data.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(strategy='median')
data[numeric_cols] = imputer.fit_transform(data[numeric_cols])
# Categorical: impute with mode
cat_cols = data.select_dtypes(include=['object']).columns
for col in cat_cols:
mode_value = data[col].mode()[0] if len(data[col].mode()) > 0 else 'Unknown'
data[col].fillna(mode_value, inplace=True)
# 6. Handle outliers (cap at 1st and 99th percentile)
for col in numeric_cols:
lower = data[col].quantile(0.01)
upper = data[col].quantile(0.99)
data[col] = data[col].clip(lower=lower, upper=upper)
# 7. Feature engineering
data['age_group'] = pd.cut(data['age'],
bins=[0, 18, 35, 50, 65, 100],
labels=['<18', '18-35', '35-50', '50-65', '65+'])
# 8. Final validation
assert data.isnull().sum().sum() == 0, "Still have missing values!"
assert data.duplicated().sum() == 0, "Still have duplicates!"
print(f"\nCleaning complete! Final shape: {data.shape}")
# 9. Save cleaned data
output_path = filepath.replace('.csv', '_cleaned.csv')
data.to_csv(output_path, index=False)
print(f"Saved to: {output_path}")
return data
# Use it!
clean_data = clean_customer_data('raw_customer_data.csv')Data Cleaning Checklist
Before moving to analysis:
- ā No missing values (or properly handled)
- ā No duplicates
- ā Correct data types
- ā Outliers addressed
- ā Text standardized (lowercase, trimmed)
- ā Dates parsed correctly
- ā Categorical values consistent
- ā Value ranges make sense
- ā Documented all cleaning steps
- ā Saved cleaned dataset
Common Pitfalls
- Not documenting cleaning decisions - Future you won’t remember why!
- Automatically removing all outliers - Some are legitimate!
- Imputing without understanding - Different strategies have different implications
- Over-cleaning - Don’t remove real variance
- Not checking after cleaning - Always validate!
Next Steps
Now that your data is clean:
- Feature Engineering - Create better variables
- Statistical Testing - Test hypotheses
- Model Evaluation - Build and evaluate models
Clean data = happy researcher! āØ