Data Visualization: Making Your Data Tell a Story
Why Visualization Matters in Research
A great visualization can make the difference between your research being understood or ignored. Humans process visual information 60,000 times faster than text. Your job is to turn complex data into clear, compelling stories.
Remember: A picture isn’t worth a thousand words if nobody can understand it!
The Principles of Good Visualization
1. Choose the Right Chart Type
| Data Type | Use This Chart |
|---|---|
| Comparing categories | Bar chart, column chart |
| Showing distribution | Histogram, box plot, violin plot |
| Showing relationships | Scatter plot, bubble chart |
| Showing trends over time | Line chart, area chart |
| Showing composition | Stacked bar, pie chart (use sparingly!) |
| Showing geographic data | Choropleth map, heat map |
2. Remove Chartjunk
Less is more! Remove anything that doesn’t add meaning:
- ❌ 3D effects
- ❌ Unnecessary gridlines
- ❌ Decorative elements
- ❌ Too many colors
- ✅ Clear labels
- ✅ Meaningful titles
- ✅ Simple, clean design
Setting Up Your Visualization Environment
Python Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set style for all plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['legend.fontsize'] = 12
# For better quality in publications
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.bbox'] = 'tight'R Setup
library(tidyverse)
library(ggplot2)
library(scales)
library(patchwork) # For combining plots
# Set theme for all plots
theme_set(theme_minimal(base_size = 14))
# Custom theme for publications
publication_theme <- theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 11),
panel.grid.minor = element_blank()
)Chart Type 1: Bar Charts
Perfect for comparing categories.
Python Example
# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 34, 51]
# Basic bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(categories, values, color='steelblue', edgecolor='black', alpha=0.7)
# Add value labels on bars
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2., height,
f'{height}',
ha='center', va='bottom', fontsize=12)
plt.title('Comparison Across Categories', fontsize=16, fontweight='bold')
plt.xlabel('Category', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.ylim(0, max(values) * 1.1) # Add space at top
plt.tight_layout()
plt.savefig('bar_chart.png', dpi=300)
plt.show()R Example
# Sample data
data <- tibble(
category = c('A', 'B', 'C', 'D', 'E'),
value = c(23, 45, 56, 34, 51)
)
# Create bar chart
ggplot(data, aes(x = category, y = value)) +
geom_col(fill = 'steelblue', color = 'black', alpha = 0.7) +
geom_text(aes(label = value), vjust = -0.5, size = 4) +
labs(
title = 'Comparison Across Categories',
x = 'Category',
y = 'Value'
) +
ylim(0, max(data$value) * 1.1) +
publication_theme
ggsave('bar_chart.png', width = 10, height = 6, dpi = 300)Chart Type 2: Line Charts
Perfect for showing trends over time.
Python Example
# Sample time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
values = np.cumsum(np.random.randn(365)) + 100
plt.figure(figsize=(12, 6))
plt.plot(dates, values, linewidth=2, color='#2E86AB', label='Metric')
# Add trend line
z = np.polyfit(range(len(values)), values, 1)
p = np.poly1d(z)
plt.plot(dates, p(range(len(values))),
linestyle='--', color='red', linewidth=1.5,
label=f'Trend (slope={z[0]:.2f})')
plt.title('Trend Over Time', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.legend(loc='best', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('line_chart.png', dpi=300)
plt.show()R Example
# Sample time series data
data <- tibble(
date = seq.Date(as.Date('2024-01-01'), by = 'day', length.out = 365),
value = cumsum(rnorm(365)) + 100
)
# Create line chart with trend
ggplot(data, aes(x = date, y = value)) +
geom_line(color = '#2E86AB', size = 1.2) +
geom_smooth(method = 'lm', se = FALSE,
linetype = 'dashed', color = 'red', size = 0.8) +
labs(
title = 'Trend Over Time',
x = 'Date',
y = 'Value'
) +
scale_x_date(date_breaks = '2 months', date_labels = '%b %Y') +
publication_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave('line_chart.png', width = 12, height = 6, dpi = 300)Chart Type 3: Scatter Plots
Perfect for showing relationships between variables.
Python Example
# Sample data
np.random.seed(42)
x = np.random.randn(200) * 10 + 50
y = x * 1.5 + np.random.randn(200) * 15 + 20
colors = np.random.rand(200)
sizes = np.random.randint(50, 500, 200)
plt.figure(figsize=(10, 8))
scatter = plt.scatter(x, y, c=colors, s=sizes,
alpha=0.6, cmap='viridis',
edgecolors='black', linewidth=0.5)
# Add regression line
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x, p(x), "r--", linewidth=2,
label=f'y = {z[0]:.2f}x + {z[1]:.2f}')
# Calculate and display correlation
correlation = np.corrcoef(x, y)[0, 1]
plt.text(0.05, 0.95, f'r = {correlation:.3f}',
transform=plt.gca().transAxes,
fontsize=14, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.colorbar(scatter, label='Color Scale')
plt.title('Relationship Between Variables', fontsize=16, fontweight='bold')
plt.xlabel('Variable X', fontsize=14)
plt.ylabel('Variable Y', fontsize=14)
plt.legend(loc='lower right', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('scatter_plot.png', dpi=300)
plt.show()R Example
# Sample data
set.seed(42)
data <- tibble(
x = rnorm(200) * 10 + 50,
y = x * 1.5 + rnorm(200) * 15 + 20,
color = runif(200),
size = runif(200, 2, 10)
)
# Calculate correlation
correlation <- cor(data$x, data$y)
# Create scatter plot
ggplot(data, aes(x = x, y = y)) +
geom_point(aes(color = color, size = size), alpha = 0.6) +
geom_smooth(method = 'lm', se = FALSE,
linetype = 'dashed', color = 'red', size = 1) +
annotate('text', x = min(data$x) + 5, y = max(data$y) - 5,
label = sprintf('r = %.3f', correlation),
size = 5, hjust = 0,
fill = 'wheat', alpha = 0.7) +
scale_color_viridis_c() +
labs(
title = 'Relationship Between Variables',
x = 'Variable X',
y = 'Variable Y',
color = 'Color\nScale',
size = 'Size'
) +
publication_theme
ggsave('scatter_plot.png', width = 10, height = 8, dpi = 300)Chart Type 4: Box Plots and Violin Plots
Perfect for showing distributions across categories.
Python Example
# Sample data
np.random.seed(42)
data = {
'Group A': np.random.normal(100, 15, 200),
'Group B': np.random.normal(110, 20, 200),
'Group C': np.random.normal(95, 10, 200),
'Group D': np.random.normal(120, 25, 200)
}
df = pd.DataFrame(data)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Box plot
bp = axes[0].boxplot(df.values, labels=df.columns,
patch_artist=True, showmeans=True)
# Color the boxes
colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
axes[0].set_title('Distribution by Group (Box Plot)',
fontsize=14, fontweight='bold')
axes[0].set_ylabel('Value', fontsize=12)
axes[0].grid(True, alpha=0.3)
# Violin plot
parts = axes[1].violinplot(df.values, showmeans=True, showmedians=True)
axes[1].set_xticks(range(1, len(df.columns) + 1))
axes[1].set_xticklabels(df.columns)
axes[1].set_title('Distribution by Group (Violin Plot)',
fontsize=14, fontweight='bold')
axes[1].set_ylabel('Value', fontsize=12)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('distribution_plots.png', dpi=300)
plt.show()R Example
# Sample data
set.seed(42)
data <- tibble(
group = rep(c('Group A', 'Group B', 'Group C', 'Group D'), each = 200),
value = c(
rnorm(200, 100, 15),
rnorm(200, 110, 20),
rnorm(200, 95, 10),
rnorm(200, 120, 25)
)
)
# Box plot
p1 <- ggplot(data, aes(x = group, y = value, fill = group)) +
geom_boxplot(alpha = 0.7) +
stat_summary(fun = mean, geom = 'point', shape = 23, size = 3, fill = 'red') +
labs(
title = 'Distribution by Group (Box Plot)',
x = '',
y = 'Value'
) +
publication_theme +
theme(legend.position = 'none')
# Violin plot
p2 <- ggplot(data, aes(x = group, y = value, fill = group)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.1, fill = 'white', alpha = 0.8) +
labs(
title = 'Distribution by Group (Violin Plot)',
x = '',
y = 'Value'
) +
publication_theme +
theme(legend.position = 'none')
# Combine plots
combined <- p1 + p2
ggsave('distribution_plots.png', combined, width = 14, height = 6, dpi = 300)Chart Type 5: Heatmaps
Perfect for showing patterns in matrices or correlations.
Python Example
# Sample correlation matrix
np.random.seed(42)
variables = ['Var1', 'Var2', 'Var3', 'Var4', 'Var5']
corr_matrix = np.random.rand(5, 5)
corr_matrix = (corr_matrix + corr_matrix.T) / 2 # Make symmetric
np.fill_diagonal(corr_matrix, 1) # Diagonal is 1
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix,
annot=True, # Show values
fmt='.2f', # Format
cmap='coolwarm',
center=0,
square=True,
linewidths=1,
cbar_kws={'label': 'Correlation'},
xticklabels=variables,
yticklabels=variables)
plt.title('Correlation Matrix Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('heatmap.png', dpi=300)
plt.show()R Example
library(reshape2)
# Sample correlation matrix
set.seed(42)
variables <- c('Var1', 'Var2', 'Var3', 'Var4', 'Var5')
corr_matrix <- matrix(runif(25), 5, 5)
corr_matrix <- (corr_matrix + t(corr_matrix)) / 2
diag(corr_matrix) <- 1
colnames(corr_matrix) <- variables
rownames(corr_matrix) <- variables
# Convert to long format
corr_long <- melt(corr_matrix)
# Create heatmap
ggplot(corr_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = 'white', size = 1) +
geom_text(aes(label = sprintf('%.2f', value)), size = 4) +
scale_fill_gradient2(low = 'blue', mid = 'white', high = 'red',
midpoint = 0, limits = c(-1, 1),
name = 'Correlation') +
labs(
title = 'Correlation Matrix Heatmap',
x = '',
y = ''
) +
coord_fixed() +
publication_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave('heatmap.png', width = 10, height = 8, dpi = 300)Color Theory for Research
Choosing Colors
For categorical data (qualitative):
# Python - use distinct colors
colors = sns.color_palette('Set2', n_colors=5)
# Or define your own
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']# R - use distinct colors
library(RColorBrewer)
colors <- brewer.pal(5, 'Set2')
# Or define your own
colors <- c('#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd')For continuous data (sequential):
# Python
colors = sns.color_palette('Blues', as_cmap=True)# R
scale_fill_gradient(low = 'white', high = 'darkblue')For diverging data (showing positive/negative):
# Python
colors = sns.diverging_palette(250, 10, as_cmap=True)# R
scale_fill_gradient2(low = 'blue', mid = 'white', high = 'red')Colorblind-Friendly Palettes
# Python - colorblind-safe palettes
colors = sns.color_palette('colorblind')# R - colorblind-safe palettes
library(viridis)
scale_fill_viridis_d() # For discrete
scale_fill_viridis_c() # For continuousCreating Publication-Quality Figures
Multi-Panel Figures
Python:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Plot 1: Line chart
axes[0, 0].plot(range(10), np.random.randn(10).cumsum())
axes[0, 0].set_title('(A) Time Series')
# Plot 2: Bar chart
axes[0, 1].bar(range(5), np.random.randint(10, 50, 5))
axes[0, 1].set_title('(B) Categories')
# Plot 3: Scatter plot
axes[1, 0].scatter(np.random.randn(100), np.random.randn(100))
axes[1, 0].set_title('(C) Relationship')
# Plot 4: Box plot
axes[1, 1].boxplot([np.random.randn(100), np.random.randn(100) + 2])
axes[1, 1].set_title('(D) Distributions')
plt.suptitle('Figure 1: Complete Analysis',
fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.savefig('multi_panel_figure.png', dpi=300, bbox_inches='tight')
plt.show()R:
library(patchwork)
# Create individual plots
p1 <- ggplot(data.frame(x = 1:10, y = cumsum(rnorm(10))),
aes(x, y)) +
geom_line() +
labs(title = '(A) Time Series') +
publication_theme
p2 <- ggplot(data.frame(x = 1:5, y = runif(5, 10, 50)),
aes(x, y)) +
geom_col() +
labs(title = '(B) Categories') +
publication_theme
p3 <- ggplot(data.frame(x = rnorm(100), y = rnorm(100)),
aes(x, y)) +
geom_point() +
labs(title = '(C) Relationship') +
publication_theme
p4 <- ggplot(data.frame(group = rep(c('A', 'B'), each = 100),
value = c(rnorm(100), rnorm(100) + 2)),
aes(group, value)) +
geom_boxplot() +
labs(title = '(D) Distributions') +
publication_theme
# Combine
combined <- (p1 + p2) / (p3 + p4) +
plot_annotation(title = 'Figure 1: Complete Analysis',
theme = theme(plot.title = element_text(size = 16, face = 'bold')))
ggsave('multi_panel_figure.png', combined, width = 12, height = 10, dpi = 300)Interactive Visualizations
For presentations and online reports.
Python with Plotly
import plotly.express as px
import plotly.graph_objects as go
# Sample data
df = pd.DataFrame({
'x': range(100),
'y': np.random.randn(100).cumsum(),
'category': np.random.choice(['A', 'B', 'C'], 100)
})
# Interactive scatter plot
fig = px.scatter(df, x='x', y='y', color='category',
title='Interactive Scatter Plot',
hover_data=['x', 'y', 'category'])
fig.update_layout(
font=dict(size=14),
title_font=dict(size=18),
hoverlabel=dict(bgcolor='white', font_size=12)
)
# Save as HTML
fig.write_html('interactive_plot.html')
# Or show in notebook
fig.show()R with Plotly
library(plotly)
# Sample data
data <- tibble(
x = 1:100,
y = cumsum(rnorm(100)),
category = sample(c('A', 'B', 'C'), 100, replace = TRUE)
)
# Interactive scatter plot
p <- ggplot(data, aes(x = x, y = y, color = category)) +
geom_point() +
labs(title = 'Interactive Scatter Plot') +
publication_theme
# Convert to interactive
interactive_plot <- ggplotly(p)
# Save as HTML
htmlwidgets::saveWidget(interactive_plot, 'interactive_plot.html')Visualization Checklist
Before publishing a figure:
- ✅ Clear, descriptive title
- ✅ Axis labels with units
- ✅ Legend (if needed)
- ✅ Readable font sizes (≥10pt for print)
- ✅ Colorblind-friendly colors
- ✅ High resolution (≥300 DPI for print)
- ✅ No chartjunk
- ✅ Consistent style across figures
- ✅ Caption explains what’s shown
- ✅ All elements are visible when printed in grayscale
Common Mistakes to Avoid
- Pie charts with too many slices - Use bar chart instead
- 3D charts - They distort perception
- Dual y-axes - Often misleading
- Red-green color schemes - Not colorblind-friendly
- Tiny fonts - Make everything readable
- Missing units - Always label with units
- Misleading scales - Start y-axis at zero (usually)
- Too much information - One message per chart
Resources
- “The Visual Display of Quantitative Information” by Edward Tufte - The classic
- Python Graph Gallery - https://python-graph-gallery.com
- R Graph Gallery - https://r-graph-gallery.com
- ColorBrewer - https://colorbrewer2.org
- Data-to-Viz - https://www.data-to-viz.com
Next Steps
- Writing Your First Paper - Use your visualizations effectively
- Building a Research Portfolio - Showcase your best visualizations
Make your data beautiful! 📊✨