Data Visualization: Making Your Data Tell a Story

Why Visualization Matters in Research

A great visualization can make the difference between your research being understood or ignored. Humans process visual information 60,000 times faster than text. Your job is to turn complex data into clear, compelling stories.

Remember: A picture isn’t worth a thousand words if nobody can understand it!


The Principles of Good Visualization

1. Choose the Right Chart Type

Data TypeUse This Chart
Comparing categoriesBar chart, column chart
Showing distributionHistogram, box plot, violin plot
Showing relationshipsScatter plot, bubble chart
Showing trends over timeLine chart, area chart
Showing compositionStacked bar, pie chart (use sparingly!)
Showing geographic dataChoropleth map, heat map

2. Remove Chartjunk

Less is more! Remove anything that doesn’t add meaning:

  • ❌ 3D effects
  • ❌ Unnecessary gridlines
  • ❌ Decorative elements
  • ❌ Too many colors
  • ✅ Clear labels
  • ✅ Meaningful titles
  • ✅ Simple, clean design

Setting Up Your Visualization Environment

Python Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for all plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['legend.fontsize'] = 12

# For better quality in publications
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.bbox'] = 'tight'

R Setup

library(tidyverse)
library(ggplot2)
library(scales)
library(patchwork)  # For combining plots

# Set theme for all plots
theme_set(theme_minimal(base_size = 14))

# Custom theme for publications
publication_theme <- theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 11),
    panel.grid.minor = element_blank()
  )

Chart Type 1: Bar Charts

Perfect for comparing categories.

Python Example

# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 34, 51]

# Basic bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(categories, values, color='steelblue', edgecolor='black', alpha=0.7)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height}',
             ha='center', va='bottom', fontsize=12)

plt.title('Comparison Across Categories', fontsize=16, fontweight='bold')
plt.xlabel('Category', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.ylim(0, max(values) * 1.1)  # Add space at top
plt.tight_layout()
plt.savefig('bar_chart.png', dpi=300)
plt.show()

R Example

# Sample data
data <- tibble(
  category = c('A', 'B', 'C', 'D', 'E'),
  value = c(23, 45, 56, 34, 51)
)

# Create bar chart
ggplot(data, aes(x = category, y = value)) +
  geom_col(fill = 'steelblue', color = 'black', alpha = 0.7) +
  geom_text(aes(label = value), vjust = -0.5, size = 4) +
  labs(
    title = 'Comparison Across Categories',
    x = 'Category',
    y = 'Value'
  ) +
  ylim(0, max(data$value) * 1.1) +
  publication_theme

ggsave('bar_chart.png', width = 10, height = 6, dpi = 300)

Chart Type 2: Line Charts

Perfect for showing trends over time.

Python Example

# Sample time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
values = np.cumsum(np.random.randn(365)) + 100

plt.figure(figsize=(12, 6))
plt.plot(dates, values, linewidth=2, color='#2E86AB', label='Metric')

# Add trend line
z = np.polyfit(range(len(values)), values, 1)
p = np.poly1d(z)
plt.plot(dates, p(range(len(values))), 
         linestyle='--', color='red', linewidth=1.5, 
         label=f'Trend (slope={z[0]:.2f})')

plt.title('Trend Over Time', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.legend(loc='best', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('line_chart.png', dpi=300)
plt.show()

R Example

# Sample time series data
data <- tibble(
  date = seq.Date(as.Date('2024-01-01'), by = 'day', length.out = 365),
  value = cumsum(rnorm(365)) + 100
)

# Create line chart with trend
ggplot(data, aes(x = date, y = value)) +
  geom_line(color = '#2E86AB', size = 1.2) +
  geom_smooth(method = 'lm', se = FALSE, 
              linetype = 'dashed', color = 'red', size = 0.8) +
  labs(
    title = 'Trend Over Time',
    x = 'Date',
    y = 'Value'
  ) +
  scale_x_date(date_breaks = '2 months', date_labels = '%b %Y') +
  publication_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave('line_chart.png', width = 12, height = 6, dpi = 300)

Chart Type 3: Scatter Plots

Perfect for showing relationships between variables.

Python Example

# Sample data
np.random.seed(42)
x = np.random.randn(200) * 10 + 50
y = x * 1.5 + np.random.randn(200) * 15 + 20
colors = np.random.rand(200)
sizes = np.random.randint(50, 500, 200)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(x, y, c=colors, s=sizes, 
                     alpha=0.6, cmap='viridis', 
                     edgecolors='black', linewidth=0.5)

# Add regression line
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x, p(x), "r--", linewidth=2, 
         label=f'y = {z[0]:.2f}x + {z[1]:.2f}')

# Calculate and display correlation
correlation = np.corrcoef(x, y)[0, 1]
plt.text(0.05, 0.95, f'r = {correlation:.3f}', 
         transform=plt.gca().transAxes,
         fontsize=14, verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.colorbar(scatter, label='Color Scale')
plt.title('Relationship Between Variables', fontsize=16, fontweight='bold')
plt.xlabel('Variable X', fontsize=14)
plt.ylabel('Variable Y', fontsize=14)
plt.legend(loc='lower right', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('scatter_plot.png', dpi=300)
plt.show()

R Example

# Sample data
set.seed(42)
data <- tibble(
  x = rnorm(200) * 10 + 50,
  y = x * 1.5 + rnorm(200) * 15 + 20,
  color = runif(200),
  size = runif(200, 2, 10)
)

# Calculate correlation
correlation <- cor(data$x, data$y)

# Create scatter plot
ggplot(data, aes(x = x, y = y)) +
  geom_point(aes(color = color, size = size), alpha = 0.6) +
  geom_smooth(method = 'lm', se = FALSE, 
              linetype = 'dashed', color = 'red', size = 1) +
  annotate('text', x = min(data$x) + 5, y = max(data$y) - 5,
           label = sprintf('r = %.3f', correlation),
           size = 5, hjust = 0,
           fill = 'wheat', alpha = 0.7) +
  scale_color_viridis_c() +
  labs(
    title = 'Relationship Between Variables',
    x = 'Variable X',
    y = 'Variable Y',
    color = 'Color\nScale',
    size = 'Size'
  ) +
  publication_theme

ggsave('scatter_plot.png', width = 10, height = 8, dpi = 300)

Chart Type 4: Box Plots and Violin Plots

Perfect for showing distributions across categories.

Python Example

# Sample data
np.random.seed(42)
data = {
    'Group A': np.random.normal(100, 15, 200),
    'Group B': np.random.normal(110, 20, 200),
    'Group C': np.random.normal(95, 10, 200),
    'Group D': np.random.normal(120, 25, 200)
}

df = pd.DataFrame(data)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Box plot
bp = axes[0].boxplot(df.values, labels=df.columns,
                     patch_artist=True, showmeans=True)

# Color the boxes
colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

axes[0].set_title('Distribution by Group (Box Plot)', 
                  fontsize=14, fontweight='bold')
axes[0].set_ylabel('Value', fontsize=12)
axes[0].grid(True, alpha=0.3)

# Violin plot
parts = axes[1].violinplot(df.values, showmeans=True, showmedians=True)
axes[1].set_xticks(range(1, len(df.columns) + 1))
axes[1].set_xticklabels(df.columns)
axes[1].set_title('Distribution by Group (Violin Plot)', 
                  fontsize=14, fontweight='bold')
axes[1].set_ylabel('Value', fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('distribution_plots.png', dpi=300)
plt.show()

R Example

# Sample data
set.seed(42)
data <- tibble(
  group = rep(c('Group A', 'Group B', 'Group C', 'Group D'), each = 200),
  value = c(
    rnorm(200, 100, 15),
    rnorm(200, 110, 20),
    rnorm(200, 95, 10),
    rnorm(200, 120, 25)
  )
)

# Box plot
p1 <- ggplot(data, aes(x = group, y = value, fill = group)) +
  geom_boxplot(alpha = 0.7) +
  stat_summary(fun = mean, geom = 'point', shape = 23, size = 3, fill = 'red') +
  labs(
    title = 'Distribution by Group (Box Plot)',
    x = '',
    y = 'Value'
  ) +
  publication_theme +
  theme(legend.position = 'none')

# Violin plot
p2 <- ggplot(data, aes(x = group, y = value, fill = group)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.1, fill = 'white', alpha = 0.8) +
  labs(
    title = 'Distribution by Group (Violin Plot)',
    x = '',
    y = 'Value'
  ) +
  publication_theme +
  theme(legend.position = 'none')

# Combine plots
combined <- p1 + p2
ggsave('distribution_plots.png', combined, width = 14, height = 6, dpi = 300)

Chart Type 5: Heatmaps

Perfect for showing patterns in matrices or correlations.

Python Example

# Sample correlation matrix
np.random.seed(42)
variables = ['Var1', 'Var2', 'Var3', 'Var4', 'Var5']
corr_matrix = np.random.rand(5, 5)
corr_matrix = (corr_matrix + corr_matrix.T) / 2  # Make symmetric
np.fill_diagonal(corr_matrix, 1)  # Diagonal is 1

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, 
            annot=True,  # Show values
            fmt='.2f',   # Format
            cmap='coolwarm',
            center=0,
            square=True,
            linewidths=1,
            cbar_kws={'label': 'Correlation'},
            xticklabels=variables,
            yticklabels=variables)

plt.title('Correlation Matrix Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('heatmap.png', dpi=300)
plt.show()

R Example

library(reshape2)

# Sample correlation matrix
set.seed(42)
variables <- c('Var1', 'Var2', 'Var3', 'Var4', 'Var5')
corr_matrix <- matrix(runif(25), 5, 5)
corr_matrix <- (corr_matrix + t(corr_matrix)) / 2
diag(corr_matrix) <- 1
colnames(corr_matrix) <- variables
rownames(corr_matrix) <- variables

# Convert to long format
corr_long <- melt(corr_matrix)

# Create heatmap
ggplot(corr_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = 'white', size = 1) +
  geom_text(aes(label = sprintf('%.2f', value)), size = 4) +
  scale_fill_gradient2(low = 'blue', mid = 'white', high = 'red',
                       midpoint = 0, limits = c(-1, 1),
                       name = 'Correlation') +
  labs(
    title = 'Correlation Matrix Heatmap',
    x = '',
    y = ''
  ) +
  coord_fixed() +
  publication_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave('heatmap.png', width = 10, height = 8, dpi = 300)

Color Theory for Research

Choosing Colors

For categorical data (qualitative):

# Python - use distinct colors
colors = sns.color_palette('Set2', n_colors=5)

# Or define your own
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
# R - use distinct colors
library(RColorBrewer)
colors <- brewer.pal(5, 'Set2')

# Or define your own
colors <- c('#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd')

For continuous data (sequential):

# Python
colors = sns.color_palette('Blues', as_cmap=True)
# R
scale_fill_gradient(low = 'white', high = 'darkblue')

For diverging data (showing positive/negative):

# Python
colors = sns.diverging_palette(250, 10, as_cmap=True)
# R
scale_fill_gradient2(low = 'blue', mid = 'white', high = 'red')

Colorblind-Friendly Palettes

# Python - colorblind-safe palettes
colors = sns.color_palette('colorblind')
# R - colorblind-safe palettes
library(viridis)
scale_fill_viridis_d()  # For discrete
scale_fill_viridis_c()  # For continuous

Creating Publication-Quality Figures

Multi-Panel Figures

Python:

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Line chart
axes[0, 0].plot(range(10), np.random.randn(10).cumsum())
axes[0, 0].set_title('(A) Time Series')

# Plot 2: Bar chart
axes[0, 1].bar(range(5), np.random.randint(10, 50, 5))
axes[0, 1].set_title('(B) Categories')

# Plot 3: Scatter plot
axes[1, 0].scatter(np.random.randn(100), np.random.randn(100))
axes[1, 0].set_title('(C) Relationship')

# Plot 4: Box plot
axes[1, 1].boxplot([np.random.randn(100), np.random.randn(100) + 2])
axes[1, 1].set_title('(D) Distributions')

plt.suptitle('Figure 1: Complete Analysis', 
             fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.savefig('multi_panel_figure.png', dpi=300, bbox_inches='tight')
plt.show()

R:

library(patchwork)

# Create individual plots
p1 <- ggplot(data.frame(x = 1:10, y = cumsum(rnorm(10))), 
             aes(x, y)) +
  geom_line() +
  labs(title = '(A) Time Series') +
  publication_theme

p2 <- ggplot(data.frame(x = 1:5, y = runif(5, 10, 50)), 
             aes(x, y)) +
  geom_col() +
  labs(title = '(B) Categories') +
  publication_theme

p3 <- ggplot(data.frame(x = rnorm(100), y = rnorm(100)), 
             aes(x, y)) +
  geom_point() +
  labs(title = '(C) Relationship') +
  publication_theme

p4 <- ggplot(data.frame(group = rep(c('A', 'B'), each = 100),
                        value = c(rnorm(100), rnorm(100) + 2)),
             aes(group, value)) +
  geom_boxplot() +
  labs(title = '(D) Distributions') +
  publication_theme

# Combine
combined <- (p1 + p2) / (p3 + p4) +
  plot_annotation(title = 'Figure 1: Complete Analysis',
                  theme = theme(plot.title = element_text(size = 16, face = 'bold')))

ggsave('multi_panel_figure.png', combined, width = 12, height = 10, dpi = 300)

Interactive Visualizations

For presentations and online reports.

Python with Plotly

import plotly.express as px
import plotly.graph_objects as go

# Sample data
df = pd.DataFrame({
    'x': range(100),
    'y': np.random.randn(100).cumsum(),
    'category': np.random.choice(['A', 'B', 'C'], 100)
})

# Interactive scatter plot
fig = px.scatter(df, x='x', y='y', color='category',
                 title='Interactive Scatter Plot',
                 hover_data=['x', 'y', 'category'])

fig.update_layout(
    font=dict(size=14),
    title_font=dict(size=18),
    hoverlabel=dict(bgcolor='white', font_size=12)
)

# Save as HTML
fig.write_html('interactive_plot.html')

# Or show in notebook
fig.show()

R with Plotly

library(plotly)

# Sample data
data <- tibble(
  x = 1:100,
  y = cumsum(rnorm(100)),
  category = sample(c('A', 'B', 'C'), 100, replace = TRUE)
)

# Interactive scatter plot
p <- ggplot(data, aes(x = x, y = y, color = category)) +
  geom_point() +
  labs(title = 'Interactive Scatter Plot') +
  publication_theme

# Convert to interactive
interactive_plot <- ggplotly(p)

# Save as HTML
htmlwidgets::saveWidget(interactive_plot, 'interactive_plot.html')

Visualization Checklist

Before publishing a figure:

  • ✅ Clear, descriptive title
  • ✅ Axis labels with units
  • ✅ Legend (if needed)
  • ✅ Readable font sizes (≥10pt for print)
  • ✅ Colorblind-friendly colors
  • ✅ High resolution (≥300 DPI for print)
  • ✅ No chartjunk
  • ✅ Consistent style across figures
  • ✅ Caption explains what’s shown
  • ✅ All elements are visible when printed in grayscale

Common Mistakes to Avoid

  1. Pie charts with too many slices - Use bar chart instead
  2. 3D charts - They distort perception
  3. Dual y-axes - Often misleading
  4. Red-green color schemes - Not colorblind-friendly
  5. Tiny fonts - Make everything readable
  6. Missing units - Always label with units
  7. Misleading scales - Start y-axis at zero (usually)
  8. Too much information - One message per chart

Resources


Next Steps

  1. Writing Your First Paper - Use your visualizations effectively
  2. Building a Research Portfolio - Showcase your best visualizations

Make your data beautiful! 📊✨