Python Data Analytics Lesson 4

Lesson 4: Understanding Basic Statistics and Summaries

📘 Overview

Statistics help us summarize our data with just a few key numbers. Think of statistics as a way to describe a whole group with simple, powerful facts. Today we’ll learn the most important statistical measures that every data analyst needs to know!

Expected Duration: 45-60 minutes

Required packages:

pandas
numpy
matplotlib
seaborn

🎯 Learning Goals

By the end of this lesson, you will be able to:

Calculate and interpret measures of central tendency (mean, median, mode)
Understand measures of spread (range, standard deviation, quartiles)
Create comprehensive statistical summaries
Compare statistics across different groups
Visualize statistical measures effectively

📌 Setting Up Our Data

Let’s use our familiar student dataset and add some new information:

# Install required packages for data analytics
import piplite
await piplite.install(['seaborn', 'matplotlib', 'pandas', 'numpy', 'scipy', 'plotly'])
print("Packages installed successfully!")
print("You can now import and use: seaborn, matplotlib, pandas, numpy, scipy, plotly")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Our student dataset
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Henry'],
    'Age': [16, 17, 16, 18, 17, 16, 17, 18],
    'Grade': ['10th', '11th', '10th', '12th', '11th', '10th', '11th', '12th'],
    'Math_Score': [85, 92, 78, 95, 88, 76, 91, 87],
    'Science_Score': [89, 87, 82, 98, 85, 79, 93, 89],
    'English_Score': [92, 85, 88, 91, 94, 83, 89, 86],
    'Hours_Studied': [5, 8, 4, 10, 6, 3, 9, 7],
    'Extracurriculars': [2, 1, 3, 2, 4, 1, 2, 3]
}

df = pd.DataFrame(student_data)
df['Average_Score'] = df[['Math_Score', 'Science_Score', 'English_Score']].mean(axis=1)

print("Our student data is ready for statistical analysis!")
print(df[['Name', 'Math_Score', 'Science_Score', 'English_Score', 'Average_Score']])

1. Measures of Central Tendency - “What’s Normal?”

These statistics tell us what a “typical” value looks like in our data.

The Mean (Average)

The mean is what most people think of as “average”:

# Calculate means for our test scores
math_mean = df['Math_Score'].mean()
science_mean = df['Science_Score'].mean()
english_mean = df['English_Score'].mean()

print("MEAN SCORES (Averages):")
print(f"Math: {math_mean:.1f} points")
print(f"Science: {science_mean:.1f} points") 
print(f"English: {english_mean:.1f} points")

# The mean tells us the "center" of our data
print(f"\nOverall, students average {math_mean:.1f} points in Math.")
print("This means if we added all Math scores and divided by 8 students,")
print(f"we get {math_mean:.1f}.")

# Let's verify this calculation manually
total_math_points = df['Math_Score'].sum()
number_of_students = len(df)
manual_average = total_math_points / number_of_students

print(f"\nManual calculation: {total_math_points} ÷ {number_of_students} = {manual_average:.1f}")
print("✓ This matches our mean calculation!")

The Median (Middle Value)

The median is the middle value when all scores are lined up in order:

# Calculate medians
math_median = df['Math_Score'].median()
science_median = df['Science_Score'].median()
english_median = df['English_Score'].median()

print("MEDIAN SCORES (Middle Values):")
print(f"Math: {math_median:.1f} points")
print(f"Science: {science_median:.1f} points")
print(f"English: {english_median:.1f} points")

# Show how median works with Math scores
math_scores_sorted = sorted(df['Math_Score'])
print(f"\nMath scores in order: {math_scores_sorted}")
print(f"The middle values are: {math_scores_sorted[3]} and {math_scores_sorted[4]}")
print(f"Median is their average: ({math_scores_sorted[3]} + {math_scores_sorted[4]}) ÷ 2 = {math_median}")

# Compare mean vs median
print(f"\nMath - Mean: {math_mean:.1f}, Median: {math_median:.1f}")
if abs(math_mean - math_median) < 1:
    print("Mean and median are very close - data is fairly balanced!")
else:
    print("Mean and median are different - might have some extreme values.")

The Mode (Most Common Value)

The mode is the value that appears most often:

# Find modes for our categorical data
grade_mode = df['Grade'].mode()[0]  # Most common grade
age_mode = df['Age'].mode()[0]      # Most common age

print("MODE (Most Common Values):")
print(f"Most common grade level: {grade_mode}")
print(f"Most common age: {age_mode}")

# Count frequencies to see why
print(f"\nGrade frequencies:")
print(df['Grade'].value_counts())
print(f"\nAge frequencies:")
print(df['Age'].value_counts())

2. Measures of Spread - “How Spread Out Are the Values?”

These statistics tell us how much variation there is in our data.

Range (Highest - Lowest)

# Calculate ranges
math_range = df['Math_Score'].max() - df['Math_Score'].min()
science_range = df['Science_Score'].max() - df['Science_Score'].min()
english_range = df['English_Score'].max() - df['English_Score'].min()

print("RANGE (Spread of Scores):")
print(f"Math: {df['Math_Score'].min()} to {df['Math_Score'].max()} (range: {math_range} points)")
print(f"Science: {df['Science_Score'].min()} to {df['Science_Score'].max()} (range: {science_range} points)")
print(f"English: {df['English_Score'].min()} to {df['English_Score'].max()} (range: {english_range} points)")

# Which subject has the most variation?
ranges = {'Math': math_range, 'Science': science_range, 'English': english_range}
most_varied = max(ranges, key=ranges.get)
least_varied = min(ranges, key=ranges.get)

print(f"\nMost variation: {most_varied} ({ranges[most_varied]} point range)")
print(f"Least variation: {least_varied} ({ranges[least_varied]} point range)")

Standard Deviation (Average Distance from Mean)

This is the most important measure of spread:

# Calculate standard deviations
math_std = df['Math_Score'].std()
science_std = df['Science_Score'].std()
english_std = df['English_Score'].std()

print("STANDARD DEVIATION (Average Distance from Mean):")
print(f"Math: {math_std:.1f} points")
print(f"Science: {science_std:.1f} points")
print(f"English: {english_std:.1f} points")

# Explain what this means
print(f"\nWhat does this mean?")
print(f"In Math, most students score within {math_std:.1f} points of the average ({math_mean:.1f})")
print(f"That means most Math scores are between {math_mean-math_std:.1f} and {math_mean+math_std:.1f}")

3. Percentiles and Quartiles - “Where Do You Rank?”

These help us understand where individual values stand compared to the group:

# Calculate quartiles (25th, 50th, 75th percentiles)
math_q1 = df['Math_Score'].quantile(0.25)  # 25th percentile
math_q2 = df['Math_Score'].quantile(0.50)  # 50th percentile (median)
math_q3 = df['Math_Score'].quantile(0.75)  # 75th percentile

print("MATH SCORE QUARTILES:")
print(f"25th percentile (Q1): {math_q1:.1f} - Bottom 25% score below this")
print(f"50th percentile (Q2): {math_q2:.1f} - Half score below this (median)")  
print(f"75th percentile (Q3): {math_q3:.1f} - Top 25% score above this")

# Show where each student ranks
print(f"\nStudent Rankings in Math:")
for _, student in df.iterrows():
    score = student['Math_Score']
    percentile = (df['Math_Score'] < score).mean() * 100
    
    if score >= math_q3:
        rank = "Top Quarter"
    elif score >= math_q2:
        rank = "Above Average"  
    elif score >= math_q1:
        rank = "Below Average"
    else:
        rank = "Bottom Quarter"
        
    print(f"{student['Name']}: {score} points - {rank} ({percentile:.0f}th percentile)")

4. Creating a Statistical Summary

Let’s create a comprehensive statistical profile:

# Generate complete statistical summary
print("="*60)
print("COMPLETE STATISTICAL SUMMARY")
print("="*60)

subjects = ['Math_Score', 'Science_Score', 'English_Score']
for subject in subjects:
    subject_name = subject.replace('_', ' ')
    print(f"\n{subject_name.upper()}:")
    
    # Central tendency
    mean_val = df[subject].mean()
    median_val = df[subject].median()
    
    # Spread
    std_val = df[subject].std()
    range_val = df[subject].max() - df[subject].min()
    
    # Quartiles
    q1 = df[subject].quantile(0.25)
    q3 = df[subject].quantile(0.75)
    iqr = q3 - q1  # Interquartile Range
    
    print(f"  Central Tendency:")
    print(f"    Mean (average): {mean_val:.1f}")
    print(f"    Median (middle): {median_val:.1f}")
    
    print(f"  Spread:")
    print(f"    Standard deviation: {std_val:.1f}")
    print(f"    Range: {range_val:.0f} (from {df[subject].min()} to {df[subject].max()})")
    print(f"    IQR (middle 50%): {iqr:.1f}")
    
    print(f"  Distribution:")
    print(f"    25th percentile: {q1:.1f}")
    print(f"    75th percentile: {q3:.1f}")

5. Visualizing Statistics

Let’s create plots that show our statistical measures:

# Create a visualization of our statistics
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Statistical Analysis of Student Performance', fontsize=16, fontweight='bold')

# 1. Box plot showing quartiles and outliers
subjects_data = [df['Math_Score'], df['Science_Score'], df['English_Score']]
box_plot = axes[0, 0].boxplot(subjects_data, labels=['Math', 'Science', 'English'], patch_artist=True)
axes[0, 0].set_title('Distribution Summary (Box Plot)')
axes[0, 0].set_ylabel('Score')

# Color the boxes
colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(box_plot['boxes'], colors):
    patch.set_facecolor(color)

# 2. Histogram with mean and median lines
axes[0, 1].hist(df['Math_Score'], bins=5, alpha=0.7, color='lightblue', edgecolor='black')
axes[0, 1].axvline(df['Math_Score'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df["Math_Score"].mean():.1f}')
axes[0, 1].axvline(df['Math_Score'].median(), color='orange', linestyle='-', linewidth=2, label=f'Median: {df["Math_Score"].median():.1f}')
axes[0, 1].set_title('Math Score Distribution')
axes[0, 1].set_xlabel('Score')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()

# 3. Standard deviation visualization
subjects = ['Math', 'Science', 'English']
means = [df['Math_Score'].mean(), df['Science_Score'].mean(), df['English_Score'].mean()]
stds = [df['Math_Score'].std(), df['Science_Score'].std(), df['English_Score'].std()]

axes[1, 0].bar(subjects, means, yerr=stds, capsize=5, color=colors, alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Mean ± Standard Deviation')
axes[1, 0].set_ylabel('Score')

# 4. Range comparison
ranges = [df['Math_Score'].max()-df['Math_Score'].min(),
          df['Science_Score'].max()-df['Science_Score'].min(), 
          df['English_Score'].max()-df['English_Score'].min()]

axes[1, 1].bar(subjects, ranges, color=colors, alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Score Range by Subject')
axes[1, 1].set_ylabel('Range (Max - Min)')

plt.tight_layout()
plt.show()

✅ Key Learning Points

Central Tendency (Typical Values):
- Mean: Mathematical average
- Median: Middle value (less affected by extremes)
- Mode: Most common value
Spread (Variation):
- Range: Difference between highest and lowest
- Standard deviation: Average distance from mean
- Quartiles: Divide data into four equal parts
When to Use Which:
- Use median when you have extreme values (outliers)
- Use mean for normally distributed data
- Use standard deviation to understand consistency
Statistics tell stories - they help us compare and understand our data

💡 Practice Exercises

Try these exercises to practice your statistical analysis skills. Download the interactive notebook to work through them!

🚀 What’s Next?

Now that we understand basic statistics, we’re ready to explore how different variables relate to each other! In the next lesson, we’ll learn about correlations - discovering which things tend to go together.

📓 Interactive Notebook

Want to practice with the interactive Jupyter notebook version of this lesson?

Download the Jupyter Notebook