Python Data Analytics Lesson 4

Lesson 4: Understanding Basic Statistics and Summaries

๐Ÿ“˜ Overview

Statistics help us summarize our data with just a few key numbers. Think of statistics as a way to describe a whole group with simple, powerful facts. Today we’ll learn the most important statistical measures that every data analyst needs to know!

Expected Duration: 45-60 minutes

Required packages:

  • pandas
  • numpy
  • matplotlib
  • seaborn

๐ŸŽฏ Learning Goals

By the end of this lesson, you will be able to:

  • Calculate and interpret measures of central tendency (mean, median, mode)
  • Understand measures of spread (range, standard deviation, quartiles)
  • Create comprehensive statistical summaries
  • Compare statistics across different groups
  • Visualize statistical measures effectively

๐Ÿ“Œ Setting Up Our Data

Let’s use our familiar student dataset and add some new information:

# Install required packages for data analytics
import piplite
await piplite.install(['seaborn', 'matplotlib', 'pandas', 'numpy', 'scipy', 'plotly'])
print("Packages installed successfully!")
print("You can now import and use: seaborn, matplotlib, pandas, numpy, scipy, plotly")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Our student dataset
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Henry'],
    'Age': [16, 17, 16, 18, 17, 16, 17, 18],
    'Grade': ['10th', '11th', '10th', '12th', '11th', '10th', '11th', '12th'],
    'Math_Score': [85, 92, 78, 95, 88, 76, 91, 87],
    'Science_Score': [89, 87, 82, 98, 85, 79, 93, 89],
    'English_Score': [92, 85, 88, 91, 94, 83, 89, 86],
    'Hours_Studied': [5, 8, 4, 10, 6, 3, 9, 7],
    'Extracurriculars': [2, 1, 3, 2, 4, 1, 2, 3]
}

df = pd.DataFrame(student_data)
df['Average_Score'] = df[['Math_Score', 'Science_Score', 'English_Score']].mean(axis=1)

print("Our student data is ready for statistical analysis!")
print(df[['Name', 'Math_Score', 'Science_Score', 'English_Score', 'Average_Score']])

1. Measures of Central Tendency - “What’s Normal?”

These statistics tell us what a “typical” value looks like in our data.

The Mean (Average)

The mean is what most people think of as “average”:

# Calculate means for our test scores
math_mean = df['Math_Score'].mean()
science_mean = df['Science_Score'].mean()
english_mean = df['English_Score'].mean()

print("MEAN SCORES (Averages):")
print(f"Math: {math_mean:.1f} points")
print(f"Science: {science_mean:.1f} points") 
print(f"English: {english_mean:.1f} points")

# The mean tells us the "center" of our data
print(f"\nOverall, students average {math_mean:.1f} points in Math.")
print("This means if we added all Math scores and divided by 8 students,")
print(f"we get {math_mean:.1f}.")

# Let's verify this calculation manually
total_math_points = df['Math_Score'].sum()
number_of_students = len(df)
manual_average = total_math_points / number_of_students

print(f"\nManual calculation: {total_math_points} รท {number_of_students} = {manual_average:.1f}")
print("โœ“ This matches our mean calculation!")

The Median (Middle Value)

The median is the middle value when all scores are lined up in order:

# Calculate medians
math_median = df['Math_Score'].median()
science_median = df['Science_Score'].median()
english_median = df['English_Score'].median()

print("MEDIAN SCORES (Middle Values):")
print(f"Math: {math_median:.1f} points")
print(f"Science: {science_median:.1f} points")
print(f"English: {english_median:.1f} points")

# Show how median works with Math scores
math_scores_sorted = sorted(df['Math_Score'])
print(f"\nMath scores in order: {math_scores_sorted}")
print(f"The middle values are: {math_scores_sorted[3]} and {math_scores_sorted[4]}")
print(f"Median is their average: ({math_scores_sorted[3]} + {math_scores_sorted[4]}) รท 2 = {math_median}")

# Compare mean vs median
print(f"\nMath - Mean: {math_mean:.1f}, Median: {math_median:.1f}")
if abs(math_mean - math_median) < 1:
    print("Mean and median are very close - data is fairly balanced!")
else:
    print("Mean and median are different - might have some extreme values.")

The Mode (Most Common Value)

The mode is the value that appears most often:

# Find modes for our categorical data
grade_mode = df['Grade'].mode()[0]  # Most common grade
age_mode = df['Age'].mode()[0]      # Most common age

print("MODE (Most Common Values):")
print(f"Most common grade level: {grade_mode}")
print(f"Most common age: {age_mode}")

# Count frequencies to see why
print(f"\nGrade frequencies:")
print(df['Grade'].value_counts())
print(f"\nAge frequencies:")
print(df['Age'].value_counts())

2. Measures of Spread - “How Spread Out Are the Values?”

These statistics tell us how much variation there is in our data.

Range (Highest - Lowest)

# Calculate ranges
math_range = df['Math_Score'].max() - df['Math_Score'].min()
science_range = df['Science_Score'].max() - df['Science_Score'].min()
english_range = df['English_Score'].max() - df['English_Score'].min()

print("RANGE (Spread of Scores):")
print(f"Math: {df['Math_Score'].min()} to {df['Math_Score'].max()} (range: {math_range} points)")
print(f"Science: {df['Science_Score'].min()} to {df['Science_Score'].max()} (range: {science_range} points)")
print(f"English: {df['English_Score'].min()} to {df['English_Score'].max()} (range: {english_range} points)")

# Which subject has the most variation?
ranges = {'Math': math_range, 'Science': science_range, 'English': english_range}
most_varied = max(ranges, key=ranges.get)
least_varied = min(ranges, key=ranges.get)

print(f"\nMost variation: {most_varied} ({ranges[most_varied]} point range)")
print(f"Least variation: {least_varied} ({ranges[least_varied]} point range)")

Standard Deviation (Average Distance from Mean)

This is the most important measure of spread:

# Calculate standard deviations
math_std = df['Math_Score'].std()
science_std = df['Science_Score'].std()
english_std = df['English_Score'].std()

print("STANDARD DEVIATION (Average Distance from Mean):")
print(f"Math: {math_std:.1f} points")
print(f"Science: {science_std:.1f} points")
print(f"English: {english_std:.1f} points")

# Explain what this means
print(f"\nWhat does this mean?")
print(f"In Math, most students score within {math_std:.1f} points of the average ({math_mean:.1f})")
print(f"That means most Math scores are between {math_mean-math_std:.1f} and {math_mean+math_std:.1f}")

3. Percentiles and Quartiles - “Where Do You Rank?”

These help us understand where individual values stand compared to the group:

# Calculate quartiles (25th, 50th, 75th percentiles)
math_q1 = df['Math_Score'].quantile(0.25)  # 25th percentile
math_q2 = df['Math_Score'].quantile(0.50)  # 50th percentile (median)
math_q3 = df['Math_Score'].quantile(0.75)  # 75th percentile

print("MATH SCORE QUARTILES:")
print(f"25th percentile (Q1): {math_q1:.1f} - Bottom 25% score below this")
print(f"50th percentile (Q2): {math_q2:.1f} - Half score below this (median)")  
print(f"75th percentile (Q3): {math_q3:.1f} - Top 25% score above this")

# Show where each student ranks
print(f"\nStudent Rankings in Math:")
for _, student in df.iterrows():
    score = student['Math_Score']
    percentile = (df['Math_Score'] < score).mean() * 100
    
    if score >= math_q3:
        rank = "Top Quarter"
    elif score >= math_q2:
        rank = "Above Average"  
    elif score >= math_q1:
        rank = "Below Average"
    else:
        rank = "Bottom Quarter"
        
    print(f"{student['Name']}: {score} points - {rank} ({percentile:.0f}th percentile)")

4. Creating a Statistical Summary

Let’s create a comprehensive statistical profile:

# Generate complete statistical summary
print("="*60)
print("COMPLETE STATISTICAL SUMMARY")
print("="*60)

subjects = ['Math_Score', 'Science_Score', 'English_Score']
for subject in subjects:
    subject_name = subject.replace('_', ' ')
    print(f"\n{subject_name.upper()}:")
    
    # Central tendency
    mean_val = df[subject].mean()
    median_val = df[subject].median()
    
    # Spread
    std_val = df[subject].std()
    range_val = df[subject].max() - df[subject].min()
    
    # Quartiles
    q1 = df[subject].quantile(0.25)
    q3 = df[subject].quantile(0.75)
    iqr = q3 - q1  # Interquartile Range
    
    print(f"  Central Tendency:")
    print(f"    Mean (average): {mean_val:.1f}")
    print(f"    Median (middle): {median_val:.1f}")
    
    print(f"  Spread:")
    print(f"    Standard deviation: {std_val:.1f}")
    print(f"    Range: {range_val:.0f} (from {df[subject].min()} to {df[subject].max()})")
    print(f"    IQR (middle 50%): {iqr:.1f}")
    
    print(f"  Distribution:")
    print(f"    25th percentile: {q1:.1f}")
    print(f"    75th percentile: {q3:.1f}")

5. Visualizing Statistics

Let’s create plots that show our statistical measures:

# Create a visualization of our statistics
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Statistical Analysis of Student Performance', fontsize=16, fontweight='bold')

# 1. Box plot showing quartiles and outliers
subjects_data = [df['Math_Score'], df['Science_Score'], df['English_Score']]
box_plot = axes[0, 0].boxplot(subjects_data, labels=['Math', 'Science', 'English'], patch_artist=True)
axes[0, 0].set_title('Distribution Summary (Box Plot)')
axes[0, 0].set_ylabel('Score')

# Color the boxes
colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(box_plot['boxes'], colors):
    patch.set_facecolor(color)

# 2. Histogram with mean and median lines
axes[0, 1].hist(df['Math_Score'], bins=5, alpha=0.7, color='lightblue', edgecolor='black')
axes[0, 1].axvline(df['Math_Score'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df["Math_Score"].mean():.1f}')
axes[0, 1].axvline(df['Math_Score'].median(), color='orange', linestyle='-', linewidth=2, label=f'Median: {df["Math_Score"].median():.1f}')
axes[0, 1].set_title('Math Score Distribution')
axes[0, 1].set_xlabel('Score')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()

# 3. Standard deviation visualization
subjects = ['Math', 'Science', 'English']
means = [df['Math_Score'].mean(), df['Science_Score'].mean(), df['English_Score'].mean()]
stds = [df['Math_Score'].std(), df['Science_Score'].std(), df['English_Score'].std()]

axes[1, 0].bar(subjects, means, yerr=stds, capsize=5, color=colors, alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Mean ยฑ Standard Deviation')
axes[1, 0].set_ylabel('Score')

# 4. Range comparison
ranges = [df['Math_Score'].max()-df['Math_Score'].min(),
          df['Science_Score'].max()-df['Science_Score'].min(), 
          df['English_Score'].max()-df['English_Score'].min()]

axes[1, 1].bar(subjects, ranges, color=colors, alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Score Range by Subject')
axes[1, 1].set_ylabel('Range (Max - Min)')

plt.tight_layout()
plt.show()

โœ… Key Learning Points

  1. Central Tendency (Typical Values):

    • Mean: Mathematical average
    • Median: Middle value (less affected by extremes)
    • Mode: Most common value
  2. Spread (Variation):

    • Range: Difference between highest and lowest
    • Standard deviation: Average distance from mean
    • Quartiles: Divide data into four equal parts
  3. When to Use Which:

    • Use median when you have extreme values (outliers)
    • Use mean for normally distributed data
    • Use standard deviation to understand consistency
  4. Statistics tell stories - they help us compare and understand our data

๐Ÿ’ก Practice Exercises

Try these exercises to practice your statistical analysis skills. Download the interactive notebook to work through them!

๐Ÿš€ What’s Next?

Now that we understand basic statistics, we’re ready to explore how different variables relate to each other! In the next lesson, we’ll learn about correlations - discovering which things tend to go together.

๐Ÿ““ Interactive Notebook

Want to practice with the interactive Jupyter notebook version of this lesson?

Download the Jupyter Notebook