Python Data Analytics Lesson 6

Lesson 6: Advanced Analysis - P-values, T-tests, and Linear Regression

📘 Overview

Welcome to our final lesson in the data analytics series! Today we’ll learn about the tools that professional data scientists use to make confident conclusions from data. We’ll explore p-values (are our findings real or just luck?), t-tests (comparing groups scientifically), and linear regression (predicting future outcomes).

Expected Duration: 60-75 minutes

Required packages:

pandas
numpy
matplotlib
seaborn
scipy
scikit-learn

🎯 Learning Goals

By the end of this lesson, you will be able to:

Understand and interpret p-values for statistical significance
Perform t-tests to compare groups scientifically
Build linear regression models to make predictions
Create multiple regression models using several predictors
Interpret R-squared values and model accuracy
Generate comprehensive statistical reports

📌 Setting Up for Advanced Analysis

# Install required packages for data analytics
import piplite
await piplite.install(['seaborn', 'matplotlib', 'pandas', 'numpy', 'scipy', 'plotly'])
print("Packages installed successfully!")
print("You can now import and use: seaborn, matplotlib, pandas, numpy, scipy, plotly")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats  # This gives us advanced statistical functions
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')  # Hide technical warnings

# Our complete student dataset
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Henry'],
    'Age': [16, 17, 16, 18, 17, 16, 17, 18],
    'Grade': ['10th', '11th', '10th', '12th', '11th', '10th', '11th', '12th'],
    'Math_Score': [85, 92, 78, 95, 88, 76, 91, 87],
    'Science_Score': [89, 87, 82, 98, 85, 79, 93, 89],
    'English_Score': [92, 85, 88, 91, 94, 83, 89, 86],
    'Hours_Studied': [5, 8, 4, 10, 6, 3, 9, 7],
    'Extracurriculars': [2, 1, 3, 2, 4, 1, 2, 3],
    'Sleep_Hours': [7, 6, 8, 6, 7, 9, 6, 7],
    'Screen_Time': [4, 6, 3, 2, 5, 7, 3, 4],
    'Books_Read': [12, 8, 15, 20, 10, 5, 18, 14]
}

df = pd.DataFrame(student_data)
df['Average_Score'] = df[['Math_Score', 'Science_Score', 'English_Score']].mean(axis=1)

print("Ready for advanced statistical analysis!")
print("We'll answer questions like:")
print("• Are older students REALLY better at math, or could it be random chance?")
print("• Can we PREDICT a student's English score from how many books they read?") 
print("• Is the difference between grades STATISTICALLY SIGNIFICANT?")

1. Understanding P-Values - “Could This Be Just Luck?”

P-values help us determine if our findings are real or could have happened by chance:

print("UNDERSTANDING P-VALUES")
print("="*40)
print()
print("A p-value answers: 'If there was really no relationship,")
print("what's the probability we'd see results this extreme by pure chance?'")
print()
print("P-value interpretation:")
print("• p < 0.001  → Almost certainly real (99.9% confident)")
print("• p < 0.01   → Very likely real (99% confident)")  
print("• p < 0.05   → Probably real (95% confident) ← Common threshold")
print("• p < 0.10   → Possibly real (90% confident)")
print("• p > 0.10   → Could easily be due to chance")
print()
print("In science, we usually want p < 0.05 to call something 'significant'")

# Let's test if the correlation between study hours and grades is significant
study_correlation = df['Hours_Studied'].corr(df['Average_Score'])
correlation_stat, p_value = stats.pearsonr(df['Hours_Studied'], df['Average_Score'])

print(f"\nExample: Study Hours ↔ Average Score")
print(f"Correlation: {study_correlation:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print(f"✓ SIGNIFICANT! (p < 0.05)")
    print(f"  We can be confident this relationship is real, not just luck")
elif p_value < 0.10:
    print(f"? BORDERLINE (p < 0.10)")
    print(f"  Possibly real, but we can't be very confident")
else:
    print(f"✗ NOT SIGNIFICANT (p > 0.10)")
    print(f"  This could easily be due to chance")

2. T-Tests - Comparing Groups Scientifically

T-tests help us determine if differences between groups are real or just random variation:

# Question: Do older students (11th/12th grade) perform better than younger students (10th grade)?

# Split students into two groups
older_students = df[df['Grade'].isin(['11th', '12th'])]
younger_students = df[df['Grade'] == '10th']

older_scores = older_students['Average_Score']
younger_scores = younger_students['Average_Score']

print("T-TEST: Do older students perform better?")
print("="*45)
print()
print("Groups:")
print(f"Older students (11th/12th): {len(older_students)} students")
print(f"Younger students (10th): {len(younger_students)} students")

print(f"\nGroup Averages:")
older_mean = older_scores.mean()
younger_mean = younger_scores.mean()
difference = older_mean - younger_mean

print(f"Older students average: {older_mean:.1f}")
print(f"Younger students average: {younger_mean:.1f}")
print(f"Difference: {difference:.1f} points")

# Perform the t-test
t_statistic, p_value_ttest = stats.ttest_ind(older_scores, younger_scores)

print(f"\nT-test Results:")
print(f"T-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value_ttest:.4f}")

# Interpret the results
print(f"\nInterpretation:")
if p_value_ttest < 0.05:
    print(f"✓ SIGNIFICANT DIFFERENCE (p < 0.05)")
    print(f"  The {difference:.1f} point difference is statistically significant")
    print(f"  Older students DO perform significantly better")
else:
    print(f"✗ NO SIGNIFICANT DIFFERENCE (p > 0.05)")
    print(f"  The {difference:.1f} point difference could easily be due to chance")
    print(f"  We can't conclude that older students perform better")

3. One-Sample T-Test - Testing Against a Standard

# Question: Are our students performing above the national average of 85?
national_average = 85
our_average = df['Average_Score'].mean()

print("ONE-SAMPLE T-TEST: Are we above national average?")
print("="*50)
print(f"National average: {national_average}")
print(f"Our school average: {our_average:.1f}")
print(f"Difference: {our_average - national_average:.1f} points")

# Perform one-sample t-test
t_stat_one, p_val_one = stats.ttest_1samp(df['Average_Score'], national_average)

print(f"\nOne-sample t-test results:")
print(f"T-statistic: {t_stat_one:.3f}")
print(f"P-value: {p_val_one:.4f}")

print(f"\nInterpretation:")
if p_val_one < 0.05:
    if our_average > national_average:
        print(f"✓ SIGNIFICANTLY ABOVE AVERAGE (p < 0.05)")
        print(f"  Our students perform significantly better than national average")
    else:
        print(f"✓ SIGNIFICANTLY BELOW AVERAGE (p < 0.05)")
        print(f"  Our students perform significantly worse than national average")
else:
    print(f"✗ NO SIGNIFICANT DIFFERENCE (p > 0.05)")
    print(f"  Our performance is not significantly different from national average")

4. Linear Regression - Predicting the Future

Linear regression helps us make predictions based on relationships we’ve found:

print("LINEAR REGRESSION: Predicting English Scores from Books Read")
print("="*60)

# Set up our regression
# Prepare data for regression
X = df[['Books_Read']]  # Predictor (must be 2D array)
y = df['English_Score']  # What we want to predict

# Create and fit the regression model
model = LinearRegression()
model.fit(X, y)

# Get the equation components
slope = model.coef_[0]      # How much English score changes per book
intercept = model.intercept_ # English score when books read = 0
r_squared = r2_score(y, model.predict(X))  # How well our line fits

print(f"Regression Equation:")
print(f"English Score = {intercept:.2f} + {slope:.2f} × (Books Read)")
print()
print(f"What this means:")
print(f"• Base English score (with 0 books): {intercept:.1f}")
print(f"• Each additional book read increases English score by {slope:.2f} points")
print(f"• R-squared: {r_squared:.3f} ({r_squared*100:.1f}% of variation explained)")

# Make predictions for new students
print(f"\nPredictions for new students:")
test_books = [5, 10, 15, 20, 25]
for books in test_books:
    predicted_score = intercept + slope * books
    print(f"Student who reads {books:2d} books/year: {predicted_score:.1f} English score")

5. Visualizing Our Regression

# Create a comprehensive regression plot
plt.figure(figsize=(12, 8))

# Scatter plot of actual data
plt.scatter(df['Books_Read'], df['English_Score'], s=100, alpha=0.7, color='blue', label='Actual Students')

# Add student names
for _, student in df.iterrows():
    plt.annotate(student['Name'], 
                (student['Books_Read'], student['English_Score']),
                xytext=(5, 5), textcoords='offset points', fontsize=9)

# Plot regression line
books_range = np.linspace(df['Books_Read'].min(), df['Books_Read'].max(), 100)
predicted_scores = intercept + slope * books_range
plt.plot(books_range, predicted_scores, 'r-', linewidth=2, label=f'Regression Line (R² = {r_squared:.3f})')

plt.title('Predicting English Scores from Books Read', fontsize=14, fontweight='bold')
plt.xlabel('Books Read per Year')
plt.ylabel('English Score')
plt.legend()
plt.grid(True, alpha=0.3)

# Add equation text box
equation_text = f'English Score = {intercept:.1f} + {slope:.2f} × Books\nR² = {r_squared:.3f}'
plt.text(0.05, 0.95, equation_text, transform=plt.gca().transAxes, 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
         verticalalignment='top', fontsize=10)

plt.tight_layout()
plt.show()

6. Multiple Regression - Using Several Predictors

print("MULTIPLE REGRESSION: Predicting Average Score from Multiple Factors")
print("="*65)

# Use multiple variables to predict average score
predictors = ['Hours_Studied', 'Sleep_Hours', 'Books_Read', 'Screen_Time']
X_multi = df[predictors]
y_multi = df['Average_Score']

# Fit multiple regression model
model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)

# Get coefficients
coefficients = model_multi.coef_
intercept_multi = model_multi.intercept_
r_squared_multi = r2_score(y_multi, model_multi.predict(X_multi))

print("Multiple Regression Equation:")
equation = f"Average Score = {intercept_multi:.2f}"
for i, (pred, coef) in enumerate(zip(predictors, coefficients)):
    sign = "+" if coef >= 0 else ""
    pred_name = pred.replace('_', ' ')
    equation += f" {sign}{coef:.2f}×{pred_name}"
print(equation)

print(f"\nR-squared: {r_squared_multi:.3f} ({r_squared_multi*100:.1f}% of variation explained)")

print(f"\nCoefficient Interpretation:")
for pred, coef in zip(predictors, coefficients):
    pred_name = pred.replace('_', ' ').title()
    direction = "increases" if coef > 0 else "decreases"
    print(f"• {pred_name}: Each unit increase {direction} score by {abs(coef):.2f} points")

# Make a prediction for a hypothetical student
print(f"\nPrediction Example:")
print(f"New student: 8 hours studied, 7 hours sleep, 15 books read, 3 hours screen time")
new_student = [[8, 7, 15, 3]]
prediction = model_multi.predict(new_student)[0]
print(f"Predicted average score: {prediction:.1f}")

7. Statistical Summary Report

print("="*70)
print("FINAL STATISTICAL ANALYSIS REPORT")
print("="*70)

print(f"\n1. CORRELATION FINDINGS:")
correlations_to_report = [
    ('Hours_Studied', 'Average_Score'),
    ('Books_Read', 'English_Score'), 
    ('Sleep_Hours', 'Average_Score'),
    ('Screen_Time', 'Average_Score')
]

for var1, var2 in correlations_to_report:
    corr = df[var1].corr(df[var2])
    corr_stat, p_val = stats.pearsonr(df[var1], df[var2])
    significance = "Significant" if p_val < 0.05 else "Not significant"
    print(f"   {var1} ↔ {var2}: r = {corr:.3f}, p = {p_val:.4f} ({significance})")

print(f"\n2. GROUP COMPARISONS:")
print(f"   Older vs Younger students: {difference:.1f} point difference")
significance_group = "Significant" if p_value_ttest < 0.05 else "Not significant"
print(f"   Result: {significance_group}")

print(f"\n3. REGRESSION MODELS:")
print(f"   Single predictor (Books → English): R² = {r_squared:.3f}")
print(f"   Multiple predictors (→ Average): R² = {r_squared_multi:.3f}")

print(f"\n4. KEY INSIGHTS:")
print(f"   • Focus on factors with significant correlations")
print(f"   • Use regression models for predicting student outcomes")
print(f"   • Collect more data to improve statistical power")
print(f"   • Remember: correlation ≠ causation")

✅ Key Learning Points

P-values tell us if our findings are likely real (p < 0.05) or could be due to chance
T-tests help compare groups scientifically (older vs younger students)
Linear regression lets us predict outcomes and understand relationships
R-squared shows how much of the variation our model explains
Multiple regression uses several factors to make better predictions
Statistical significance doesn’t guarantee practical importance
Always consider whether correlations might have alternative explanations

💡 Practice Exercises

Try these advanced exercises to practice your statistical skills. Download the interactive notebook to work through them!

🎉 Congratulations!

You’ve completed the data analytics series! You now know how to:

Load and explore data
Ask good analytical questions
Create meaningful visualizations
Calculate important statistics
Find correlations and relationships
Test statistical significance
Make predictions with regression

These skills form the foundation of data science. Keep practicing with your own datasets, and remember: good analysis combines statistical rigor with clear thinking about what the data really means!

🚀 Continue Your Journey

Continue your data science journey by exploring more advanced topics like machine learning, hypothesis testing, and experimental design.

📓 Interactive Notebook

Want to practice with the interactive Jupyter notebook version of this lesson?

Download the Jupyter Notebook