Python Data Analytics Lesson 6
Lesson 6: Advanced Analysis - P-values, T-tests, and Linear Regression
📘 Overview
Welcome to our final lesson in the data analytics series! Today we’ll learn about the tools that professional data scientists use to make confident conclusions from data. We’ll explore p-values (are our findings real or just luck?), t-tests (comparing groups scientifically), and linear regression (predicting future outcomes).
Expected Duration: 60-75 minutes
Required packages:
- pandas
- numpy
- matplotlib
- seaborn
- scipy
- scikit-learn
🎯 Learning Goals
By the end of this lesson, you will be able to:
- Understand and interpret p-values for statistical significance
- Perform t-tests to compare groups scientifically
- Build linear regression models to make predictions
- Create multiple regression models using several predictors
- Interpret R-squared values and model accuracy
- Generate comprehensive statistical reports
📌 Setting Up for Advanced Analysis
# Install required packages for data analytics
import piplite
await piplite.install(['seaborn', 'matplotlib', 'pandas', 'numpy', 'scipy', 'plotly'])
print("Packages installed successfully!")
print("You can now import and use: seaborn, matplotlib, pandas, numpy, scipy, plotly")import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats # This gives us advanced statistical functions
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore') # Hide technical warnings# Our complete student dataset
student_data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Henry'],
'Age': [16, 17, 16, 18, 17, 16, 17, 18],
'Grade': ['10th', '11th', '10th', '12th', '11th', '10th', '11th', '12th'],
'Math_Score': [85, 92, 78, 95, 88, 76, 91, 87],
'Science_Score': [89, 87, 82, 98, 85, 79, 93, 89],
'English_Score': [92, 85, 88, 91, 94, 83, 89, 86],
'Hours_Studied': [5, 8, 4, 10, 6, 3, 9, 7],
'Extracurriculars': [2, 1, 3, 2, 4, 1, 2, 3],
'Sleep_Hours': [7, 6, 8, 6, 7, 9, 6, 7],
'Screen_Time': [4, 6, 3, 2, 5, 7, 3, 4],
'Books_Read': [12, 8, 15, 20, 10, 5, 18, 14]
}
df = pd.DataFrame(student_data)
df['Average_Score'] = df[['Math_Score', 'Science_Score', 'English_Score']].mean(axis=1)
print("Ready for advanced statistical analysis!")
print("We'll answer questions like:")
print("• Are older students REALLY better at math, or could it be random chance?")
print("• Can we PREDICT a student's English score from how many books they read?")
print("• Is the difference between grades STATISTICALLY SIGNIFICANT?")1. Understanding P-Values - “Could This Be Just Luck?”
P-values help us determine if our findings are real or could have happened by chance:
print("UNDERSTANDING P-VALUES")
print("="*40)
print()
print("A p-value answers: 'If there was really no relationship,")
print("what's the probability we'd see results this extreme by pure chance?'")
print()
print("P-value interpretation:")
print("• p < 0.001 → Almost certainly real (99.9% confident)")
print("• p < 0.01 → Very likely real (99% confident)")
print("• p < 0.05 → Probably real (95% confident) ← Common threshold")
print("• p < 0.10 → Possibly real (90% confident)")
print("• p > 0.10 → Could easily be due to chance")
print()
print("In science, we usually want p < 0.05 to call something 'significant'")
# Let's test if the correlation between study hours and grades is significant
study_correlation = df['Hours_Studied'].corr(df['Average_Score'])
correlation_stat, p_value = stats.pearsonr(df['Hours_Studied'], df['Average_Score'])
print(f"\nExample: Study Hours ↔ Average Score")
print(f"Correlation: {study_correlation:.3f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print(f"✓ SIGNIFICANT! (p < 0.05)")
print(f" We can be confident this relationship is real, not just luck")
elif p_value < 0.10:
print(f"? BORDERLINE (p < 0.10)")
print(f" Possibly real, but we can't be very confident")
else:
print(f"✗ NOT SIGNIFICANT (p > 0.10)")
print(f" This could easily be due to chance")2. T-Tests - Comparing Groups Scientifically
T-tests help us determine if differences between groups are real or just random variation:
# Question: Do older students (11th/12th grade) perform better than younger students (10th grade)?
# Split students into two groups
older_students = df[df['Grade'].isin(['11th', '12th'])]
younger_students = df[df['Grade'] == '10th']
older_scores = older_students['Average_Score']
younger_scores = younger_students['Average_Score']
print("T-TEST: Do older students perform better?")
print("="*45)
print()
print("Groups:")
print(f"Older students (11th/12th): {len(older_students)} students")
print(f"Younger students (10th): {len(younger_students)} students")
print(f"\nGroup Averages:")
older_mean = older_scores.mean()
younger_mean = younger_scores.mean()
difference = older_mean - younger_mean
print(f"Older students average: {older_mean:.1f}")
print(f"Younger students average: {younger_mean:.1f}")
print(f"Difference: {difference:.1f} points")
# Perform the t-test
t_statistic, p_value_ttest = stats.ttest_ind(older_scores, younger_scores)
print(f"\nT-test Results:")
print(f"T-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value_ttest:.4f}")
# Interpret the results
print(f"\nInterpretation:")
if p_value_ttest < 0.05:
print(f"✓ SIGNIFICANT DIFFERENCE (p < 0.05)")
print(f" The {difference:.1f} point difference is statistically significant")
print(f" Older students DO perform significantly better")
else:
print(f"✗ NO SIGNIFICANT DIFFERENCE (p > 0.05)")
print(f" The {difference:.1f} point difference could easily be due to chance")
print(f" We can't conclude that older students perform better")3. One-Sample T-Test - Testing Against a Standard
# Question: Are our students performing above the national average of 85?
national_average = 85
our_average = df['Average_Score'].mean()
print("ONE-SAMPLE T-TEST: Are we above national average?")
print("="*50)
print(f"National average: {national_average}")
print(f"Our school average: {our_average:.1f}")
print(f"Difference: {our_average - national_average:.1f} points")
# Perform one-sample t-test
t_stat_one, p_val_one = stats.ttest_1samp(df['Average_Score'], national_average)
print(f"\nOne-sample t-test results:")
print(f"T-statistic: {t_stat_one:.3f}")
print(f"P-value: {p_val_one:.4f}")
print(f"\nInterpretation:")
if p_val_one < 0.05:
if our_average > national_average:
print(f"✓ SIGNIFICANTLY ABOVE AVERAGE (p < 0.05)")
print(f" Our students perform significantly better than national average")
else:
print(f"✓ SIGNIFICANTLY BELOW AVERAGE (p < 0.05)")
print(f" Our students perform significantly worse than national average")
else:
print(f"✗ NO SIGNIFICANT DIFFERENCE (p > 0.05)")
print(f" Our performance is not significantly different from national average")4. Linear Regression - Predicting the Future
Linear regression helps us make predictions based on relationships we’ve found:
print("LINEAR REGRESSION: Predicting English Scores from Books Read")
print("="*60)
# Set up our regression
# Prepare data for regression
X = df[['Books_Read']] # Predictor (must be 2D array)
y = df['English_Score'] # What we want to predict
# Create and fit the regression model
model = LinearRegression()
model.fit(X, y)
# Get the equation components
slope = model.coef_[0] # How much English score changes per book
intercept = model.intercept_ # English score when books read = 0
r_squared = r2_score(y, model.predict(X)) # How well our line fits
print(f"Regression Equation:")
print(f"English Score = {intercept:.2f} + {slope:.2f} × (Books Read)")
print()
print(f"What this means:")
print(f"• Base English score (with 0 books): {intercept:.1f}")
print(f"• Each additional book read increases English score by {slope:.2f} points")
print(f"• R-squared: {r_squared:.3f} ({r_squared*100:.1f}% of variation explained)")
# Make predictions for new students
print(f"\nPredictions for new students:")
test_books = [5, 10, 15, 20, 25]
for books in test_books:
predicted_score = intercept + slope * books
print(f"Student who reads {books:2d} books/year: {predicted_score:.1f} English score")5. Visualizing Our Regression
# Create a comprehensive regression plot
plt.figure(figsize=(12, 8))
# Scatter plot of actual data
plt.scatter(df['Books_Read'], df['English_Score'], s=100, alpha=0.7, color='blue', label='Actual Students')
# Add student names
for _, student in df.iterrows():
plt.annotate(student['Name'],
(student['Books_Read'], student['English_Score']),
xytext=(5, 5), textcoords='offset points', fontsize=9)
# Plot regression line
books_range = np.linspace(df['Books_Read'].min(), df['Books_Read'].max(), 100)
predicted_scores = intercept + slope * books_range
plt.plot(books_range, predicted_scores, 'r-', linewidth=2, label=f'Regression Line (R² = {r_squared:.3f})')
plt.title('Predicting English Scores from Books Read', fontsize=14, fontweight='bold')
plt.xlabel('Books Read per Year')
plt.ylabel('English Score')
plt.legend()
plt.grid(True, alpha=0.3)
# Add equation text box
equation_text = f'English Score = {intercept:.1f} + {slope:.2f} × Books\nR² = {r_squared:.3f}'
plt.text(0.05, 0.95, equation_text, transform=plt.gca().transAxes,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
verticalalignment='top', fontsize=10)
plt.tight_layout()
plt.show()6. Multiple Regression - Using Several Predictors
print("MULTIPLE REGRESSION: Predicting Average Score from Multiple Factors")
print("="*65)
# Use multiple variables to predict average score
predictors = ['Hours_Studied', 'Sleep_Hours', 'Books_Read', 'Screen_Time']
X_multi = df[predictors]
y_multi = df['Average_Score']
# Fit multiple regression model
model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)
# Get coefficients
coefficients = model_multi.coef_
intercept_multi = model_multi.intercept_
r_squared_multi = r2_score(y_multi, model_multi.predict(X_multi))
print("Multiple Regression Equation:")
equation = f"Average Score = {intercept_multi:.2f}"
for i, (pred, coef) in enumerate(zip(predictors, coefficients)):
sign = "+" if coef >= 0 else ""
pred_name = pred.replace('_', ' ')
equation += f" {sign}{coef:.2f}×{pred_name}"
print(equation)
print(f"\nR-squared: {r_squared_multi:.3f} ({r_squared_multi*100:.1f}% of variation explained)")
print(f"\nCoefficient Interpretation:")
for pred, coef in zip(predictors, coefficients):
pred_name = pred.replace('_', ' ').title()
direction = "increases" if coef > 0 else "decreases"
print(f"• {pred_name}: Each unit increase {direction} score by {abs(coef):.2f} points")
# Make a prediction for a hypothetical student
print(f"\nPrediction Example:")
print(f"New student: 8 hours studied, 7 hours sleep, 15 books read, 3 hours screen time")
new_student = [[8, 7, 15, 3]]
prediction = model_multi.predict(new_student)[0]
print(f"Predicted average score: {prediction:.1f}")7. Statistical Summary Report
print("="*70)
print("FINAL STATISTICAL ANALYSIS REPORT")
print("="*70)
print(f"\n1. CORRELATION FINDINGS:")
correlations_to_report = [
('Hours_Studied', 'Average_Score'),
('Books_Read', 'English_Score'),
('Sleep_Hours', 'Average_Score'),
('Screen_Time', 'Average_Score')
]
for var1, var2 in correlations_to_report:
corr = df[var1].corr(df[var2])
corr_stat, p_val = stats.pearsonr(df[var1], df[var2])
significance = "Significant" if p_val < 0.05 else "Not significant"
print(f" {var1} ↔ {var2}: r = {corr:.3f}, p = {p_val:.4f} ({significance})")
print(f"\n2. GROUP COMPARISONS:")
print(f" Older vs Younger students: {difference:.1f} point difference")
significance_group = "Significant" if p_value_ttest < 0.05 else "Not significant"
print(f" Result: {significance_group}")
print(f"\n3. REGRESSION MODELS:")
print(f" Single predictor (Books → English): R² = {r_squared:.3f}")
print(f" Multiple predictors (→ Average): R² = {r_squared_multi:.3f}")
print(f"\n4. KEY INSIGHTS:")
print(f" • Focus on factors with significant correlations")
print(f" • Use regression models for predicting student outcomes")
print(f" • Collect more data to improve statistical power")
print(f" • Remember: correlation ≠ causation")✅ Key Learning Points
- P-values tell us if our findings are likely real (p < 0.05) or could be due to chance
- T-tests help compare groups scientifically (older vs younger students)
- Linear regression lets us predict outcomes and understand relationships
- R-squared shows how much of the variation our model explains
- Multiple regression uses several factors to make better predictions
- Statistical significance doesn’t guarantee practical importance
- Always consider whether correlations might have alternative explanations
💡 Practice Exercises
Try these advanced exercises to practice your statistical skills. Download the interactive notebook to work through them!
🎉 Congratulations!
You’ve completed the data analytics series! You now know how to:
- Load and explore data
- Ask good analytical questions
- Create meaningful visualizations
- Calculate important statistics
- Find correlations and relationships
- Test statistical significance
- Make predictions with regression
These skills form the foundation of data science. Keep practicing with your own datasets, and remember: good analysis combines statistical rigor with clear thinking about what the data really means!
🚀 Continue Your Journey
Continue your data science journey by exploring more advanced topics like machine learning, hypothesis testing, and experimental design.
📓 Interactive Notebook
Want to practice with the interactive Jupyter notebook version of this lesson?