Research Ethics in Data Science

Why Ethics Matters in Data Science

Data science has real-world impact. Your models and analyses affect people’s lives - from loan approvals to medical diagnoses to criminal justice. As researchers, we have a responsibility to do no harm.

Remember: Just because you can do something with data doesn’t mean you should.

Core Ethical Principles

1. Privacy: Protect Personal Information

What to consider:

Is this data about identifiable people?
Do subjects know their data is being used this way?
Could re-identification be possible?
Are you storing data securely?

Best practices:

# Remove direct identifiers
data = data.drop(['name', 'ssn', 'email', 'phone'], axis=1)

# Generalize quasi-identifiers
data['age_group'] = pd.cut(data['age'], bins=[0, 18, 30, 50, 100])
data = data.drop('age', axis=1)

# Add noise for privacy (differential privacy)
from diffprivlib import tools
private_mean = tools.mean(data['income'], epsilon=1.0, bounds=(0, 1000000))

Questions to ask:

Did people consent to this specific use?
Is this within the scope of the original purpose?
Are you violating any terms of service?

Example - Twitter data:

✅ Aggregate analysis of public tweets
✅ Studying trends and patterns
❌ Publishing usernames with controversial tweets
❌ Identifying individuals without consent

3. Bias: Recognize and Mitigate Unfairness

Types of bias:

Selection bias: Non-representative sampling
Measurement bias: Systematic errors in data collection
Algorithmic bias: Models that discriminate
Confirmation bias: Seeing what you want to see

Checking for bias:

# Check representation across groups
print(data.groupby(['gender', 'race']).size())

# Check model performance across groups
from sklearn.metrics import accuracy_score

for group in data['demographic_group'].unique():
    group_data = data[data['demographic_group'] == group]
    accuracy = accuracy_score(group_data['true_label'], 
                             group_data['predicted_label'])
    print(f"{group}: {accuracy:.3f}")

# Fairness metrics
from fairlearn.metrics import demographic_parity_ratio
dpr = demographic_parity_ratio(y_true, y_pred, sensitive_features=sensitive)
print(f"Demographic parity ratio: {dpr:.3f}")  # Should be close to 1.0

4. Transparency: Be Open About Your Methods

What to document:

Data sources and collection methods
Preprocessing steps and decisions
Model choices and hyperparameters
Known limitations
Potential biases

Example documentation:

## Data Source
Customer transaction data from XYZ Corp (2020-2025)
- n = 50,000 customers
- Limited to US customers only
- Missing data for 15% of income field

## Known Limitations
- Data is US-centric and may not generalize
- Income is self-reported and may be inaccurate
- Low-income customers may be underrepresented

## Potential Biases
- Selection bias: Only includes customers who completed profile
- Survivorship bias: Excludes churned customers before 2020

5. Beneficence: Do No Harm

Consider potential harms:

Could this reinforce stereotypes?
Could this disadvantage vulnerable groups?
Could this be misused?
Are benefits distributed fairly?

Example - Predictive policing:

⚠️ May reinforce discriminatory policing patterns
⚠️ Historical data reflects biased enforcement
⚠️ Could create feedback loops
✅ Could improve resource allocation if done carefully
✅ Transparency about limitations is critical

Specific Ethical Scenarios

Ethical considerations:

# ✅ Good practice
def ethical_twitter_analysis():
    """Analyze public tweets about climate change"""
    # 1. Only public tweets
    # 2. Aggregate analysis, no individual identification
    # 3. Don't republish tweet content verbatim
    # 4. Respect API terms of service
    # 5. Consider power dynamics (who has voice on Twitter?)
    
    tweets = collect_public_tweets(query="climate change")
    
    # Aggregate analysis
    sentiment_by_region = tweets.groupby('region')['sentiment'].mean()
    
    # Remove identifying information before saving
    tweets_anon = tweets.drop(['user_id', 'username', 'tweet_id'], axis=1)
    
    return sentiment_by_region

Scenario 2: Health Data Research

Special considerations:

HIPAA (US) and similar laws apply
Requires IRB approval for human subjects research
Extra protections for sensitive information
Secure storage required

# Security measures
def secure_health_data_analysis():
    """Handle sensitive health data properly"""
    # 1. Encrypt data at rest
    # 2. Use secure connections
    # 3. Log all access
    # 4. Minimize data retention
    # 5. De-identify before analysis
    
    # Example: k-anonymity
    from anonymizedf import anonymize
    
    anon_data = anonymize(
        health_data,
        k=5,  # At least 5 people per group
        quasi_identifiers=['age', 'zip', 'gender']
    )
    
    return anon_data

Scenario 3: Predictive Models for Decision-Making

When models affect people’s lives:

def responsible_model_deployment():
    """Deploy models responsibly"""
    
    # 1. Test for fairness
    from aif360.metrics import BinaryLabelDatasetMetric
    
    metric = BinaryLabelDatasetMetric(
        dataset, 
        privileged_groups=[{'gender': 1}],
        unprivileged_groups=[{'gender': 0}]
    )
    
    print(f"Disparate impact: {metric.disparate_impact()}")
    # Should be close to 1.0 (equal treatment)
    
    # 2. Provide explanations
    from shap import TreeExplainer
    
    explainer = TreeExplainer(model)
    shap_values = explainer.shap_values(X)
    
    # Show why prediction was made
    
    # 3. Allow for human review
    # Flag edge cases for manual review
    confidence = model.predict_proba(X)
    uncertain = confidence.max(axis=1) < 0.7
    
    # 4. Monitor for drift
    # Regularly check if model is still fair and accurate
    
    # 5. Have an appeals process
    # Let people contest automated decisions

IRB and Research Approval

Many research projects require Institutional Review Board (IRB) approval.

When you need IRB approval:

Research involving human subjects
Collecting data from people
Using identifiable health information
Studying vulnerable populations

The IRB process:

Submit research protocol
Explain data collection methods
Describe risks and benefits
Show informed consent process
Wait for approval before starting

Tips for IRB applications:

Start early (can take months)
Be thorough in documentation
Explain how you’ll protect privacy
Describe data storage and destruction plans
Show you’ve considered risks

When sharing research data:

def prepare_data_for_sharing():
    """Prepare data for public sharing"""
    
    # 1. Remove identifiers
    data_clean = data.drop([
        'name', 'address', 'phone', 'email',
        'ssn', 'medical_record_number'
    ], axis=1)
    
    # 2. Generalize sensitive fields
    data_clean['income_bracket'] = pd.cut(
        data['income'],
        bins=[0, 30000, 60000, 100000, np.inf],
        labels=['Low', 'Medium', 'High', 'Very High']
    )
    
    # 3. Add noise to continuous variables
    data_clean['age'] += np.random.randint(-2, 3, len(data_clean))
    
    # 4. Check for re-identification risk
    # Ensure no unique combinations
    unique_combinations = data_clean.groupby([
        'age_bracket', 'gender', 'zip_prefix'
    ]).size()
    
    assert (unique_combinations >= 5).all(), "Re-identification risk!"
    
    # 5. Document what you did
    create_data_dictionary()
    
    return data_clean

Include:

Clear documentation
Data dictionary
Limitations and known biases
Appropriate license (e.g., CC-BY)
Citation information

Algorithmic Fairness

Fairness definitions (pick based on context):

Demographic parity: Equal positive rates across groups
Equal opportunity: Equal true positive rates across groups
Predictive parity: Equal precision across groups

library(fairness)

# Check multiple fairness metrics
fairness_check <- equal_odds(
  data = predictions,
  outcome = 'actual',
  prediction = 'predicted',
  group = 'protected_group'
)

print(fairness_check)

# Visualize disparities
ggplot(fairness_check) +
  geom_bar(aes(x = group, y = metric_value, fill = group)) +
  facet_wrap(~metric) +
  labs(title = "Fairness Metrics Across Groups")

Your Ethics Checklist

Before starting research:

✅ IRB approval obtained (if needed)
✅ Data collection is ethical and legal
✅ Informed consent obtained (if needed)
✅ Privacy protections in place
✅ Potential biases identified
✅ Fairness metrics considered
✅ Limitations documented
✅ Benefits outweigh risks
✅ Vulnerable populations protected
✅ Plan for responsible sharing

Resources

ACM Code of Ethics: https://www.acm.org/code-of-ethics
Fairness in ML: https://fairmlbook.org
AI Ethics Guidelines: https://ai.google/principles
Data Science Ethics Course: https://ethics.fast.ai
Your institution’s IRB: Contact them early!

Key Takeaways

Privacy matters: Protect people’s information
Recognize bias: In data, models, and yourself
Be transparent: Document everything
Do no harm: Consider negative impacts
Get approval: Follow proper procedures
Test fairness: Check for discriminatory outcomes
Think critically: Question your assumptions
Stay informed: Ethics evolves with technology

Remember: Good ethics is good science. Take the time to do it right!

For more guidance, check out:

Do research that makes the world better! 🌍✨