Setting Up Your Data Science Environment

Welcome to Your First Data Science Setup!

Hey! So you’re ready to start doing data science research, but your computer isn’t quite ready yet. No worries – we’ll get you set up with everything you need. This might take an hour or two, but once you’re done, you’ll have a professional-grade research environment.

What we’ll install:

  • Python (the most popular language for data science)
  • R (another great option, especially for statistics)
  • Jupyter Notebooks (for interactive coding)
  • VS Code (a powerful text editor)
  • Git (for version control)
  • Essential packages and libraries

Don’t worry if these names are unfamiliar – by the end, you’ll know what they all do!


Step 1: Installing Python

Python is the most widely-used language in data science. Let’s get it on your computer.

For Mac Users

The easiest way is to use Homebrew, a package manager for Mac:

# First, install Homebrew (if you don't have it)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Then install Python
brew install python@3.11

For Windows Users

  1. Go to python.org/downloads
  2. Download Python 3.11 or newer
  3. Important: Check “Add Python to PATH” during installation!
  4. Click “Install Now”

For Linux Users

# Ubuntu/Debian
sudo apt update
sudo apt install python3.11 python3-pip

# Fedora
sudo dnf install python3.11 python3-pip

Verify Your Installation

Open a terminal (Mac/Linux) or Command Prompt (Windows) and type:

python3 --version

You should see something like Python 3.11.x. Success! 🎉


Step 2: Installing R

R is fantastic for statistical analysis and has some unique packages Python doesn’t have.

All Operating Systems

  1. Go to r-project.org
  2. Click “download R”
  3. Choose a mirror near you
  4. Download for your operating system
  5. Install like any other application

RStudio makes working with R much easier:

  1. Go to posit.co/downloads
  2. Download RStudio Desktop (free version)
  3. Install it

Verify Installation

Open RStudio or a terminal and type:

R --version

You should see your R version. Great work!


Step 3: Setting Up Virtual Environments

Virtual environments keep your projects separate and organized. Think of them as separate workspaces for different projects.

Python Virtual Environments

# Install virtualenv
pip3 install virtualenv

# Create a new environment for your project
cd ~/my-research-project
python3 -m venv venv

# Activate it (Mac/Linux)
source venv/bin/activate

# Activate it (Windows)
venv\Scripts\activate

# Your prompt should now show (venv)

When you’re done working:

deactivate

Conda is another popular option that works great for data science:

# Download Miniconda from conda.io/miniconda.html
# Then create an environment:

conda create -n myresearch python=3.11
conda activate myresearch

# Install packages
conda install pandas numpy scikit-learn matplotlib

Step 4: Installing Essential Python Packages

With your environment activated, install these essential packages:

# Data manipulation and analysis
pip install pandas numpy scipy

# Visualization
pip install matplotlib seaborn plotly

# Machine learning
pip install scikit-learn

# Jupyter notebooks
pip install jupyter jupyterlab

# Data loading
pip install openpyxl xlrd requests beautifulsoup4

# Statistics
pip install statsmodels

Or install them all at once:

pip install pandas numpy scipy matplotlib seaborn plotly scikit-learn jupyter jupyterlab statsmodels requests beautifulsoup4

Step 5: Installing Essential R Packages

Open RStudio and run:

# Data manipulation
install.packages("tidyverse")  # This includes dplyr, ggplot2, and more!

# Additional useful packages
install.packages(c(
  "data.table",    # Fast data manipulation
  "caret",         # Machine learning
  "readr",         # Reading data
  "readxl",        # Excel files
  "jsonlite",      # JSON data
  "httr",          # HTTP requests
  "rvest"          # Web scraping
))

This might take a few minutes. Grab a coffee! ☕


Step 6: Installing Jupyter Notebook/Lab

Jupyter lets you write code in an interactive notebook format – perfect for research!

# Already installed if you followed Step 4
# To launch Jupyter Lab:
jupyter lab

# Or classic Jupyter Notebook:
jupyter notebook

Your browser should open automatically. You’re now in Jupyter! 🎉

Create Your First Notebook

  1. Click “New” → “Notebook”
  2. Choose Python 3
  3. In the first cell, type: print("Hello, research world!")
  4. Press Shift + Enter to run

Step 7: Installing VS Code

VS Code is a powerful, free editor that works great for Python, R, and more.

Installation

  1. Go to code.visualstudio.com
  2. Download for your operating system
  3. Install it

Essential Extensions

Open VS Code and install these extensions (click the Extensions icon on the left):

  • Python (by Microsoft)
  • Jupyter (by Microsoft)
  • R (by REditorSupport)
  • GitLens (for Git integration)
  • Markdown All in One (for documentation)

Step 8: Installing Git

Git tracks changes to your code and helps you collaborate.

For Mac

brew install git

For Windows

Download from git-scm.com and install.

For Linux

# Ubuntu/Debian
sudo apt install git

# Fedora
sudo dnf install git

Configure Git

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Create a GitHub Account

Go to github.com and sign up. You’ll use this to share code and collaborate.


Step 9: Test Your Setup

Let’s make sure everything works together!

Create a Test Project

# Create a directory
mkdir ~/data-science-test
cd ~/data-science-test

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Install packages
pip install pandas matplotlib jupyter

# Start Jupyter
jupyter notebook

Run This Test Code

Create a new notebook and run:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Create sample data
data = pd.DataFrame({
    'x': range(10),
    'y': np.random.randn(10)
})

# Make a plot
plt.figure(figsize=(8, 6))
plt.plot(data['x'], data['y'], marker='o')
plt.title('Test Plot - Your Setup Works! 🎉')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.show()

print("Success! Your environment is ready for research!")

If you see a plot, you’re all set! 🎊


Bonus: Working with Computing Clusters

Many universities provide computing clusters for bigger analyses.

SSH Setup

# Generate SSH key
ssh-keygen -t ed25519 -C "your.email@example.com"

# Copy public key to clipboard (Mac)
cat ~/.ssh/id_ed25519.pub | pbcopy

# Then add it to your cluster account

Connect to Cluster

ssh username@cluster.university.edu

Transfer Files

# Upload to cluster
scp mydata.csv username@cluster.university.edu:~/

# Download from cluster
scp username@cluster.university.edu:~/results.csv ./

Keeping Your Environment Organized

Save Your Package List

For Python:

pip freeze > requirements.txt

To recreate on another computer:

pip install -r requirements.txt

For R, use renv:

install.packages("renv")
renv::init()      # Initialize
renv::snapshot()  # Save current packages
renv::restore()   # Restore saved packages

For Conda:

conda env export > environment.yml
conda env create -f environment.yml

Troubleshooting Common Issues

“Command not found”

Make sure you’ve added the program to your PATH. On Mac/Linux:

echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

Python version conflicts

Use virtual environments! They keep different projects separate.

Package installation fails

Try:

pip install --upgrade pip
pip install --user package-name

Jupyter kernel issues

python -m ipykernel install --user --name=myenv

Your Setup Checklist

Before moving on, make sure you have:

  • ✅ Python 3.11+ installed and working
  • ✅ R and RStudio installed
  • ✅ Can create and activate virtual environments
  • ✅ Essential packages installed (pandas, matplotlib, etc.)
  • ✅ Jupyter Notebook/Lab running
  • ✅ VS Code installed with extensions
  • ✅ Git installed and configured
  • ✅ GitHub account created
  • ✅ Successfully ran the test code

Next Steps

Now that your environment is ready:

  1. Finding Data for Your Research Project - Learn where to get data
  2. Writing Clean, Reproducible Code - Best practices for research code
  3. Exploratory Data Analysis - Start analyzing data

Quick Reference: Common Commands

# Python virtual environment
python3 -m venv venv
source venv/bin/activate  # Mac/Linux
venv\Scripts\activate     # Windows
deactivate

# Conda
conda create -n myenv python=3.11
conda activate myenv
conda deactivate
conda list  # See installed packages

# Jupyter
jupyter lab
jupyter notebook

# Git basics
git init
git add .
git commit -m "Message"
git push

Congratulations! 🎉 You now have a professional data science research environment. This might have felt like a lot, but you’ve just built the foundation for all your future research. Everything gets easier from here!

Ready to find some data? Check out the Finding Data tutorial!