DataGators 👋

Help With Data Science at Allegheny College!

Data Gators

DataGators 👋

Help With Data Science at Allegheny College!

Exploratory Steps With Synthetic Data

Lesson: Introduction to R Programming and Data Science

Overview

In this lesson, we will introduce you to R programming by loading a synthetic dataset, performing basic correlations, and visualizing the data with several plots. After each code block, we’ll explain how the code works step by step.

Prerequisites

Before we begin, make sure that you have R installed. You can download R from CRAN and use RStudio as an IDE to write and execute R code.


Step 1: Loading a Synthetic Dataset

First, we’ll create a synthetic dataset that simulates a simple survey of people’s ages, heights, weights, and annual income. We will store this data in a data frame, which is a fundamental data structure in R.

# Load the necessary library
library(tibble)

# Create a synthetic dataset
set.seed(123)  # Set a seed for reproducibility

data <- tibble(
  # Random ages between 20 and 60
  Age = sample(20:60, 100, replace = TRUE),  
  # Heights with a mean of 170 cm and SD of 10
  Height = rnorm(100, mean = 170, sd = 10),  
  # Weights with a mean of 70 kg and SD of 15
  Weight = rnorm(100, mean = 70, sd = 15),   
  # Income with a mean of 50,000 and SD of 15,000
  Income = rnorm(100, mean = 50000, sd = 15000)
)

# View the first few rows of the data
head(data)

Explanation:


Step 2: Calculating Basic Correlations

Now, let’s calculate the correlation between different columns (variables) in the dataset. Correlation tells us the strength and direction of the relationship between two variables.

# Calculate correlations between numeric columns
correlation_matrix <- cor(
    data[, c(
        "Age",
        "Height",
        "Weight",
        "Income")])

# View the correlation matrix
print(correlation_matrix)

Explanation:


Step 3: Creating Plots to Visualize the Data

Visualization is key in data science to explore relationships between variables. We will now generate a few plots.

1. Scatter Plot: Age vs. Income

# Load the ggplot2 library for data visualization
library(ggplot2)

# Create a scatter plot of Age vs. Income
ggplot(data, aes(
    x = Age,
    y = Income)) +
  geom_point() +
  labs(
    title = "Scatter Plot of Age vs. Income",
    x = "Age",
    y = "Income")

Explanation:

2. Histogram: Distribution of Height

# Create a histogram of Height
ggplot(data, aes(x = Height)) +
  geom_histogram(
    binwidth = 2,
    fill = "blue",
    color = "black",
    alpha = 0.7) +
  labs(
    title = "Histogram of Height",
    x = "Height (cm)",
    y = "Frequency")

Explanation:

3. Boxplot: Weight Distribution by Age Group

# Create a boxplot of Weight by Age Group
data$AgeGroup <- 
    ifelse(
        data$Age < 30, 
        "Under 30", 
    ifelse(
        data$Age <= 40, 
        "30-40", 
        "Over 40"))
ggplot(
    data,
    aes(
        x = AgeGroup,
        y = Weight,
        fill = AgeGroup)
        ) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Weight by Age Group",
    x = "Age Group",
    y = "Weight (kg)") +
  theme_minimal()

Explanation:


Step 4: Summarizing the Results

We have completed a few basic tasks:

  1. We loaded a synthetic dataset and explored its structure.
  2. We calculated correlations between the numeric variables to understand their relationships.
  3. We created visualizations using ggplot2 to explore the data further.

The following key observations could be made:


Conclusion

In this lesson, you learned how to: