17 📝 Intermediate Exam Simulation - Healthcare Management

This is a comprehensive simulation of the intermediate exam covering all major statistical topics applied to healthcare management scenarios. Complete all 20 questions to test your knowledge and prepare for the actual exam.

Instructions: Answer all questions carefully. Solutions are provided at the end of this chapter.

17.1 🏥 Question 1: Data Wrangling - Reshaping Data

Exercise 17.1 You have a patient_data dataframe with columns: patient_id, emergency_visits, and outpatient_visits. Reshape the data into a “long” format with three columns: patient_id, visit_type (containing “emergency_visits” or “outpatient_visits”), and visit_count. Write the R command using the tidyr package.

Sample data:

set.seed(123)
patient_data <- data.frame(
  patient_id = 1:10,
  emergency_visits = rpois(10, 2),
  outpatient_visits = rpois(10, 5)
)

17.2 🏥 Question 2: Data Wrangling - Complex Filtering

Exercise 17.2 From a patient_satisfaction dataset with columns patient_id, department (e.g., “Cardiology”, “Orthopedics”, “Emergency”), satisfaction_score, and treatment_cost, write a dplyr pipeline to create a summary table showing the number of patients, mean satisfaction score, and total treatment cost for each department. Show only departments with more than 50 patients AND mean satisfaction above 4.0.

Sample data:

set.seed(456)
patient_satisfaction <- data.frame(
  patient_id = 1:200,
  department = sample(c("Cardiology", "Orthopedics", "Emergency", "Pediatrics", "Surgery"), 200, replace = TRUE),
  satisfaction_score = sample(1:5, 200, replace = TRUE, prob = c(0.1, 0.1, 0.2, 0.3, 0.3)),
  treatment_cost = rnorm(200, 5000, 1500)
)

17.3 🏥 Question 3: Data Wrangling - Advanced Calculations

Exercise 17.3 A hospital needs to calculate physician bonuses based on performance. Given a physician_data dataframe with columns physician_id, patient_satisfaction_score, years_experience, and specialty, create a new column bonus_amount that calculates 2% of satisfaction score per year of experience, with a maximum bonus of 8000 and a minimum bonus of 1000 for physicians with less than 3 years experience. Additionally, add a 15% specialty bonus for physicians in “Emergency Medicine” due to higher risk. Write the complete dplyr pipeline.

Sample data:

set.seed(789)
physician_data <- data.frame(
  physician_id = 1:100,
  patient_satisfaction_score = rnorm(100, 85, 10),
  years_experience = sample(1:20, 100, replace = TRUE),
  specialty = sample(c("Emergency Medicine", "Cardiology", "Orthopedics", "Pediatrics"), 100, replace = TRUE)
)

17.4 🏥 Question 4: Data Wrangling - Complex Reshaping

Exercise 17.4 From a patient_monitoring dataset with columns patient_id, measurement_date, pain_level, mobility_score, and measurement_time (with values “morning” or “evening”), filter to include only patients who have taken both types of measurements, and calculate the difference in their pain levels between these two measurement times. Also calculate the correlation between morning and evening pain levels. Write the complete dplyr pipeline.

Sample data:

set.seed(321)
patient_monitoring <- data.frame(
  patient_id = rep(1:50, each = 2),
  measurement_date = rep(seq.Date(as.Date("2024-01-01"), by = "day", length.out = 50), each = 2),
  measurement_time = rep(c("morning", "evening"), 50),
  pain_level = rnorm(100, 5, 1.5),
  mobility_score = rnorm(100, 7, 1.2)
)

17.5 🏥 Question 5: Data Wrangling - Advanced Grouping

Exercise 17.5 You have hospital performance data in long format. The hospital_performance dataframe contains hospital_id, quarter (with values “Q1”, “Q2”, “Q3”, “Q4”), patient_volume, and region. Reshape this data into wide format so that each hospital has one row with separate columns for each quarter (volume_Q1, volume_Q2, volume_Q3, volume_Q4). Then calculate the quarterly growth rate for each hospital (Q4 vs Q1). Write the complete pipeline.

Sample data:

set.seed(654)
hospital_performance <- data.frame(
  hospital_id = rep(1:30, each = 4),
  quarter = rep(c("Q1", "Q2", "Q3", "Q4"), 30),
  patient_volume = rnorm(120, 1000, 200),
  region = rep(sample(c("North", "South", "East", "West"), 30, replace = TRUE), each = 4)
)

17.6 🔬 Question 6: Hypothesis Testing - Single Population

Exercise 17.6 A hospital claims that their new patient management system reduces average waiting time by 15 minutes. In a pilot study with 45 patients, the mean reduction was 18 minutes with a standard deviation of 6 minutes. Test at α = 0.05 if the actual reduction is significantly greater than the claimed 15 minutes. Write the R command and interpret the results.

Sample data:

set.seed(123)
waiting_time_reduction <- rnorm(45, mean = 18, sd = 6)

17.7 🔬 Question 7: Hypothesis Testing - Proportions

Exercise 17.7 A hospital claims that 35% of patients are satisfied with their emergency department experience. In a sample of 400 patients, only 120 reported satisfaction. Write the R command to test if the true satisfaction rate is significantly lower than 35%. Calculate the 95% confidence interval for the true proportion and interpret both results.

17.8 🔬 Question 8: Hypothesis Testing - Two Populations

Exercise 17.8 A hospital is testing a new training program for nurses. The same 25 nurses took a patient safety competency test before and after the training. Test if the training program significantly improved the test scores. Write the appropriate R command and state whether this should be a paired or unpaired test.

Sample data:

set.seed(456)
nurse_data <- data.frame(
  nurse_id = 1:25,
  before = rnorm(25, 72, 5),
  after = rnorm(25, 78, 5)
)

17.9 🔬 Question 9: Hypothesis Testing - Two Proportions

Exercise 17.9 Two hospitals are comparing their patient satisfaction rates. Hospital A reports that 180 out of 250 patients (72%) were satisfied, while Hospital B reports that 210 out of 280 patients (75%) were satisfied. Write the R command to test if there is a significant difference in satisfaction rates between the two hospitals. Calculate the effect size and interpret the results.

17.10 🔬 Question 10: Hypothesis Testing - Advanced Concepts

Exercise 17.10 A study compares the patient satisfaction scores between two different hospital wards. Before running a t-test, you perform a Levene’s test for equality of variances and get a p-value of 0.02. Which type of two-sample t-test should you use and why? Write the R command for the appropriate t-test, assuming you have a patient_satisfaction dataframe with score and ward columns.

Sample data:

set.seed(789)
patient_satisfaction <- data.frame(
  score = c(rnorm(30, 4.2, 0.8), rnorm(35, 4.0, 1.2)),
  ward = rep(c("Ward_A", "Ward_B"), c(30, 35))
)

17.11 📊 Question 11: ANOVA - Basic Analysis

Exercise 17.11 A clinical trial tests three different pain management protocols (Protocol A, Protocol B, Protocol C) on different patients. The pain reduction scores are stored in a variable called pain_reduction and the treatment group is stored in protocol in a dataframe called pain_study. Write the R command to perform ANOVA and display the results. Also test the assumptions.

Sample data:

set.seed(321)
pain_study <- data.frame(
  pain_reduction = c(rnorm(20, 6.5, 1.2), rnorm(20, 7.2, 1.5), rnorm(20, 5.8, 1.0)),
  protocol = rep(c("Protocol_A", "Protocol_B", "Protocol_C"), each = 20)
)

17.12 📊 Question 12: ANOVA - Post-hoc Analysis

Exercise 17.12 After performing ANOVA and finding a significant result (p = 0.003), you want to perform post-hoc pairwise comparisons to identify which specific protocols differ. Your ANOVA model is stored in the variable anova_model. Write the R command to perform Tukey’s Honest Significant Difference (HSD) test and interpret the results.

Sample ANOVA output:

            Df Sum Sq Mean Sq F value Pr(>F)  
protocol     2  245.6   122.8   5.234 0.003 **
Residuals   57 1337.2    23.5                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

17.13 📊 Question 13: ANOVA - Assumptions Testing

Exercise 17.13 Before conducting ANOVA, you want to test the homogeneity of variances assumption using Levene’s test. Your data is in a dataframe called patient_data with a continuous variable recovery_time and a grouping variable hospital. Write the R command using the car package. Also test for normality of residuals and interpret both results.

Sample data:

set.seed(654)
patient_data <- data.frame(
  recovery_time = c(rnorm(25, 10, 2), rnorm(25, 12, 3), rnorm(25, 8, 1.5)),
  hospital = rep(c("Hospital_A", "Hospital_B", "Hospital_C"), each = 25)
)

17.14 📈 Question 14: Linear Regression - Multiple Regression

Exercise 17.14 You have a patient_data dataframe with variables length_of_stay (dependent variable), age, BMI, and comorbidity_count (independent variables). Write the R command to fit a multiple linear regression model, check for multicollinearity using VIF, and display the summary. Interpret the results.

Sample data:

set.seed(987)
patient_data <- data.frame(
  age = rnorm(100, 65, 15),
  BMI = rnorm(100, 28, 5),
  comorbidity_count = sample(0:5, 100, replace = TRUE),
  length_of_stay = 5 + 0.05 * age + 0.2 * BMI + 1.5 * comorbidity_count + rnorm(100, 0, 2)
)

17.15 📈 Question 15: Linear Regression - Model Prediction and Classification

Exercise 17.15 A hospital has developed a regression model to predict length of stay (LOS) based on patient age, BMI, and number of comorbidities. The model output is:

Coefficients:
              Estimate  Std. Error  t value  Pr(>|t|)    
(Intercept)   3.250     0.450       7.22     < 0.001 ***
age           0.045     0.012       3.75     0.0003 ***
BMI           0.120     0.025       4.80     < 0.001 ***
comorbidities 1.450     0.180       8.06     < 0.001 ***

Using this model, predict the length of stay for a 65-year-old patient with BMI of 28 and 2 comorbidities. Then, classify this patient as “Short Stay” (< 7 days) or “Long Stay” (≥ 7 days). Show your calculations.

17.16 📈 Question 16: Linear Regression - Practical Prediction

Exercise 17.16 A hospital has fitted a multiple regression model to predict patient readmission risk score (scale 0-100) based on the following variables:

Sample data:

set.seed(456)
patient_data <- data.frame(
  age = c(45, 62, 71, 55, 68, 58, 75, 63, 52, 70),
  length_of_stay = c(3, 7, 9, 4, 8, 5, 10, 6, 4, 9),
  comorbidities = c(0, 2, 3, 1, 2, 1, 4, 2, 0, 3),
  readmission_risk = c(25, 55, 72, 35, 60, 40, 78, 50, 28, 68)
)

# Fit the model
risk_model <- lm(readmission_risk ~ age + length_of_stay + comorbidities, 
                 data = patient_data)

Task: 1. Fit the model and display the summary 2. Predict the readmission risk for a new 60-year-old patient with 6-day length of stay and 2 comorbidities 3. Based on the prediction, classify the patient as “Low Risk” (< 40), “Medium Risk” (40-60), or “High Risk” (> 60) 4. Write the complete R code

17.17 🎯 Question 17: Logistic Regression - Model Fitting

Exercise 17.17 You want to predict the probability of hospital readmission (binary variable: 1=readmitted, 0=not readmitted) based on age, diabetes status, and length of stay. The data is in patient_data. Write the R command to fit a logistic regression model, calculate odds ratios, and interpret the results.

Sample data:

set.seed(456)
patient_data <- data.frame(
  age = rnorm(200, 65, 15),
  diabetes_status = sample(0:1, 200, replace = TRUE),
  length_of_stay = rnorm(200, 7, 3),
  readmission = sample(0:1, 200, replace = TRUE, prob = c(0.8, 0.2))
)

17.18 🎯 Question 18: Logistic Regression - Prediction and Interpretation

Exercise 17.18 Using the logistic regression model from Question 17, predict the probability of readmission for a 75-year-old patient with diabetes and a 10-day length of stay. Also calculate the odds ratio for diabetes status and interpret what it means in practical terms.

Sample model:

# Using the model from Question 17
new_patient <- data.frame(age = 75, diabetes_status = 1, length_of_stay = 10)

17.19 🏷️ Question 19: Dummy Variables - Categorical Regression

Exercise 17.19 You have a clinical_data dataframe with recovery_time (continuous), age (continuous), and treatment_type (categorical: A, B, C). Write the R command to fit a linear regression model including treatment_type as a factor. Set treatment A as the reference category and interpret the coefficients.

Sample data:

set.seed(789)
clinical_data <- data.frame(
  age = rnorm(90, 65, 10),
  treatment_type = rep(c("A", "B", "C"), each = 30),
  recovery_time = c(rnorm(30, 15, 3), rnorm(30, 12, 2.5), rnorm(30, 18, 4))
)

17.20 🏷️ Question 20: Dummy Variables - Interaction Effects

Exercise 17.20 You want to test if the effect of age on patient_satisfaction differs between patients with and without chronic conditions. Write the R command to fit a regression model that includes an interaction term between age (continuous) and chronic_condition (binary dummy variable). Interpret the interaction coefficient if it is significant.

Sample data:

set.seed(321)
patient_data <- data.frame(
  age = rnorm(150, 55, 12),
  chronic_condition = sample(0:1, 150, replace = TRUE),
  patient_satisfaction = 7 + 0.02 * age + 0.5 * chronic_condition + 
                         0.01 * age * chronic_condition + rnorm(150, 0, 1)
)

17.21 ✅ SOLUTIONS

Answer to Question 1:

library(tidyr)
library(dplyr)

# Create sample data
set.seed(123)
patient_data <- data.frame(
  patient_id = 1:10,
  emergency_visits = rpois(10, 2),
  outpatient_visits = rpois(10, 5)
)

# Reshape to long format
patient_data_long <- patient_data %>%
  pivot_longer(
    cols = c(emergency_visits, outpatient_visits), 
    names_to = "visit_type", 
    values_to = "visit_count"
  )

print(patient_data_long)

Explanation: The pivot_longer() function transforms the wide format (separate columns for emergency_visits and outpatient_visits) into long format (single visit_count column with visit_type indicating the source).

Answer to Question 2:

library(dplyr)

# Create sample data
set.seed(456)
patient_satisfaction <- data.frame(
  patient_id = 1:200,
  department = sample(c("Cardiology", "Orthopedics", "Emergency", "Pediatrics", "Surgery"), 200, replace = TRUE),
  satisfaction_score = sample(1:5, 200, replace = TRUE, prob = c(0.1, 0.1, 0.2, 0.3, 0.3)),
  treatment_cost = rnorm(200, 5000, 1500)
)

# Solution
result <- patient_satisfaction %>%
  group_by(department) %>%
  summarise(
    n_patients = n(),
    mean_satisfaction = mean(satisfaction_score, na.rm = TRUE),
    total_cost = sum(treatment_cost, na.rm = TRUE)
  ) %>%
  filter(n_patients > 50 & mean_satisfaction > 4.0)

print(result)

Explanation: The pipeline groups by department, calculates summary statistics, then filters for departments with more than 50 patients AND mean satisfaction above 4.0.

Answer to Question 3:

library(dplyr)

# Create sample data
set.seed(789)
physician_data <- data.frame(
  physician_id = 1:100,
  patient_satisfaction_score = rnorm(100, 85, 10),
  years_experience = sample(1:20, 100, replace = TRUE),
  specialty = sample(c("Emergency Medicine", "Cardiology", "Orthopedics", "Pediatrics"), 100, replace = TRUE)
)

# Solution
physician_data <- physician_data %>%
  mutate(
    # Base bonus calculation
    bonus_amount = pmin(patient_satisfaction_score * 0.02 * years_experience, 8000),
    # Minimum bonus for physicians with < 3 years experience
    bonus_amount = ifelse(years_experience < 3, pmax(bonus_amount, 1000), bonus_amount),
    # Specialty bonus for Emergency Medicine
    bonus_amount = ifelse(specialty == "Emergency Medicine", bonus_amount * 1.15, bonus_amount)
  )

head(physician_data)

Explanation: The solution uses pmin() for maximum cap, pmax() for minimum bonus, and conditional logic for specialty bonuses. Emergency Medicine physicians get a 15% bonus on top of the calculated amount due to higher risk.

Answer to Question 4:

library(dplyr)
library(tidyr)

# Create sample data
set.seed(321)
patient_monitoring <- data.frame(
  patient_id = rep(1:50, each = 2),
  measurement_date = rep(seq.Date(as.Date("2024-01-01"), by = "day", length.out = 50), each = 2),
  measurement_time = rep(c("morning", "evening"), 50),
  pain_level = rnorm(100, 5, 1.5),
  mobility_score = rnorm(100, 7, 1.2)
)

# Solution
pain_analysis <- patient_monitoring %>%
  group_by(patient_id) %>%
  filter(n_distinct(measurement_time) == 2) %>%
  select(patient_id, measurement_time, pain_level) %>%
  pivot_wider(
    names_from = measurement_time,
    values_from = pain_level
  ) %>%
  mutate(pain_diff = morning - evening)

# Calculate correlation
correlation <- cor(pain_analysis$morning, pain_analysis$evening, use = "complete.obs")

print(pain_analysis)
print(paste("Correlation between morning and evening pain levels:", round(correlation, 3)))

Explanation: The solution filters for patients with both measurement times, reshapes to wide format, calculates the difference, and computes the correlation between morning and evening pain levels.

Answer to Question 5:

library(dplyr)
library(tidyr)

# Create sample data
set.seed(654)
performance_data <- data.frame(
  employee_id = rep(1:30, each = 4),
  quarter = rep(c("Q1", "Q2", "Q3", "Q4"), 30),
  sales_amount = rnorm(120, 10000, 2000),
  region = rep(sample(c("North", "South", "East", "West"), 30, replace = TRUE), each = 4)
)

# Solution
performance_wide <- performance_data %>%
  pivot_wider(
    names_from = quarter,
    values_from = sales_amount,
    names_prefix = "sales_"
  ) %>%
  mutate(
    quarterly_growth_rate = (sales_Q4 - sales_Q1) / sales_Q1 * 100
  )

head(performance_wide)

Explanation: The solution reshapes the data to wide format with separate columns for each quarter, then calculates the growth rate from Q1 to Q4 as a percentage.

Answer to Question 6:

# Create sample data
set.seed(123)
waiting_time_reduction <- rnorm(45, mean = 18, sd = 6)

# One-sample t-test (one-sided)
test_result <- t.test(waiting_time_reduction, mu = 15, alternative = "greater")
print(test_result)

# Interpretation
if(test_result$p.value < 0.05) {
  cat("Reject H0: The new system produces significantly greater reduction than claimed (p =", 
      round(test_result$p.value, 4), ")\n")
} else {
  cat("Fail to reject H0: No significant evidence of greater reduction (p =", 
      round(test_result$p.value, 4), ")\n")
}

Explanation: This is a one-sided test because we’re testing if the reduction is significantly greater than 15 minutes. The test compares the sample mean to the hypothesized population mean.

Answer to Question 7:

# Proportion test
prop_test <- prop.test(120, 400, p = 0.35, alternative = "less")
print(prop_test)

# Confidence interval
ci_test <- prop.test(120, 400, conf.level = 0.95)
print(ci_test)

# Interpretation
cat("Sample proportion:", round(120/400, 3), "\n")
cat("95% CI for true proportion: [", round(ci_test$conf.int[1], 3), ",", 
    round(ci_test$conf.int[2], 3), "]\n")

if(prop_test$p.value < 0.05) {
  cat("Reject H0: Click-through rate is significantly lower than 35% (p =", 
      round(prop_test$p.value, 4), ")\n")
}

Explanation: The test shows if the observed proportion (30%) is significantly lower than the claimed 35%. The confidence interval provides a range of plausible values for the true proportion.

Answer to Question 8:

# Create sample data
set.seed(456)
nurse_data <- data.frame(
  nurse_id = 1:25,
  before = rnorm(25, 72, 5),
  after = rnorm(25, 78, 5)
)

# Paired t-test
test_result <- t.test(nurse_data$after, nurse_data$before, 
                     paired = TRUE, alternative = "greater")
print(test_result)

# Interpretation
cat("This should be a PAIRED test because the same nurses are measured before and after training.\n")
cat("Mean improvement:", round(mean(nurse_data$after - nurse_data$before), 2), "points\n")

if(test_result$p.value < 0.05) {
  cat("Reject H0: Training significantly improved scores (p =", 
      round(test_result$p.value, 4), ")\n")
}

Explanation: This is a paired test because we’re comparing the same subjects (nurses) before and after treatment. The test examines if there’s a significant improvement in scores.

Answer to Question 9:

# Two-sample proportion test
prop_test <- prop.test(x = c(180, 210), n = c(250, 280), alternative = "two.sided")
print(prop_test)

# Calculate effect size (Cohen's h)
p1 <- 180/250
p2 <- 210/280
cohens_h <- 2 * (asin(sqrt(p1)) - asin(sqrt(p2)))
cat("Cohen's h (effect size):", round(cohens_h, 3), "\n")

# Interpretation
cat("Hospital A satisfaction rate:", round(p1, 3), "\n")
cat("Hospital B satisfaction rate:", round(p2, 3), "\n")

if(prop_test$p.value < 0.05) {
  cat("Reject H0: Significant difference in satisfaction rates (p =", 
      round(prop_test$p.value, 4), ")\n")
} else {
  cat("Fail to reject H0: No significant difference (p =", 
      round(prop_test$p.value, 4), ")\n")
}

Explanation: The test compares two independent proportions. Cohen’s h measures the effect size, where values around 0.2, 0.5, and 0.8 represent small, medium, and large effects respectively.

Answer to Question 10:

library(car)

# Create sample data
set.seed(789)
cholesterol_data <- data.frame(
  level = c(rnorm(30, 200, 15), rnorm(35, 195, 25)),
  clinic = rep(c("Clinic_A", "Clinic_B"), c(30, 35))
)

# Levene's test
levene_test <- leveneTest(level ~ clinic, data = cholesterol_data)
print(levene_test)

# Appropriate t-test (Welch's t-test - default in R)
t_test <- t.test(level ~ clinic, data = cholesterol_data)
print(t_test)

# Interpretation
cat("Levene's test p-value:", round(levene_test$`Pr(>F)`[1], 4), "\n")
cat("Since p < 0.05, variances are significantly different.\n")
cat("Use Welch's t-test (default) which doesn't assume equal variances.\n")

Explanation: When Levene’s test shows unequal variances (p < 0.05), use Welch’s t-test instead of the standard two-sample t-test. Welch’s test adjusts for unequal variances.

Answer to Question 11:

# Create sample data
set.seed(321)
clinical_data <- data.frame(
  blood_pressure = c(rnorm(20, 140, 10), rnorm(20, 135, 12), rnorm(20, 130, 8)),
  treatment_group = rep(c("Drug_A", "Drug_B", "Drug_C"), each = 20)
)

# ANOVA
anova_result <- aov(blood_pressure ~ treatment_group, data = clinical_data)
summary(anova_result)

# Test assumptions
library(car)

# Normality of residuals
shapiro_test <- shapiro.test(residuals(anova_result))
cat("Shapiro-Wilk test for normality p-value:", round(shapiro_test$p.value, 4), "\n")

# Homogeneity of variances
levene_test <- leveneTest(blood_pressure ~ treatment_group, data = clinical_data)
cat("Levene's test p-value:", round(levene_test$`Pr(>F)`[1], 4), "\n")

# Interpretation
if(summary(anova_result)[[1]][["Pr(>F)"]][1] < 0.05) {
  cat("Reject H0: Significant differences among treatment groups\n")
} else {
  cat("Fail to reject H0: No significant differences among groups\n")
}

Explanation: ANOVA tests if there are significant differences among group means. The assumptions (normality and equal variances) should be checked before interpreting results.

Answer to Question 12:

# Assuming anova_model is already fitted
# Tukey's HSD test
tukey_result <- TukeyHSD(anova_model)
print(tukey_result)

# Plot the results
plot(tukey_result)

# Interpretation
cat("Tukey HSD identifies which specific groups differ significantly.\n")
cat("Look for p-values < 0.05 in the 'p adj' column.\n")
cat("The 'diff' column shows the difference in means between groups.\n")

Explanation: Tukey’s HSD test performs pairwise comparisons while controlling the family-wise error rate. Significant differences are indicated by p-values < 0.05 in the adjusted p-value column.

Answer to Question 13:

library(car)

# Create sample data
set.seed(654)
patient_data <- data.frame(
  recovery_time = c(rnorm(25, 10, 2), rnorm(25, 12, 3), rnorm(25, 8, 1.5)),
  hospital = rep(c("Hospital_A", "Hospital_B", "Hospital_C"), each = 25)
)

# ANOVA model
anova_model <- aov(recovery_time ~ hospital, data = patient_data)

# Levene's test for homogeneity of variances
levene_test <- leveneTest(recovery_time ~ hospital, data = patient_data)
print(levene_test)

# Normality test for residuals
shapiro_test <- shapiro.test(residuals(anova_model))
print(shapiro_test)

# Interpretation
cat("Levene's test p-value:", round(levene_test$`Pr(>F)`[1], 4), "\n")
if(levene_test$`Pr(>F)`[1] > 0.05) {
  cat("Assumption of equal variances is met (p > 0.05)\n")
} else {
  cat("Assumption of equal variances is violated (p < 0.05)\n")
}

cat("Shapiro-Wilk test p-value:", round(shapiro_test$p.value, 4), "\n")
if(shapiro_test$p.value > 0.05) {
  cat("Assumption of normality is met (p > 0.05)\n")
} else {
  cat("Assumption of normality is violated (p < 0.05)\n")
}

Explanation: Both tests check ANOVA assumptions. Levene’s test checks equal variances (homoscedasticity), while Shapiro-Wilk tests normality of residuals. Both assumptions should be met for valid ANOVA results.

Answer to Question 14:

library(car)

# Create sample data
set.seed(987)
health_data <- data.frame(
  age = rnorm(100, 55, 12),
  BMI = rnorm(100, 27, 4),
  exercise_hours = rnorm(100, 3, 2),
  blood_pressure = 90 + 0.4 * age + 1.8 * BMI - 2.1 * exercise_hours + rnorm(100, 0, 8)
)

# Multiple linear regression
model <- lm(blood_pressure ~ age + BMI + exercise_hours, data = health_data)
summary(model)

# Check for multicollinearity
vif_values <- vif(model)
print("VIF values:")
print(vif_values)

# Interpretation
cat("R-squared:", round(summary(model)$r.squared, 3), "\n")
cat("Adjusted R-squared:", round(summary(model)$adj.r.squared, 3), "\n")
cat("F-statistic p-value:", round(summary(model)$fstatistic[4], 4), "\n")

# Check VIF interpretation
if(all(vif_values < 5)) {
  cat("No multicollinearity concerns (all VIF < 5)\n")
} else if(any(vif_values > 10)) {
  cat("Severe multicollinearity detected (VIF > 10)\n")
} else {
  cat("Moderate multicollinearity (5 < VIF < 10)\n")
}

Explanation: The model shows how age, BMI, and exercise hours predict blood pressure. VIF values < 5 indicate no multicollinearity concerns. Higher VIF values suggest correlated predictors.

Answer to Question 15:

Step 1: Write the regression equation

LOS = 3.250 + 0.045(age) + 0.120(BMI) + 1.450(comorbidities)

Step 2: Plug in the values - age = 65 - BMI = 28 - comorbidities = 2

LOS = 3.250 + 0.045(65) + 0.120(28) + 1.450(2)
LOS = 3.250 + 2.925 + 3.360 + 2.900
LOS = 12.435 days

Step 3: Classification Since 12.435 ≥ 7 days, this patient is classified as “Long Stay”

R code:

# Coefficients from the model
intercept <- 3.250
coef_age <- 0.045
coef_bmi <- 0.120
coef_comorbidities <- 1.450

# New patient values
age <- 65
bmi <- 28
comorbidities <- 2

# Prediction
predicted_los <- intercept + coef_age * age + coef_bmi * bmi + coef_comorbidities * comorbidities
cat("Predicted Length of Stay:", round(predicted_los, 2), "days\n")

# Classification
if(predicted_los < 7) {
  cat("Classification: Short Stay\n")
} else {
  cat("Classification: Long Stay\n")
}

Explanation: The predicted length of stay is approximately 12.4 days, which exceeds the 7-day threshold, classifying this patient as a “Long Stay” case. This information can help the hospital plan resource allocation.

Answer to Question 16:

# Create the data
set.seed(456)
patient_data <- data.frame(
  age = c(45, 62, 71, 55, 68, 58, 75, 63, 52, 70),
  length_of_stay = c(3, 7, 9, 4, 8, 5, 10, 6, 4, 9),
  comorbidities = c(0, 2, 3, 1, 2, 1, 4, 2, 0, 3),
  readmission_risk = c(25, 55, 72, 35, 60, 40, 78, 50, 28, 68)
)

# Task 1: Fit the model and display summary
risk_model <- lm(readmission_risk ~ age + length_of_stay + comorbidities, 
                 data = patient_data)
summary(risk_model)

# Task 2: Predict for new patient
new_patient <- data.frame(age = 60, length_of_stay = 6, comorbidities = 2)
predicted_risk <- predict(risk_model, newdata = new_patient)
cat("Predicted readmission risk:", round(predicted_risk, 2), "\n")

# Task 3: Classification
if(predicted_risk < 40) {
  risk_category <- "Low Risk"
} else if(predicted_risk <= 60) {
  risk_category <- "Medium Risk"
} else {
  risk_category <- "High Risk"
}

cat("Risk Classification:", risk_category, "\n")

# Display complete results
cat("\n=== COMPLETE RESULTS ===\n")
cat("Patient Profile:\n")
cat("  - Age: 60 years\n")
cat("  - Length of Stay: 6 days\n")
cat("  - Comorbidities: 2\n")
cat("Predicted Risk Score:", round(predicted_risk, 2), "\n")
cat("Risk Category:", risk_category, "\n")

Interpretation: Based on the model, a 60-year-old patient with a 6-day length of stay and 2 comorbidities has a predicted readmission risk score around 49-51 (depending on the exact coefficients). This falls into the “Medium Risk” category (40-60), suggesting moderate follow-up care and monitoring would be appropriate.

Explanation: This practical approach shows how hospitals can use regression models to classify patients into risk categories, enabling targeted interventions and resource allocation for patients most at risk of readmission.

Answer to Question 17:

# Create sample data
set.seed(456)
patient_data <- data.frame(
  age = rnorm(200, 55, 12),
  smoking_status = sample(0:1, 200, replace = TRUE),
  cholesterol = rnorm(200, 200, 30),
  disease = sample(0:1, 200, replace = TRUE, prob = c(0.7, 0.3))
)

# Logistic regression
logit_model <- glm(disease ~ age + smoking_status + cholesterol, 
                   data = patient_data, family = binomial)
summary(logit_model)

# Odds ratios
odds_ratios <- exp(coef(logit_model))
print("Odds Ratios:")
print(odds_ratios)

# Interpretation
cat("Interpretation of coefficients:\n")
cat("Age: For each year increase, odds of disease multiply by", round(odds_ratios[2], 3), "\n")
cat("Smoking: Smokers have", round(odds_ratios[3], 3), "times the odds of disease vs non-smokers\n")
cat("Cholesterol: For each unit increase, odds multiply by", round(odds_ratios[4], 3), "\n")

Explanation: Logistic regression models log-odds of disease. Odds ratios > 1 indicate increased risk, < 1 indicate decreased risk. The coefficients represent log-odds changes, so we exponentiate to get odds ratios.

Answer to Question 18:

# Using the model from Question 17
new_patient <- data.frame(age = 60, smoking_status = 1, cholesterol = 250)

# Predict probability
predicted_prob <- predict(logit_model, newdata = new_patient, type = "response")
cat("Predicted probability of disease:", round(predicted_prob, 3), "\n")

# Odds ratio for smoking
smoking_or <- exp(coef(logit_model)[3])
cat("Odds ratio for smoking:", round(smoking_or, 3), "\n")

# Practical interpretation
cat("Practical interpretation:\n")
cat("A 60-year-old smoker with cholesterol 250 has a", round(predicted_prob*100, 1), "% chance of disease\n")
cat("Smokers have", round(smoking_or, 1), "times higher odds of disease than non-smokers\n")
if(smoking_or > 1) {
  cat("This means smoking increases disease risk by", round((smoking_or-1)*100, 1), "%\n")
}

Explanation: The predicted probability gives the likelihood of disease for this specific patient. The odds ratio quantifies how much smoking increases disease risk compared to non-smoking.

Answer to Question 19:

# Create sample data
set.seed(789)
clinical_data <- data.frame(
  age = rnorm(90, 65, 10),
  treatment_type = rep(c("A", "B", "C"), each = 30),
  recovery_time = c(rnorm(30, 15, 3), rnorm(30, 12, 2.5), rnorm(30, 18, 4))
)

# Set treatment A as reference
clinical_data$treatment_type <- relevel(factor(clinical_data$treatment_type), ref = "A")

# Linear regression with dummy variables
model <- lm(recovery_time ~ age + treatment_type, data = clinical_data)
summary(model)

# Interpretation
cat("Interpretation of coefficients:\n")
cat("Intercept: Expected recovery time for treatment A at age 0\n")
cat("Age: For each year increase, recovery time changes by", round(coef(model)[2], 3), "days\n")
cat("Treatment B: Recovery time differs by", round(coef(model)[3], 3), "days vs treatment A\n")
cat("Treatment C: Recovery time differs by", round(coef(model)[4], 3), "days vs treatment A\n")

Explanation: Treatment A is the reference category. The coefficients for treatments B and C show how much their recovery times differ from treatment A, holding age constant. Positive coefficients mean longer recovery times.

Answer to Question 20:

# Create sample data
set.seed(321)
patient_data <- data.frame(
  age = rnorm(150, 55, 12),
  smoking_status = sample(0:1, 150, replace = TRUE),
  blood_pressure = 90 + 0.4 * age + 12 * smoking_status + 
                   0.2 * age * smoking_status + rnorm(150, 0, 8)
)

# Regression with interaction term
model <- lm(blood_pressure ~ age * smoking_status, data = patient_data)
summary(model)

# Interpretation
cat("Model equation: BP =", round(coef(model)[1], 2), "+", round(coef(model)[2], 2), "*age +", 
    round(coef(model)[3], 2), "*smoking +", round(coef(model)[4], 2), "*age*smoking\n")

cat("Interpretation:\n")
cat("For non-smokers (smoking=0): BP =", round(coef(model)[1], 2), "+", round(coef(model)[2], 2), "*age\n")
cat("For smokers (smoking=1): BP =", round(coef(model)[1] + coef(model)[3], 2), "+", 
    round(coef(model)[2] + coef(model)[4], 2), "*age\n")

if(coef(model)[4] != 0) {
  cat("The interaction term shows that the effect of age on BP differs between smokers and non-smokers\n")
  cat("Age effect for smokers:", round(coef(model)[2] + coef(model)[4], 3), "\n")
  cat("Age effect for non-smokers:", round(coef(model)[2], 3), "\n")
}

Explanation: The interaction term (age * smoking_status) allows the effect of age on blood pressure to differ between smokers and non-smokers. If significant, it means the relationship between age and blood pressure is different for the two groups.

16 💻 Second Intermediate Sample Questions

18 ⚡ Course Performance