15 💻 First Intermediate Sample Questions

Hi guys, this is your favourite TA, I am just aggregating questions that have been asked in previous exam sessions the previous years i.e. 2020/2021 and 2021/2022. They are representative of the actual exam, but you know, take it like a grain of salt.

I will also make sure to provide to you some other exercises if you are still anxious.

15.1 👨‍🎓 2020/2021

Exercise 15.1 Write the line of the R command that you use to produce a boxplot of the variable X

Exercise 15.2 We want to test statistically the hypothesis that the performances of students at UCSC in Rome that graduated last year are better than those that graduated this year. Can we say that this is a paired sample test ?

Exercise 15.3 Without using formulae, describe how you can calculate the test statistics in a hypothesis testing procedure on a single mean with known variance.

Exercise 15.4 Using the dataset Boston downloaded from the library spdep, write the correlation matrix of the variables MEDV, NOX and CRIM.

Exercise 15.5 How do you define the confidence of a statistical test?

Exercise 15.6 Given the following 2 variables X = (1,5,3,3,5,5) and Y= (4,4,6,3,2,3), write the cross-tabulation between X and Y.

Exercise 15.7 Write the line of the R command that you use to simulate 1000 random observation from normal distribution with 0 mean and variance = 0.5.

Exercise 15.8 A law company is evaluating the performances of two departments measuring in terms of the time required for solving a conflict in the last year. The observed values are reported in the following table:

…

can we accept the hypothesis H0: (the mean of Dept 1 is equal to the mean of Dept 2) versus a bilateral alternative hypothesis? (F)

Exercise 15.9 A company has recorded the number of costumers in 10 sample stores before (variable X) and after (Variable Y) a new advertising campaign was introduced. The observed values are reported in the following table

…

write the p-value of the test with H0: (the mean of X is equal to the mean of Y) versus a bilateral alternative hypothesis. ( 0,000341138)

Exercise 15.10 The HR office of a cleaning company wants to test if there is a gender discrimination between its employees. Call X = the income of a set of 20 male workers and Y = the income of a set of 35 female workers. Write the line R command to run an appropriate test of hypothesis.

Exercise 15.11 What is the power of statistical test?

Exercise 15.12 Using the dataset boston.c downloaded from the library spdep, calculate the coefficient of skewness of the variable RM.

Answer to Exercise 15.12:

library(moments)
skewness(boston.c$RM)

0,4024147

Exercise 15.13 How do you define the significance of a statistical test?

15.2 👨‍🎓 2021/2022

Exercise 15.14 Given the dataset “Duncan” in the library “carData” estimate the regression model where the variable prestige is regressed on the variables income Looking at the following information,

Residuals:

Min      1Q  Median      3Q     Max

-29.538  -6.417   0.655   6.605  34.641

Do residuals display.

Exercise 15.15 What are the consequences of collinearity among regressors?

Estimators become biased
Estimators become inefficient
Estimators become inconsistent
Estimators become unstable

Exercise 15.16 What is the correct definition of the variance inflation factor i.e. VIF?

\(1-R2\)
\(\frac{1}{R2}\)
\(\frac{1}{1-R2}\)
\(1-\frac{1}{R2}\)

Answer to Exercise 15.16:

A general guideline is that a VIF larger than 5 or 10 is large, indicating that the model has problems estimating the coefficient. However, this in general does not degrade the quality of predictions. If the VIF is larger than 1/(1-R2), where R2 is the Multiple R-squared of the regression, then that predictor is more related to the other predictors than it is to the response.

install.packages("regclass")
library(regclass)
VIF(modello_regressione)

alternatively you can use the library car and use vif() function

install.packges("car")
library(car)
vif(modello_regressione)

Exercise 15.17 Using only the following variables minority , crime , poverty , language highschool and housing of the Ericksen data in the library carData, run a factor analysis. What is the percentage explained by the first two factors?

risposta: 90.130.001

Exercise 15.18 In a multiple linear regression model y= a+bx1+cx2, if Correlation(x1,x2)=0.9, do we have to discard one of the two variables for collinearity?

risposta: F

Exercise 15.19 Given the dataset Duncan in the library carData estimate the regression model where the variable prestige is regressed on the variables income and education. Which variable is the most significant?

Education
income

Answer to Exercise 15.19:

at first you load data from Duncan dataset

library(carData)
data("Duncan")

Then you specify the model and produce sumamries:

duncan_regression = lm(prestige~ income + education, data= Duncan)
summary(duncan_regression)

you look at pvalues and

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept) -6.06466    4.27194  -1.420      0.163    
income       0.59873    0.11967   5.003 0.00001053 ***
education    0.54583    0.09825   5.555 0.00000173 ***

education is significant more than income since 0.00000173 < 0.00001053

Exercise 15.20 In a multiple linear regression model y= a+bx1+cx2, what is the level of correlation between x1 and x2 beyond which we have to discard one of the two variables for collinearity?

risposta: 0.948

Exercise 15.21 Given the dataset Duncan in the library carData estimate the regression model where the variable prestige is regressed on the variables income and education. What is the p-value of the coefficient of the variable education?

Answer to Exercise 15.21:

at first you load data from Duncan dataset

library(carData)
data("Duncan")

Then you specify the model and produce sumamries:

duncan_regression = lm(prestige~ income + education, data= Duncan)
summary(duncan_regression)

you look at pvalues and

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept) -6.06466    4.27194  -1.420      0.163    
income       0.59873    0.11967   5.003 0.00001053 ***
education    0.54583    0.09825   5.555 0.00000173 ***

The pvalue for the coefficient is 0.00000173

you may want to directly access to it instead of just copying and pasting from console sumamry output

Exercise 15.22 What is the reason for adjusting the R2 in a multiple regression

To account for the number of degrees of freedom
To account for the number of parameters
To reduce the uncertainty
To adjust for variance inflation factor

rispoasta: To account for the number of degrees of freedom

Exercise 15.23 Given the dataset Duncan in the library carData estimate the regression model where the variable prestige is regressed on the variables income. Using the VIF, do we have to exclude some variable due to collinearity?

result: F

Answer to Exercise 15.23:

at first you load data from Duncan dataset

library(carData)
library(car)
data("Duncan")

Then you specify the model and produce sumamries:

duncan_regression = lm(prestige~ income + education, data= Duncan)
vif(duncan_regression)

Then the output will look like something like.

 income education 
 2.1049    2.1049

Since they are below 10 which is the rule of thumb we gave to ourselves to assess multicollinearity then we conclude that neither income nor education are collinear.

Exercise 15.24 Given the dataset Duncan in the library carData estimate the regression model where the variable prestige is regressed on the variables income. What is the value of the t value of the coefficient of the variable education?

Answer to Exercise 15.24:

at first you load data from Duncan dataset

library(carData)
data("Duncan")

Then you specify the model and produce sumamries:

duncan_regression = lm(prestige~ income + education, data= Duncan)
summary(duncan_regression)

Then the output will look like something like.

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept) -6.06466    4.27194  -1.420      0.163    
income       0.59873    0.11967   5.003 0.00001053 ***
education    0.54583    0.09825   5.555 0.00000173 ***

By inspecting the summary wee obtain that the t value (t value column in the summary) dor variable education is 5.555

Exercise 15.24 Using only the following variables minority , crime , poverty , language, highschool and housing of the Ericksen data in the library carData, run a cluster analysis using the k-means method. If we divide the observations in 4 classes what is the frequency of the largest class ?

result: 26

Exercise 15.25 Using only the following variables minority , crime , poverty , language, highschool and housing of the Ericksen data in the library carData, run a cluster analysis using the k-means method. What is the percentage explained by the first factor?

risposta: 7.391.719

Exercise 15.26 Using only the following variables minority , crime , poverty , language, highschool and housing of the Ericksen data in the library carData, run a cluster analysis using the hierarchical method. If we divide the observations in 10 classes what is the frequency of the largest class ?

risposta: 27

Exercise 15.27 Given the dataset Duncan in the library carData estimate the regression model where the variable prestige is regressed on the variables income and education and report the \(R^2\).

Answer to Exercise 15.27:

at first you load data from Duncan dataset

library(carData)
data("Duncan")

Then you specify the model and produce sumamries:

duncan_regression = lm(prestige~ income + education, data= Duncan)
summary(duncan_regression)

Then the output will look like something like.

Residual standard error: 13.37 on 42 degrees of freedom
Multiple R-squared:  0.8282,    Adjusted R-squared:   0.82 
F-statistic: 101.2 on 2 and 42 DF,  p-value: < 0.00000000000000022

By inspecting the lowe end of the summary we obtain that the R2 (multiple) for the model is 0.8282, which is high.

15.3 👨‍🎓 2022/2023

Exercise 15.28 Using the dataset Boston downloaded from the library spdep, calculate the coefficient of skewness of the variable RM.

Exercise 15.29 How do you define the significance of a statistical test?

Exercise 15.30 What is the power of statistical test?

Exercise 15.31 How do you define the confidence of a statistical test?

Exercise 15.32 A law company is evaluating the performances of two departments measuring in terms of the time required for solving a conflict in the last year. The observed values are reported in the following table:

perf_table = data.frame(
  stringsAsFactors = FALSE,
             month = c("january","febraury","march",
                       "april","may","june","july","august","september",
                       "october","november","december"),
            dept_1 = c(NA, NA, NA, 3L, 6L, 9L, 7L, 5L, 7L, 3L, 4L, 6L),
            dept_2 = c(4L, 3L, 9L, 5L, 7L, 2L, 6L, 3L, 6L, 7L, 4L, 1L)
)
)

can we reject the hypothesis H0: (the mean of Dept 1 is equal to the mean of Dept 2) versus a bilateral alternative hypothesis?

Exercise 15.33 A company has recorded the number of costumers in 10 sample stores before (variable X) and after (Variable Y) a new advertising campaign was introduced. The observed values are reported in the following table:

stores = data.frame(
     n_store = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L),
      before = c(113L, 110L, 108L, 108L, 103L, 101L, 96L, 101L, 104L, 98L),
       after = c(125L, 113L, 115L, 117L, 105L, 112L, 100L, 103L, 116L, 104L)
)

can we reject the hypothesis H0: (the mean of X, i.e. before is equal to the mean of Y, i.e. after) versus a bilateral alternative hypothesis?

Exercise 15.34 Write the line of the R command that you use to simulate 2000 random observation from normal distribution with 0 mean and variance = 0.1

Many of you fall into this trap!. Tip: always use the “tab” for automatic suggestion but also check what are arguments. In this case exercise wants you to sample from a normal distribution with 2000 instances (data points), 0 mean and variance = 0.1. The argument in rnorm is sd not var, so you have to apply the square root!

Answer to Question 15.34:

rnorm(n = 2000, mean = 0, sd = 0.1^(1/2))

Exercise 15.35 Write the line of the R command that you use to produce a boxplot of the variable X

Exercise 15.36 Given the following 2 variables X = (5,5,3,3,5,5) and Y= (4,4,3,3,3,3), test if the mean of X is significantly different from the mean of Y. Report the p-value of the appropriate test and your decision.

Exercise 15.37 Using the dataset boston.c downloaded from the library spdep, write the elements of the correlation matrix of the variables MEDV, NOX and CRIM.

Exercise 15.38 Without using formulae, describe how you can calculate the test statistics in a hypothesis testing procedure on a single mean with known variance.

Exercise 15.39 The HR office of a cleaning company wants to test if there is significant difference in the salary between males and females. Call X = the salary of a set of 2000 male workers and Y = the salary of a set of 150 female workers. From previous survey we know that the variances of the two groups are equal. Write the line R command to run an appropriate test of hypothesis.

Exercise 15.40 We want to test statistically the hypothesis that the students at UCSC in Rome have better performances in the second year than in first year year. Can we say that this is a paired sample test?

Exercise 15.41 Using the dataset iris test if there is a significant difference between the mean of Petal.Length and the mean of Sepal.Width and report the outcome value of the t-test.

Exercise 15.42 Using the dataset iris calculate the correlation between Sepal.Length and Sepal.Width.

Exercise 15.43 Using the dataset iris report the highest correlation coefficient that you find between the four variables.

Exercise 15.44 Using the dataset iris report the highest correlation coefficient that you find between the four variables.

Exercise 15.45 Using the dataset iris report the variance of Sepal.Length

Exercise 15.46 Using the dataset iris report the third quartile of Sepal.Length

Exercise 15.47 What is the reason for adjusting the R2 in a multiple regression?

Exercise 15.48 What is the correct definition of the variance inflation factor?

Exercise 15.49 What are the consequences of collinearity among regressors?

Exercise 15.50 Using the dataset Wong from the R library carData, estimate a multiple linear regression where the variable piq is expressed as a function of age, days and duration.