Starting from:

$40

SOCGA2332 Lab 3 Solution


Logistics & Announcement
• Make sure to comment on your code. You will get credit for demonstrating your thought process even if you don’t get the final answer correct.
knitr::opts_chunk$set(echo = TRUE) library(tidyverse)
# Import data weight_df <- read.csv("data/weight.csv")
Part 1 Review: Population and Sample
1. Write down the formula you use to calculate the following sample statistics (assume your sample size = 𝑛):
• Sample mean:
• Sample variance:
• Sample standard deviation:
• Standard error of sample mean:
• 95% confidence interval of the population mean:
2. You have collected a sample of 25 on BMI (Body Mass Index). The sample mean is 23 and sample variance is 4.
• What is the point estimate of the population mean?
• What is the 95% confidence interval of the population mean (round to 2 d.p.)?
• What is the 95% confidence interval of the population mean if the sample size is 10,000 (round to 2 d.p.)?
Part 2: Hypothesis and Significance Test
First, let’s review the standard steps for conducting a significance test:
2.1 The standard procedure of a significance test
• 1. Formulate our research question in the null and alternative hypotheses
• 2. Select a significance level (𝛼) (in social science, usually 𝛼 = 0.05)
• 3. Select which test statistics to use (for small samples, we use the t test statistics)
• 4. If you are collecting first-hand data, select a sample size that provides you with sufficient statistical power
• 5. Derive the sampling distribution of the test statistic under the assumption that the null hypothesis is true
• For the t test statistics, its sampling distribution is approximately the Student t distribution with 𝑛−1 degrees of freedom
• When 𝑛 gets larger (usually 𝑛 > 30), the t distribution is approximately a standard normal distribution (see graph below)

Figure 1: t distribution 1
• 6(A) Derive the critical value of t and your rejection region according to the null hypothesis • The critical value of t (𝐶𝑉𝑡) is the value beyond which we will regard our observed t as “unusual”
• The rejection region will be (−∞,−|𝐶𝑉𝑡|) ∪ (|𝐶𝑉𝑡|,∞).
• For samples with a df ≥ 100, the critical value of t is 1.96 for a significance level at 0.05. The rejection region is (−∞,−1.96) ∪ (1.96,∞)
• For samples with a df ≤ 100, you can use the “t Distribution Critical Values” table in your textbook to find out the critical value and rejection region:
– For a two-tailed test that have a significance level at 0.05, we find values from the 𝑡.025 column
– For a one-tailed test that have a significance level at 0.05, we find values from the 𝑡.050 column • You can also use the qt() function in R to find out the critical value:
– To find out critical value of t for a two-tailed test, use qt(p = 0.5*your_alpha, df = your_degree_of_freedom)
– To find out critical value of t for a one-tailed test, use qt(p = your_alpha, df = your_degree_of_freedom)
– Note: the qt() function is the quantile function for the Student 𝑡 distribution in base R that gives the 𝑡 value based on the percentile you input
• 6(B) Alternatively, you can calculate the 𝑝-value of your observed t statistic
• 𝑝-value is the probability that the test statistic equals to (or is more extreme than) what we observed

Figure 2: t distribution 2
– To find out the two-tail 𝑝-value, use 2*pt(q = observed_t, df = your_degree_of_freedom, lower.tail = FALSE)
– To find out the one-tail 𝑝-value, use pt(q = observed_t, df = your_degree_of_freedom, lower.tail = FALSE)

Figure 3: t distribution 3
• 7. Make a conclusion about whether to reject the null hypothesis
– You can use this online tool to visualize a 𝑡-test
Exercise
With 𝜇0 = 0, 𝑦̄ = 1.54, sample 𝑛 = 27, 𝑠 = 3.25, derive:
(1) The t test statistic
(2) The critical value of t given 𝐻0 is true
(3) Your rejection region
(4) 𝑝-value
(5) Your conclusion of the significance test
## you can code your answer here
2.2 One-sample t-test using R
• When do you use one-sample t-test?
• R provides a simple function t.test() to perform hypothesis testing using the t test statistics

Figure 4: t distribution 4
• For example, the data object weight_df we just imported records the weight change of anorexic patients who went through therapy programs, and we want to know whether these therapies are effective.
## check data head(weight_df)
## subj therapy before after change
## 1 1 b 80.5 82.2 1.7
## 2 2 b 84.9 85.6 0.7
## 3 3 b 81.5 81.4 -0.1
## 4 4 b 82.6 81.9 -0.7
## 5 5 b 79.9 76.4 -3.5
## 6 6 b 88.7 103.6 14.9
• Before performing any statistical test, it will be useful to (1) check the descriptive statistics and (2) plot the variables of interest.
## check descriptive statistics of all variables summary(weight_df)
## subj therapy before after
## Min. : 1.00 Length:72 Min. :70.00 Min. : 71.30
## 1st Qu.:18.75 Class :character 1st Qu.:79.60 1st Qu.: 79.33
## Median :36.50 Mode :character Median :82.30 Median : 84.05
## Mean :36.50 Mean :82.41 Mean : 85.17
## 3rd Qu.:54.25 3rd Qu.:86.00 3rd Qu.: 91.55
## Max. :72.00
## change
## Min. :-12.200
## 1st Qu.: -2.225
## Median : 1.650
## Mean : 2.764
## 3rd Qu.: 9.100
## Max. : 21.500 Max. :94.90 Max. :103.60
## plot histogram and density curve
weight_df %>% ggplot(aes(x = change, y=..density..)) + geom_histogram(binwidth = 1, fill = "grey", color = "black") + geom_density() +
labs(title = "Distribution of Weight Change") +
theme_classic()
## i Please use `after_stat(density)` instead.
Distribution of Weight Change

• We test:
𝐻0 ∶ 𝜇change = 0
• the mean weight change is 0 against:
𝐻a1 ∶ 𝜇change ≠ 0 • the mean weight change is not 0, a two-tailed test:
𝐻a2 ∶ 𝜇change > 0
• the mean weight change is larger than 0, a one-tailed test, using the following code:
## mean of weight change mean(weight_df$change)
## [1] 2.763889
# ---- one sample two-tail t-test ---- #
two_tail_t <- t.test(
weight_df$change, # the sample value vector that you want to test mu = 0, # mean given by your null hypothesis
alternative = "two.sided", # direction of alternative hypothesis
conf.level = 0.95
)
## extract test statistic two_tail_t$statistic # significance level
## t
## 2.93757
## extract p-value two_tail_t$p.value
## [1] 0.004457718
## extract the confidence interval of the mean two_tail_t$conf.int
## [1] 0.8878354 4.6399424
## attr(,"conf.level") ## [1] 0.95
## display full result two_tail_t
##
## One Sample t-test
##
## data: weight_df$change
## t = 2.9376, df = 71, p-value = 0.004458 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 0.8878354 4.6399424 ## sample estimates:
## mean of x
## 2.763889
# ---- one sample one-tail t-test ---- # t.test(weight_df$change,
mu = 0,
alternative = "greater", conf.level = 0.95)
##
## One Sample t-test
##
## data: weight_df$change
## t = 2.9376, df = 71, p-value = 0.002229 ## alternative hypothesis: true mean is greater than 0 ## 95 percent confidence interval:
## 1.195825 Inf
## sample estimates: ## mean of x
## 2.763889
Exercise
The institution that offers therapy programs to the anorexic patients claims that their treatment will lead to a weight increase of 4 lbs. Use the weight_df data and with 𝛼 = 0.05, perform both a two-tailed and a one-tailed test:
𝐻0 ∶ 𝜇change = 4
against
𝐻a1 ∶ 𝜇change ≠ 4 and 𝐻a2 ∶ 𝜇change < 4
• Report your hypothesis testing result.
• Hint: Make sure you put correct arguments for your t.test() function! (Are you testing for “two.sided”, “less”, or “greater”? What’s your mu?)
## you can code your answer here
Part 3: Comparing the Mean of Two Groups (Two-sample t-test)
3.1 Two independent samples
Using R, we can perform a two-sample 𝑡-test by using the same t.test() function but adding a second sample mean vector.
For example, in treating anorexic patients, three different therapies are used. We can plot a boxplot to visualize how weight changes differ across these therapies.
## box plot weight_df %>%
ggplot(aes(x = therapy, y = change)) + geom_boxplot() +
geom_point(shape = 1, alpha = 0.7) +
labs(title = "Weight Changes by Therapy Program", y = "weight change") +
theme_classic()
Weight Changes by Therapy Program

It looks like therapy f tends to result in a higher weight increase compared to other therapies. Let’s use a two-sample t-test to see if the mean weight change in therapy f is statistically different from that in therapy c:
𝐻0 ∶ 𝜇𝑓 − 𝜇𝑐 = 0
against
𝐻𝑎 ∶ 𝜇𝑓 − 𝜇𝑐 ≠ 0
## filter data for each therapy
weight_f <- weight_df %>% filter(therapy == "f") weight_c <- weight_df %>% filter(therapy == "c")
# ---- two-group independent two-tailed t-test ---- # t.test(
x = weight_f$change, # mean value vector from the first sample y = weight_c$change, # mean value vector from the second sample mu = 0,
# mean difference given by your null hypothesis
alternative = "two.sided" # direction of alternative hypothesis
)
##
## Welch Two Sample t-test
##
## data: weight_f$change and weight_c$change
## t = 3.2992, df = 36.979, p-value = 0.002152
## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 2.976597 12.452815 ## sample estimates:
## mean of x mean of y
## 7.264706 -0.450000
Note: The degrees of freedom of the t-distribution will be 𝑛0 + 𝑛1 − 2 if the population variance of the two groups is equal. This is often not a very realistic scenario. Out of this reason the Welch’s approximation (which we will not define here, but can be found here if you are curious) is often used for the degrees of freedom of the 𝑡 distribution. This is, in fact, the default option in the t.test() function that we use in R.
3.2 Two dependent samples
In fact, our example in the one-sample t-test in Part 2 is a two dependent sample t-test. For two dependent sample t-test, you can always create a new variable equal to the difference between the two dependent samples, like what we did in Part 2; or you can use the t.test() function and set the argument paired = TRUE.
For example, in the weight_df data, if we want to test whether the mean weight before the treatment is different from the mean weight after the treatment:
𝐻0 ∶ 𝜇before − 𝜇after = 0
against
𝐻𝑎 ∶ 𝜇before − 𝜇after ≠ 0
# ---- two-group dependent two-tailed t-test ---- # t.test(
x = weight_df$before, # mean value vector from the first sample y = weight_df$after, # mean value vector from the second sample mu = 0, # mean difference given by your null hypothesis paired = TRUE, # dependent samples
alternative = "two.sided", # direction of alternative hypothesis conf.level = 0.95 # significance level
)
##
## Paired t-test
##
## data: weight_df$before and weight_df$after
## t = -2.9376, df = 71, p-value = 0.004458
## alternative hypothesis: true mean difference is not equal to 0 ## 95 percent confidence interval: ## -4.6399424 -0.8878354 ## sample estimates: ## mean difference
## -2.763889
Part 3 Exercise
Perform a two-sample two-tailed 𝑡-test for the difference between therapy b and c:
𝐻0 ∶ 𝜇𝑏 − 𝜇𝑐 = 0
against
𝐻𝑎 ∶ 𝜇𝑏 − 𝜇𝑐 ≠ 0
## you can code your answer here

More products