STAT292- Mid-Term Test 1 Discrete Distributions Contingency Tables Solved
Instructions: There are 10 questions given on pages 2–6 worth a total of 100 marks. Answer ALL questions.
Solutions must be either typed or written neatly, and questions must be answered in order.
Some SAS output and a standard normal probability table are provided on pages 7–8.
Be sure to submit your assignment as a PDF and follow the instructions specified on the submission system.
Section A: Multi-Choice (40 Marks) For Section A questions, only record the letter corresponding to your answer. Do not
present any working to support your choice of answer.
Use the following information to answer Questions 1 to 6.
When MetService reports that there is a 30% chance of rain in Wellington for a given day, it means that they estimate the probability of any rain in Wellington for that day to be 0.3. Consider 8 randomly selected days where MetService reported that there was a 30% chance of rain in Wellington. Suppose that rain was recorded in Wellington for 5 of those days.
1. Assuming that MetService’s reported probability of rain in Wellington for each of those days is correct, what is the probability (to 4dp) that exactly 5 of the 8 days had rain? (5 marks)
a. 0.0008
b. 0.0013
c. 0.0467
d. 0.625
e. 0.9887
2. For a random sample of 8 days where MetService reports that there is a 30% chance of rain, what are the mean and variance (to 4dp) for the number of days Y that have rain? (5 marks)
a. E(Y ) = 0.3, V(Y ) = 0.0262.
b. E(Y ) = 0.3, V(Y ) = 0.162.
c. E(Y ) = 2.4, V(Y ) = 1.2961.
d. E(Y ) = 2.4, V(Y ) = 1.68.
e. E(Y ) = 2.4, V(Y ) = 5.6.
3. Again assuming that MetService’s reported probability of rain in Wellington for each of the 8 days is correct, use a normal approximation to find the probability (to 4dp) that rain was recorded on fewer than 5 of the 8 days. (For your reference, a standard normal probability table is presented on page 8.) (5 marks)
a. 0.7422
b. 0.8023
c. 0.8907
d. 0.8944
e. 0.9474
4. Was it appropriate to use the normal approximation in Question 3? (5 marks)
a. No. One of np and n(1 − p) is less than 5.
b. No. Both np and n(1 − p) are less than 5.
c. Yes. One of np and n(1 − p) is at least 5.
d. Yes. Both np and n(1 − p) are at least 5.
e. Yes. Both np and np(1 − p) are at least 5.
5. Now, suppose that the true probability of rain in Wellington for the 8 randomly sampled days is unknown. Using the observed number of days with rain (5), produce an Agresti-Coull adjusted 95% confidence interval (to 4dp) for the true probability of rain p. (5 marks)
a. (0.2895, 0.9605)
b. (0.3044, 0.8623)
c. (0.441, 0.7257)
d. (0.4538, 0.7962)
e. None of the above.
6. Finally, suppose that you want to estimate the true proportion of days with rain in Wellington, and you plan to present your results using a 90% confidence interval. Find the most conservative minimum sample size required if the interval is to have an approximate margin of error of 0.03. (5 marks)
a. 632
b. 752
c. 897
d. 1068
e. None of the above.
Use the following information to answer Questions 7 to 8.
For students who graduated from a particular university 5 years ago, the following data show the numbers of students who have not changed their jobs since they graduated for a random selection of 110 bachelor’s degree students and 70 master’s degree students:
Number of students Degree Sample size in the same job Bachelor’s 110 58 Master’s 70 32 7. Let pB denote the proportions of bachelor’s degree students who have not changed their jobs since they graduated and pM denote the proportions of master’s degree students who have not changed their jobs since they graduated. Calculate the test statistic z∗ (to 4dp) for a test of
H0 : pB = pM
H1 : pB 6= pM.
(5 marks)
a. z∗ = 0.9174
b. z∗ = 0.5326
c. z∗ = −4.9566
d. z∗ = −5.4526
e. None of the above.
8. For the hypothesis test in Question 7, calculate the p-value (to 4dp). Use the standard normal probability table on page 8 to calculate the p-value. (5 marks)
a. p-value = 0.0000
b. p-value = 0.1788
c. p-value = 0.2981
d. p-value = 0.3576
e. p-value = 0.5962
Section B: Written Answers (60 Marks) For Section B questions, you must write your response to the question. Page 7 includes SAS output which may prove useful to answering parts of Question 9.
9. (40 marks)
Consider data published in the 1950s on a case-control study investigating the relationship between smoking and lung cancer. A breakdown of lung cancer by smoker status (where smokers are classified as those smoking at least 1 cigarette per day for a year) and reported sex of the individual is presented in the partial contingency tables below.
Smoker status
Sex Has lung cancer? Smoker Non-smoker Male Yes
No 647
622 2
27
Female Yes
No 41
28 19
32 a. Estimate the conditional associations between incidence of lung cancer and smoker status, conditional on reported sex of the individual, using conditional odds ratios (to 4dp).
b. Assuming that the conditional associations estimated in part (a) are indicative of the true conditional odds ratios, do reported sex of the individual and smoker status interact in their effect on incidence of lung cancer? Explain why or why not.
Now consider the marginal table representing the relationship between lung cancer and smoker status, as shown below.
Smoker status
Has lung cancer? Smoker Non-smoker
Yes 688 21 No 650 59 c. Using the marginal table, estimate the association between smoker status and lung cancer using the odds ratio (to 4dp). Interpret the estimated odds ratio, and present a corresponding 95% confidence interval (to 4dp).
d. Using the marginal table, carry out a chi-square test of independence for smoker status and incidence of lung cancer. Relevant SAS output can be found on page 7 and may be used to answer this question (i.e., hand calculations are not required). Be sure to answer the following questions:
i. Is a chi-square test of independence appropriate for the data presented in the marginal table? Why or why not? ii. What are the hypotheses to be tested?
iii. What are the Pearson and likelihood ratio chi-square test statistics? What are their distribution under the null hypothesis?
iv. What are the p-values corresponding to the Pearson and likelihod ratio chi-square test statistics?
v. What is your conclusion at the α = 0.05 significance level?
e. Again using the marginal table, now carry out Fisher’s exact test to determine if smokers are more likely to have lung cancer than non-smokers. Relevant SAS output can be found on page 7 and may be used to answer this question (i.e., hand calculations are not required). Be sure to answer the following questions:
i. What are the hypotheses to be tested?
ii. What is the p-value for the test? Clearly explain what row in the SAS output provides relevant information for this p-value.
iii. Although it would be possible to calculate a mid-p-value in theory, explain why a mid-p-value would be unnecessary in this case. (Note: You are being asked a conceptual question, not to try to calculate a mid-p-value.)
iv. What conclusion would you make at the α = 0.05 significance level?
10. (20 marks)
The Otago region has a rabbit problem, and farmers are interested to know whether or not the distribution of rabbit holes in the region is random. A researcher took a random sample of 30 areas (each 100 m2) and calculated an average of 1.6 rabbit holes per area. The frequency distribution for the number of rabbit holes per 100 m2 is given below.
Number of rabbit Frequency
holes (r) (fr) P(Y = r) fbr
0
1 6
6 0.2019
0.32303 fb0 9.6909 2
≥ 3 12
6 0.25843 7.7529
6.4992 P(Y ≥ 3) It is of interest to test the hypotheses
H0 : The population distribution is Poisson.
H1 : The population distribution is not Poisson.
using a chi-square goodness-of-fit test.
a. What are the missing values for the probability (to 5dp) and expected frequency (to 4dp) corresponding to the black cells in the above table?
b. Calculate the test statistic (to 4dp).
c. Name the probability distribution that the test statistic follows under the null hypothesis. Explain why this distribution has two degrees of freedom.
d. The p-value for the test is 0.1517 (to 4dp). State what you would conclude at the α = 0.05 significance level.
Standard Normal Probabilities P(0 ≤ Z ≤ z) for Z ∼ N(0,1)