$24.99
Introduction 3
Statistical skills sample 3
A note on time management 3
Task 1: Setting up libraries and seed value 3
Task 2a: Return to Statdew Valley: exploring sources of variance in a balanced experimental design(teaching and learning world) 4
Task 2b: Applying linear mixed models for the strawberry data (practical world) 7
Task 3a: Building a confidence interval interpreter 7
Task 3b: Building a p value interpreter 9
Task 3c: User instructions and disclaimer 11
Task 4: Creating a reproducible example (reprex) 12
Task 5: Simulating p values 13
Writing sample 17
Reflection 18
Rubric 19
Information Note
Name Mini-portfolio
Type (Main, Mini or Basket) Mini
Value 20% (Path A)
25% (Path B)
Submission instructions Submission: Via Markus
Portfolio assessments aim to help you demonstrate your technical coding, statistical thinking, communication and reflection skills. This mini-portfolio also aims to recap and refresh knowledge from your previous statistics courses as well as building your ability to create quality data visualizations.
General instructions
• Be very careful to follow instructions on variable naming. If you do not, your code won’t pass autograding and you will not receive the grades. This will not be eligible for regrading requests.
• Comment your code! In an R code chunk comments start with a # (pound sign or hashtag symbol). Don’t confuse this with the use of # to denote different levels of headings in the text parts (governed by Markdown syntax) of an R Markdown document.
• All figures and tables should have appropriate (and appropriately positioned) captions.
• Include your code in the body of the PDF itself (don’t set echo=FALSE, don’t hide/suppress etc.). Note that this is different to what you will be asked to do in the final project or in professional reporting. This is a demonstration of your skills.
• If asked to write and answer in words, write in full sentences.
• KNIT EARLY AND OFTEN! Don’t leave things till the last minute, your Rmd not knitting is not an emergency for which an extension will be granted.
Template
You can access the template for this assessment here.
There is currently a lot of ‘filler text’ and ‘filler code’ in the template that you will want to delete. Fun fact: All filler text sourced from Hipster Ipsum, which Katy Wang in the UG Stats office introduced me to.
Submission instructions
• Submit both your Rmd (must be called: sta303-w22-portfolio.Rmd) and PDF (must be called: sta303w22-portfolio.pdf) on MarkUs.
• You do not need to submit any data or tex files.
• The PDF must be knit directly from the Rmd (not Word or HTML first)
Cover page
You don’t have to use the provided template, but you DO need to write your portfolio in RMarkdown and include a standalone cover page. The cover page must have:
• A title and subtitle (you can use my examples in the template or update them as you see fit, no points for changing, but it might make it a better part of your own website or GitHUb portfolio, in the future)
• Your name
It should not have your table of contents or any analyses on it.
In the template, you can change the colour of this cover to any colour you would like by replacing 6C3082 in the YAML (line 13) to another hex code. You could use this tool to help you: https://htmlcolorcodes. com/color-picker/
Introduction
Write this section last or second to last (before the refection).
Statistical skills sample
A note on time management.
Tasks 3 and 5 are essentially revision tasks for previous courses as I saw that there were still quite a few misunderstandings about these core concepts. If you find yourself unable to completely finish this portfolio, focusing your attention on the new material in Task 2 is a good strategy.
Task 1: Setting up libraries and seed value
• Set up a chunk called setup_portfolio where you load the tidyverse library.
• Set your chunk options to message=FALSE so all the package loading information isn’t included in your our output. You will need to make sure you run this chunk each time you start a new session so you can use many of the functions required.
• Create an object called last3digplus that takes the value 100 + the last three digits of your student ID number.
• Delete any placeholder library chunks in the template (if using).
Task 2a: Return to Statdew Valley: exploring sources of variance in a balanced experimental design (teaching and learning world)
In a class interactive, we explored some tomato and honey data in a statistical farming simulator.
Suppose you also have some data about strawberry yields (in kgs) across 18 different strawberry patches on your farm. There had been a problem with birds eating more than their fair share of your grandmother’s strawberries (there wouldn’t be enough for her famous jam!) so she had tried out three ‘treatment’ types: a scarecrow, covering the strawberries with netting and doing nothing (as a control).
Each patch was harvested 6 times, twice under each treatment. Your grandmother says weather and soil conditions were remarkably consistent across the entire time.
Growing your (grandmother’s) strawberry patch
# Don't edit this file
# Sourcing it makes a function available # You CAN edit these comments
# Load grow_my_strawberries() function source("grow_my_strawberries.R")
• Run the function grow_my_strawberries() (the chunk to source the function must be run first for this to work) with seed = last3digplus as the input, saving the output as the object my_patch.
• Alter the my_patch data so that treatment is a factor variable with the levels ordered as follows: “No netting”, “Netting”, “Scarecrow”.
Plotting the strawberry patch
Create a plot, using ggplot2, where the x-xis has each plot (alphabetical) and the y-axis represents yield.
• Just for fun, make each point an upside-down triangle (strawberry-shaped).
– Within geom_point(), set pch = 25.
• Colour the points with the following colours:
– No netting (#78BC61), Netting (#E03400), Scarecrow (#520048).
– You will need to point both fill and color to the appropriate variable in your aesthetic mapping and use scale_fill_manual and scale_color_manual to set the colours.
• Apply theme_minimal() to the plot
Model formula Write out the model formulation for the model that fits a fixed effect for treatment, a random effect for patch, and a random effect for the interaction of treatment and patch.
• Make sure you also EXPLAIN each term. All random effects should have a statement of their distribution.
• While there are multiple ways to write out this model, I strongly suggest you mimic the structure written out for the honey model in the Statdew Valley interactive.
• Create a table with the following columns: Source of variation, the name for this source of variation Variance, the values you’ve calculated (var_patch, var_int, var_ab), Proportion, which is the proportion of variance in yield (not explained by treatment) explained by this source (round to 2 decimal places). Use kable to display the table nicely and add an appropriate caption.
Objects to create (proof-read carefully for autograding)
Tibbles
• my_patch, your ‘patch’ of strawberries, grown based on using your last3digplus as the seed (created in the first part of the task).
• agg_patch, starting with my_patch, group by patch and then summarize to create a new variable, yield_avg_patch.
• agg_int, starting with my_patch, group by both patch and treatment and then summarize to create a new variable, yield_avg_int that is the average strawberry yield for each patch and treatment combination.
Models
• int_mod, an interaction model including main effects. yield is the response and patch and treatment should be the predictors.
• patch_mod, an intercept only model, where yield_avg_patch is the response. Use appropriate data aggregated to find the average yield per patch.
• agg_mod, a main effects model where yield_avg_int is the response and patch and treatment are the predictors. Use appropriate data aggregated across both patch and treatment.
Numeric values Note: These should not be hardcoded (i.e., not var_patch = 3), they should make use of your models
• var_patch, variance in average yield patch-to-patch
• var_int, residual variance after fitting the version of this linear model with this most parameters
• var_ab, variance in yield explained by the interaction between patch and treatment, after accounting for the fixed effects and other sources.
# Example tibble
tibble(`Source of variation` = c("Source 1", "Source 2",
"Source 4"),
Variance = c("var_b", "var_ab", "var_int"),
Proportion = c(round(0.3333333, 2),
round(0.3333333, 2), round(0.3333333,2) )) %>%
# It is okay if that caption goes off the page when knitting
knitr::kable(caption = "An example table, where you can replace the appropriate values, including remo
Source of variation Variance Proportion
Source 1 var_b 0.33
Source 2 var_ab 0.33
Source 4 var_int 0.33
Task 2b: Applying linear mixed models for the strawberry data (practical world)
• Fit 3 models (yield will be the response in all of them)
– mod0, a linear model with only treatment,
– mod1, a linear mixed model with treatment and patch (appropriate choices about what is a fixed vs random effect should be made)
– mod2, a linear mixed model with treatment, patch and the interaction of treatment and patch (appropriate choices about what is a fixed vs random effect should be made)
• Set up appropriate comparisons between these models, using a likelihood ratio test (lmtest::lrtest)
– Writing: make explicit comment on whether/where you are using REML / ML and WHY)
Justification and interpreation
• Chose which model, out of mod0, mod1 and mod2 is the most appropriate final model (make it clear which one you are choosing and justify why).
– Interpret the fixed effect coefficients.
– Discuss the proportion of variance, not explained by the fixed effects, that is explained by each source of variance in the model.
– Hint: If you have done both Task 2a correctly, it will match your variances from one of these model.
This should be at MAX 1-2 paragraphs.
Task 3a: Building a confidence interval interpreter
Write a function that helps you interpret a frequentist confidence interval with the appropriate language. Remember that we need to be careful about what our confidence level means and what value the interval is for.
Note: For this and the next task, you are largely replacing and reworking what I have set up. I am definitely not claiming these are excellently specified functions, there are lots of other improvement we might try to make if we were really adding a useful function to a package. For now, I just want you to get a bit of a sense of where the errors and warnings you see when using other functions come from. The main focus should be on practising a sensible template for how to interpret these values. Honestly, this is not a place where you need to be creative in your writing. It will speed up your writing and communication if you have a mental script for each time you have to interpret a p value or confidence interval.
Specifications:
• The function should be called interpret_ci.
• It should take 4 arguments:
– lower, the lower bound of the confidence interval (numeric)
– upper, the upper bound of the confidence interval (numeric)
– ci_level, the confidence level this interval was calculated at, e.g. 99 or 95 (numeric)
– stat a description of the statistic of interest
• The output must be a full sentence that uses all 4 of the inputs to the function. It must be CLEAR what the confidence interval is making a claim about and the interpretation must be statistically appropriate.
• Warnings must be produced in the following cases:
– If lower or upper is not numeric.
– If ci_level is not an numeric value between 0 and 100. – If stat isn’t a character string.
• You must run (and include) the following code after setting up your function:
– interpret_ci(10, 20, 99, "mean number of shoes owned by students")
– interpret_ci(10, 20, -1, "mean number of shoes owned by students")
– interpret_ci(10, 20, -1, tibble(stat = 3))
– The results of the above must be printed out in the main text of the PDf (see examples in template)
interpret_ci <- function(lower, upper, ci_level, stat){ if(!is.character(stat)) {
stat should be a character string that describes the statistics of interest. The first part of this would be appropriate error text for this issue. You can reuse it as in in your code (not plagairisim concern), just make sure you delete this part.")
} else if(!is.numeric(lower)) {
} else if(!is.numeric(upper)) {
This error message should be improved.")
} else{
# print interpretation
# this is the main skill I want to see, writing a good CI interpretation.
str_c("This is the confidence level: ", ci_level,
"%. And the text fed to stat is ", stat,
". There is also the lower and upper bounds: ", lower, " and ", upper, ". This part needs to be fixed!" )
} }
# Test 1
ci_test1 <- interpret_ci(10, 20, 99, "mean number of shoes owned by students")
# Test 2
ci_test2 <- interpret_ci(10, 20, -1, "mean number of shoes owned by students")
# Test 3
ci_test3 <- interpret_ci(10, 20, -1, tibble(stat = 3))
CI function test 1: This is the confidence level: 99%. And the text fed to stat is mean number of shoes owned by students. There is also the lower and upper bounds: 10 and 20. This part needs to be fixed!
Task 3b: Building a p value interpreter
Write a function that will interpret p values based on strength of evidence. Use this table to help you. You can decide how to set up the ‘greater than’ vs ’strictly greater than’s, etc. just be consistent (and make sure you understand why I’m a little fuzzy about the borders anyways.)
Here is an EXAMPLE. This is not the answer. I DO NOT want to see this submitted as an answer.
This would be an example of a p values interpreter that uses a 0.05 threshold to reject or fail to reject the null hypothesis.
This is not an appropriate error message, FYI.
Make yours more professional.")
} else if(pval > 1) {
No way.
This is also not an appropriately professional error message.")
This is also not an appropriately professional error message.")
} else if(pval > 0.05){ str_c("The p value is ", round(pval, 5),
", we fail to reject the hyothesis that ", nullhyp)
} else if(pval <= 0.05){ str_c("The p value is ", round(pval, 5),
", we reject the hypothesis that ", nullhyp, ".")
}
}
pval_test1 <- interpret_pval(0.0000000003,
"the mean grade for statistics students is the same as for non-stats studen
pval_test2 <- interpret_pval(0.0499999,
"the mean grade for statistics students is the same as for non-stats studen
pval_test3 <- interpret_pval(0.050001,
"the mean grade for statistics students is the same as for non-stats studen
pval_test4 <- interpret_pval("0.05", 7)
p value function test 1: The p value is 0, we reject the hypothesis that the mean grade for statistics students is the same as for non-stats students.
__p value function test 2:__The p value is 0.05, we reject the hypothesis that the mean grade for statistics students is the same as for non-stats students.
p value function test 3: The p value is 0.05, we fail to reject the hyothesis that the mean grade for statistics students is the same as for non-stats students
• Specifications for your function:
– Your function must be called interpret_pval.
– It must take two arguments: pval and nullhyp.
– The interpretation messages must be statistically appropriate and should not talk about rejecting or failing to reject the null. Instead the strength of evidence should be discussed appropriately.
– The output should be:
∗ a full sentence,
∗ that included the p value, rounded to 3 decimal places or replaced with “<.001” if less than
0.001 (based on APA guidelines for reporting statistics), and ∗ incorporate the text of the null hypothesis.
Task 3c: User instructions and disclaimer
Neither of these sections should be very long (1 to 3 paragraphs maximum). Please write in full sentences.
Instructions
Write brief instructions for how to use your two interpreters to someone who is new to statistics, but has been told they need to apply some basic statistical methods in their summer research project.
This should touch on what a null hypothesis is and what a population parameter is, and should give some examples and tips on wording a null hypothesis appropriately. It should also mention some common pitfalls in interpreting frequentist confidence intervals.
Disclaimer
Task 4: Creating a reproducible example (reprex)
In Module 1, we discussed creating reproducible examples in R.
Suppose you had written the below code. You want to summarize the dataset my_data into a new dataset called my_summary and find the mean value for each group, and just have one row per group, but for some reason it is not working.
Note: DO NOT FIX THIS CODE. That is not the point of this activity.
Goal: Create a reproducible example that would allow someone ELSE to understand the code and situation.
In your submission there should be two things:
1. A brief description IN YOUR OWN WORDS of what a reprex is and what you needed to consider to produce the reprex.
2. The reprex output that would allow someone to completely reproduce your problem exactly. Think carefully about what you would need to provide. I.e., what you might post on a help-board as part of a request for help. You do NOT need to write up the rest of the description.
Hint: The workflow will be something like (1) create the code chunk for the reprex, (2) copy the code chunk to clipboard (Ctrl+C or Cmd+C), (3) run reprex::reprex() in the console, (4) paste output into your template (it won’t be a runable code chunk, you should be able to just paste as is and not edit it at all). Delete the original chunk (i.e., there should be now code run in this section of your report when you knit it.)
my_data <- tibble(group = rep(1:10, each=10), value = c(16, 18, 19, 15, 15, 23, 16, 8, 18, 18, 16, 17, 17, 16, 37, 23, 22, 13, 8, 35, 20, 19, 21, 18, 18, 18,
17, 14, 18, 22, 15, 27, 20, 15, 12, 18, 15, 24, 18, 21, 28, 22, 15, 18, 21, 18, 24, 21, 12, 20, 15, 21,
33, 15, 15, 22, 23, 27, 20, 23, 14, 20, 21, 19, 20,
18, 16, 8, 7, 23, 24, 30, 19, 21, 25, 15, 22, 12,
18, 18, 24, 23, 32, 22, 11, 24, 11, 23, 22, 26, 5, 16, 23, 26, 20, 25, 34, 27, 22, 28))
my_summary <- my_data %>% summarize(group_by = group, mean_val = mean(value)) glimpse(my_summary)
## Rows: 100
## Columns: 2
## $ group_by <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~ ## $ mean_val <dbl> 19.67, 19.67, 19.67, 19.67, 19.67, 19.67, 19.67, 19.67, 19.67~
Task 5: Simulating p values
Setting up simulated data
This revision task is meant to connect to Question 16 of your pre-knowledge quiz and help you visualize the behaviour of_p_ values.
• Set your seed to last3digplus (that you set up in Task 1)
• Create 3 simulated data sets (sim1 to sim3) of length 100000 and with two columns: group and val.
– Use tibble to create each dataset.
– For each dataset, set group = rep(1:1000, each = 100)
– For sim1 set val to be 100000 values drawn from N(0,1) (see function rnorm()‘). – For sim2 set val to be 100000 values drawn from N(0.2,1).
– For sim3 set val to be 100000 values drawn from N(1,1).
• Stack all 4 datasets into one new dataset called all_sim, with an new column at the beginning called sim that takes the value 1 if the observation is from sim1, 2 from sim2, and so on.
– This is most of the code you will need bind_rows(sim1, sim2, sim3, .id = "sim")
• Make any alterations necessary to all_sim so that you can join on the dataset sim_description to provided better labels for each simulation (but without including any irrelevant labels).
– You will need to include the code below to create this dataset.
• Plot histograms for the first three groups for each simulated dataset in one plot. Your final plot should use ggplot and facet_wrap. There should be three rows and three columns. The facets should be labelled with desc (the more informative description than just 1, 2, 3).
– Provide an appropriate figure caption, set with your chunk options.
– Apply theme_minimal().
# Create sim_description
# Dataset to merge with improved simulation names sim_description <- tibble(sim = 1:4, desc = c("N(0, 1)",
"N(0.2, 1)",
"N(1, 1)",
"Pois(5)"))
all_sim %>%
filter(group <= 3) %>% ggplot(aes(x = val)) + geom_histogram(bins = 40) + facet_wrap(desc~group, nrow = 3) +
theme_minimal() +
N(0, 1) N(0, 1) N(0, 1)
1 2 3
10
5
0
N(0.2, 1) N(0.2, 1) N(0.2, 1)
1 2 3
N(1, 1) N(1, 1) N(1, 1)
1 2 3
10
5
0
−2 0 2 4 −2 0 2 4 −2 0 2 4
val
Figure 2: Target first visualisation for task 5
Calculating p values
• Create a new dataset called pvals that starts with all_sim, groups by both desc AND group, and then summarizes to find the p value for each group, based on a one sample, 2-sided t.test, where the null hypothesis is the the population mean is 0.
• t.test(val, mu = 0)$p.value will pull out the p value as a value for you (when included within the appropriate code).
• Plot histograms of the p values, faceted by desc. I would recommend adding scales = "free_y" to your facet_wrap() call as well (not assessed: can you figure out what it does?)
– Use these statements appropriately in in your code to make the figure:
∗ geom_histogram(boundary = 0, binwidth = 0.05, fill = "grey", color = "black") ∗ xlim(0,1)
– Provide an appropriate figure caption, set with your chunk options.
• Apply theme_minimal().
N(0, 1) N(0.2, 1) N(1, 1)
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
pval
Figure 3: Target second visualisation for task 5
QQ plots
• Create one final plot that creates a 3x1 figure with QQ plots for each simulation. You must determine what the distribution of a p value should be, WHEN THE NULL HYPOTHESIS IS TRUE, and compare it it to the distributions of p values we have generated.
I have provided a template for the code, you just need to decide which quantile function to put where the XXXXXXX is now. (e.g., qpois is the quantile function for the Poisson distribution). Also adjust the caption with your authorship and add an appropriate figure caption.
pvals %>% ggplot(aes(sample = pval)) + geom_qq(distribution = XXXXXXX) + geom_abline(intercept = 0, slope = 1) + facet_wrap(~desc) + theme_minimal() +
Conclusion and summary
Remark briefly on:
• how what you have done in this task connects to the definition of a p value, and
• specifically how it connects to answering the below question from the pre-knowledge check.
(Note: so that you didn’t have to wait to long for your PDF to knit, I made the number of simulations and simulation size a little smaller, but the idea is the same.)
Figure 4: Target final visualisation for task 5
Writing sample
Prompt
Read Motulsky (2014), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4203998/pdf/210_2014_Article_ 1037.pdf, and paraphrase it with the goal of giving advice to your future self. I.e., YOU in the future is your audience, but keep the tone professional enough that you could share it as a LinkedIn post, or provide a copy to an acquaintance as advice. This shouldn’t be a summary of the full article. Instead, focus on what surprised you, what you didn’t know before, what you don’t want to forget. You can also disagree with the claims in the article if you see fit.
As your task is to explicitly comment on this article, you must reference the article! The in-text APA style reference for this article is: (Motulsky, 2014) with the below the full reference to include at the end:
Motulsky, H. J. (2014). Common misconceptions about data analysis and statistics. Naunyn-
Schmiedeberg’s Archives of Pharmacology, 387(11), 1017–1023. https://doi.org/10.1007/s00210014-1037-6
You might also incorporate it as “Motulsky (2014) states that...”. You don’t need the reference at the end of every sentence, but it should be clear to a reader what ideas have come from the article and what has come from you, throughout the piece.
You can use another referencing style if you wish, or you can just copy this (that is not an academic offence!).
Please be very mindful of proper paraphrasing! The main point of this activity is to put things in YOUR OWN WORDS. Read the provided advice in Module 2.
There should not be too many direct quotes from the article, and better if there are none at all. IMPORTANT NOTE: A zero grade on paraphrasing will result in a zero on the ENTIRE writing component of your portfolio.
Make sure your writing sample has an introduction and comes to a conclusion.
Word count: 300–500 words (not including reference at the end). Please add a statement of your word count at the end of the passage.
Reflection
Briefly, 100 to 200 words each, answer the following questions:
• What is something specific I’m proud of in this portfolio?
• How might I apply what I’ve learned and demonstrated in this portfolio in future work and study, after STA303/1002?
• What is something I’d do differently next time?
Rubric
There is a breakdown of the overall contribution of each component to your grade below, followed by a more detailed rubric.
Component Percentage for component
Cover page 1%
Introduction 5%
Statistical skills sample 60%
Commenting of code 1%
Readability of R code 1%
Reproducibility 2%
Writing sample 20%
Reflection 5%
General (multiple components) 5%
Detailed rubric TBA.