$20
Given two random variables (R.V.) that are related to the same subjects it can be of interests to determine whether or not there is a possible linear relationship. The linear relationship can be formally expressed as π = ππ + π where π and π are real numbers and π and π are the R.V. The information we obtain about the R.V. will be empirical. Further, we are working with R.V. thus we would be surprised if there was a perfect fit to of our data and linear relationship. You can make an initial conjecture about which of the two variables is the independent variable. (If in the end it turns out you are wrong you can start over with the variables swapped.) The first approach to studying a possible relationship between the variables is to obtain a visual representation.
Scatter Plot
Use a horizontal axis for the variable you decided is the independent variable. This is conventionally labeled π₯. The vertical axis will be the dependent variable and by convention will be labeled π¦. To make the scatter plot you will need pairs of samples (π₯, π¦) from the subjects under study. You then plot these ordered pairs on the scatter plot. If there is a general trend upward this may indicate what is termed positive correlation. If there is a general trend downward this may indicate what is termed negative correlation. If no trend is indicated there may not be a correlation of the linear type.
As always there is a desire to quantify or give a figure of merit to our perception. We will use some of the concepts from probability we have developed already. Remember the covariance between two variables?
πΆππ£(π, π) = πΈ[(π − ππ)(π − ππ)] = πΈ(ππ) − ππππ.
It can be used to obtain a numerical value that will convey to us the likelihood of a linear relationship between the variables.
Correlation Coefficient
Let π and π be any two R.V. The correlation coefficient of π and π is denoted π (or π(π, π)) and is given by
where .
Further,
|ρ| ≤ 1 and |π| = 1 if and only if π = ππ + π.
The correlation coefficient π is a parameter and in practice we will not know it. The statistic related to it is π. The statistic π can be determined by the formula below. As before −1 ≤ π ≤ 1.
A value for π is determine by entering the sample values into the formula for π. The heuristic employed is that if π is between eight-tenths and one then a positive correlation may exist and if π is between negative one and negative eight-tenths then a negative correlation may exist. To make the interpretation more accurate a process of quantizing the decision process is employed.
Hypothesis Test
We will use the traditional method for the hypothesis test. Further, if the samples are small and the two random variable are normally or approximately normally distributed that we can use the t-distribution.
The usual form of the statement of the hypothesis is (though it can be one sided):
π»0: π = 0 π»1: π ≠ 0
The critical value (C.V.) is determined using the t-table and degrees of freedom equal to n-2 where n is the sample size (The number of data pairs). The usual level of significance is 5%. The derivation of the test value (T.V.) is tangential to our present interest; hence, the formula is given below.
If the T.V. enters either of the two rejection regions determined by the critical values then the decision is to reject the null hypothesis and a correlations between the two variables is taken as the stepping off point for further studies of these two variables.
Correlation
There are five characterization of correlation.
1.) Cause and effect
2.) A reverse cause and effect: the variables are the reverse of that first assumed
3.) A third or lurking variable is involved
4.) There is a spectrum of variables with a complexity of interrelationships
5.) The relationship though it appears to exist is coincidental
If the researcher wishes to pursue the line of argument that there is a linear relationship then a straight line model can be made.
Least Square Fit
The formulas for the constants in the regression line, π¦′ = π + ππ₯ , are
The name for this line is the regression line. The regression line can, potentially, be used to determine future values of the dependent variable.
The preceding has been a brief and limited discussion of correlation regression.
Exercise
For the following information provided draw a scatter plot, compute π, at a 5% level of significance perform a hypothesis test, determine the type of correlation, obtain the regression line, and use it to predict several new values of the dependent variable.
Listed in the table below are the number of grams of carbohydrates and the number of kilocalories for a 100-gram sample of various raw foods.
carbohydrates
15.25
16.55
11.10
13.01
14.13
15.11
kilocalories
59
72
43
55
56
59
You may want to consider using EXCEL to address this exercise.