Today, we will see how to make a chi2 test in R.
Chi2 memo
Pearson's chi-squared test or chi2 test will help us to test if :- one variable fits with a theoretical distribution (example : results with 1000 throws of a dice, is it a normal dice or a loaded one?).
- two random variables are independents (example : someone's eye color and his shoe size).
You can find more information here : Chi2 by Stattrek
How to make a chi2 in R
I will use a dataframe to make my chi2. I want to know if two variables (marital status and education) are independent or not.Firstly, I will display the table of the two variables :
table(dataset$marital,dataset$educ)
I obtain :
-9th 9-11th High School -College +College Married 270 343 522 703 843 Widowed 110 84 110 101 61 Divorced 34 80 144 209 103 Separated 42 46 43 51 22 Single 48 138 244 450 308 Couple 46 90 103 142 59
I can see that 843 people are married and with a College education. I also see that 110 people are widowed and stopped their education in High School.
Now, we will see if these two variables are independent (null hypothesis), with the chi2 test :
chi2<-chisq.test(dataset$marital,dataset$educ)
chi2
I obtain :
Pearson's Chi-squared test data: dataset$marital and dataset$educ X-squared = 390.2901, df = 20, p-value < 2.2e-16
I can see that my pvalue is less than 0.05 (pvalue<2.2e-16). So I can reject the null hypothesis. My two variables are not independent!
How can I know which category has an excess or a deficit :
chi2$residuals
I obtain :
dataset$educ dataset$marital -9th 9-11th High School -College +College Married 0.2617848 -1.76781487 -1.74227653 -3.432675 6.4889432 Widowed 9.3892736 2.2735136 1.2208028 -3.2281972 -5.19370145 Divorced -2.9930002 -0.0251501 2.2137140 2.9820762 -3.37361722 Separated 4.8436372 3.2263128 0.0204511 -1.2662691 -4.09296860 Single -6.4278820 -2.2586597 -0.3564617 5.0699334 0.52792194 Couple 0.3616861 3.5671765 1.0965409 0.9328732 -4.91334127
This command shows me that in the category "married and College education" I have more people that what I expected under the null hypothesis.
I can also have this information with :
chi2$observed-chi2$expected
For information :
residuals = (observed - expected)/sqrt(expected)
PS : All statistics are made with the dataset : demographics from NHANES 2011-2012 (only people >20 years old)