Statistics for Laboratory Scientists ( 140.615 )

Sampling Distributions

Example from class

Suppose \(X_1,X_2,\ldots,X_{10}\) are iid Normal ( \(\mu\) = 10, \(\sigma\) = 4 ).

It follows that the sample mean \(\bar{X}\) has a Normal ( \(\mu\) = 10, \(\sigma \ /\sqrt{n}\) = 1.26 ) distribution.

The densities of \(X \sim\) Normal ( \(\mu\) = 10, \(\sigma\) = 4 ) in blue and \(\bar{X} \sim\) Normal ( \(\mu\) = 10, \(\sigma \ /\sqrt{n}\) = 1.26 ) in red.

curve(dnorm(x,10,4/sqrt(10)), from=-2, to=22, n=501, lwd=2, col="red", ylim=c(0,0.35), ylab="")
curve(dnorm(x,mean=10,sd=4), add=TRUE, lwd=2, col="blue")

Simulate \(X_1,X_2,\ldots,X_{10}\) ( as ten independent draws from the blue density ) and calculate their sample mean ( which is the equivalent of one draw from the red density ).

set.seed(2)
x <- rnorm(10, mean=10, sd=4)
x
##  [1]  6.412342 10.739397 16.351381  5.478497  9.678993 10.529681 12.831819  9.041208 17.937896  9.444852
mean(x)
## [1] 10.84461

Simulate five instances of such data, and calculate the sample means.

x <- matrix(rnorm(5*10,10,4), ncol=10)
x
##           [,1]       [,2]      [,3]       [,4]      [,5]     [,6]      [,7]      [,8]      [,9]     [,10]
## [1,] 11.670603  0.7557237 18.363277  0.1931744 12.955754 7.617358  8.465655 17.963682  6.646851  2.136487
## [2,] 13.927011 13.5144183  5.200297 11.9089492 11.275842 3.096081  2.163587  8.778065 18.265205  8.708116
## [3,]  8.429219 10.1432269 16.358553  7.6137673 14.304657 6.389662  6.633180  9.636623  7.751012 13.743450
## [4,]  5.841324 14.0513148 17.818607 13.1688131  8.863369 7.763752 17.614190  9.263354 15.102862 14.556919
## [5,] 17.128916 11.7290606 10.019751 11.1585468  6.893299 9.013950 12.489976  5.204929  5.809709 16.686475
dim(x)
## [1]  5 10
apply(x,1,mean)
## [1]  8.676857  9.683757 10.100335 12.404451 10.613461

Simulate 10,000 instances of such data, calculate the sample means, and make a histogram.

x <- matrix(rnorm(10000*10,10,4), ncol=10)
dim(x)
## [1] 10000    10
m <- apply(x,1,mean)
length(m)
## [1] 10000
hist(m, breaks=seq(-2,22,by=0.5))

Represent these 10,000 means as proportions, and add the \(\bar{X} \sim\) Normal ( \(\mu\) = 10, \(\sigma \ /\sqrt{n}\) = 1.26 ) density to the plot.

hist(m, breaks=seq(-2,22,by=0.5), prob=TRUE, ylim=c(0,0.35), main="")
curve(dnorm(x, 10, 4/sqrt(10)), add=TRUE, lwd=2, col="red")

We can also simulate 10,000 independent draws from a Normal ( \(\mu\) = 10, \(\sigma \ /\sqrt{n}\) = 1.26 ) distribution directly.

m <- rnorm(10000, mean=10, sd=4/sqrt(10))
hist(m, breaks=seq(-2,22,by=0.5), prob=TRUE, ylim=c(0,0.35), main="")
curve(dnorm(x, 10, 4/sqrt(10)), add=TRUE, lwd=2, col="red")

What is the probability the sample mean is bigger than 12?

curve(dnorm(x, mean=10, sd=4/sqrt(10)), from=-2, to=22, lwd=2, col="red")
abline(h=0)
abline(v=12,lty="dotted")

Using the CDF.

pnorm(12, mean=10, sd=4/sqrt(10), lower=FALSE)
## [1] 0.05692315
pnorm((10-12) / (4/sqrt(10)))
## [1] 0.05692315
round(pnorm((10-12) / (4/sqrt(10))), 3)
## [1] 0.057

Estimation via simulation ( 100 random draws from \(\bar{X}\) ).

m <- rnorm(100, mean=10, sd=4/sqrt(10))
m
##   [1]  9.046601  9.609598 10.900771  7.506382 10.960653  8.760014  9.734709  7.750663 10.395367  9.598393
##  [11] 12.727253  9.739435  9.685666 10.604888  7.032293  8.641225 11.639845  9.130762  9.263315 11.082936
##  [21]  7.426709  8.610265 10.839825 11.976583  7.570574 10.362458 10.360318  9.677420  8.662667  8.349442
##  [31]  8.975238 10.352639  9.187727 11.695180  9.170882  8.972076 10.823614 10.909693 11.346058 10.544790
##  [41]  9.547472 12.592633  8.765388 11.412670 10.216970 11.029030 10.339447 11.334574 10.181115 11.038134
##  [51] 11.877069  9.165368  9.138246  8.302916 10.935300 10.269147 10.081504  9.138465 11.410697  7.846416
##  [61] 10.110097 10.845205 10.679155 10.137946  9.707903  9.964085  6.433407 12.928445  9.277801 10.217730
##  [71]  8.992177 10.242735  9.154868 10.121973  8.141926  9.435607  8.960050  7.900956 10.461073 10.147304
##  [81] 10.488639  8.955591 10.909913 10.454365  9.974919  9.496750 10.736377 10.916811  9.058601 11.284210
##  [91] 11.638376 11.798765  7.314513 11.109341  7.018857  8.745757 10.514798 10.655080 12.107459  9.892131
m>12
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
sum(m>12)
## [1] 4
mean(m>12)
## [1] 0.04

Estimation via simulation ( 100,000 random draws ).

m <- rnorm(100000, mean=10, sd=4/sqrt(10))
sum(m>12)
## [1] 5726
mean(m>12)
## [1] 0.05726

What is the chance the sample mean falls between 9.5 and 10.5 ( i.e. within half a unit of the population mean ) ?

curve(dnorm(x, mean=10, sd=4/sqrt(10)), from=-2, to=22, lwd=2, col="red")
abline(h=0)
abline(v=c(9.5,10.5),lty="dotted")

Using the CDF.

pnorm(10.5, mean=10, sd=4/sqrt(10))
## [1] 0.6536836
pnorm(9.5, mean=10, sd=4/sqrt(10))
## [1] 0.3463164
pnorm(10.5, 10, 4/sqrt(10)) - pnorm(9.5, 10, 4/sqrt(10))
## [1] 0.3073672
round(pnorm(10.5, 10, 4/sqrt(10)) - pnorm(9.5, 10, 4/sqrt(10)), 3)
## [1] 0.307

Estimation via simulation ( 100 random draws ).

m <- rnorm(100, mean=10, sd=4/sqrt(10))
m
##   [1]  9.447857 11.697260  9.441668  9.637500  8.622925 11.850755  9.191609  8.794434  9.037606  9.701813
##  [11]  8.370014 10.841709  7.863280 11.312261 10.196758  9.457513 10.698300 11.927690  9.746977  7.204793
##  [21]  8.115221  9.314579 11.936503 11.394658 10.830723  9.424826  8.995837 11.497951 11.149130 10.475976
##  [31]  9.998511 12.012303 11.572981  8.726872  7.708670 11.021226  9.047590  9.992818  8.712282 11.034607
##  [41]  5.960929  9.466259 11.615162  9.024361 10.938870  9.165003 11.574046 10.579790 10.424597  9.777232
##  [51]  8.216042  8.408506  7.999985  9.204514  8.494146  8.426051 10.496881  8.639924  9.041945 10.940673
##  [61] 12.293288 11.252905 11.724816  9.416055  9.714676  8.997239 10.494349 10.444072  9.932385  9.216208
##  [71]  9.848498  9.690198  7.324484  8.881614 11.422962 11.550887 13.823367  7.106822  9.720442 12.432338
##  [81]  8.815525  9.811895  8.633530 11.498508  5.878121 10.017842  8.900173  8.713918  9.563255 11.126727
##  [91]  7.063751 12.162897 10.044861 11.064124 10.961089 11.137365  9.523818  8.114910  9.445144  9.604066
m<10.5
##   [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [18] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
##  [35]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
##  [52]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [69]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
##  [86]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
m<9.5
##   [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
##  [18] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [35]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
##  [52]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
##  [69] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
##  [86] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
mean(m<10.5)
## [1] 0.67
mean(m<9.5)
## [1] 0.44
mean(m<10.5) - mean(m<9.5)
## [1] 0.23
m>9.5 & m<10.5 
##   [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [18] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
##  [35] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [52] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
##  [69]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [86]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
mean(m>9.5 & m<10.5) 
## [1] 0.23

Estimation via simulation ( 100,000 random draws ).

m <- rnorm(100000, mean=10, sd=4/sqrt(10))
mean(m<10.5)
## [1] 0.65609
mean(m<9.5)
## [1] 0.34694
mean(m<10.5) - mean(m<9.5)
## [1] 0.30915
mean(m>9.5 & m<10.5) 
## [1] 0.30915

What is the chance the sample mean is less than 9 or larger than 11 ( i.e. more than one unit away from the population mean ) ?

curve(dnorm(x, mean=10, sd=4/sqrt(10)), from=-2, to=22, lwd=2, col="red")
abline(h=0)
abline(v=c(9,11),lty="dotted")

Using the CDF.

pnorm(9, mean=10, sd=4/sqrt(10))
## [1] 0.2145977
pnorm(11,mean=10, sd=4/sqrt(10), lower=FALSE)
## [1] 0.2145977
pnorm(9, 10, 4/sqrt(10)) + pnorm(11, 10, 4/sqrt(10), lower=FALSE)
## [1] 0.4291953
round(pnorm(9, 10, 4/sqrt(10)) + pnorm(11, 10, 4/sqrt(10), lower=FALSE), 3)
## [1] 0.429

Estimation via simulation ( 100 random draws ).

m <- rnorm(100, mean=10, sd=4/sqrt(10))
m
##   [1]  8.438822  9.171132  9.792386  9.839669 11.293565  9.198246  7.708930  6.811357 10.544466  9.027098
##  [11]  9.594502 10.325829  9.353004 11.387640  9.434517  8.433147  8.910395 10.819983  8.195336 11.503378
##  [21] 10.612452 12.344783  9.692761  8.451338  9.833801 10.549925 10.325532 11.431634  9.361531  6.707064
##  [31]  9.556983 11.038387  8.682359  8.977892 10.444864  9.870009 10.271136 11.367709  8.931747 10.562852
##  [41]  9.615851  9.624510  9.232280 10.701514 10.943003  8.876197  9.858351 10.054871  9.783857  9.188723
##  [51]  7.513860  9.047270  9.149116  9.915054  8.215888 11.883360  9.001377 10.991108 10.239029  8.276146
##  [61] 10.802986 11.213062 10.676254  9.927167  9.961855  9.769051  8.260860 11.182160  8.110080  8.378657
##  [71]  9.382504 11.003883 10.832473  9.476532 10.332445  8.978295  8.372930 10.535734  8.844135  9.992423
##  [81] 10.526703  8.176535  9.249963  9.466049 11.481737 11.708129 11.757776 10.713577  9.583837  9.672322
##  [91]  9.125612 13.516723  7.768200  9.885208  7.433144 10.238779 10.051908  9.966723 10.690069 10.827183
m<9
##   [1]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [18] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
##  [35] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
##  [52] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [69]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
m>11
##   [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [18] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [35] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [52] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [69] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [86]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mean(m<9)
## [1] 0.24
mean(m>11)
## [1] 0.15
mean(m<9) + mean(m>11)
## [1] 0.39
m<9 | m>11
##   [1]  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
##  [18] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
##  [35] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
##  [52] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [69]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [86]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
mean(m<9 | m>11)
## [1] 0.39

Estimation via simulation ( 100,000 random draws ).

m <- rnorm(100000, mean=10, sd=4/sqrt(10))
mean(m<9)
## [1] 0.21394
mean(m>11)
## [1] 0.21555
mean(m<9) + mean(m>11)
## [1] 0.42949
mean(m<9 | m>11)
## [1] 0.42949

Another example

Simulate n=10 random samples from a Uniform(0,1) distribution, and take their mean. Repeat 10,000 times. Do the same using the median instead of the mean. Describe and compare the distribution of the means and the distribution of the medians. Repeat the same for n=100. What do you conclude?

set.seed(51)
nit <- 10000

n <- 10
x <- matrix(runif(n*nit),ncol=n)
xA <- apply(x,1,mean)
x <- matrix(runif(n*nit),ncol=n)
xB <- apply(x,1,median)

n <- 100
x <- matrix(runif(n*nit),ncol=n)
xC <- apply(x,1,mean)
x <- matrix(runif(n*nit),ncol=n)
xD <- apply(x,1,median)

par(mfcol=c(2,2),las=1)
hist(xA,breaks=seq(0,1,length=41),col="lightgrey",main="A:  Mean n=10",xlab="",ylim=c(0,1200),cex.main=1.2)
hist(xB,breaks=seq(0,1,length=41),col="lightgrey",main="B:  Median  n=10",xlab="",ylim=c(0,1200),cex.main=1.2)
hist(xC,breaks=seq(0,1,length=41),col="lightgrey",main="C:  Mean  n=100",xlab="",ylim=c(0,3000),cex.main=1.2)
hist(xD,breaks=seq(0,1,length=41),col="lightgrey",main="D:  Median  n=100",xlab="",ylim=c(0,3000),cex.main=1.2)

round(mean(xA),3)
## [1] 0.5
round(mean(xB),3)
## [1] 0.498
round(mean(xC),3)
## [1] 0.499
round(mean(xD),3)
## [1] 0.5
round(sd(xA),2)
## [1] 0.09
round(sd(xB),2)
## [1] 0.14
round(sd(xC),2)
## [1] 0.03
round(sd(xD),2)
## [1] 0.05

Both the population mean and population median for a Uniform(0,1) distribution are 0.5, and all of the distributions above are centered around this value (the sample mean and the sample median are unbiased estimators). However, for a fixed n, the deviations of the sample means are a lot smaller than the deviations for the sample medians. In this setting, the sample mean is a more efficient estimator for the population mean (or population median) than the sample median. Obviously, using the same estimator, the deviation becomes smaller as n gets larger.