Basic unit-level models

The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))

The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname variable match those of its population counterpart.

The basic unit-level model with county random area effects is fit as follows

library(mcmcsae)
mod <- api00 ~ 
  reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
  gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
##       Mean   SD t-value MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.49    -243 0.12 -1097 -1089 -1083  1394     1
## 
## sigma_ :
##        Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.2 3.11    18.1 0.067  51.4 56.1  61.4  2149     1
## 
## beta :
##                 Mean     SD t-value    MCSE    q0.05     q0.5    q0.95 n_eff R_hat
## (Intercept)  802.755 25.179   31.88 0.49801  761.691  802.847 843.8644  2556 0.999
## ell           -1.960  0.306   -6.40 0.00584   -2.481   -1.962  -1.4580  2751 1.001
## meals         -1.966  0.284   -6.92 0.00542   -2.435   -1.966  -1.5030  2748 1.000
## stypeH      -106.948 13.367   -8.00 0.25196 -128.710 -106.845 -84.3888  2814 1.000
## stypeM       -60.240 11.604   -5.19 0.21186  -79.362  -60.367 -40.9081  3000 1.000
## hsg           -0.618  0.413   -1.50 0.00754   -1.292   -0.613   0.0543  3000 1.000
## col.grad       0.580  0.507    1.14 0.01005   -0.234    0.573   1.4234  2541 0.999
## grad.sch       2.121  0.480    4.42 0.00936    1.338    2.106   2.9163  2629 1.000
## 
## v_sigma :
##         Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 23.3 6.26    3.72 0.202  13.7 22.9  34.1   962     1
## 
## v :
##                 Mean   SD  t-value  MCSE q0.05     q0.5 q0.95 n_eff R_hat
## Alameda      -25.745 15.2 -1.69449 0.327 -51.6 -25.3497 -1.11  2160 1.000
## Amador         0.162 24.2  0.00669 0.449 -39.8   0.0835 39.76  2915 1.000
## Butte         -0.452 23.7 -0.01906 0.456 -39.1  -0.2165 38.43  2699 1.000
## Calaveras      8.082 22.0  0.36734 0.431 -27.0   7.4774 44.89  2607 1.001
## Colusa         0.521 24.4  0.02130 0.453 -40.4   0.8411 40.15  2916 1.000
## Contra Costa -10.131 15.2 -0.66481 0.278 -35.3  -9.5220 13.53  3000 1.000
## Del Norte      1.001 23.9  0.04186 0.437 -38.3   0.3058 41.06  3000 0.999
## El Dorado      0.449 24.5  0.01833 0.477 -38.7   0.7864 39.66  2633 1.000
## Fresno         0.152 15.5  0.00982 0.283 -25.9   0.2895 25.52  3000 1.000
## Glenn         -0.614 23.8 -0.02579 0.434 -40.1   0.1035 37.19  3000 1.000
## ... 47 elements suppressed ...

We wish to estimate the area population means \[ \theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,, \] where \(U_i\) is the set of units in area \(i\) of size \(N_i\). The MCMC output in variable sim can be used to obtain draws from the posterior distribution for \(\theta_i\). The \(r\)th draw can be expressed as \[ \theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,, \] where \(\bar{y}_i\) is the sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals for area \(i\).

N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                 Mean    SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda          678 14.45    46.9 0.310   654  679   702  2170 1.000
## Amador           738 31.03    23.8 0.573   689  738   790  2938 1.000
## Butte            679 26.40    25.7 0.503   635  679   723  2750 1.000
## Calaveras        744 27.34    27.2 0.499   700  743   789  3000 1.000
## Colusa           552 31.90    17.3 0.595   499  553   605  2870 1.000
## Contra Costa     727 14.52    50.0 0.268   703  727   750  2928 1.000
## Del Norte        680 31.90    21.3 0.582   629  679   732  3000 0.999
## El Dorado        754 26.92    28.0 0.509   710  754   798  2798 1.000
## Fresno           595 15.08    39.4 0.275   570  595   619  3000 1.000
## Glenn            624 31.41    19.9 0.573   573  625   674  3000 1.000
## Humboldt         701 27.54    25.5 0.504   656  702   748  2984 1.001
## Imperial         558 25.21    22.1 0.517   518  556   600  2375 0.999
## Inyo             707 32.91    21.5 0.601   653  707   760  3000 1.000
## Kern             596 14.99    39.8 0.310   571  596   620  2332 0.999
## Kings            594 22.90    25.9 0.500   555  595   630  2094 1.001
## Lake             663 25.54    26.0 0.477   621  662   706  2863 1.000
## Lassen           706 26.44    26.7 0.495   662  707   748  2854 1.001
## Los Angeles      636  8.18    77.7 0.149   622  636   649  2995 1.000
## Madera           615 20.25    30.4 0.370   583  615   650  3000 1.001
## Marin            817 20.41    40.0 0.373   783  817   850  3000 1.000
## Mariposa         724 35.74    20.2 0.674   665  724   783  2816 1.000
## Mendocino        663 27.04    24.5 0.494   618  662   707  3000 1.002
## Merced           579 23.76    24.4 0.434   541  578   619  3000 1.001
## Modoc            672 30.30    22.2 0.598   623  672   722  2570 0.999
## Mono             703 41.74    16.8 0.762   634  703   772  3000 0.999
## Monterey         629 17.96    35.0 0.332   598  629   658  2919 1.000
## Napa             702 21.75    32.3 0.398   667  702   739  2984 0.999
## Nevada           799 29.66    26.9 0.541   749  798   847  3000 1.000
## Orange           693 15.23    45.5 0.294   669  693   718  2686 1.000
## Placer           772 24.02    32.1 0.461   734  772   812  2714 0.999
## Plumas           690 32.28    21.4 0.622   637  690   744  2696 1.000
## Riverside        636 14.26    44.6 0.273   613  636   658  2724 1.000
## Sacramento       669 15.40    43.4 0.281   644  668   694  3000 1.000
## San Benito       709 30.33    23.4 0.567   659  709   759  2862 0.999
## San Bernardino   647 13.11    49.3 0.239   625  647   669  3000 1.000
## San Diego        705 15.09    46.7 0.323   680  705   730  2185 1.000
## San Francisco    638 19.89    32.1 0.363   605  638   671  3000 0.999
## San Joaquin      630 17.66    35.7 0.368   600  630   658  2298 1.000
## San Luis Obispo  753 23.66    31.8 0.432   715  753   792  3000 1.000
## San Mateo        735 21.62    34.0 0.408   700  735   770  2812 1.001
## Santa Barbara    679 19.25    35.3 0.351   647  679   710  3000 1.001
## Santa Clara      733 15.77    46.5 0.295   707  733   758  2865 1.000
## Santa Cruz       680 20.01    34.0 0.400   646  680   712  2497 1.001
## Shasta           696 22.55    30.9 0.434   660  696   734  2705 1.001
## Sierra           708 40.49    17.5 0.777   641  708   775  2716 1.001
## Siskiyou         697 25.23    27.6 0.488   656  697   739  2677 1.000
## Solano           712 20.11    35.4 0.370   678  712   743  2954 1.000
## Sonoma           725 22.51    32.2 0.433   688  725   761  2703 1.000
## Stanislaus       661 19.69    33.6 0.382   629  661   693  2656 1.001
## Sutter           652 25.26    25.8 0.473   610  652   692  2852 1.000
## Tehama           659 28.67    23.0 0.538   612  659   707  2845 1.000
## Trinity          650 38.84    16.7 0.709   586  651   713  3000 1.000
## Tulare           583 21.60    27.0 0.415   546  583   618  2712 1.001
## Tuolumne         720 29.90    24.1 0.546   671  720   769  3000 1.000
## Ventura          710 17.42    40.7 0.332   681  709   739  2758 1.000
## Yolo             687 23.99    28.6 0.454   647  687   726  2791 1.000
## Yuba             627 28.80    21.8 0.533   580  626   675  2925 1.000
theta <- c(tapply(apipop$api00, apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

Binomial Unit-Level Model

A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, \[ y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \]

apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
##       Mean  SD t-value   MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -83.3 2.5   -33.3 0.0616 -87.7 -83.2 -79.6  1651     1
## 
## beta :
##                 Mean     SD t-value     MCSE   q0.05     q0.5    q0.95 n_eff R_hat
## (Intercept)  4.56516 1.4705  3.1044 0.053839  2.2341  4.50600  7.11086   746 1.002
## ell         -0.03653 0.0148 -2.4738 0.000397 -0.0615 -0.03621 -0.01266  1383 1.000
## meals        0.00123 0.0139  0.0886 0.000360 -0.0217  0.00131  0.02443  1496 1.001
## stypeH      -2.83251 0.6289 -4.5039 0.018450 -3.8958 -2.81534 -1.80446  1162 1.001
## stypeM      -1.66460 0.5866 -2.8378 0.018269 -2.6488 -1.66899 -0.69783  1031 1.000
## hsg         -0.02637 0.0222 -1.1892 0.000655 -0.0630 -0.02606  0.01058  1146 1.002
## col.grad    -0.03600 0.0267 -1.3468 0.000852 -0.0783 -0.03624  0.00844   984 1.001
## grad.sch     0.01267 0.0327  0.3871 0.001279 -0.0397  0.01183  0.06877   655 0.999
## 
## v_sigma :
##          Mean    SD t-value   MCSE  q0.05  q0.5 q0.95 n_eff R_hat
## v_sigma 0.372 0.267    1.39 0.0117 0.0335 0.325 0.869   525  1.01
## 
## v :
##                  Mean    SD  t-value    MCSE  q0.05      q0.5 q0.95 n_eff R_hat
## Alameda      -0.19342 0.429 -0.45131 0.01244 -1.041 -0.082722 0.335  1186 1.001
## Amador        0.01100 0.456  0.02412 0.00835 -0.704  0.002089 0.749  2986 1.000
## Butte         0.00190 0.451  0.00421 0.00824 -0.697  0.004309 0.692  2998 1.000
## Calaveras     0.04634 0.454  0.10213 0.00978 -0.613  0.004646 0.818  2154 1.001
## Colusa       -0.00633 0.455 -0.01389 0.00832 -0.729 -0.004004 0.711  3000 0.999
## Contra Costa -0.00822 0.396 -0.02074 0.00875 -0.666  0.000858 0.618  2051 1.000
## Del Norte    -0.00822 0.471 -0.01745 0.00883 -0.761 -0.002536 0.712  2843 1.000
## El Dorado    -0.00807 0.454 -0.01777 0.00933 -0.750  0.000655 0.707  2366 1.000
## Fresno       -0.04698 0.365 -0.12865 0.00715 -0.704 -0.015413 0.517  2608 1.000
## Glenn         0.01047 0.464  0.02254 0.00848 -0.687  0.002597 0.745  3000 1.000
## ... 47 elements suppressed ...

To predict the population fractions of schools that meet the growth target by county,

samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                  Mean     SD t-value     MCSE q0.05  q0.5 q0.95 n_eff R_hat
## Alameda         0.785 0.0601   13.05 0.001644 0.670 0.796 0.867  1338 1.002
## Amador          0.839 0.1185    7.08 0.002226 0.600 0.900 1.000  2834 1.000
## Butte           0.836 0.0744   11.23 0.001447 0.708 0.833 0.938  2645 1.000
## Calaveras       0.872 0.0960    9.08 0.001921 0.700 0.900 1.000  2499 1.000
## Colusa          0.650 0.1605    4.05 0.003092 0.333 0.667 0.889  2693 0.999
## Contra Costa    0.816 0.0572   14.27 0.001306 0.715 0.821 0.899  1918 1.001
## Del Norte       0.880 0.1090    8.08 0.002101 0.750 0.875 1.000  2693 0.999
## El Dorado       0.849 0.0752   11.29 0.001521 0.725 0.850 0.950  2444 1.000
## Fresno          0.782 0.0583   13.40 0.001205 0.683 0.785 0.866  2344 0.999
## Glenn           0.794 0.1398    5.68 0.002684 0.556 0.778 1.000  2713 0.999
## Humboldt        0.853 0.0734   11.62 0.001436 0.725 0.850 0.950  2609 1.000
## Imperial        0.698 0.1047    6.67 0.002090 0.525 0.700 0.850  2510 1.000
## Inyo            0.784 0.1504    5.21 0.002810 0.571 0.857 1.000  2867 1.000
## Kern            0.805 0.0516   15.59 0.001181 0.711 0.811 0.878  1911 1.000
## Kings           0.851 0.0798   10.66 0.001649 0.720 0.840 0.960  2345 1.000
## Lake            0.826 0.0948    8.71 0.001822 0.682 0.818 0.955  2706 1.001
## Lassen          0.837 0.1070    7.82 0.002028 0.636 0.818 1.000  2785 0.999
## Los Angeles     0.799 0.0427   18.73 0.001128 0.731 0.799 0.872  1431 1.003
## Madera          0.817 0.0738   11.07 0.001497 0.677 0.806 0.935  2430 1.001
## Marin           0.839 0.0675   12.43 0.001566 0.720 0.840 0.940  1856 1.000
## Mariposa        0.862 0.1454    5.93 0.002731 0.600 0.800 1.000  2834 1.000
## Mendocino       0.816 0.0905    9.01 0.001723 0.640 0.840 0.960  2760 1.000
## Merced          0.790 0.0807    9.80 0.001614 0.651 0.794 0.905  2500 1.000
## Modoc           0.835 0.1509    5.53 0.002993 0.600 0.800 1.000  2544 1.001
## Mono            0.760 0.2446    3.11 0.004505 0.333 0.667 1.000  2948 1.002
## Monterey        0.802 0.0685   11.71 0.001537 0.687 0.807 0.904  1988 1.000
## Napa            0.848 0.0808   10.50 0.001664 0.704 0.852 0.963  2357 1.001
## Nevada          0.870 0.0933    9.32 0.001769 0.714 0.857 1.000  2782 1.001
## Orange          0.774 0.0603   12.84 0.001327 0.672 0.778 0.871  2066 1.000
## Placer          0.815 0.0743   10.97 0.001705 0.682 0.818 0.924  1900 1.003
## Plumas          0.743 0.1524    4.87 0.002987 0.444 0.778 1.000  2602 1.000
## Riverside       0.825 0.0529   15.60 0.001238 0.729 0.831 0.895  1824 0.999
## Sacramento      0.847 0.0471   18.00 0.001229 0.767 0.847 0.924  1466 1.003
## San Benito      0.824 0.1232    6.69 0.002517 0.636 0.818 1.000  2396 1.000
## San Bernardino  0.861 0.0420   20.49 0.001134 0.790 0.862 0.928  1374 1.000
## San Diego       0.849 0.0420   20.22 0.000935 0.778 0.852 0.913  2016 0.999
## San Francisco   0.760 0.0756   10.06 0.001505 0.630 0.760 0.870  2521 1.000
## San Joaquin     0.829 0.0575   14.42 0.001274 0.726 0.839 0.911  2034 1.000
## San Luis Obispo 0.857 0.0681   12.57 0.001400 0.750 0.875 0.950  2368 1.001
## San Mateo       0.800 0.0732   10.92 0.001707 0.671 0.811 0.895  1840 1.003
## Santa Barbara   0.833 0.0640   13.01 0.001366 0.728 0.840 0.926  2196 1.001
## Santa Clara     0.784 0.0687   11.41 0.001917 0.649 0.796 0.875  1283 1.002
## Santa Cruz      0.786 0.0745   10.56 0.001571 0.653 0.796 0.898  2245 1.002
## Shasta          0.817 0.0764   10.69 0.001533 0.675 0.825 0.925  2486 0.999
## Sierra          0.730 0.2254    3.24 0.004407 0.333 0.667 1.000  2615 1.000
## Siskiyou        0.843 0.0992    8.51 0.002000 0.667 0.867 1.000  2458 1.001
## Solano          0.866 0.0621   13.94 0.001465 0.750 0.867 0.950  1799 1.001
## Sonoma          0.845 0.0630   13.41 0.001332 0.741 0.852 0.935  2235 1.002
## Stanislaus      0.812 0.0714   11.38 0.001809 0.692 0.824 0.901  1557 1.000
## Sutter          0.834 0.0907    9.20 0.001703 0.650 0.850 0.950  2835 1.000
## Tehama          0.841 0.0991    8.49 0.001876 0.647 0.824 1.000  2794 1.000
## Trinity         0.756 0.1999    3.78 0.003735 0.500 0.750 1.000  2864 1.000
## Tulare          0.831 0.0631   13.17 0.001325 0.718 0.836 0.918  2270 0.999
## Tuolumne        0.882 0.0920    9.59 0.001827 0.750 0.917 1.000  2535 1.000
## Ventura         0.837 0.0509   16.46 0.001085 0.752 0.839 0.913  2197 1.000
## Yolo            0.849 0.0737   11.52 0.001444 0.714 0.857 0.943  2608 1.002
## Yuba            0.798 0.0996    8.01 0.001933 0.632 0.789 0.947  2652 0.999
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

References

Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36.
Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.