Raphaël Jauslin

Password generator

2025-05-20T00:00:00+00:00

Password

Deville systematic

2022-07-06T00:00:00+00:00

Introduction

The Deville systematic design is a sampling method developed in 1998 by Jean-Claude Deville. While it shares similarities with systematic sampling, it has distinct properties. Chauvet (2012) demonstrated that Deville systematic sampling and the ordered pivotal method are actually the same underlying sampling design.

This vignette explains how to use the functions sys_deville and sys_devillepi2 and includes a small simulation to verify that the second-order inclusion probabilities align with those calculated by the function spm from the BalancedSampling package, which implements the ordered pivotal method.

Generating Data

Inclusion probabilities are generated unequally and are proportional to a random uniform variable.

library(sampling)
library(StratifiedSampling)
library(BalancedSampling)

set.seed(1)
N <- 20
n <- 3
pik <- inclusionprobabilities(runif(N),n)

Simulations

To verify whether the function correctly computes second-order inclusion probabilities, we perform a large number of simulations to estimate the second-order inclusion probability matrix.

SIM <- 100000
PI_1 <-  PI_2 <-  matrix(rep(0,N*N),ncol = N,nrow = N)

for(i in 1:SIM){
  
  s1 <- BalancedSampling::spm(pik)
  s1_01 <- rep(0,N)
  s1_01[s1] <- 1
  
  s2 <- sys_deville(pik)
  s2_01 <- rep(0,N)
  s2_01[s2] <- 1
  PI_1 <- PI_1 + s1_01%*%t(s1_01)
  PI_2 <- PI_2 + s2_01%*%t(s2_01)

}

PI_1 <- PI_1/SIM
PI_2 <- PI_2/SIM

Exact matrix of second order inclusion

The function sys_devillepi2 computes the exact second order inclusion probabilities.

PI <- sys_devillepi2(pik) # compute the second order inclusion probabilities

Results

We visualize and compare the second-order inclusion probability matrices.

PI_1_sp <- as(as.matrix(PI_1),"sparseMatrix")
PI_2_sp <- as(as.matrix(PI_2),"sparseMatrix")
PI_sp <- as(as.matrix(PI),"sparseMatrix")

image(PI_1_sp)

image(PI_2_sp)

image(PI_sp)

Accuracy test

To test accuracy, we verify that the estimated probabilities closely match the expected values.

# proportional test, these values should be approximately to 0.95
length( which(abs((PI_1_sp@x - PI_sp@x)/sqrt(PI_sp@x*(1-PI_sp@x)/SIM)) < 1.96))/length(PI_sp@x)

#> [1] 0.961039

length( which(abs((PI_2_sp@x - PI_sp@x)/sqrt(PI_sp@x*(1-PI_sp@x)/SIM)) < 1.96))/length(PI_sp@x)

#> [1] 0.9448052

References

Chauvet, G. (2012), On a characterization of ordered pivotal sampling, Bernoulli, 18(4):1320-1340 DOI: https://doi.org/10.3150/11-BEJ380

Sequential balanced sampling

2022-06-06T00:00:00+00:00

Introduction

Balanced sampling plays a crucial role in applied statistics. In this vignette, we explain how to use the balseq function to select a balanced and spatially distributed sample. For a detailed explanation of the method, refer to doi:10.1002/env.2776.

Loading Data

We will use the belgianmunicipalities dataset from the sampling package, which does not contain spatial coordinates. Fortunately, a GEOjson file is available on ArcGIS Hub. We transform it into an sf object and compute the municipalities’ centroids to distribute the sample across space.

# to load data
library(sampling)
library(geojsonio)
library(ggplot2) 
library(viridis)
library(rgeos) 
library(sf)
library(rmapshaper)

data("belgianmunicipalities")

# load geojson directly from the url
belg  <- geojson_read("https://opendata.arcgis.com/datasets/9589d9e5e5904f1ea8d245b54f51b4fd_0.geojson",what = "sp")

# simplify the variable and transform it into a sf object
belg <- rmapshaper::ms_simplify(input = belg,keep = 0.01) %>%
  st_as_sf()

coord <- gCentroid(as(belg, "Spatial"), byid = TRUE)

# concatenated file
Belgium <- cbind(belg,belgianmunicipalities,coord)
head(Belgium)

#> Simple feature collection with 6 features and 26 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 4.216306 ymin: 51.08066 xmax: 4.564865 ymax: 51.37764
#> Geodetic CRS:  WGS 84
#>   OBJECTID   ADMUNAFR   ADMUNADU   ADMUNAGE   Communes CODE_INS arrond    Commune   INS Province Arrondiss
#> 1        1 AARTSELAAR AARTSELAAR AARTSELAAR Aartselaar    11001     11 Aartselaar 11001        1        11
#> 2        2     ANVERS  ANTWERPEN  ANTWERPEN  Antwerpen    11002     11     Anvers 11002        1        11
#> 3        3   BOECHOUT   BOECHOUT   BOECHOUT   Boechout    11004     11   Boechout 11004        1        11
#> 4        4       BOOM       BOOM       BOOM       Boom    11005     11       Boom 11005        1        11
#> 5        5   BORSBEEK   BORSBEEK   BORSBEEK   Borsbeek    11007     11   Borsbeek 11007        1        11
#> 6        6 BRASSCHAAT BRASSCHAAT BRASSCHAAT Brasschaat    11008     11 Brasschaat 11008        1        11
#>    Men04 Women04  Tot04  Men03 Women03  Tot03 Diffmen Diffwom DiffTOT TaxableIncome Totaltaxation averageincome
#> 1   6971    7169  14140   7010    7243  14253     -39     -74    -113     242104077      74976114         33809
#> 2 223677  233642 457319 221767  232405 454172    1910    1237    3147    5416418842    1423715652         22072
#> 3   6027    5927  11954   6005    5942  11947      22     -15       7     167616996      50739035         29453
#> 4   7640    8066  15706   7535    7952  15487     105     114     219     186075961      46636930         21907
#> 5   4948    5328  10276   4951    5322  10273      -3       6       3     143225590      40564374         26632
#> 6  18142   18916  37058  18217   18903  37120     -75      13     -62     533368826     153629397         30574
#>   medianincome        x        y                       geometry
#> 1        23901 4.382005 51.13223 MULTIPOLYGON (((4.400451 51...
#> 2        17226 4.369578 51.26067 MULTIPOLYGON (((4.368136 51...
#> 3        21613 4.516463 51.16576 MULTIPOLYGON (((4.530071 51...
#> 4        17537 4.371434 51.09387 MULTIPOLYGON (((4.385267 51...
#> 5        20739 4.488162 51.19138 MULTIPOLYGON (((4.509002 51...
#> 6        21523 4.500281 51.30940 MULTIPOLYGON (((4.540815 51...

Data visualization

We visualize Belgian municipalities with their average income using ggplot2.

p <- ggplot()+
  geom_sf(data = Belgium,aes(fill = averageincome),size = 0.1)+
  scale_fill_viridis_c(option = "G")
p

Inclusion probabilites

A good sample should maintain the population’s characteristics. By defining proportional inclusion probabilities, we ensure better representativity. We set up here the inclusion probabilities equal with sum equal to 50. i.e. the sample will contain 50 units.

N <- nrow(Belgium) # population total
n <- 50 # sample size

# variable of interest
y <- belgianmunicipalities$averageincome

# auxiliary variables
Xaux <- cbind(belgianmunicipalities$Tot04,
              belgianmunicipalities$Women04,
              belgianmunicipalities$TaxableIncome,
              belgianmunicipalities$Diffmen,
              belgianmunicipalities$Diffwom)


# inclusion probabilities
pik <- rep(n/N,N)

Xaux <- cbind(pik,Xaux) # add pik to fixed sample size
Xspread <- coord

Balanced sampling

We compare balanced sampling balseq with two other methods: samplecube and simple random sampling srswor. The percentage deviation from auxiliary totals helps evaluate the balance quality.

library(StratifiedSampling)

s <- balseq(pik,Xaux)
s_cube <- samplecube(Xaux,pik,comment = FALSE)
s_srs <- srswor(n,N)


TOT <- colSums(Xaux)
EST1 <- colSums(Xaux[s,]/pik[s])
EST2 <- colSums(Xaux[s_cube == 1,]/pik[s_cube == 1])
EST3 <- colSums(Xaux[s_srs == 1,]/pik[s_srs == 1])


100* (EST1 - TOT)/TOT

#>       pik                                                   
#>  0.000000  4.908396  4.890552  6.873691 14.225213  6.440032

100* (EST2 - TOT)/TOT

#>        pik                                                        
#>   0.000000  -4.068363  -4.153367  -8.105537 -19.212958 -14.508554

100* (EST3 - TOT)/TOT

#>       pik                                                   
#>   0.00000  12.99101  12.52005  12.72183 -13.28123 -21.77427

Spread sampling

To incorporate spatial distribution, we use geographic coordinates as a matrix to the argument Xspread of the function. Here coord is an output of the function gCentroid which is by construction an S4 object. To get the data.frame that are encapsulated inside, we simply use the @coords operator.

s <-  balseq(pik,
             Xaux,
             Xspread = as.matrix(Xspread@coords))
p <- p + 
  geom_point(data = Belgium,aes(x = x,y = y),shape = 1,alpha = 0.9)+
  geom_point(data = Belgium[s,],aes(x = x,y = y),colour = "red")+
  scale_fill_viridis_c(option = "G")

Wave Sampling

2021-10-19T09:11:51+00:00

Introduction

Geographical data are generally auto-correlated, making it preferable to avoid sampling neighboring units. We introduce a new method for selecting well-spread samples from a finite spatial population with equal or unequal inclusion probabilities. The proposed method, called wave (Weakly Associated Vectors), defines the contiguity structure using dense stratification. This method precisely satisfies inclusion probabilities while providing well-spread samples. This document serves as an introduction to using the wave() function.

Data Generation

We use the meuse dataset from the sp package, described as follows:

This dataset provides locations and topsoil heavy metal concentrations, along with various soil and landscape variables recorded at observation locations in a floodplain of the river Meuse, near Stein (NL).

As explained by Grafström and Tillé (2013), we generate inclusion probabilities proportional to copper concentration, a highly spatially correlated variable.

library(sp)
library(sf)
library(sampling)
library(WaveSampling)
data("meuse")
data("meuse.riv")
meuse.riv <- meuse.riv[which(meuse.riv[,2] < 334200 & meuse.riv[,2] > 329400),]
meuse_sf <- st_as_sf(meuse, coords = c("x", "y"), crs = 28992, agr = "constant")

X <- scale(as.matrix(meuse[,1:2]))
pik <- inclusionprobabilities(meuse$copper,30)

Sample selection

We perform sample selection easily using the wave() function.

s <- wave(X,pik)
sum(s)
#> [1] 30

Data visualization

The selected sample is visualized using ggplot2.

library(ggplot2)
p <- ggplot()+
  geom_sf(data = meuse_sf,aes(size=copper),show.legend = 'point',shape = 1,stroke = 0.3)+
  geom_polygon(data = data.frame(x = meuse.riv[,1],y = meuse.riv[,2]),
               aes(x = x,y = y),
               fill = "lightskyblue2",
               colour= "grey50")+
  geom_point(data = meuse,
             aes(x = x,y = y,size = copper),
             shape = 1,
             stroke = 0.3)+
  geom_point(data = meuse[which(s == 1),],
             aes(x = x,y = y,size = copper),
             shape = 16)+
  labs(x = "Longitude",
       y = "Latitude",
       title = NULL,
       size = "Copper",
       caption = NULL)+
  scale_size(range = c(0.5, 3.5))+
  theme_minimal()
p

Spatial balance

Voronoï polygons

One way of measuring the spread of a sample was developed by Stevens Jr. and Olsen (2004) and then suggested by Grafström et al. (2012). It is based on the Voronoï polygons and is given by

\[B(\bf s) = \frac{1}{n}\sum_{i \in s} (v_i - 1)^2\]

where \(v_i\) is equal to the sum of the inclusion probabilities inside the \(i\)th polygons and \(\bf s\) is the vector of size \(N\) with elements equal 0 or 1. This quantity is implemented in the package BalancedSampling with the function sb(). We calculate the values of the \(v_k\) with the function sb_vk.

The closer \(B(\bf s)\) is to zero, the better is the spatial balance of the sample. Graphically, we obtain the following plot.

library(sp)
library(sampling)
library(ggvoronoi)

data("meuse")
data("meuse.area")

v <- sb_vk(pik,as.matrix(meuse[,1:2]),s)
meuse$v <- v

p <- p + geom_voronoi(data = meuse[which(s == 1),],
               aes(x = x,y = y,fill = v),
               outline =as.data.frame(meuse.area),
               size = 0.1,
               colour = "black")+
  geom_point(data = meuse,
             aes(x = x,y = y,size = copper),
             shape = 1,
             stroke = 0.3)+
  geom_point(data = meuse[which(s == 1),],
             aes(x = x,y = y,size = copper),
             shape = 16)+
  scale_fill_gradient2(midpoint = 1)
p

BalancedSampling::sb(pik,as.matrix(meuse[,1:2]),which(s == 1))

#> [1] 0.0910097

Moran index

Another way to estimate the spatial spread is developed by Tillé et al. (2018), it uses a corrected version of the traditional Moran’s \(I\) index. This estimator use spatial weights \(w_{ij}\) that indicates how a unit \(i\) is close from the unit \(j\). Such matrix is supposed to include inclusion probabilities in its computation, hence, the spatial weights matrix \(\bf W\) is generally not symmetric. The spatial balance measure is given by

\[I_B =\frac{(\bf s-\bf \bar{s}_w)^\top \bf W (\bf s-\bf \bar{s}_w)}{\sqrt{(\bf s-\bf \bar{s}_w)^\top \bf D (\bf s-\bf \bar{s}_w) (\bf s-\bf \bar{s}_w)^\top \bf B (\bf s-\bf \bar{s}_w) }},\]

where \(\bf D\) is the diagonal matrix containing the \(w_i\),

\[\bf \bar{s}_w = \bf 1 \frac{\bf s^\top \bf W \bf 1}{\bf 1^\top \bf W \bf 1},\]

and

\[\bf B = \bf W^\top \bf D^{-1} \bf W - \frac{\bf W^\top \bf 1\bf 1^\top \bf W}{\bf1^\top \bf W \bf 1}.\]

The Moran’s \(I\) index is implemented in the function IB(). It is possible to specify your own spatial weights with the argument W. There is no natural way of defining \(\bf W\), here we propose to consider for each unit only the neighbour such that the sum of the inclusion probabilities of the stratum sum up to 1. It is implemented in the function wpik(). Another way of estimating the spatial weights is developed by Tillé et al. (2018) and use the inverse of the inclusion probabilities \(1/\pi_i\) to estimate the neighbours of the unit \(i\). It is implemented in the function wpikInv(). As explain by Tillé et al. (2018), \(w_{ii}\) is supposed to be equal to 0 for all \(i \in U\). By construction the function wpik does not return the diagonal equal to zero. So if we want to calculate the Moran’s I index with wpik, we need to subtract the diagonal of the returned matrix.

W <- wpik(X,pik)
W <- W - diag(diag(W))
IB(W,s)

#> [1] -0.4895601

W1 <- wpikInv(X,pik)
IB(W1,s)

#> [1] -0.4554427

References

Grafström, A., Lundström, N. L. P., and Schelin, L., (2012). Spatially balanced sampling through the pivotal method, Biometrics, 68(2):514-520 DOI: https://doi.org/10.1111/j.1541-0420.2011.01699.x

Grafström, A. and Tillé, Y., Doubly balanced spatial sampling with spreading and restitution of auxiliary totals, Environmetrics, 14(2):120-131 DOI: https://doi.org/10.1002/env.2194

Stevens Jr., D.L. and Olsen, A. R. (2004), Spatially balanced sampling of natural resources. Journal of the American Statistical Association, 99(465):262-278 DOI: https://doi.org/10.1198/016214504000000250)

Tillé, Y., Dickson, M. M., Espa, G., and Giuliani, D. (2018). Measuring the spatial balance of a sample: A new measure based on Moran’s I index, Spatial Statistics, 23:182-192 DOI: https://doi.org/10.1016/j.spasta.2018.02.001

Statistical Matching using Optimal Transport

2021-10-19T00:00:00+00:00

Introduction

In this vignette, we explore how key functions from the package can be used to estimate a contingency table. Our analysis is based on the eusilc dataset from the laeken package. Each function discussed here is thoroughly explained in the manuscript by Raphaël Jauslin and Yves Tillé (2021), available on doi:10.1016/j.jspi.2022.12.003.

Contingency Table

To construct the contingency table, we examine the factor variable pl030, which represents economic status, in combination with a discretized version of the equivalized household income, eqIncome. The discretization process involves calculating specific percentiles (0.15, 0.30, 0.45, 0.60, 0.75, 0.90) of eqIncome and defining categorical intervals based on these values.

library(laeken)
library(sampling)
library(StratifiedSampling)

data("eusilc")
eusilc <- na.omit(eusilc)
N <- nrow(eusilc)


# Xm are the matching variables and id are identity of the units
Xm <- eusilc[,c("hsize","db040","age","rb090","pb220a")]
Xmcat <- do.call(cbind,apply(Xm[,c(2,4,5)],MARGIN = 2,FUN = disjunctive))
Xm <- cbind(Xmcat,Xm[,-c(2,4,5)])
id <- eusilc$rb030


# categorial income splitted by the percentile
c_income  <- eusilc$eqIncome
q <- quantile(eusilc$eqIncome, probs = seq(0, 1, 0.15))
c_income[which(eusilc$eqIncome <= q[2])] <- "(0,15]"
c_income[which(q[2] < eusilc$eqIncome & eusilc$eqIncome <= q[3])] <- "(15,30]"
c_income[which(q[3] < eusilc$eqIncome & eusilc$eqIncome <= q[4])] <- "(30,45]"
c_income[which(q[4] < eusilc$eqIncome & eusilc$eqIncome <= q[5])] <- "(45,60]"
c_income[which(q[5] < eusilc$eqIncome & eusilc$eqIncome <= q[6])] <- "(60,75]"
c_income[which(q[6] < eusilc$eqIncome & eusilc$eqIncome <= q[7])] <- "(75,90]"
c_income[which(  eusilc$eqIncome > q[7] )] <- "(90,100]"

# variable of interests
Y <- data.frame(ecostat = eusilc$pl030)
Z <- data.frame(c_income = c_income)

# put same rownames
rownames(Xm) <- rownames(Y) <- rownames(Z)<- id

YZ <- table(cbind(Y,Z))
addmargins(YZ)

#>        c_income
#> ecostat (0,15] (15,30] (30,45] (45,60] (60,75] (75,90] (90,100]   Sum
#>     1      409     616     722     807     935    1025      648  5162
#>     2      189     181     205     184     165     154       82  1160
#>     3      137      90      72      75      59      52       33   518
#>     4      210     159     103      95      74      49       46   736
#>     5      470     462     492     477     459     435      351  3146
#>     6       57      25      28      30      17      11       10   178
#>     7      344     283     194     149     106      91       40  1207
#>     Sum   1816    1816    1816    1817    1815    1817     1210 12107

Sampling schemes

Here we set up the sampling designs and define all the quantities we will need for the rest of the vignette. The sample is selected with simple random sampling without replacement and the weights are equal to the inverse of the inclusion probabilities.

# size of sample
n1 <- 1000
n2 <- 500

# samples
s1 <- srswor(n1,N)
s2 <- srswor(n2,N)
  
# extract matching units
X1 <- Xm[s1 == 1,]
X2 <- Xm[s2 == 1,]
  
# extract variable of interest
Y1 <- data.frame(Y[s1 == 1,])
colnames(Y1) <- colnames(Y)
Z2 <- as.data.frame(Z[s2 == 1,])
colnames(Z2) <- colnames(Z)
  
# extract correct identities
id1 <- id[s1 == 1]
id2 <- id[s2 == 1]
  
# put correct rownames
rownames(Y1) <- id1
rownames(Z2) <- id2
  
# here weights are inverse of inclusion probabilities
d1 <- rep(N/n1,n1)
d2 <- rep(N/n2,n2)
  
# disjunctive form
Y_dis <- sampling::disjunctive(as.matrix(Y))
Z_dis <- sampling::disjunctive(as.matrix(Z))
  
Y1_dis <- Y_dis[s1 ==1,]
Z2_dis <- Z_dis[s2 ==1,]

Harmonization

Then the harmonization step must be performed. The harmonize function returns the harmonized weights. If the true population totals are known, it is possible to use these instead of the estimate made within the function.

re <- harmonize(X1,d1,id1,X2,d2,id2)  

# if we want to use the population totals to harmonize we can use 
re <- harmonize(X1,d1,id1,X2,d2,id2,totals = c(N,colSums(Xm)))

w1 <- re$w1
w2 <- re$w2

colSums(Xm)

#>      1      2      3      4      5      6      7      8      9     10     11 
#>    476    887   2340    763   1880   1021   2244   1938    558   6263   5844 
#>     12     13     14  hsize    age 
#>  11073    283    751  36380 559915

colSums(w1*X1)

#>      1      2      3      4      5      6      7      8      9     10     11 
#>    476    887   2340    763   1880   1021   2244   1938    558   6263   5844 
#>     12     13     14  hsize    age 
#>  11073    283    751  36380 559915

colSums(w2*X2)

#>      1      2      3      4      5      6      7      8      9     10     11 
#>    476    887   2340    763   1880   1021   2244   1938    558   6263   5844 
#>     12     13     14  hsize    age 
#>  11073    283    751  36380 559915

Optimal transport matching

The statistical matching is done by using the otmatch function. The estimation of the contingency table is calculated by extracting the id1 units (respectively id2 units) and by using the function tapply with the correct weights.

# Optimal transport matching
object <- otmatch(X1,id1,X2,id2,w1,w2)
head(object[,1:3])

#>         id1    id2    weight
#> 702     702 251702 11.509002
#> 1      1401   1401 13.550397
#> 2506   2506 194004  8.315938
#> 2506.1 2506 324205  2.013395
#> 3001   3001 494702 10.976034
#> 3602   3602 503002 12.651816

Y1_ot <- cbind(X1[as.character(object$id1),],y = Y1[as.character(object$id1),])
Z2_ot <- cbind(X2[as.character(object$id2),],z = Z2[as.character(object$id2),])
YZ_ot <- tapply(object$weight,list(Y1_ot$y,Z2_ot$z),sum)

# transform NA into 0
YZ_ot[is.na(YZ_ot)] <- 0

# result
round(addmargins(YZ_ot),3)

#>       (0,15]  (15,30]  (30,45]  (45,60]  (60,75]  (75,90] (90,100]       Sum
#> 1    908.206  732.717  739.153  768.961  886.744  804.436  505.966  5346.183
#> 2    229.633  157.397  125.376  231.835  232.178  166.663  105.011  1248.094
#> 3    111.164  105.834   47.015   68.783   81.041   51.124   51.106   516.067
#> 4     60.987   66.797  104.289  210.126   76.667   92.290   12.085   623.241
#> 5    549.988  566.912  482.875  446.948  400.627  362.297  356.626  3166.273
#> 6      8.577   37.881   14.943   51.063    0.000   10.138    0.000   122.602
#> 7    176.177  164.798  193.780  190.152  119.938  166.566   73.129  1084.540
#> Sum 2044.732 1832.336 1707.432 1967.869 1797.195 1653.514 1103.922 12107.000

Balanced sampling

As you can see from the previous section, the optimal transport results generally do not have a one-to-one match, meaning that for every unit in sample 1, we have more than one unit with weights not equal to 0 in sample 2. The bsmatch function creates a one-to-one match by selecting a balanced stratified sampling to obtain a data.frame where each unit in sample 1 has only one imputed unit from sample 2.

# Balanced Sampling 
BS <- bsmatch(object)
head(BS$object[,1:3])

#>         id1    id2    weight
#> 702     702 251702 11.509002
#> 1      1401   1401 13.550397
#> 2506   2506 194004  8.315938
#> 3001   3001 494702 10.976034
#> 3602   3602 503002 12.651816
#> 3901.1 3901 294601 10.970414

Y1_bs <- cbind(X1[as.character(BS$object$id1),],y = Y1[as.character(BS$object$id1),])
Z2_bs <- cbind(X2[as.character(BS$object$id2),],z = Z2[as.character(BS$object$id2),])
YZ_bs <- tapply(BS$object$weight/BS$q,list(Y1_bs$y,Z2_bs$z),sum)
YZ_bs[is.na(YZ_bs)] <- 0
round(addmargins(YZ_bs),3)

#>       (0,15]  (15,30]  (30,45]  (45,60]  (60,75]  (75,90] (90,100]       Sum
#> 1    950.180  747.323  753.780  706.298  903.800  786.384  498.417  5346.183
#> 2    202.833  138.314  153.513  220.030  237.620  175.001  120.782  1248.094
#> 3     93.911   92.113   52.996   78.145   85.264   42.861   70.776   516.067
#> 4     69.117   58.966  102.611  227.049   68.017   85.771   11.710   623.241
#> 5    516.973  554.095  495.790  464.549  395.341  331.096  408.429  3166.273
#> 6      8.367   37.881   14.943   51.274    0.000   10.138    0.000   122.602
#> 7    171.059  177.551  181.066  213.189  102.669  172.524   66.482  1084.540
#> Sum 2012.440 1806.244 1754.699 1960.535 1792.710 1603.775 1176.597 12107.000

# With Z2 as auxiliary information for stratified balanced sampling.
BS <- bsmatch(object,Z2)

Y1_bs <- cbind(X1[as.character(BS$object$id1),],y = Y1[as.character(BS$object$id1),])
Z2_bs <- cbind(X2[as.character(BS$object$id2),],z = Z2[as.character(BS$object$id2),])
YZ_bs <- tapply(BS$object$weight/BS$q,list(Y1_bs$y,Z2_bs$z),sum)
YZ_bs[is.na(YZ_bs)] <- 0
round(addmargins(YZ_bs),3)

#>       (0,15]  (15,30]  (30,45]  (45,60]  (60,75]  (75,90] (90,100]       Sum
#> 1    916.607  733.295  727.348  783.917  893.807  804.205  487.003  5346.183
#> 2    215.298  139.840  115.908  246.175  238.891  195.369   96.613  1248.094
#> 3    105.798  103.158   75.489   55.427   72.337   42.861   60.997   516.067
#> 4     46.193   70.368  114.037  190.878   91.443   98.613   11.710   623.241
#> 5    571.876  569.674  459.916  460.539  378.955  356.515  368.799  3166.273
#> 6      8.367   37.881   14.943   51.274    0.000   10.138    0.000   122.602
#> 7    186.803  174.473  191.122  180.442  130.024  144.693   76.984  1084.540
#> Sum 2050.942 1828.688 1698.763 1968.652 1805.457 1652.393 1102.105 12107.000

Prediction

# split the weight by id1
q_l <- split(object$weight,f = object$id1)
# normalize in each id1
q_l <- lapply(q_l, function(x){x/sum(x)})
q <- as.numeric(do.call(c,q_l))
  
Z_pred <- t(q*disjunctive(object$id1))%*%disjunctive(Z2[as.character(object$id2),])
colnames(Z_pred) <- levels(factor(Z2$c_income))
head(Z_pred)

#>         (0,15]   (15,30]    (30,45] (45,60] (60,75] (75,90] (90,100]
#> [1,] 0.0000000 0.0000000 1.00000000       0       0       0        0
#> [2,] 0.0000000 0.0000000 0.00000000       1       0       0        0
#> [3,] 0.1949201 0.8050799 0.00000000       0       0       0        0
#> [4,] 0.0000000 0.0000000 0.00000000       0       1       0        0
#> [5,] 0.0000000 0.0000000 1.00000000       0       0       0        0
#> [6,] 0.7749145 0.1455486 0.07953691       0       0       0        0