Title: | Spatially Balanced Sampling |
---|---|
Description: | Selection of spatially balanced samples. In particular, the implemented sampling designs allow to select probability samples well spread over the population of interest, in any dimension and using any distance function (e.g. Euclidean distance, Manhattan distance). For more details, Pantalone F, Benedetti R, and Piersimoni F (2022) <doi:10.18637/jss.v103.c02>, Benedetti R and Piersimoni F (2017) <doi:10.1002/bimj.201600194>, and Benedetti R and Piersimoni F (2017) <arXiv:1710.09116>. The implementation has been done in C++ through the use of 'Rcpp' and 'RcppArmadillo'. |
Authors: | Francesco Pantalone [aut, cre], Roberto Benedetti [aut], Federica Piersimoni [aut] |
Maintainer: | Francesco Pantalone <[email protected]> |
License: | GPL-3 |
Version: | 1.3.5 |
Built: | 2025-02-17 05:15:24 UTC |
Source: | https://github.com/francescopantalone/spbsampling |
Selects spatially balanced samples through the use of
Heuristic Product Within Distance design (HPWD). To have constant inclusion
probabilities , where
is sample size
and
is population size, the distance matrix has to be standardized
with function
stprod
.
hpwd(dis, n, beta = 10, nrepl = 1L)
hpwd(dis, n, beta = 10, nrepl = 1L)
dis |
A distance matrix NxN that specifies how far all the pairs of units in the population are. |
n |
Sample size. |
beta |
Parameter |
nrepl |
Number of samples to draw (default = 1). |
The HPWD design generates samples approximately with the same
probabilities of the pwd
but with a significantly smaller
number of steps. In fact, this algorithm randomly selects a sample of size
exactly with
steps, updating at each step the selection
probability of not-selected units, depending on their distance from the
units that were already selected in the previous steps.
Returns a matrix nrepl
x n
, which contains the
nrepl
selected samples, each of them stored in a row. In particular,
the i-th row contains all labels of units selected in the i-th sample.
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084. doi:10.1002/bimj.201600194
Benedetti R, Piersimoni F (2017). Fast Selection of Spatially Balanced Samples. arXiv. https://arxiv.org/abs/1710.09116
# Example 1 # Draw 1 sample of dimension 10 without constant inclusion probabilities dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix s <- hpwd(dis = dis, n = 10) # drawn sample # Example 2 # Draw 1 sample of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- hpwd(dis = stand_dist$mat, n = 15) # drawn sample # Example 3 # Draw 2 samples of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size, and an increased level of spread, beta = 20 dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- hpwd(dis = stand_dist$mat, n = 15, beta = 20, nrepl = 2) # drawn samples
# Example 1 # Draw 1 sample of dimension 10 without constant inclusion probabilities dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix s <- hpwd(dis = dis, n = 10) # drawn sample # Example 2 # Draw 1 sample of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- hpwd(dis = stand_dist$mat, n = 15) # drawn sample # Example 3 # Draw 2 samples of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size, and an increased level of spread, beta = 20 dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- hpwd(dis = stand_dist$mat, n = 15, beta = 20, nrepl = 2) # drawn samples
The dataset contains the total income of the municipalities in the region "Emilia Romagna", in Italy, for the year 2015. Each municipality is defined by their own ISTAT (Istituto nazionale di statistica, Italy) code and a name. For each municipality there are the following auxiliary variables: province, number of taxpayers and spatial coordinates (geographical position).
income_emilia
income_emilia
A data frame with 334 rows and 7 variables:
identification municipality code
name of the municipality
province of the municipality
number of taxpayers in the municipality
average income of the municipality
coordinate x of the municipality
coordinate y of the municipality
The dataset is a rearrangement from the data released by the Italian Finance Department, MEF - Dipartimento delle Finanze (Italy).
The land use/cover area frame statistical survey, abbreviated as LUCAS, is a European field survey program funded and executed by Eurostat. Its objective is to set up area frame surveys for the provision of coherent and harmonised statistics on land use and land cover in the European Union (EU). Note that in LUCAS survey the concept of land is extended to inland water areas (lakes, river, coastal areas, etc.) and does not embrace uses below the earth's surface (mine deposits, subways, etc.). The LUCAS survey is a point survey, in particular the basic unit of observation is a circle with a radius of 1.5m (corresponding to an identifiable point on an orthophoto). In the classification there is a clear distinction between land cover and land use: land cover means physical cover ("material") observed at the earth's surface; land use means socio-economic function of the observed earth's surface. For each of both we assign a code to identified which type the point is. Land cover has 8 main categories, which are indicated by letter:
artificial land
cropland
woodland
shrubland
grassland
bareland
water
wetlands
Every main category has subclasses, which are indicated by the combination of the letter of the category and digits. Altogether there are 84 classes. Land use has 14 main categories. It has altogether 33 classes, which are indicated by the combination of the letter U and three digits.
lucas_abruzzo
lucas_abruzzo
A data frame with 2699 rows and 7 variables:
identified code for the unit spatial point
province
elevation of the unit spatial point, meant as the height above or below sea level
coordinate x
coordinate y
land cover code
land use code
The dataset is a rearrangement of the data from LUCAS 2012 for the region "Abruzzo", Italy. https://ec.europa.eu/eurostat/web/lucas/data/primary-data/2012
Selects spatially balanced samples through the use of the
Product Within Distance design (PWD). To have constant inclusion
probabilities , where
is sample size and
is population size, the distance matrix has to be standardized with
function
stprod
.
pwd(dis, n, beta = 10, nrepl = 1L, niter = 10L)
pwd(dis, n, beta = 10, nrepl = 1L, niter = 10L)
dis |
A distance matrix NxN that specifies how far all the pairs of units in the population are. |
n |
Sample size. |
beta |
Parameter |
nrepl |
Number of samples to draw (default = 1). |
niter |
Maximum number of iterations for the algorithm. More iterations are better but require more time. Usually 10 is very efficient (default = 10). |
Returns a list with the following components:
s
, a matrix nrepl
x n
, which contains the
nrepl
selected samples, each of them stored in a row. In particular,
the i-th row contains all labels of units selected in the i-th sample.
iterations
, number of iterations run by the algorithm.
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084. doi:10.1002/bimj.201600194
# Example 1 # Draw 1 sample of dimension 15 without constant inclusion probabilities dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix s <- pwd(dis = dis, n = 15)$s # drawn sample # Example 2 # Draw 1 sample of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- pwd(dis = stand_dist$mat, n = 15)$s # drawn sample # Example 3 # Draw 2 samples of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size, and an increased level of spread, beta = 20 dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- pwd(dis = stand_dist$mat, n = 15, beta = 20, nrepl = 2)$s # drawn samples
# Example 1 # Draw 1 sample of dimension 15 without constant inclusion probabilities dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix s <- pwd(dis = dis, n = 15)$s # drawn sample # Example 2 # Draw 1 sample of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- pwd(dis = stand_dist$mat, n = 15)$s # drawn sample # Example 3 # Draw 2 samples of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size, and an increased level of spread, beta = 20 dis <- as.matrix(dist(cbind(lucas_abruzzo$x, lucas_abruzzo$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix s <- pwd(dis = stand_dist$mat, n = 15, beta = 20, nrepl = 2)$s # drawn samples
Computes the Spatial Balance Index (SBI), which is a measure of spatial balance of a sample. The lower it is, the better the spread.
sbi(dis, pi, s)
sbi(dis, pi, s)
dis |
A distance matrix NxN that specifies how far all the pairs of units in the population are. |
pi |
A vector of first order inclusion probabilities of the units of the population. |
s |
A vector of labels of the sample. |
The SBI is based on Voronoi polygons. Given a sample s, each unit
in the sample has its own Voronoi polygon, which is composed by all
population units closer to
than to any other sample unit
.
Then, per each Voronoi polygon, define
as the sum of the
inclusion probabilities of all units in the
-th Voronoi polygon.
Finally, the variance of
is the SBI.
Returns the Spatial Balance Index.
Stevens DL, Olsen AR (2004). Spatially Balanced Sampling of Natural Resources. Journal of the American Statistical Association, 99(465), 262-278. doi:10.1198/016214504000000250
dis <- as.matrix(dist(cbind(simul1$x, simul1$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix pi <- rep(100 / nrow(dis), nrow(dis)) # vector of probabilities inclusion s <- pwd(dis = stand_dist$mat, n = 100)$s # sample sbi(dis = dis, pi = pi, s = s)
dis <- as.matrix(dist(cbind(simul1$x, simul1$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix pi <- rep(100 / nrow(dis), nrow(dis)) # vector of probabilities inclusion s <- pwd(dis = stand_dist$mat, n = 100)$s # sample sbi(dis = dis, pi = pi, s = s)
The dataset contains a simulated georeferenced population of dimension
. The coordinates are generated in the range
as a
simulated realization of a particular random point pattern: the Neyman-Scott
process with Cauchy cluster kernel. The nine values for each unit are
generated according to the outcome of a Gaussian stochastic process, with an
intensity dependence parameter
(that means low dependence)
and with no spatial trend.
simul1
simul1
A data frame with 1000 rows and 11 variables:
coordinate x
coordinate y
first value of the unit
second value of the unit
third value of the unit
fourth value of the unit
fifth value of the unit
sixth value of the unit
seventh value of the unit
eighth value of the unit
ninth value of the unit
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084.
The dataset contains a simulated georeferenced population of dimension
. The coordinates are generated in the range
as a
simulated realization of a particular random point pattern: the Neyman-Scott
process with Cauchy cluster kernel. The nine values for each unit are
generated according to the outcome of a Gaussian stochastic process, with an
intensity dependence parameter
(that means medium dependence)
and with a spatial trend
.
simul2
simul2
A data frame with 1000 rows and 11 variables:
coordinate x
coordinate y
first value of the unit
second value of the unit
third value of the unit
fourth value of the unit
fifth value of the unit
sixth value of the unit
seventh value of the unit
eighth value of the unit
ninth value of the unit
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084.
The dataset contains a simulated georeferenced population of dimension
. The coordinates are generated in the range
as a
simulated realization of a particular random point pattern: the Neyman-Scott
process with Cauchy cluster kernel. The nine values for each unit are
generated according to the outcome of a Gaussian stochastic process, with an
intensity dependence parameter
(that means high dependence)
and with a spatial trend
.
simul3
simul3
A data frame with 1000 rows and 11 variables:
coordinate x
coordinate y
first value of the unit
second value of the unit
third value of the unit
fourth value of the unit
fifth value of the unit
sixth value of the unit
seventh value of the unit
eighth value of the unit
ninth value of the unit
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084.
Selection of spatially balanced samples. In particular, the implemented
sampling designs allow to select probability samples well spread over the
population of interest, in any dimension and using any distance function
(e.g. Euclidean distance, Manhattan distance). The implementation has been
done in C++
through the use of Rcpp
and RcppArmadillo
.
Francesco Pantalone, Roberto Benedetti, Federica Piersimoni
Maintainer: Francesco Pantalone [email protected]
Pantalone F, Benedetti R, Piersimoni F (2022). An R Package for Spatially Balanced Sampling. Journal of Statistical Software, Code Snippets, 103(2), 1-22. <doi:10.18637/jss.v103.c02>
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084. doi:10.1002/bimj.201600194
Benedetti R, Piersimoni F (2017). Fast Selection of Spatially Balanced Samples. arXiv. https://arxiv.org/abs/1710.09116
Standardizes a distance matrix to fixed rows and columns products. The function iteratively constrains a logarithmic transformed matrix to know products, and in order to keep the symmetry of the matrix, at each iteration performs an average with its transpose. When the known products are all equal to a constant (e.g. 0), this method provides a simple and accurate way to scale a distance matrix to a doubly stochastic matrix.
stprod(mat, con, differ = 1e-15, niter = 1000L)
stprod(mat, con, differ = 1e-15, niter = 1000L)
mat |
A distance matrix size NxN. |
con |
A vector of row (column) constraints. |
differ |
A scalar with the maximum accepted difference with the constraint (default = 1e-15). |
niter |
An integer with the maximum number of iterations (default = 1000). |
The standardized matrix will not be affected by problems arising from units with different inclusion probabilities caused by undesired features of the spatial distribution of the population, as edge effects and/or isolated points.
Returns a list with the following components:
mat
, the standardized distance matrix of size NxN.
iterations
, number of iterations run by the algorithm.
conv
, convergence reached by the algorithm.
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084. doi:10.1002/bimj.201600194
dis <- as.matrix(dist(cbind(simul1$x, simul1$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix
dis <- as.matrix(dist(cbind(simul1$x, simul1$y))) # distance matrix con <- rep(0, nrow(dis)) # vector of constraints stand_dist <- stprod(mat = dis, con = con) # standardized matrix
Standardizes a distance matrix to fixed rows and columns totals. The function iteratively constrains the rows sums of the matrix to know totals, and in order to keep the symmetry of the matrix, at each iteration performs an average with its transpose. When the known totals are all equal to a constant (e.g. 1), this method provides a simple and accurate way to scale a distance matrix to a doubly stochastic matrix.
stsum(mat, con, differ = 1e-15, niter = 1000L)
stsum(mat, con, differ = 1e-15, niter = 1000L)
mat |
A distance matrix size NxN. |
con |
A vector of row (column) constraints. |
differ |
A scalar with the maximum accepted difference with the constraint (default = 1e-15). |
niter |
An integer with the maximum number of iterations (default = 1000). |
The standardized matrix will not be affected by problems arising from units with different inclusion probabilities caused by undesired features of the spatial distribution of the population, as edge effects and/or isolated points.
Returns a list with the following components:
mat
, the standardized distance matrix of size NxN.
iterations
, number of iterations run by the algorithm.
conv
, convergence reached by the algorithm.
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084. doi:10.1002/bimj.201600194
dis <- as.matrix(dist(cbind(simul2$x, simul2$y))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stsum(mat = dis, con = con) # standardized matrix
dis <- as.matrix(dist(cbind(simul2$x, simul2$y))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stsum(mat = dis, con = con) # standardized matrix
Selects spatially balanced samples through the use of the
Sum Within Distance design (SWD). To have a constant inclusion
probabilities , where
is sample size and
is population size, the distance matrix has to be standardized with
function
stsum
.
swd(dis, n, beta = 10, nrepl = 1L, niter = 10L)
swd(dis, n, beta = 10, nrepl = 1L, niter = 10L)
dis |
A distance matrix NxN that specifies how far all the pairs of units in the population are. |
n |
Sample size. |
beta |
Parameter |
nrepl |
Number of samples to draw (default = 1). |
niter |
Maximum number of iterations for the algorithm. More iterations are better but require more time. Usually 10 is very efficient (default = 10). |
Returns a list with the following components:
s
, a matrix nrepl
x n
, which contains the
nrepl
selected samples, each of them stored in a row. In particular,
the i-th row contains all labels of units selected in the i-th sample.
iterations
, number of iterations run by the algorithm.
Benedetti R, Piersimoni F (2017). A spatially balanced design with probability function proportional to the within sample distance. Biometrical Journal, 59(5), 1067-1084. doi:10.1002/bimj.201600194
# Example 1 # Draw 1 sample of dimension 15 without constant inclusion probabilities dis <- as.matrix(dist(cbind(income_emilia$x_coord, income_emilia$y_coord))) # distance matrix s <- swd(dis = dis, n = 15)$s # drawn sample # Example 2 # Draw 1 sample of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size dis <- as.matrix(dist(cbind(income_emilia$x_coord,income_emilia$y_coord))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stsum(mat = dis, con = con) # standardized matrix s <- swd(dis = stand_dist$mat, n = 15)$s # drawn sample # Example 3 # Draw 2 samples of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size and an increased level of spread, i.e. beta = 20 dis <- as.matrix(dist(cbind(income_emilia$x_coord,income_emilia$y_coord))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stsum(mat = dis, con = con) # standardized matrix s <- swd(dis = stand_dist$mat, n = 15, beta = 20, nrepl = 2)$s # drawn samples
# Example 1 # Draw 1 sample of dimension 15 without constant inclusion probabilities dis <- as.matrix(dist(cbind(income_emilia$x_coord, income_emilia$y_coord))) # distance matrix s <- swd(dis = dis, n = 15)$s # drawn sample # Example 2 # Draw 1 sample of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size dis <- as.matrix(dist(cbind(income_emilia$x_coord,income_emilia$y_coord))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stsum(mat = dis, con = con) # standardized matrix s <- swd(dis = stand_dist$mat, n = 15)$s # drawn sample # Example 3 # Draw 2 samples of dimension 15 with constant inclusion probabilities # equal to n/N, with N = population size and an increased level of spread, i.e. beta = 20 dis <- as.matrix(dist(cbind(income_emilia$x_coord,income_emilia$y_coord))) # distance matrix con <- rep(1, nrow(dis)) # vector of constraints stand_dist <- stsum(mat = dis, con = con) # standardized matrix s <- swd(dis = stand_dist$mat, n = 15, beta = 20, nrepl = 2)$s # drawn samples