Title: | Nonparametric Kernel Estimation of Distribution Function |
Version: | 1.3-1 |
Date: | 2025-06-18 |
Author: | Alejandro Quintela-del-Río [aut], Graciela Estévez-Pérez [aut], Ignacio López-de-Ullibarri [cre] |
Maintainer: | Ignacio López-de-Ullibarri <ignacio.lopezdeullibarri@udc.es> |
Depends: | R (≥ 4.5.0) |
Suggests: | chron, date, evir |
Description: | Nonparametric kernel distribution function estimation is performed. Three bandwidth selectors are implemented: the plug-in selectors of Altman and Leger and of Polansky and Baker, and the cross-validation selector of Bowman, Hall and Prvan. The exceedance function, the mean return period and the return level are also computed. For details, see Quintela-del-Río and Estévez-Pérez (2012) <doi:10.18637/jss.v050.i08>. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
LazyLoad: | yes |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2025-06-18 15:45:41 UTC; ilu |
Repository: | CRAN |
Date/Publication: | 2025-06-23 10:20:06 UTC |
Nonparametric Kernel Estimation of the Distribution Function
Description
Nonparametric kernel distribution function estimation for continuous random variables is performed. Three automatic bandwidth selection procedures for nonparametric kernel distribution function estimation are implemented: the plug-in method of Altman and Leger, the plug-in method of Polansky and Baker, and the modified cross-validation method of Bowman, Hall and Prvan. The exceedance function, the mean return period and the return level are also computed.
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
Maintainer: Ignacio López-de-Ullibarri
References
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Compute the Plug-in Bandwidth of Altman and Leger
Description
The bandwidth parameter for the distribution function kernel estimator is calculated, using the plug-in method of Altman and Leger (1995). Four possible kernel functions can be used for the kernel estimator: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight.
Usage
ALbw(type_kernel = "n", vec_data)
Arguments
type_kernel |
The kernel function. You can use four types: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The normal kernel is used by default. |
vec_data |
The data sample. |
Details
Altman and Leger (1995) recommend the use of the Epanechnikov kernel, because in this case the rate of convergence for the kernel derivative estimator is improved. For the sake of uniformity along the package, the gaussian kernel is used by default, but the user can obviously choose the Epanechnikov function.
Value
A numeric value for the bandwidth parameter.
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
References
Altman, N. and Leger, C. (1995), "Bandwidth selection for kernel distribution function estimation", Journal of Statistical Planning and Inference, 46, 195-214.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Examples
## Compute the plug-in bandwidth for a sample of 100 random N(0,1) data
x <- rnorm(100, 0, 1)
h_AL <- ALbw(type_kernel = "e", vec_data = x)
h_AL
## A quick plot of a distribution function estimate
x <- rnorm(1000)
h_AL <- ALbw(vec_data = x)
F_AL <- kde(vec_data = x, bw = h_AL)
plot(F_AL$grid, F_AL$Estimated_values, type = "l")
## Plotting the distribution function estimate, controling the grid points
## and the kernel function
ss <- quantile(x, c(0.05, 0.95))
## Number of points to be used in the representation of estimated distribution
## function
n_pts <- 100
y <- seq(ss[1], ss[2], length.out = n_pts)
F_AL <- kde(type_kernel = "e", x, y, h_AL)$Estimated_values
## Plot of the theoretical and estimated distribution functions
plot(y, F_AL, type = "l", lty = 2)
lines(y, pnorm(y), type = "l", lty = 1)
legend(-1, 0.8, c("Real", "Nonparametric"), lty = 1:2)
Compute the Cross-Validation Bandwidth of Bowman et al (1998)
Description
The bandwidth parameter for the distribution function kernel estimator is calculated using the modified cross-validation method of Bowman, Hall and Prvan (1998). Four possible kernel functions can be used: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The cross-validation function involves an integral term, that is calculated using Simpson's rule.
Usage
CVbw(type_kernel = "n", vec_data, n_pts = 100, seq_bws = NULL)
Arguments
type_kernel |
The kernel function. You can use four types: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The normal kernel is used by default. |
vec_data |
The data sample. |
n_pts |
The number of points used to approximate, by Simpson's rule, the integral term in the cross-validation function. Because this numeric method increases the computing time, you can check different numbers, depending on your sample size. By default, 100 points are used. |
seq_bws |
The sequence of bandwidths in which to compute the cross-validation function. By default, the procedure defines a sequence of 50 points, from the range of the data divided by 200 to the range divided by 2. |
Value
A list consisting of:
seq_bws |
The sequence of bandwidths. |
CV function |
The values of the cross-validation function in the bandwidths grid. |
bw |
A numeric value for the cross-validation bandwidth. |
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
References
Bowman, A.W., Hall, P. and Prvan,T. (1998), "Cross-validation for the smoothing of distribution functions", Biometrika. 85, 799-808.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Examples
## Compute the cross-validation bandwidth for a sample of 100 random N(0,1)
## data
x <- rnorm(100, 0, 1)
num_bws <- 50
seq_bws <- seq(((max(x) - min(x))/2)/50, (max(x) - min(x))/2,length = num_bws)
hCV <- CVbw(type_kernel = "e", vec_data = x, n_pts = 200, seq_bws = seq_bws)
hCV
## The CV function is plotted
h_CV <- CVbw(vec_data = x)
h_CV$bw
plot(h_CV$seq_bws, h_CV$CVfunction, type = "l")
## Plotting the distribution function estimate controlling the grid points
## and the kernel function
ss <- quantile(x, c(0.05, 0.95))
## Number of points to be used in the representation of estimated distribution
## function
n_pts <- 100
y <- seq(ss[1], ss[2], length.out = n_pts)
F_CV <- kde(type_kernel = "e", x, y, h_CV$bw)$Estimated_values
## Plot of the theoretical and estimated distribution functions
plot(y, F_CV, type = "l", lty = 2)
lines(y, pnorm(y), type = "l", lty = 1)
legend(-1, 0.8, c("Real", "Nonparametric"), lty = 1:2)
Compute the Plug-in Bandwidth of Polansky and Baker
Description
The bandwidth parameter for the distribution function kernel estimator is calculated, using the plug-in method of Polansky and Baker (2000). Four possible kernel functions can be used for the kernel estimator: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. Because kernel estimators of derivatives of order greater than two are required, only the normal kernel is used in this case.
Usage
PBbw(type_kernel = "n", vec_data, num_stage = 2)
Arguments
type_kernel |
The kernel function used. You can use four types: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The normal kernel is used by default. |
vec_data |
The data sample. |
num_stage |
The number of iterations in the Polansky and Baker method. The default, 2, is usually a good option. Values of 3 or 4 are also allowed. |
Value
A numeric value for the bandwidth parameter.
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
References
Polansky, A.M. and Baker, E.R. (2000), "Multistage plug-in bandwidth selection for kernel distribution function estimates", Journal of Statistical Computation and Simulation, 65, 63-80.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Examples
## Compute the plug-in bandwidth for a sample of 100 random N(0,1) data
x <- rnorm(100, 0, 1)
h_PB <- PBbw(vec_data = x, num_stage = 4)
h_PB
## A quick plot of a distribution function estimate
x <- rnorm(1000)
h_PB <- PBbw(vec_data = x)
F_PB <- kde(vec_data = x, bw = h_PB)
plot(F_PB$grid, F_PB$Estimated_values, type = "l")
## Plotting the distribution function estimate controling the grid points and
## the kernel function
ss <- quantile(x, c(0.05, 0.95))
## number of points to be used in the representation of the estimated
## distribution function
n_pts <- 100
y <- seq(ss[1], ss[2], length.out = n_pts)
F_PB <- kde(type_kernel = "e", x, y, h_PB)$Estimated_values
## Plot of the theoretical and estimated distribution functions
plot(y, F_PB, type = "l", lty = 2)
lines(y, pnorm(y), type = "l", lty = 1)
legend(-1.2, 0.8, c("Real", "Nonparametric"), lty = 1:2)
Exceedance Function Estimation
Description
Computes the exceedance probability, i.e., the probability that a
specified value c
(a magnitude of a seismic event, a flow level...)
will be exceeded in D
time units.
Usage
ef(type_kernel = "n", vec_data, c,
bw = PBbw(type_kernel = "n", vec_data, 2), Dmin = 0, Dmax = 15,
size_grid = 50, lambda)
Arguments
type_kernel |
The kernel function. You can use four types: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The Normal kernel is used by default. |
vec_data |
The data sample (earthquake magnitudes, flow levels, wind speed...) |
c |
The concrete level in which we want to compute the exceedance probability. |
bw |
The bandwidth parameter. The plug-in method of Polansky and Baker (2000) is used by default. |
Dmin |
Minimum value for D time units (years, days... ). Default is 0. |
Dmax |
Maximum value for D time units (years, days... ). Default is 15. |
size_grid |
Length of a grid in which we compute the exceedance function. By default, 50. |
lambda |
The mean activity rate. |
Details
The exceedance function is usually calculated assuming that event
occurrence follows a Poisson process. In this case, the exceedance
function, i.e., the probability of an specific value c
is calculated
as
R(c,D) = 1 - exp(-\lambda D(1-F_h(c)).
See, e.g., Orlecka-Sikora (2008) or Quintela-del-Rio (2010) for earthquake data applications.
Value
Returns a list containing:
Estimated_values |
Vector containing the estimated function. |
grid |
The used grid. |
bw |
Value of the bandwidth. |
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
References
Orlecka-Sikora, B. (2008), "Resampling methods for evaluating the uncertainty of the nonparametric magnitude distribution estimation in the probabilistic seismic hazard analysis", Tectonophysics 456, 38-51.
Quintela-del-Rio, A. (2010), "On nonparametric techniques for area-characteristic seismic hazard parameters", Geophysical Journal International, 180, 339-346.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Examples
## Working with earthquake data. We use the catalogue of the National
## Geographic Institute (IGN) of Spain and select the data of the Northwest
## of the Iberian Peninsula.
data(nwip)
require(chron)
require(date)
## The data with magnitude greater than 3 are considered
mg <- nwip$magnitude[nwip$magnitude > 3.0]
x1 <- nwip$year
x2 <- nwip$month
x3 <- nwip$day
ys <- paste(x1, x2, x3)
earthquake_date <- as.character(ys)
y1s <- as.date(earthquake_date, order = "ymd")
## Computation of the total number of years
y2s <- as.POSIXct(y1s)
z <- years(y2s)
n.years <- length(levels(z))
## Mean rate of earthquakes per year
lambda <- length(mg)/n.years
## Estimation of the exceedance probability for magnitude = 4
est <- ef(vec_data = mg, c = 4, lambda = lambda)
plot(est$grid, est$Estimated_values, type = "l", xlab = "Years",
ylab = "Probability of Exceedance")
Kernel Estimation of the Distribution Function
Description
Computes the value of the kernel estimator of the distribution function, in a single value or in a grid. Four possibilites for the kernel function are implemented, and the bandwidth parameter can be directly calculated by the plug-in method of Polansky and Baker (2000).
Usage
kde(type_kernel = "n", vec_data, y = NULL,
bw = PBbw(type_kernel = "n", vec_data, 2))
Arguments
type_kernel |
The kernel function. You can use four types: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The normal kernel is used by default. |
vec_data |
The data sample. |
y |
The single value or the grid vector where the distribution function is estimated. By default, a grid of 100 equidistant points from the minimum to the maximum of the data sample is selected. |
bw |
The bandwidth used. If it is not provided, the plug-in bandwidth of Polansky and Baker (2000) is computed. |
Value
A list containing:
Estimated_values |
Vector containing the estimated function in the grid values. |
grid |
The used grid. |
bw |
Value of the bandwidth. |
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
References
Reiss, R.D. (1981), "Nonparametric estimation of smooth distribution functions", Scandinavian Journal of Statistics, 8, 116-119.
Simonoff, J. (1996), "Smoothing Methods in Statistics", Springer, New York.
Polansky, A.M. and Baker, E.R. (2000), "Multistage plug-in bandwidth selection for kernel distribution function estimates", Journal of Statistical Computation and Simulation, 65, 63-80.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Examples
## Comparison of three bandwidth selection methods
x <- rnorm(100)
## The bandwidths by cross-validation, plug-in of Altman and Leger
## and plug-in of Polansky and Baker are computed using a normal kernel
## and a standard setting of parameters
h_CV <- CVbw(vec_data = x)$bw
h_CV
## Plug-in of Altman and Leger
h_AL <- ALbw(vec_data = x)
h_AL
## Plug-in of Polansky and Baker
h_PB <- PBbw(vec_data = x)
h_PB
## Plot of the three estimates together with the real distribution
F_CV <- kde(vec_data = x, bw = h_CV)
F_AL <- kde(vec_data = x, bw = h_AL)
F_PB <- kde(vec_data = x, bw = h_PB)
y <- F_CV$grid
Ft <- pnorm(y)
plot(y, Ft, ylab = "Distribution", xlab = "Data", type = "l", lty = 1)
lines(y, F_CV$Estimated_values, type = "l", lty = 2)
lines(y, F_AL$Estimated_values, type = "l", lty = 3)
lines(y, F_PB$Estimated_values, type = "l", lty = 4)
legend(1, 0.4, c("Real", "F_CV", "F_AL", "F_PB"), lty = 1:4)
Internal Grid Functions
Description
Internal Grid functions
Details
Internal functions not to be called by the user.
Mean Return Period Estimation
Description
This functions computes an estimate of the time between two values of a concrete level (size of an earthquake, flow lewel, wind speed...).
Usage
mrp(type_kernel = "n", vec_data, y = NULL,
bw = PBbw(type_kernel = "n", vec_data, 2), lambda)
Arguments
type_kernel |
The kernel function. You can use four types: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The normal kernel is used by default. |
vec_data |
The data sample (earthquake magnitudes, flow levels, wind speed...). |
y |
A grid or a singular value where the estimator is computed. By default, a grid of 50 values between the minimum and the maximum of the data is computed. |
bw |
The bandwidth parameter. By default, the plug-in method of Polansky and Baker (2000) is used. |
lambda |
The mean activity rate. |
Details
The mean return period is usually calculated assuming that event
occurrence follows a Poisson process. In this case, the mean return
period of events of size c
is calculated as
T(c) = \frac{1}{ \lambda (1-F_h(c))}.
In Orlecka-Sikora (2008) or Quintela-del-Rio (2010) an application to
earthquake data is made. In hydrological applications, if we work with
annual maxima data, the parameter of the Poisson variable is 1 (one maximum
per year). The mean return period between flow levels of value c
is
calculated as
T(c) = \frac{1}{ 1-F_h(c)}.
See, for instance, Quintela-del-Rio (2011), for an application to data of Salt River near Roosevelt, AZ, USA (saltriver data).
Value
A list containing:
Estimated_values |
Vector containing the estimated function. |
grid |
The used grid. |
bw |
Value of the bandwidth. |
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
References
Orlecka-Sikora, B. (2008), "Resampling methods for evaluating the uncertainty of the nonparametric magnitude distribution estimation in the probabilistic seismic hazard analysis", Tectonophysics, 456, 38-51.
Quintela-del-Rio, A. (2010), "On nonparametric techniques for area- characteristic seismic hazard parameters", Geophysical Journal International, 180, 339-346.
Quintela-del-Rio, A. (2011), "On bandwidth selection for nonparametric estimation in flood frequency analysis", Hydrological Processes, 25, 671-678.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Examples
## Working with earthquake data. We use the catalogue of the National
## Geographic Institute (IGN) of Spain and select the data of the Northwest
## of the Iberian Peninsula.
data(nwip)
require(chron)
require(date)
## Data with magnitude greater than 3 are considered
mg <- nwip$magnitude[nwip$magnitude > 3.0]
x1 <- nwip$year
x2 <- nwip$month
x3 <- nwip$day
ys <- paste(x1, x2, x3)
earthquake_date <- as.character(ys)
y1s <- as.date(earthquake_date, order = "ymd")
## Computation of the total number of years
y2s <- as.POSIXct(y1s)
z <- years(y2s)
n.years <- length(levels(z))
## Mean rate of earthquakes per year
lambda <- length(mg)/n.years
## Estimation of the mean return period (in years) between earthquakes of
## the same magnitude
est2 <- mrp(vec_data = mg, lambda = lambda)
plot(est2$grid, est2$Estimated_values, type = "l", xlab = "Magnitude",
ylab = "Mean return period (years)")
## Working with hydrological data: annual peak instantaneous flow of the
## Salt River near Roosevelt, AZ, USA, for 1924-2009.
data(saltriver)
peak <- saltriver$peakflow
year <- saltriver$year
plot(year, peak, type = "l", xlab = "Year", ylab = "Annual peak flow")
## Mean return period for the Saltriver data
rp <- mrp(type_kernel = "n", vec_data = peak, lambda = 1)
plot(rp$grid, rp$Estimated_values, type = "l", xlab = "Flow level",
ylab = "Years ", main = "Mean return period")
Earthquakes of the Northwest Iberian Peninsula
Description
This data set corresponds to the earthquakes occurred in the Northwest of the Iberian Peninsula, from 25/Nov/1924 to 31/Jul/2010. The area is limited by the coordinates 41 N–44 N and 6 W–10 W, involving the autonomic region of Galicia (Spain) and northern Portugal.
Usage
data(nwip)
Format
A data frame with 3491 observations on the following 10 variables, corresponding to the earthquake epicenters and time of ocurrence.
day
Day
month
Month
year
Year
hour
Hour
minute
Minute
second
second
latitude
Latitude in degrees
longitude
Longitude in degrees
depth
Depth in km
magnitude
Magnitude in Richter Scale
Source
The data catalogue has been obtained from the National Geographic Institute (IGN) of Spain. The web page is www.ign.es.
References
Rueda, J., and J. Mezcua (2001), "Sismicidad, sismotectonica y peligrosidad sismica en Galicia", IGN Technical Publication, 35.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Return Level Estimation
Description
The T-return level is defined as the value of the observed
variable that can be expected to be once exceeded during a T-period of
time. This is computed as the quantile of the distribution corresponding
to the value F^{-1}(1-\frac{1}{T})
.
Usage
rl(type_kernel = "n", vec_data, time,
bw = PBbw(type_kernel = "n", vec_data, 2))
Arguments
type_kernel |
The kernel function used. You can use four types: "e" Epanechnikov, "n" Normal, "b" Biweight and "t" Triweight. The normal kernel is used by default. |
vec_data |
The data sample (earthquake magnitudes, flow levels, wind speeds...). |
time |
A time or a vector of times for T. |
bw |
The bandwidth parameter. By default, he plug-in method of Polansky and Baker (2000) is used. |
Details
In several scientific fields, it is of interest to estimate quantiles
corresponding to a probability of exceedance. E.g., in hydrology, the
T-return level x_T
is defined as the value of the observed flow
that can be expected to be once exceeded during a T-period of time; i.e.,
the quantile
x_T=F^{-1}(1-\frac{1}{T}).
It can be directly estimated by
\hat{x}_T=F_h^{-1}(1-\frac{1}{T}).
See, e.g., Quintela-del-Rio (2011), for an application to data of Salt River near Roosevelt, AZ, USA.
Value
A single value or an array for the estimated quantiles.
Author(s)
Graciela Estévez Pérez and Alejandro Quintela del Río
References
Quintela-del-Rio, A. (2011), "On bandwidth selection for nonparametric estimation in flood frequency analysis". Hydrological Processes 25, 671-678.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.
Examples
data(saltriver)
peak <- saltriver$peakflow
year <- saltriver$year
plot(year, peak, type = "l", ylab = "Annual peak flow")
## Calculating the return values for a period from 2 to 100 years
times <- seq(2,100, length.out = 100)
ret.lev <- rl(vec_data = peak, time = times)
plot(times, ret.lev, type = "l", xlab = "Years", ylab = "Flow (cumecs)",
main = "Return level Plot")
Data from Salt River
Description
The annual peak instantaneous flow of the Salt River near Roosevelt, AZ, USA, for 1924-2009. Data are in cfs (0.028317 m3/s); water year October-September. The data were examined in several papers related with extreme values in hydrology. Among others, they were analyzed by Anderson and Meerschaert (1998) and Dettinger and Diaz (2000), where they were fitted to GEV and GPD distributions. In Quintela-del-Rio (2011), a nonparametric analysis for this data set is presented.
Usage
data(saltriver)
Format
A data frame with 85 observations on the following 2 variables.
year
Year
peakflow
The annual observed maximum peak flow
Source
US Geological Survey http://water.usgs.gov/nwis/peak.
References
Anderson, P.L. and Meerschaert, M.M. (1998), "Modeling river flows with heavy tails", Water Resources Research, 34, 2271-2280.
Dettinger, M.D. and Diaz, H.F. (2000), "Global characteristics of stream flow seasonality and variability", Journal of Hydrometeorology, 1, 289-310.
Quintela-del-Rio, A. (2011), "On bandwidth selection for nonparametric estimation in flood frequency analysis". Hydrological Processes 25, 671-678.
Quintela-del-Río, A. and Estévez-Pérez, G. (2012), "Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications", Journal of Statistical Software, 50(8), 1-21.