The changepointGA package applies Genetic Algorithms (GAs) to detect both single and multiple changepoints in time series data. By encoding each candidate changepoint as an integer, changepointGA dynamically adjusts the model parameter vector to accommodate varying numbers of changepoints and optimizes the model fitness function accordingly. Additionally, it also provides the functionalities of estimating the model hyperparameters, parameters, and changepoint configurations simultaneously.
This package includes the time series simulation function
ts.sim()
with the ability to introduce changepoint effects.
The time series could be simulated from a class of time series
models,
\[Z_{t}=\mu_{t}+e_{t}.\]
The model could be specified by the arguments of phi
and
theta
, representing the autoregressive (AR) coefficients of
and the moving-average (MA) coefficients for the autoregressive
moving-average (ARMA) model. The default values of phi
and
theta
are NULL
, generating time series with
\(e_{t}\) that are independent and
identically \(N(0,\sigma^{2})\)
distributed. If we assign values to phi
and/or
theta
, the \(e_{t}\) will
be generated from an ARMA(\(p\),\(q\)) process, where \(p\) equals the number of values in
phi
and \(q\) equals the
number of values in theta
. For the above model, \(\mu_{t}\) denotes time-varying mean
function and the mean changepoint effects could be introduced through
indicator functions to \(\mu_{t}\).
Note that the changepointGA could also work with other
time series model as long as the fitness function is appropriately
specified. To illustrate the package usage, we will generate the \(e_{t}\) following an ARMA(\(p=1\),\(q=1\)) process. The AR coefficient is \(\phi=0.5\) and the MA coefficient is \(\theta=0.8\). The time varying mean values
are generated through the following equation,
\[\mu_{t}=\beta_{0}+\Delta_{1}I_{[t>\tau_{1}]} + \Delta_{2}I_{[t>\tau_{2}]},\]
including the intercept term with \(\beta_{0}=0.5\) and two changepoints at
\(\tau_{1}=125\) and \(\tau_{2}=375\) with \(\Delta_{1}=2\) and \(\Delta_{2}=-2\). The users can also include
additional covariates into matrix XMatT
for other possible
model dynamics, such as tread and seasonalities.
Ts = 200
betaT = c(0.5) # intercept
XMatT = matrix(rep(1, Ts), ncol=1)
colnames(XMatT) = c("intercept")
sigmaT = 1
phiT = c(0.5)
thetaT = c(0.8)
DeltaT = c(2, -2)
CpLocT = c(50, 150)
myts = ts.sim(beta=betaT, XMat=XMatT, sigma=sigmaT, phi=phiT, theta=thetaT, Delta=DeltaT, CpLoc=CpLocT, seed=1234)
str(myts)
#> Time-Series [1:200, 1] from 1 to 200: -2.006 -2.328 -1.47 0.526 1.17 ...
#> - attr(*, "DesignX")= num [1:200, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:3] "intercept" "" ""
#> - attr(*, "mu")= num [1:200, 1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#> - attr(*, "theta")= num [1:3] 0.5 2 -2
#> - attr(*, "CpEff")= num [1:2] 2 -2
#> - attr(*, "CpLoc")= num [1:2] 50 150
#> - attr(*, "arEff")= num 0.5
#> - attr(*, "maEff")= num 0.8
#> - attr(*, "seed")= num 1234
The simulated time series could be plotted in the figure below. The vertical dashed blue lines indicate the true changepoint location and the horizontal red dashed segments represent the time series segment sample mean. We can clearly observe the mean changing effects from the introduced changepoints \(\tau_{1}=250\) and \(\tau_{2}=500\). The changepointGA package will be used to detected these two changepoints.
Considering the model used in the data simulation, we could use the R
function, ARIMA.BIC.Order()
, included in the
changepointGA package. This function inputs include
the
chromosome
, consists of the number of changepoints, the
order of AR part, the order of MA part, the changepoint locations, and a
value of time series sample size plus 1 (\(N+1\)) indicating the end of the
chromosome;plen
represent the number of model order
hyperparameters. We have plen=2
for this example
representing one hyperparameter for AR order and one hyperparameter for
MA order;XMat
requires the design matrix, including the
of-interested covariates;Xt
requires the simulated time series data
objects.Give the specific chromosome, including the ARMA model AR and MA
orders and the changepoint configuration information, the function
ARIMA.BIC.Order()
returns the model BIC value, which
represents the configuration’s fitness.
The genetic algorithm requires users set up a list object, including
all the GA hyperparameters. As shown below, we will have 40 individual
chromosomes in the population for each generation. p.range
is a list object containing integer values ranges for parameters
phi
and theta
. For example, in the code
snippet below, p.range
specifies that the AR and MA orders
can vary from the set \(\{0, 1, 2\}\).
Details for each hyperparameter definition can be found via the package
help documentation.
N = Ts
p.range = list(ar=c(0,2), ma=c(0,2))
GA_param = list(
popsize = 80,
Pcrossover = 0.95,
Pmutation = 0.3,
Pchangepoint = 10/N,
minDist = 1,
mmax = N/2 - 1,
lmax = 2 + N/2 - 1,
maxgen = 5000,
maxconv = 100,
option = "both",
monitoring = FALSE,
parallel = FALSE,
nCore = NULL,
tol = 1e-5,
seed = 1234
)
The default argument GA_operators
for the
GA()
function requires four operators,
population
, selection
, crossover
,
and mutation
. The population
operator will
provide a matrix like R object that store the randomly generated
chromosome representations as the initial population. As with any
optimization problem, better initial values can lead to improved and
faster-converging results. For convenience, users can specify their
preferred solutions in a matrix R object to serve as the initialized
population and replace the default operator in operators
,
but the matrix dimensions should be lmax
rows and
popsize
columns. Each column from this matrix represents an
individual chromosome.
pop = matrix(0, nrow=2 + N/2 - 1, ncol=80)
# put the two desired candidate solutions into pop matrix
pop[1:6,1] = c(2, 1, 1, 50, 150, N+1)
pop[1:7,2] = c(3, 1, 1, 50, 100, 150, N+1)
# fill the remain 38 individuals slot from default generation
tmppop = random_population(popsize=80, prange=p.range, N=N, minDist=1, Pb=10/N,
mmax=N/2 - 1, 2 + N/2 - 1)
pop[,3:80] = tmppop[,3:80]
pop[1:10, 1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 2 3 12 16 9 9 7 11 9 7
#> [2,] 1 1 0 2 2 0 2 0 1 0
#> [3,] 1 1 1 0 0 0 1 1 1 0
#> [4,] 50 50 6 13 46 15 82 19 9 27
#> [5,] 150 100 37 28 65 22 83 42 20 32
#> [6,] 201 150 42 48 109 23 87 67 55 41
#> [7,] 0 201 67 68 110 112 120 81 67 61
#> [8,] 0 0 69 73 140 130 156 107 73 155
#> [9,] 0 0 85 76 144 140 183 117 82 162
#> [10,] 0 0 112 84 156 146 184 168 99 181
operators.self = list(population = pop,
selection = "selection_linearrank",
crossover = "uniformcrossover",
mutation = "mutation")
Since we are trying to estimate the changepoint configurations, it is
necessary to set the XMat
to include all other covariates
except the changepoint indicators, as if we are under a changepoint free
model.
Then we call function GA()
for changepoint searching and
simultaneously estimate the best fitted model AR and MA orders. The
estimated number of changepoints and corresponding changepoint locations
could be extracted as below.
res.changepointGA = suppressWarnings(GA(ObjFunc=ARIMA.BIC.Order, N=N, GA_param, GA_operators = operators.self,
p.range=p.range, XMat=XMatEst, Xt=myts))
fit.changepointGA = res.changepointGA$overbestfit
fit.changepointGA
#> [1] 614.0761
tau.changepointGA = res.changepointGA$overbestchrom
tau.changepointGA
#> [1] 4 1 1 50 54 149 155 201
The island GA model can be implemented by calling R function
IslandGA()
with similar arguments as in GA()
function. The argument of IslandGA_param
would be slightly
different. Related, the initialized population is a list object, the
length of such list should equal to the number of request islands. Here,
we use the default operator to initialize the population, but users can
also insert in their “reasonable” and “better” initial solutions to the
list object as in previous section for improved and faster-converging
results. The computational time is also showing below.
IslandGA_param = list(
subpopsize = 80,
Islandsize = 2,
Pcrossover = 0.95,
Pmutation = 0.15,
Pchangepoint = 10/N,
minDist = 1,
mmax = N/2 - 1,
lmax = 2 + N/2 - 1,
maxMig = 1000,
maxgen = 50,
maxconv = 20,
option = "both",
monitoring = FALSE,
parallel = FALSE, ###
nCore = NULL,
tol = 1e-5,
seed = 1234
)
tim1 = Sys.time()
res.Island.changepointGA = suppressWarnings(IslandGA(ObjFunc=ARIMA.BIC.Order, N=N, IslandGA_param,
p.range=p.range, XMat=XMatEst, Xt=myts))
tim2 = Sys.time()
tim2 - tim1
#> Time difference of 27.05757 secs
fit.Island.changepointGA = res.Island.changepointGA$overbestfit
fit.Island.changepointGA
#> [1] 587.0658
tau.Island.changepointGA = res.Island.changepointGA$overbestchrom
tau.Island.changepointGA
#> [1] 2 1 1 45 149 201
The default setting for both GA()
and
IslandGA()
implementations is to evaluate each changepoint
configuration sequentially and the initialized population fitness
evaluation is computational expensive. In some case, we could use the
parallel computing techniques to speed up the fitness evaluation, which
requires the R packages foreach and
doParallel. The parallel computing will be more
efficient when the island size ad the number of islands is larger (a
larger starting population). We then could implement the parallel
computing easily by setting parallel=TRUE
and apply
nCore=2
threads on a single computer as below (set
nCore
equals to the number of island is recommended).
IslandGA_param = list(
subpopsize = 80,
Islandsize = 2,
Pcrossover = 0.95,
Pmutation = 0.15,
Pchangepoint = 10/N,
minDist = 1,
mmax = N/2 - 1,
lmax = 2 + N/2 - 1,
maxMig = 1000,
maxgen = 50,
maxconv = 20,
option = "both",
monitoring = FALSE,
parallel = TRUE, ###
nCore = 2,
tol = 1e-5,
seed = 1234
)
tim3 = Sys.time()
res.Island.changepointGA = suppressWarnings(IslandGA(ObjFunc=ARIMA.BIC.Order, N=N, IslandGA_param,
p.range=p.range, XMat=XMatEst, Xt=myts))
tim4 = Sys.time()
tim4 - tim3
#> Time difference of 15.67648 secs
fit.Island.changepointGA = res.Island.changepointGA$overbestfit
fit.Island.changepointGA
#> [1] 587.0658
tau.Island.changepointGA = res.Island.changepointGA$overbestchrom
tau.Island.changepointGA
#> [1] 2 1 1 45 149 201
It is important to measure how different the estimated changepoint
configuration away from the true configuration used in data generation.
The pairwise distance metric proposed by Shi et al. (2022) can help
quantify such differences. We implemented and included the method via
the cpDist()
function in this package.
Shi, X., Gallagher, C., Lund, R., & Killick, R. (2022). A comparison of single and multiple changepoint techniques for time series data. Computational Statistics & Data Analysis, 170, 107433.
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Chicago
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] changepointGA_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.5.1 codetools_0.2-20
#> [4] RcppArmadillo_14.2.2-1 fastmap_1.2.0 doParallel_1.0.17
#> [7] xfun_0.50 clue_0.3-65 iterators_1.0.14
#> [10] cachem_1.1.0 parallel_4.3.1 knitr_1.49
#> [13] htmltools_0.5.8.1 rmarkdown_2.29 lifecycle_1.0.4
#> [16] cli_3.6.3 foreach_1.5.2 sass_0.4.9
#> [19] jquerylib_0.1.4 compiler_4.3.1 rstudioapi_0.16.0
#> [22] tools_4.3.1 cluster_2.1.6 evaluate_1.0.3
#> [25] bslib_0.8.0 Rcpp_1.0.14 yaml_2.3.10
#> [28] rlang_1.1.4 jsonlite_1.8.9