Type: | Package |
Title: | Linkage Using Diagnosis Codes |
Version: | 0.2.0 |
Date: | 2021-08-18 |
LinkingTo: | Rcpp, RcppArmadillo |
Depends: | R (≥ 3.0.0), Rcpp (≥ 0.12.11), |
Imports: | fGarch, landpred, Matrix, methods, rootSolve |
Suggests: | testthat |
Description: | Probabilistic record linkage without direct identifiers using only diagnosis codes. Method is detailed in: Hejblum, Weber, Liao, Palmer, Churchill, Szolovits, Murphy, Kohane & Cai (2019) <doi:10.1038/sdata.2018.298> ; Zhang, Hejblum, Weber, Palmer, Churchill, Szolovits, Murphy, Liao, Kohane & Cai (2021) <doi:10.1101/2021.05.02.21256490>. |
BugReports: | https://github.com/borishejblum/ludic/issues |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.1 |
NeedsCompilation: | yes |
Packaged: | 2021-08-18 11:00:16 UTC; boris |
Author: | Boris P Hejblum [aut, cre], Harrison G Zhang [aut], Tianxi Cai [aut] |
Maintainer: | Boris P Hejblum <boris.hejblum@u-bordeaux.fr> |
Repository: | CRAN |
Date/Publication: | 2021-08-18 14:30:06 UTC |
ludic
Description
Linkage Using Diagnosis Codes
Details
This package implements probabilistic record linkage methods that relies on the use of diagnosis codes only, in the absence of direct identifiers .
Package: | ludic |
Type: | Package |
Version: | ludic 0.2.0 |
Date: | 2021-08-18 |
License: | The "MIT License" (MIT) |
The main function of ludic
is recordLink
.
Author(s)
Boris P. Hejblum, Tianxi Cai — Maintainer: Boris P. Hejblum
References
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi: 10.1038/sdata.2018.298.
Zhang HG, Hejblum BP, Weber G, Palmer N, Churchill S, Szolovits P, Murphy S, Liao KP, Kohane I and Cai T, ATLAS: An automated association test using probabilistically linked health records with application to genetic studies, JAMIA, in press (2021). doi: 10.1101/2021.05.02.21256490.
C++ implementation of the E and M steps from Winkler's EM algorithm estimating FS method using sparse matrices for big sample sizes
Description
C++ implementation of the E and M steps from Winkler's EM algorithm estimating FS method using sparse matrices for big sample sizes
Usage
EMstep_C_sparse_big(mat_A, mat_B, p, m, u)
Anonymized binarized diagnosis codes from RA study.
Description
An anonymized version of the binarized diagnosis code data from the RA1 and RA2 datasets, over both 6-year and 11-year time span.
Usage
data(RA)
Format
5 objects
RA1_6y
: an integer matrix of 0s and 1s containing 4,936 renamed diagnosis codes for 26,681 patients from the dataset RA1 recorded over a 6-year time span.RA2_6y
: an integer matrix of 0s and 1s containing 4,936 renamed diagnosis codes for 5,707 patients from the dataset RA2 recorded over a 6-year time span.RA1_11y
: an integer matrix of 0s and 1s containing 5,593 renamed diagnosis codes for 26,687 patients from the dataset RA1 recorded over a 11-year time span.RA2_11y
: an integer matrix of 0s and 1s containing 5,593 renamed diagnosis codes for 6,394 patients from the dataset RA2 recorded over a 11-year time span.silverstandard_truematches
: a character matrix with two columns containing the identifiers of the 3,831 pairs of silver-standard matches.
Details
The ICD-9 diagnosis codes have also been masked and randomly reordered, replaced by meaningless names. Finally, the silver-standard matching pairs are also provided to allow the benchmarking of methods for probabilistic record linkage using diagnosis codes.
References
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi: 10.1038/sdata.2018.298.
Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care & Research 62, 1120-1127 (2010). doi: 10.1002/acr.20184
Liao, K. P. et al. Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts. PLoS ONE 10, e0136651 (2015). doi: 10.1371/journal.pone.0136651
Examples
if(interactive()){
rm(list=ls())
library(ludic)
data(RA)
res_match_6y <- recordLink(data1 = RA1_6y, data2 = RA2_6y,
eps_plus = 0.01, eps_minus = 0.01,
aggreg_2ways ="mean",
min_prev = 0,
use_diff = FALSE)
res_match_11y <- recordLink(data1 = RA1_11y, data2 = RA2_11y,
eps_plus = 0.01, eps_minus = 0.01,
aggreg_2ways ="mean",
min_prev = 0,
use_diff = FALSE)
print.res_matching <- function(res, threshold=0.9, ref=silverstandard_truematches){
have_match_row <- rowSums(res>threshold)
have_match_col <- colSums(res>threshold)
bestmatched_pairs_all <- cbind.data.frame(
"D1"=rownames(res)[apply(res[,which(have_match_col>0), drop=FALSE], 2, which.max)],
"D2"=names(have_match_col)[which(have_match_col>0)]
)
nTM_all <- nrow(ref)
nP_all <- nrow(bestmatched_pairs_all)
TPR_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="")
%in% apply(ref, 1, paste0, collapse=""))/nTM_all
PPV_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="")
%in% apply(ref, 1, paste0, collapse=""))/nP_all
cat("threshold: ", threshold,
"\nnb matched: ", nP_all,"; nb true matches: ", nTM_all,
"\nTPR: ", TPR_all, "; PPV: ", PPV_all, "\n\n", sep="")
}
print.res_matching(res_match_6y)
print.res_matching(res_match_11y)
}
Fast C++ implementation of agreement vector for the element-wise comparison of 2 matrices
Description
agree_C_sparse
uses sparse matrices.
Usage
agree_C(mat_A, mat_B)
agree_C_sparse(mat_A, mat_B)
Arguments
mat_A |
a |
mat_B |
a |
Examples
mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100)
mat2 <- rbind(mat1[1:10, ],
matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90)
)
rownames(mat1) <- paste0("A", 1:nrow(mat1))
rownames(mat1) <- paste0("B", 1:nrow(mat1))
mat1 <- 1*(mat1>1)
mat2 <- 1*(mat2>1)
Association testing by combining several matching thresholds
Description
Computes association test p-values from a generalized linear model for each considered threshold, and computes a p-value for the combination of all the envisioned thresholds through Fisher's method using perturbation resampling.
Usage
atlas(
match_prob,
y,
x,
covar = NULL,
thresholds = seq(from = 0.1, to = 0.9, by = 0.2),
nb_perturb = 200,
dist_family = c("gaussian", "binomial"),
impute_strategy = c("weighted average", "best")
)
Arguments
match_prob |
matching probabilities matrix (e.g. obtained through |
y |
response variable of length |
x |
a |
covar |
a |
thresholds |
a vector (possibly of length |
nb_perturb |
the number of perturbation used for the p-value combination. Default is 200. |
dist_family |
a character string indicating the distribution family for the glm.
Currently, only |
impute_strategy |
a character string indicating which strategy to use to impute x
from the matching probabilities |
Value
a list containing the following:
-
influencefn_pvals
p-values obtained from influence function perturbations with the covariates as columns and thethresholds
as rows, with an additional row at the top for the combination -
wald_pvals
a matrix containing the p-values obtained from the Wald test with the covariates as columns and thethresholds
as rows -
ptbed_pvals
a list containing, for each covariates, a matrix with thenb_perturb
perturbed p-values with the differentthresholds
as rows -
theta_impute
a matrix of the estimated coefficients from the glm when imputing the weighted average for covariates (as columns) with thethresholds
as rows -
sd_theta
a matrix of the estimated SD (from the influence function) of the coefficients from the glm when imputing the weighted average for covariates (as columns), with thethresholds
as rows -
ptbed_theta_impute
a list containing, for each covariates, a matrix with thenb_perturb
perturbed estimated coefficients from the glm when imputing the weighted average for covariates, with the differentthresholds
as rows -
impute_strategy
a character string indicating which impute strategy was used (either"weighted average"
or"best"
)
References
Zhang HG, Hejblum BP, Weber G, Palmer N, Churchill S, Szolovits P, Murphy S, Liao KP, Kohane I and Cai T, ATLAS: An automated association test using probabilistically linked health records with application to genetic studies, JAMIA, in press (2021). doi: 10.1101/2021.05.02.21256490.
Examples
#rm(list=ls())
n_sims <- 1#5000
mysim <- function(i){
x <- matrix(ncol=2, nrow=99, stats::rnorm(n=99*2))
#plot(density(rbeta(n=1000, 1,2)))
match_prob <- matrix(rbeta(n=103*99, 1, 2), nrow=103, ncol=99)
#y <- rnorm(n=103, mean = 1, sd = 0.5)
#return(atlas(match_prob, y, x, dist_family="gaussian")$influencefn_pvals)
y <- rbinom(n=103, size = 1, prob=0.5)
return(atlas(match_prob, y, x, dist_family="binomial")$influencefn_pvals)
}
#res <- pbapply::pblapply(1:n_sims, mysim, cl = parallel::detectCores()-1)
res <- lapply(1:n_sims, mysim)
size <- sapply(1:(ncol(res[[1]])-2),
FUN = function(i){
rowMeans(sapply(res, function(m){m[, i]<0.05}), na.rm = TRUE)
}
)
rownames(size) <- rownames(res[[1]])
colnames(size) <- colnames(res[[1]])[-(-1:0 + ncol(res[[1]]))]
size
Fisher's rule for combining several p-values
Description
Compute the negative of the log-sum for a vector of p-values.
Usage
comb_pvals(pv)
Arguments
pv |
the vector of p-values to be combined together |
Details
According to Fisher's rule, if the p-values are correlated, then this does not follow a simple chi-square mixture under the null.
Value
the Fisher combination of the p-values. See Details.
Implementation of Winkler's EM algorithm for Fellegi-Sunter matching method
Description
em_winkler_big
implements the same method when the data are too big to compute
the agreement matrix. Agreement is then recomputed on the fly each time it is needed. The EM steps
are completely done in C++. This decreases the RAM usage (still important though), at the cost of
increasing computational time.
Usage
em_winkler(
data1,
data2,
tol = 0.001,
maxit = 500,
do_plot = TRUE,
oneone = FALSE,
verbose = FALSE
)
em_winkler_big(
data1,
data2,
tol = 0.001,
maxit = 500,
do_plot = TRUE,
oneone = FALSE,
verbose = FALSE
)
Arguments
data1 |
either a binary ( |
data2 |
either a binary ( |
tol |
tolerance for the EM algorithm convergence. |
maxit |
maximum number of iterations for the EM algorithm. |
do_plot |
a logical flag indicating whether a plot should be drawn for the EM convergence.
Default is |
oneone |
a logical flag indicating whether 1-1 matching should be enforced.
If |
verbose |
a logical flag indicating whether intermediate values from the EM algorithm should
be printed. Useful for debugging. Default is |
Value
a list containing:
matchingScore
a matrix of sizen1 x n2
with the matching score for eachn1*n2
pair.threshold_ms
threshold value for the matching scores above which pairs are considered true matches.estim_nbmatch
an estimation of the number of true matches (N
pairs considered multiplied byp
the estimated proportion of true matches from the EM algorithm)convergence_status
a logical flag indicating whether the EM algorithm converged
References
Winkler WE. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proc Sect Surv Res Methods, Am Stat Assoc 1988: 667-71.
Grannis SJ, Overhage JM, Hui S, et al. Analysis of a probabilistic record linkage technique without human review. AMIA 2003 Symp Proc 2003: 259-63.
Examples
mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100)
mat2 <- rbind(mat1[1:10, ],
matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90)
)
rownames(mat1) <- paste0("A", 1:nrow(mat1))
rownames(mat1) <- paste0("B", 1:nrow(mat1))
mat1 <- 1*(mat1>1)
mat2 <- 1*(mat2>1)
em_winkler(mat1, mat2)
Fast C++ implementation of Winkler's Method E step
Description
Fast C++ implementation of Winkler's Method E step
Usage
estep_C_vect(agreemat, p, m, u)
Logit link function
Description
inverse of the logistic transformation function
logistic transformation function
logistic transformation function
Usage
logit(p)
expit(x)
expit_dev1(x)
Arguments
p |
a real number in [0,1] to be transformed to the real scale ( ]-infinity,infinity[ ) |
x |
a real number in ]-infinity,infinity[ to be transformed to the probability scale ( [0,1] ) |
C++ implementation of the pseudo-likelihood computation
Description
loglikC_bin
implements an even faster C++ implementation of the pseudo-likelihood computation for binary
variables
loglikC_bin_wDates
implements a C++ implementation of the pseudo-likelihood computation for binary
variables with dates
Usage
loglikC_bin(Bmat, Amat, eps_p, eps_n, piA, piB)
loglikC_bin_wDates(Bmat, Amat, Bdates, Adates, eps_p, eps_n, piA, piB)
loglikratioC_diff_arbitrary(Bmat, Amat, d_max, cost)
Arguments
Bmat |
|
Amat |
|
eps_p |
a vector of length |
eps_n |
a vector of length |
piA |
a vector of length |
piB |
a vector of length |
Bdates |
|
Adates |
|
d_max |
a numeric vector of length |
cost |
a numeric vector of length |
Compute the matching probabilities for each pair of observations
Description
C++ version: for each observations in (1:n)
, all the matching probabilities are computed
for the p
possible pairs.
Usage
matchProbs_rank_full_C(computed_dist, prop_match)
Arguments
computed_dist |
a |
prop_match |
a priori proportion of matches ("rho_1") |
Value
a n x p
matrix containing the matching probabilities for each pair
Fast C++ computation of the final posterior probabilities in the E-M Winkler's method
Description
matchingScore_C_sparse_big
implements a version using sparse matrices. It has a better
management of memory but is a little bit slower (indicated for big matrices)
Usage
matchingScore_C(agreemat, m, u, nA, nB)
matchingScore_C_sparse_big(mat_A, mat_B, m, u)
Arguments
agreemat |
binary sparse matrix of dimensions |
m |
vector of length |
u |
vector of length |
nA |
integer indicating the number of observations to be matched. |
nB |
integer indicating the number of observations to be matched with. |
mat_A |
a |
mat_B |
a |
Computes a matching score from agreement vectors and weights
Description
Computes a matching score from agreement vectors and weights
Usage
matching_score(agree, m, u)
Examples
estep_vect <- function(ag_score, p, m, u){
a <-exp(log(p) + ag_score%*%log(m) + (1-ag_score)%*%log(1-m))
b <- exp(log(1-p) + ag_score%*%log(u) + (1-ag_score)%*%log(1-u))
return(cbind(a/(a+b), b/(a+b)))
}
Compute p-values for a Z-score
Description
Compute p-values for a Z-score assuming normal distribution of the z-score under the null Hypothesis H0
Usage
pval_zscore(beta, sigma)
Arguments
beta |
the estimate |
sigma |
estimate's estimated variance |
Value
the p-value
Probabilistic Patient Record Linkage
Description
Probabilistic Patient Record Linkage
Usage
recordLink(
data1,
data2,
dates1 = NULL,
dates2 = NULL,
eps_plus,
eps_minus,
aggreg_2ways = "mean",
min_prev = 0.01,
data1_cont2diff = NULL,
data2_cont2diff = NULL,
d_max,
use_diff = TRUE
)
Arguments
data1 |
either a binary ( |
data2 |
either a binary ( |
dates1 |
matrix or dataframe of dimension |
dates2 |
matrix or dataframe of dimension |
eps_plus |
discrepancy rate between |
eps_minus |
discrepancy rate between |
aggreg_2ways |
a character string indicating how to merge the posterior two
probability matrices obtained for each of the 2 databases. Four possibility are
currently implemented: |
min_prev |
minimum prevalence for the variables used in matching. Default is 1%. |
data1_cont2diff |
either a matrix or dataframe of continuous features,
such as age, for which the similarity measure uses the difference with
|
data2_cont2diff |
either a matrix or dataframe of continuous features,
such as age, for which the similarity measure uses the difference with
|
d_max |
a numeric vector of length |
use_diff |
logical flag indicating whether continuous differentiable variables should be used in the |
Details
Dates:
the use of dates1
and dates2
requires that at least one date interval matches across
dates1
and dates2
for claiming an agreement on a diagnosis code between data1
and data2
,
in addition of having that very same code recorded in both.
Value
a matrix of size n1 x n2
with the posterior probability of matching for each n1*n2
pair
References
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi: 10.1038/sdata.2018.298.
Examples
set.seed(123)
ncodes <- 500
npat <- 200
incid <- abs(rnorm(n=ncodes, 0.15, 0.07))
bin_codes <- rbinom(n=npat*ncodes, size=1, prob=rep(incid, npat))
bin_codes_mat <- matrix(bin_codes, ncol=ncodes, byrow = TRUE)
data1_ex <- bin_codes_mat[1:(npat/2+npat/10),]
data2_ex <- bin_codes_mat[c(1:(npat/10), (npat/2+npat/10 + 1):npat), ]
rownames(data1_ex) <- paste0("ID", 1:(npat/2+npat/10), "_data1")
rownames(data2_ex) <- paste0("ID", c(1:(npat/10), (npat/2+npat/10 + 1):npat), "_data2")
if(interactive()){
res <- recordLink(data1 = data1_ex, data2 = data2_ex,
use_diff = FALSE, eps_minus = 0.01, eps_plus = 0.01)
round(res[c(1:3, 19:23), c(1:3, 19:23)], 3)
}
Splitting a character string in C++
Description
Splitting a character string in C++
Usage
strsplitC(s, sep)
Arguments
s |
a character string to be split |
sep |
a character that delimits the splits |
Examples
strsplitC(c(";aaa;bb;cccc;ee;"), sep=";")
Association testing using Han & Lahiri estimating equations and jackknife approach
Description
Association testing using Han & Lahiri estimating equations and jackknife approach
Usage
test_han2018(
match_prob,
y,
x,
covar_y = NULL,
covar_x = NULL,
jackknife_nrep = 100,
jackknife_blocksize = max(floor(min(length(y), nrow(x))/jackknife_nrep), 1),
methods = c("F", "M", "M2"),
dist_family = c("gaussian", "binomial")
)
Arguments
match_prob |
matching probabilities matrix (e.g. obtained through |
y |
response variable of length |
x |
a |
covar_y |
a |
covar_x |
a |
jackknife_nrep |
the number of jackknife repetitions. Default is 100 (from Han et al.). |
jackknife_blocksize |
the number of observations to remove in each jackknife. |
methods |
a character vector which must be a subset of |
dist_family |
a character string indicating the distribution family for the glm.
Currently, only |
Value
a list containing the following for each estimator in methods
:
-
beta
a vector containing thep
estimated coefficients -
varcov
thep x p
variance-covariancematrix
of thebeta
coefficients -
zscores
a vector containing thep
Z-scores -
pval
the corresponding Gaussian assumption p-values
References
Han, Y., and Lahiri, P. (2019) Statistical Analysis with Linked Data. International Statistical Review, 87: S139– S157. doi: 10.1111/insr.12295.
Examples
# rm(list=ls())
# n_sims <- 500
# res <- pbapply::pblapply(1:n_sims, function(n){
# nx <- 99
# ny <- 103
# x <- matrix(ncol=2, nrow=ny, stats::rnorm(n=ny*2))
#
# #plot(density(rbeta(n=1000, 1,2)))
# match_prob <- diag(ny)[, 1:nx]#matrix(rbeta(n=ny*nx, 1, 2), nrow=ny, ncol=99)
#
# covar_y <- matrix(rnorm(n=ny, 1, 0.5), ncol=1)
# covar_x <- matrix(ncol=3, nrow=ny, stats::rnorm(n=ny*3))
#
# #y <- rnorm(n=ny, mean = x %*% c(2,-3) + covar_x %*% rep(0.2, ncol(covar_x)) + 0.5*covar_y, 0.5)
# y <- rbinom(n=ny, 1, prob=expit(x %*% c(2,-3) + covar_x %*%
# rep(0.2, ncol(covar_x)) + 0.5*covar_y))
# #glm(y~0+x+covar_y+covar_x, family = "binomial")
# return(
# #test_han2018(match_prob, y, x, jackknife_blocksize = 10, covar_x = NULL, covar_y = NULL)
# test_han2018(match_prob, y[1:ny], x[1:nx, ], dist_family = "binomial",
# jackknife_blocksize = 10, covar_x = covar_x[1:nx, ],
# covar_y = covar_y[1:ny, , drop=FALSE])
# )
# }, cl=parallel::detectCores()-1)
# pvals_F <- sapply(lapply(res, "[[", "F"), "[[", "beta")
# pvals_M <- sapply(lapply(res, "[[", "M"), "[[", "beta")
# pvals_M2 <- sapply(lapply(res, "[[", "M2"), "[[", "beta")
# quantile(pvals_F)
# quantile(pvals_M)
# quantile(pvals_M2)
# rowMeans(pvals_F<0.05)
# rowMeans(pvals_M<0.05)
# rowMeans(pvals_M2<0.05)