| Type: | Package |
| Title: | Dataset Comparison with 'CDISC' Validation for Clinical Trial Data |
| Version: | 1.0.0 |
| Description: | A general-purpose toolkit for comparing any two data frames with optional 'CDISC' (Clinical Data Interchange Standards Consortium) validation for clinical trial data. Core comparison functions work on arbitrary datasets: variable-level and observation-level comparison, data type checking, metadata attribute analysis (types, labels, lengths, formats), missing value handling, key-based row matching, tolerance-based numeric comparisons, and group-wise comparisons. Optional z-score outlier detection is available when enabled. When working with clinical data, the package additionally validates 'SDTM' (Study Data Tabulation Model) and 'ADaM' (Analysis Data Model) datasets against CDISC standards (SDTM IG 3.3/3.4, ADaM IG 1.1/1.2/1.3), automatically detecting domains and flagging non-conformant variables. Generates unified comparison reports in text or HTML format with interactive dashboards. For CDISC standards, see https://www.cdisc.org/standards. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/siddharthlokineni/clinCompare |
| BugReports: | https://github.com/siddharthlokineni/clinCompare/issues |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Depends: | R (≥ 3.5.0) |
| Imports: | dplyr (≥ 1.0.0), haven (≥ 2.0.0), rlang (≥ 0.4.0), tidyr (≥ 1.0.0), methods, stats, tools, utils |
| Suggests: | ggplot2 (≥ 3.0.0), openxlsx (≥ 4.0.0), testthat (≥ 3.0.0), knitr, rmarkdown |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-02-15 01:22:16 UTC; siddharthlokineni |
| Author: | Siddharth Lokineni [aut, cre] |
| Maintainer: | Siddharth Lokineni <sidhu871@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-18 19:00:07 UTC |
clinCompare: Dataset Comparison with CDISC Validation
Description
A comprehensive toolkit for comparing clinical trial datasets. Provides functions for dataset comparison including variable-level and observation-level differences, data type checking, and missing value analysis. Integrates CDISC validation for SDTM and ADaM datasets.
Main Functions
compare_datasetsHigh-level comparison of two datasets
compare_variablesCompare variable names and types
compare_observationsRow-wise value comparison
cdisc_compareCompare datasets with CDISC validation
validate_cdiscValidate a dataset against CDISC standards
detect_cdisc_domainAuto-detect CDISC domain or ADaM dataset
CDISC Standards Supported
- SDTM
DM, AE, LB, VS, EX, CM, MH, DS, SV, TA, TE domains
- ADaM
ADSL, ADAE, ADLB, ADTTE, ADEFF datasets
Author(s)
Maintainer: Siddharth Lokineni sidhu871@gmail.com
See Also
Useful links:
Report bugs at https://github.com/siddharthlokineni/clinCompare/issues
Package-Level Settings Environment
Description
Internal environment used to store package settings without modifying global options.
Usage
.clincompare_env
Format
An object of class environment of length 2.
Print Observation-Level Differences (Internal Helper)
Description
Shared helper used by both print.dataset_comparison and
print.cdisc_comparison. Prints a summary line, a per-variable
table, and up to n rows of the top variable's differing observations.
Usage
.print_observation_diffs(obs, n = 30, id_details = NULL, n_total_obs = NULL)
Arguments
obs |
Observation comparison list (with |
n |
Maximum number of differing rows to display (default 30). |
id_details |
Optional named list of ID detail data frames (from key-based comparison). |
n_total_obs |
Total number of observations (for percentage calculation). |
Value
Called for side effects (prints to console). Returns NULL invisibly.
Build Metadata Comparison
Description
Internal function to compare metadata attributes (types, labels, lengths, formats, and column order) between two datasets.
Usage
build_metadata_comparison(df1, df2)
Arguments
df1 |
First data frame (base). |
df2 |
Second data frame (compare). |
Value
A list with:
type_mismatches |
Data frame of variables with differing R classes |
label_mismatches |
Data frame of variables with differing labels |
length_mismatches |
Data frame of variables with differing lengths (max character width or haven width attribute) |
format_mismatches |
Data frame of variables with differing SAS format attributes (format.sas or display_format) |
order_match |
Logical: TRUE if common column ordering matches |
order_df1 |
Character: column order in df1 for common columns |
order_df2 |
Character: column order in df2 for common columns |
Build Unified Comparison Table
Description
Internal function that merges attribute differences (type, label, length, format) and value differences into a single data frame, giving a consolidated per-variable view of all differences.
Usage
build_unified_comparison(meta, obs_comp, id_vars, df1, df2)
Arguments
meta |
Metadata comparison list from |
obs_comp |
Observation comparison list from |
id_vars |
Character vector of ID variable names (or NULL). |
df1 |
First data frame (base), used to retrieve ID values. |
df2 |
Second data frame (compare). |
Value
A data frame with columns: variable, diff_type, row_or_key, base_value, compare_value. The diff_type column indicates whether the row is a Type, Label, Length, Format, or Value difference.
Compare Two Datasets with CDISC Validation
Description
Flagship function that compares two datasets AND runs CDISC validation on both. Combines dataset comparison with CDISC conformance analysis to provide comprehensive insights into both differences and regulatory compliance.
Usage
cdisc_compare(
df1,
df2,
domain = NULL,
standard = NULL,
id_vars = NULL,
vars = NULL,
ts_data = NULL,
detect_outliers = FALSE,
tolerance = 0,
where = NULL
)
Arguments
df1 |
First data frame to compare, or a file path (character string
ending in |
df2 |
Second data frame to compare, or a file path. |
domain |
Optional character string specifying the CDISC domain code or dataset name (e.g., "DM", "AE", "ADSL"). Strongly recommended – auto-detection can be ambiguous for datasets with common columns. If NULL, auto-detected from df1. |
standard |
Optional character string: "SDTM" or "ADaM". If NULL, auto-detected from df1. |
id_vars |
Optional character vector of ID variable names (e.g.,
|
vars |
Optional character vector of variable names to compare. Only these columns are included in value comparison. Structural and CDISC validation still covers all columns. |
ts_data |
Optional data frame of the TS (Trial Summary) domain. When provided, CDISC standard versions (e.g., SDTM IG 3.4, ADaM IG 1.3) are extracted and included in the results and reports. If NULL (default), version information is omitted. |
detect_outliers |
Logical. When TRUE, runs z-score outlier detection on numeric columns and includes results in the output. Defaults to FALSE. |
tolerance |
Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance. |
where |
Optional filter expression as a string (e.g., "AESEV == 'SEVERE'"). Applied to both datasets before comparison. Equivalent to a WHERE clause. |
Value
A list containing:
domain |
Character: detected or supplied CDISC domain |
standard |
Character: detected or supplied CDISC standard (SDTM/ADaM) |
nrow_df1 |
Integer: number of rows in df1 |
ncol_df1 |
Integer: number of columns in df1 |
nrow_df2 |
Integer: number of rows in df2 |
ncol_df2 |
Integer: number of columns in df2 |
id_vars |
Character vector of ID variables used for matching (NULL if positional matching was used) |
comparison |
Result of |
variable_comparison |
Result of |
metadata_comparison |
List of metadata differences: type_mismatches, label_mismatches, length_mismatches, format_mismatches, column ordering |
observation_comparison |
Result of |
unified_comparison |
Data frame combining attribute and value differences per variable. Columns: variable, attribute, base_value, compare_value, and optionally id columns and row when value differences exist |
unmatched_rows |
List with df1_only and df2_only data frames of rows that could not be matched by id_vars (NULL when id_vars is not used) |
cdisc_validation_df1 |
CDISC validation results for df1 |
cdisc_validation_df2 |
CDISC validation results for df2 |
cdisc_conformance_comparison |
Data frame showing which CDISC issues are unique to df1, unique to df2, or common to both |
outlier_notes |
Data frame of z-score outliers (|z| > 3) found in numeric columns of either dataset (NULL when detect_outliers is FALSE) |
cdisc_version |
List of CDISC version information extracted from TS
domain (NULL when ts_data is not provided). See |
Examples
# Create sample SDTM DM domains
dm1 <- data.frame(
STUDYID = "STUDY001",
USUBJID = c("SUBJ001", "SUBJ002"),
DMSEQ = c(1, 1),
RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN"),
stringsAsFactors = FALSE
)
dm2 <- data.frame(
STUDYID = "STUDY001",
USUBJID = c("SUBJ001", "SUBJ003"),
DMSEQ = c(1, 1),
RACE = c("WHITE", "ASIAN"),
ETHNIC = c("NOT HISPANIC", "NOT HISPANIC"),
stringsAsFactors = FALSE
)
# Positional matching (default)
result <- cdisc_compare(dm1, dm2, domain = "DM", standard = "SDTM")
# Key-based matching by ID variables
result <- cdisc_compare(dm1, dm2, domain = "DM", id_vars = c("USUBJID"))
names(result)
Check Compatibility of Two Datasets for Comparison
Description
Checks if two datasets are compatible for comparison by verifying their dimensions, column names, and data types. Returns a list indicating whether the datasets are compatible and detailing any structural differences.
Usage
check_compatibility(df1, df2)
Arguments
df1 |
The first data frame to be compared. |
df2 |
The second data frame to be compared. |
Value
A list containing details about the compatibility of the datasets, including information on dimension equality and common columns.
Clean Dataset
Description
Removes duplicate rows, standardizes column names and text values to uppercase or lowercase, and performs basic data cleaning on a data frame.
Usage
clean_dataset(
df,
variables = NULL,
remove_duplicates = TRUE,
convert_to_case = NULL
)
Arguments
df |
A data frame to be cleaned. |
variables |
Optional; a vector of variable names to specifically clean. If NULL, applies cleaning to all variables. |
remove_duplicates |
Logical; whether to remove duplicate rows. |
convert_to_case |
Optional; convert character variables to "lower" or "upper" case. |
Value
A cleaned data frame.
Examples
df <- data.frame(name = c("Alice", "Bob", "Alice"),
score = c(90, 85, 90),
stringsAsFactors = FALSE)
clean_dataset(df, remove_duplicates = TRUE, convert_to_case = "upper")
Compare Two Datasets by Group
Description
Compares two datasets within subgroups defined by grouping variables. Performs separate comparisons for each group and returns results organized by group.
Usage
compare_by_group(df1, df2, group_vars)
Arguments
df1 |
A data frame representing the first dataset. |
df2 |
A data frame representing the second dataset. |
group_vars |
A character vector of column names to group by. |
Value
A list of comparison results for each group.
Examples
df1 <- data.frame(region = c("A", "A", "B"), value = c(10, 20, 30),
stringsAsFactors = FALSE)
df2 <- data.frame(region = c("A", "A", "B"), value = c(10, 25, 30),
stringsAsFactors = FALSE)
compare_by_group(df1, df2, group_vars = "region")
Compare Two Datasets
Description
Compares two datasets at three levels in a single call:
-
Dataset level – dimensions, column overlap, missing-value totals.
-
Variable level – column name discrepancies and data-type mismatches (delegates to
compare_variables()). -
Observation level – row-by-row value differences on common columns. Uses positional matching by default, or key-based matching when
id_varsis provided.
The return value is a list with class "dataset_comparison", which has
a tidy print() method. The same object is accepted by
generate_summary_report(), generate_detailed_report(), and
compare_by_group().
Usage
compare_datasets(df1, df2, tolerance = 0, vars = NULL, id_vars = NULL)
Arguments
df1 |
A data frame (the base dataset). |
df2 |
A data frame (the compare dataset). |
tolerance |
Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance. |
vars |
Optional character vector of variable names to compare. When provided, only these columns are included in the observation-level comparison. Structural comparison (extra columns, type mismatches) still covers all columns. Default is NULL (compare all common columns). |
id_vars |
Optional character vector of column names to use as matching
keys. When provided, rows are matched by these key columns instead of by
position. This allows comparison of datasets with different row counts or
different row orders. Rows that exist in only one dataset are reported in
|
Value
A dataset_comparison list containing:
nrow_df1, ncol_df1 |
Dimensions of df1. |
nrow_df2, ncol_df2 |
Dimensions of df2. |
common_columns |
Character vector of columns present in both. |
extra_in_df1 |
Columns only in df1. |
extra_in_df2 |
Columns only in df2. |
type_mismatches |
Data frame of columns whose class differs
(columns: |
missing_values |
Data frame summarising NA counts per column per
dataset (columns: |
variable_comparison |
Output of |
observation_comparison |
Output of |
id_vars |
Character vector of key columns used for matching, or
|
unmatched_rows |
List with |
Examples
# Positional matching (default)
df1 <- data.frame(id = 1:3, val = c(10, 20, 30))
df2 <- data.frame(id = 1:3, val = c(10, 25, 30))
result <- compare_datasets(df1, df2)
result
# Key-based matching (for different row counts or row orders)
df1 <- data.frame(id = c(1, 2, 3), val = c(10, 20, 30))
df2 <- data.frame(id = c(2, 3, 4), val = c(20, 35, 40))
result <- compare_datasets(df1, df2, id_vars = "id")
result
result$unmatched_rows
Compare Observations of Two Datasets
Description
Performs row-by-row comparison of two datasets on common columns, identifying specific value differences at the cell level. Returns discrepancy counts and details showing which rows differ and how their values diverge.
Usage
compare_observations(df1, df2, tolerance = 0)
Arguments
df1 |
A data frame representing the first dataset. |
df2 |
A data frame representing the second dataset. |
tolerance |
Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance. |
Value
A list containing discrepancy counts and details of row differences.
Examples
df1 <- data.frame(id = 1:3, value = c(1.0, 2.0, 3.0))
df2 <- data.frame(id = 1:3, value = c(1.0, 2.5, 3.0))
compare_observations(df1, df2)
compare_observations(df1, df2, tolerance = 0.00001)
Compare Observations by ID Variables
Description
Internal function to match rows between two datasets using specified key variables, then compare values on matched rows. Also identifies unmatched rows in either dataset.
Usage
compare_observations_by_id(df1, df2, id_vars, common_cols, tolerance = 0)
Arguments
df1 |
First data frame (base). |
df2 |
Second data frame (compare). |
id_vars |
Character vector of ID column names. |
common_cols |
Character vector of common column names. |
tolerance |
Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance. |
Value
A list with:
observation_comparison |
List with discrepancies and details (same
structure as |
unmatched_rows |
List with df1_only and df2_only data frames |
Batch Compare CDISC Datasets Across Submission Directories
Description
Scans two directories for matching dataset files, runs cdisc_compare()
on each pair, and optionally generates a consolidated Excel report.
Usage
compare_submission(
base_dir,
compare_dir,
format = NULL,
id_vars = NULL,
tolerance = 0,
output_file = NULL
)
Arguments
base_dir |
Path to directory containing base/reference files. |
compare_dir |
Path to directory containing comparison files. |
format |
File format to match: "xpt", "sas7bdat", "csv", or "rds". When NULL (default), auto-detected from the most common file type in base_dir. |
id_vars |
Optional character vector of ID variables (passed to each comparison). When NULL, CDISC-standard keys are auto-detected per domain. |
tolerance |
Numeric tolerance for floating-point comparisons (default 0). |
output_file |
Optional path to Excel (.xlsx) file for consolidated report. |
Value
Named list of cdisc_compare() results, one per matched domain.
Examples
## Not run:
# Auto-detects format from directory contents
results <- compare_submission("v1/", "v2/",
output_file = "submission_diff.xlsx")
# Explicit format
results <- compare_submission("v1/", "v2/", format = "csv")
## End(Not run)
Compare Variables of Two Datasets
Description
Compares the structural attributes of two datasets including column names, data types, and variable ordering. Identifies common columns and reports columns that exist in only one dataset.
Usage
compare_variables(df1, df2)
Arguments
df1 |
A data frame representing the first dataset. |
df2 |
A data frame representing the second dataset. |
Value
A list containing variable comparison details and discrepancy count.
Examples
df1 <- data.frame(id = 1:3, name = c("A", "B", "C"))
df2 <- data.frame(id = 1:3, name = c("A", "B", "C"), score = c(90, 80, 70))
compare_variables(df1, df2)
Converts the data types of specified variables in a dataset.
Description
Converts columns in a data frame to specified types based on a named list mapping column names to target types. Supports conversion to numeric, character, factor, integer, logical, and other R data types.
Usage
convert_data_types(df, conversions)
Arguments
df |
A data frame containing the variables to be converted. |
conversions |
A named list where names correspond to variable names in the dataset, and values are the desired data types (e.g., 'numeric', 'factor'). |
Value
A data frame with converted variable types.
Create CDISC Conformance Comparison
Description
Internal function to compare CDISC validation results from two datasets and identify which issues are unique to each or common to both.
Usage
create_conformance_comparison(val_df1, val_df2)
Arguments
val_df1 |
Validation result data frame from df1. |
val_df2 |
Validation result data frame from df2. |
Value
A data frame showing CDISC issue distribution across datasets, with columns:
category |
Character: validation issue category |
variable |
Character: variable name |
df1_only |
Logical: TRUE if issue only appears in df1 |
df2_only |
Logical: TRUE if issue only appears in df2 |
both |
Logical: TRUE if issue appears in both datasets |
Detect CDISC Domain Type
Description
Detects whether a data frame looks like an SDTM domain or ADaM dataset by comparing column names against known CDISC standards. Calculates a confidence score based on the percentage of expected variables present.
Auto-detection is a convenience for exploratory use. For anything important –
validation reports, regulatory submissions, scripted pipelines – always pass
domain and standard explicitly. Datasets with common columns
(STUDYID, USUBJID, etc.) can match multiple domains, and a warning is issued
when the top two candidates score within 10 percentage points of each other.
Usage
detect_cdisc_domain(df, name_hint = NULL)
Arguments
df |
A data frame to analyze. |
name_hint |
Optional character string with the dataset name (e.g., "DM", "ADLB", or a filename like "adlb.xpt"). When provided and it matches a known CDISC domain, that candidate receives a strong confidence boost. This makes detection much more accurate when the filename is available. |
Value
A list containing:
standard |
Character: "SDTM", "ADaM", or "Unknown" |
domain |
Character: domain code (e.g., "DM", "AE") or dataset name (e.g., "ADSL"), or NA |
confidence |
Numeric between 0 and 1 indicating match quality |
message |
Character: human-readable explanation |
Examples
# Create a sample SDTM DM domain
dm <- data.frame(
STUDYID = "STUDY001",
USUBJID = "SUBJ001",
SUBJID = "001",
DMSEQ = 1,
RACE = "WHITE",
ETHNIC = "NOT HISPANIC OR LATINO",
ARMCD = "ARM01",
ARM = "Treatment A",
stringsAsFactors = FALSE
)
result <- detect_cdisc_domain(dm)
print(result)
Detect Outliers Using Z-Score Method
Description
Internal function to detect potential outliers in numeric columns of both datasets using the z-score method. Values with |z| > 3 are flagged. Results are returned as advisory notes for the user.
Usage
detect_outliers_zscore(df1, df2, threshold = 3)
Arguments
df1 |
First data frame (base). |
df2 |
Second data frame (compare). |
threshold |
Numeric z-score threshold (default 3). |
Value
A data frame with columns: dataset, variable, row, value, zscore. Empty data frame if no outliers found.
Export Comparison Report to File
Description
Exports a dataset or CDISC comparison result to a file in multiple formats. Automatically detects format from file extension (.html, .txt, .xlsx).
Usage
export_report(result, file, format = NULL)
Arguments
result |
A list from |
file |
Character string specifying the output file path. File extension determines format: .html, .txt, or .xlsx. |
format |
Character string specifying output format: "html", "text", or "excel". If NULL (default), format is auto-detected from file extension. |
Details
Supported formats:
-
HTML (.html): Self-contained HTML report with styling and interactive charts.
-
Text (.txt): Plain text report suitable for console review.
-
Excel (.xlsx): Multi-sheet workbook with tabbed data:
"Summary": Dataset dimensions, domain, standard, matching type, tolerance
"Variable Diffs": Metadata attribute differences
"Value Diffs": Unified diff data frame from
get_all_differences()"CDISC Validation": Combined validation results (for CDISC comparisons only)
The result object can be either a dataset_comparison (from compare_datasets())
or cdisc_comparison (from cdisc_compare()). All features are supported for both.
Value
Invisibly returns the input result (useful for piping).
Examples
# Create sample datasets
df1 <- data.frame(
ID = c(1, 2, 3),
NAME = c("Alice", "Bob", "Charlie"),
AGE = c(25, 30, 35)
)
df2 <- data.frame(
ID = c(1, 2, 3),
NAME = c("Alice", "Bob", "Charles"),
AGE = c(25, 30, 36)
)
# Compare datasets
result <- compare_datasets(df1, df2)
# Export to different formats (write to tempdir)
export_report(result, file.path(tempdir(), "report.html"))
export_report(result, file.path(tempdir(), "report.txt"))
# Explicit format specification
export_report(result, file.path(tempdir(), "report.xlsx"), format = "excel")
Extract CDISC Version from TS Domain
Description
Reads a Trial Summary (TS) dataset and extracts the CDISC standard version information. Looks for SDTM IG version (TSPARMCD = "SDTIGVER" or "CDISCVER") and ADaM IG version (TSPARMCD = "ADAMIGVR") parameters.
Usage
extract_cdisc_version(ts_data)
Arguments
ts_data |
A data frame representing a TS (Trial Summary) domain. Must contain at minimum TSPARMCD and TSVAL columns. |
Value
A list containing:
sdtm_ig_version |
Character: SDTM IG version (e.g., "3.4"), or NA |
adam_ig_version |
Character: ADaM IG version (e.g., "1.3"), or NA |
study_id |
Character: STUDYID from TS if available, or NA |
protocol_title |
Character: Protocol title if available, or NA |
version_note |
Character: Formatted note string for reports |
Format Validation Results as HTML
Description
Internal function to format validation results as an HTML table.
Usage
format_validation_html(validation_df)
Arguments
validation_df |
Validation results data frame. |
Value
Character vector of HTML lines.
Format Validation Summary
Description
Internal function to format validation results as text.
Usage
format_validation_summary(validation_df)
Arguments
validation_df |
Validation results data frame. |
Value
Character vector of formatted lines.
Generate CDISC Validation Report
Description
Generates a formatted report from the results of cdisc_compare(). Supports both
text-based console output and HTML reports with professional styling and color-coding.
Usage
generate_cdisc_report(cdisc_results, output_format = "text", file_name = NULL)
Arguments
cdisc_results |
A list output from |
output_format |
Character string: either "text" (default) for console output or "html" for HTML report. |
file_name |
Optional character string specifying the output file path. For text format, the report is appended to this file. For HTML format, must be explicitly provided by the user. If NULL, output is not written to file. |
Details
The report includes:
Dataset Comparison Summary
CDISC Compliance for each dataset
CDISC Conformance Comparison
For text output, formatting uses console-friendly layout. For HTML output, a self-contained report is generated with color-coded severity levels: red for ERROR, orange for WARNING, blue for INFO.
Value
Invisibly returns the input cdisc_results (useful for piping).
Examples
## Not run:
# Create sample datasets
dm1 <- data.frame(
STUDYID = "STUDY001",
USUBJID = c("SUBJ001", "SUBJ002"),
DMSEQ = c(1, 1),
RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN")
)
dm2 <- data.frame(
STUDYID = "STUDY001",
USUBJID = c("SUBJ001", "SUBJ003"),
DMSEQ = c(1, 1),
RACE = c("WHITE", "ASIAN")
)
result <- cdisc_compare(dm1, dm2, domain = "DM")
# Generate text report to console
generate_cdisc_report(result, output_format = "text")
# Generate HTML report to file
out <- file.path(tempdir(), "report.html")
generate_cdisc_report(result, output_format = "html", file_name = out)
## End(Not run)
Generate Visualization for Data Comparison
Description
Creates a ggplot2 bar chart visualization showing the number of discrepancies per variable from comparison results. Provides a clear visual summary of data differences across variables in the datasets being compared.
Usage
generate_comparison_visualization(comparison_results)
Arguments
comparison_results |
A list or data frame containing the results of dataset comparisons. |
Value
A plot object visualizing the comparison results.
Generate a Detailed Report of Dataset Comparison
Description
Creates a detailed report outlining all the differences found in the comparison, including variable differences, observation differences, and group-based discrepancies.
Usage
generate_detailed_report(
comparison_results,
output_format = "text",
file_name = NULL
)
Arguments
comparison_results |
A list containing the results of dataset comparisons. |
output_format |
Format of the output ('text' or 'html'). |
file_name |
Name of the file to save the report to (applicable for 'html' format). |
Value
The detailed report. For 'text', prints to console. For 'html', writes to file.
Examples
## Not run:
generate_detailed_report(comparison_results, output_format = "text")
## End(Not run)
Generate HTML Report
Description
Internal function to generate a self-contained HTML report with styling.
Usage
generate_html_report(cdisc_results)
Arguments
cdisc_results |
List from |
Value
Character string containing the HTML report.
Generate a Summary Report of Dataset Comparison
Description
Provides a summary of the comparison results, highlighting key points such as the number of differing observations and variables.
Usage
generate_summary_report(
comparison_results,
detail_level = "high",
output_format = "text",
file_name = NULL
)
Arguments
comparison_results |
A list containing the results of dataset comparisons. |
detail_level |
The level of detail ('high', 'medium', 'low') for the summary. |
output_format |
Format of the output ('text' or 'html'). |
file_name |
Name of the file to save the report to (applicable for 'html' format). |
Value
The summary report. For 'text', prints to console. For 'html', writes to file.
Examples
## Not run:
generate_summary_report(comparison_results, detail_level = "high", output_format = "text")
## End(Not run)
Generate Text Report
Description
Internal function to generate a formatted text report from CDISC comparison results.
Usage
generate_text_report(cdisc_results)
Arguments
cdisc_results |
List from |
Value
Character string containing the formatted text report.
ADaM Metadata
Description
Returns metadata for ADaM datasets following CDISC standards. Provides information about required, conditional, and other variables for each ADaM analysis dataset.
Usage
get_adam_metadata(version = "1.3")
Arguments
version |
Character string specifying the ADaM IG version. Supported values: "1.3" (default), "1.2", "1.1". Note: All versions currently return identical variable definitions. The ADaM IG revisions (1.1 -> 1.3) changed guidance and rules but not the core variable inventory. The parameter exists for provenance tracking only – it does not enable version-specific validation. |
Details
Variable definitions are based on the published CDISC ADaM Implementation Guide. The canonical machine-readable source is the CDISC Library API (https://www.cdisc.org/cdisc-library), which requires CDISC membership. The metadata shipped with clinCompare is hand-curated from the published IG specifications.
Value
A named list where keys are ADaM dataset names and values are data.frames with columns:
- variable
Variable name (character)
- label
Variable label/description (character)
- type
Data type: "Char" for character or "Num" for numeric
- core
Importance level: "Req" (Required), "Cond" (Conditional)
Extract All Differences as a Unified Data Frame
Description
Converts per-variable observation differences into a single long-format
data frame suitable for filtering with dplyr, writing to CSV, or
programmatic analysis. This is the R equivalent of SAS PROC COMPARE's
OUT= dataset with _TYPE_ and _DIF_ variables.
Accepts output from compare_datasets(), cdisc_compare(), or any list
containing an observation_comparison element with the standard
discrepancies / details / id_details structure.
Usage
get_all_differences(comparison_results)
Arguments
comparison_results |
A |
Value
A data frame with one row per differing cell. Columns:
- Variable
Character: column name where the difference was found.
- Row
Integer: row index in df1 (positional matching).
- Base
The value in df1 (base dataset).
- Compare
The value in df2 (compare dataset).
- Diff
Numeric: Base - Compare (NA for character columns).
- PctDiff
Numeric: absolute percentage difference relative to Base (NA when Base is 0 or column is character).
When key-based matching was used (id_vars), the ID columns are prepended to the left of the data frame.
Returns an empty data frame with the expected columns when no differences exist or observation comparison was skipped.
Examples
df1 <- data.frame(id = 1:3, value = c(10, 20, 30), name = c("A", "B", "C"))
df2 <- data.frame(id = 1:3, value = c(10, 25, 30), name = c("A", "B", "D"))
result <- compare_datasets(df1, df2)
diffs <- get_all_differences(result)
head(diffs)
SDTM Metadata
Description
Returns metadata for SDTM domains following CDISC standards. Provides information about required, expected, and permissible variables for each SDTM domain.
Usage
get_sdtm_metadata(version = "3.4")
Arguments
version |
Character string specifying the SDTM IG version. Supported values: "3.4" (default, based on SDTM v2.0), "3.3" (based on SDTM v1.7). Version "3.3" excludes 7 domains introduced in v3.4 (GF, CP, BE, BS, SM, TD, TM). Within a domain, the variable lists are the same across versions – this parameter only controls which domains are available, not per-variable version differences. |
Details
Variable definitions are based on the published CDISC SDTM Implementation Guide. The canonical machine-readable source is the CDISC Library API (https://www.cdisc.org/cdisc-library), which requires CDISC membership. The metadata shipped with clinCompare is hand-curated from the published IG specifications and should be cross-referenced with the official CDISC Library for regulatory submissions.
Value
A named list where keys are SDTM domain codes and values are data.frames with columns:
- variable
Variable name (character)
- label
Variable label/description (character)
- type
Data type: "Char" for character or "Num" for numeric
- core
Importance level: "Req" (Required), "Exp" (Expected), or "Perm" (Permissible)
Get Tolerance Level for Comparisons
Description
Retrieves the currently set tolerance level for numeric comparisons.
Usage
get_tolerance()
Value
The current tolerance level as a numeric value.
Handle Missing Values in Dataset
Description
Handles missing values (NA) in a data frame using one of several strategies: exclude rows, replace with a value, fill with column mean, fill with column median, or flag with an indicator column.
Usage
handle_missing_values(df, method = "exclude", replace_with = NULL)
Arguments
df |
A data frame with potential missing values. |
method |
Method for handling missing values ('exclude', 'replace', 'mean', 'median', 'flag'). |
replace_with |
Optional; a value or named list to replace missing values with (used with 'replace' method). |
Value
A data frame after handling missing values.
Initialize Settings for Data Comparison
Description
Initializes default settings for dataset comparison including tolerance and other parameters stored in a package environment.
Usage
initialize_comparison_settings(tolerance = 0, missing_value_method = "ignore")
Arguments
tolerance |
Default tolerance level for numeric comparisons. |
missing_value_method |
Default method for handling missing values in data comparison. |
Value
Invisible NULL. Called for its side effect of updating package settings.
Prepare Datasets for Comparison
Description
Prepares two datasets for comparison by optionally sorting by specified columns and filtering rows based on a condition.
Usage
prepare_datasets(df1, df2, sort_columns = NULL, filter_criteria = NULL)
Arguments
df1 |
First dataset to be prepared. |
df2 |
Second dataset to be prepared. |
sort_columns |
Columns to sort the datasets by. |
filter_criteria |
Criteria for filtering the datasets. |
Value
A list containing two prepared datasets.
Examples
df1 <- data.frame(id = c(3, 1, 2), score = c(70, 90, 80))
df2 <- data.frame(id = c(2, 3, 1), score = c(80, 75, 90))
prepare_datasets(df1, df2, sort_columns = "id", filter_criteria = "score > 75")
Print CDISC Comparison Results
Description
Prints a concise summary of CDISC comparison results. Shows dataset dimensions, domain, number of differences, and a pass/fail verdict based on CDISC validation errors.
Usage
## S3 method for class 'cdisc_comparison'
print(x, ...)
Arguments
x |
A cdisc_comparison object returned by |
... |
Additional arguments (ignored). |
Value
Invisibly returns x.
Print Dataset Comparison Results
Description
Print Dataset Comparison Results
Usage
## S3 method for class 'dataset_comparison'
print(x, ...)
Arguments
x |
A |
... |
Ignored. |
Value
Invisibly returns x.
Print CDISC Validation Results
Description
Pretty-prints CDISC validation results to the console with a summary and grouped output by category. Displays counts of errors, warnings, and info messages.
Usage
print_cdisc_validation(validation_result)
Arguments
validation_result |
A data frame from |
Details
Output includes:
Summary counts of errors, warnings, and info messages
Issues grouped by category
Each issue displayed with its variable name and message
Value
Invisibly returns the input (useful for piping).
Examples
## Not run:
# Validate a dataset
dm <- data.frame(
STUDYID = "STUDY001",
USUBJID = c("SUBJ001", "SUBJ002"),
DMSEQ = c(1, 1),
RACE = c("WHITE", "BLACK OR AFRICAN AMERICAN")
)
validation_result <- validate_cdisc(dm, domain = "DM", standard = "SDTM")
print_cdisc_validation(validation_result)
## End(Not run)
Generate a Report of Differences Found in Dataset Comparison
Description
Creates a formatted report summarizing all differences found between two data frames, including column-level and value-level differences.
Usage
report_differences(variable_diffs, observation_diffs)
Arguments
variable_diffs |
A data frame or list detailing the differences found in variables. |
observation_diffs |
A data frame or list detailing the differences found in observations. |
Value
A structured report of the differences, typically a list or a data frame.
Reset Comparison Settings to Defaults
Description
Resets all comparison settings back to their defaults, clearing any custom tolerance or other parameters.
Usage
reset_comparison_settings()
Value
Invisible NULL. Called for its side effect of resetting package settings.
Set Tolerance Level for Comparisons
Description
Sets the numeric tolerance for floating-point comparisons, allowing small differences within the tolerance to be treated as equal.
Usage
set_tolerance(tolerance = 0)
Arguments
tolerance |
A non-negative numeric value specifying the tolerance level. |
Value
Invisible NULL. Called for its side effect of updating the tolerance setting.
Summarize CDISC Comparison Results
Description
Returns a concise one-row data frame summarizing the comparison: domain, standard, row/col counts, number of differences, and CDISC error/warning counts.
Usage
## S3 method for class 'cdisc_comparison'
summary(object, ...)
Arguments
object |
A cdisc_comparison object returned by |
... |
Additional arguments (ignored). |
Value
A one-row data frame with summary metrics.
Transform Variables in a Dataset
Description
Applies mathematical or logical transformations to specified columns in a data frame based on a named list of transformation functions.
Usage
transform_variables(df, transformations)
Arguments
df |
A data frame containing the variables to be transformed. |
transformations |
A list of functions for transforming the variables. The names of the list should correspond to the variable names in the dataset. |
Value
A data frame with transformed variables.
Validate ADaM Compliance
Description
Validates a data frame against a specific ADaM dataset specification. Similar to
validate_sdtm() but uses ADaM metadata and treats Conditional variables differently.
Usage
validate_adam(df, domain)
Arguments
df |
A data frame to validate. |
domain |
Character string specifying the ADaM dataset name (e.g., "ADSL", "ADAE"). |
Details
Severity levels:
ERROR: Required variable is missing
WARNING: Data type mismatch detected
INFO: Conditional variable missing, non-standard variable, or variable information
Value
A data frame with validation results containing columns:
category |
Character: validation issue type |
variable |
Character: variable name |
message |
Character: issue description |
severity |
Character: "ERROR", "WARNING", or "INFO" |
Validate CDISC Compliance
Description
Main validation entry point that checks whether a data frame conforms to CDISC standards.
If domain and standard are not provided, they are automatically detected via
detect_cdisc_domain(). Dispatches to validate_sdtm() or validate_adam() as appropriate.
Usage
validate_cdisc(df, domain = NULL, standard = NULL)
Arguments
df |
A data frame to validate. |
domain |
Optional character string specifying the CDISC domain code (e.g., "DM", "AE") or ADaM dataset name (e.g., "ADSL", "ADAE"). If NULL, auto-detected. |
standard |
Optional character string: "SDTM" or "ADaM". If NULL, auto-detected. |
Value
A data frame with columns:
category |
Character: type of validation issue ("Missing Required Variable", "Missing Expected Variable", "Type Mismatch", "Non-Standard Variable", "Variable Info") |
variable |
Character: variable name |
message |
Character: description of the issue |
severity |
Character: "ERROR", "WARNING", or "INFO" |
Examples
# Auto-detect domain
dm <- data.frame(
STUDYID = "STUDY001",
USUBJID = "SUBJ001",
DMSEQ = 1,
RACE = "WHITE",
stringsAsFactors = FALSE
)
results <- validate_cdisc(dm)
print(results)
# Validate with explicit domain specification
results <- validate_cdisc(dm, domain = "DM", standard = "SDTM")
Validate SDTM Compliance
Description
Validates a data frame against a specific SDTM domain specification. Checks for missing required/expected variables, data type mismatches, and non-standard variables.
Usage
validate_sdtm(df, domain)
Arguments
df |
A data frame to validate. |
domain |
Character string specifying the SDTM domain code (e.g., "DM", "AE", "VS"). |
Details
Severity levels:
ERROR: Required variable is missing
WARNING: Expected variable is missing or data type mismatch detected
INFO: Non-standard variable present or variable information
Value
A data frame with validation results containing columns:
category |
Character: validation issue type |
variable |
Character: variable name |
message |
Character: issue description |
severity |
Character: "ERROR", "WARNING", or "INFO" |