WFDB: An Introduction to the Waveform Database Software Package

library(EGM)

Historically {EGM} wrapped the Waveform Database (WFDB) software package, a well documented suite of C and C++ applications for interacting with signal data. Recent releases of {EGM} now provide native readers, writers, and annotation utilities implemented directly in R and C++, so the core functionality of the package no longer depends on an external WFDB installation. You can work with WFDB headers, signals, and annotations on any system where {EGM} is installed, including CRAN environments.

Because {EGM} ships its own WFDB readers and writers, no additional system dependencies are required to begin analysing WFDB data in R. Installing {EGM} from CRAN (or the development version from GitHub) is sufficient for the examples in this guide to run end-to-end on any platform.

If you would still like to interface with the traditional C-based WFDB tools—for example to compare results or to call PhysioNet utilities that have not yet been ported—you can follow the instructions in the WFDB GitHub repository. The C-based WFDB is more robust and has a full set of tools and functionality not found in the R version, however the key features of reading and writing data and annotators is compatible with the R implementation used by {EGM}.

Signal Data

Understanding the WFDB Storage Format

WFDB files store signal data as integer values (called digital units) in binary .dat files. These integers represent the raw output from an analog-to-digital converter (ADC). To convert these integer values to meaningful physical units (like millivolts), the WFDB header file (.hea) contains conversion metadata:

The conversion formula between digital and physical units is:

\[ \text{physical} = \frac{\text{digital} - \text{baseline}}{\text{gain}} \]

And the inverse formula for writing:

\[ \text{digital} = (\text{physical} \times \text{gain}) + \text{baseline} \]

Reading Data in Different Units

The read_wfdb() and read_signal() functions support a units parameter that controls how signal data is returned:

Let’s demonstrate this concept using a WFDB file:

# Locate the WFDB files in the package
dir_path <- system.file('extdata', package = 'EGM')

# Read the same data in both digital and physical units
ecg_digital <- read_wfdb(
  record = "muse-sinus",
  record_dir = dir_path,
  units = "digital"
)

ecg_physical <- read_wfdb(
  record = "muse-sinus",
  record_dir = dir_path,
  units = "physical"
)

# Examine the header to see the conversion parameters
head(ecg_digital$header)
#> <header_table: 12 channels, 5000 samples @ 500 Hz> muse-sinus
#>         file_name storage_format number ADC_gain ADC_baseline ADC_units
#>            <char>          <int>  <int>    <num>        <int>    <char>
#> 1: muse-sinus.dat             16      1      200            0        mV
#> 2: muse-sinus.dat             16      2      200            0        mV
#> 3: muse-sinus.dat             16      3      200            0        mV
#> 4: muse-sinus.dat             16      4      200            0        mV
#> 5: muse-sinus.dat             16      5      200            0        mV
#> 6: muse-sinus.dat             16      6      200            0        mV
#>    ADC_zero ADC_resolution initial_value checksum blocksize  label   lead
#>       <int>          <int>         <int>    <int>     <int> <fctr> <char>
#> 1:        0             16           -10    -3015         0      I      I
#> 2:        0             16             5    -6357         0     II     II
#> 3:        0             16            15    -3468         0    III    III
#> 4:        0             16            10    -4851         0    AVF    AVF
#> 5:        0             16           -12      269         0    AVL    AVL
#> 6:        0             16             2     4498         0    AVR    AVR
#>    source additional_gain low_pass high_pass   color  scale
#>    <fctr>           <num>    <int>     <int>  <char> <lgcl>
#> 1:    ECG               1       NA        NA #000000     NA
#> 2:    ECG               1       NA        NA #000000     NA
#> 3:    ECG               1       NA        NA #000000     NA
#> 4:    ECG               1       NA        NA #000000     NA
#> 5:    ECG               1       NA        NA #000000     NA
#> 6:    ECG               1       NA        NA #000000     NA
#>   file_name storage_format ADC_gain ADC_baseline ADC_units ADC_resolution
#> 1  muse-sinus.dat      16      200            0        mV             16
#> 2  muse-sinus.dat      16      200            0        mV             16
#> 3  muse-sinus.dat      16      200            0        mV             16
#> 4  muse-sinus.dat      16      200            0        mV             16
#> 5  muse-sinus.dat      16      200            0        mV             16
#> 6  muse-sinus.dat      16      200            0        mV             16

Notice the ADC_gain and ADC_baseline columns in the header. In this example, all channels have a gain of 200 ADC units/mV and a baseline of 0. However, many real-world recordings have non-zero baseline values to account for DC offsets in the recording equipment.

Comparing Digital and Physical Values

Let’s compare the first few samples from lead II in both formats:

# Extract the first 10 samples from lead II
digital_values <- ecg_digital$signal$II[1:10]
physical_values <- ecg_physical$signal$II[1:10]

# Get the conversion parameters for lead II from the header
lead_ii_row <- ecg_digital$header[ecg_digital$header$label == "II", ]
gain <- lead_ii_row$ADC_gain
baseline <- lead_ii_row$ADC_baseline

# Create a comparison table
comparison <- data.frame(
  sample = 1:10,
  digital = digital_values,
  physical = round(physical_values, 3),
  calculated_physical = round((digital_values - baseline) / gain, 3)
)

print(comparison)
#>    sample digital physical calculated_physical
#> 1       1       5    0.025               0.025
#> 2       2       5    0.025               0.025
#> 3       3       5    0.025               0.025
#> 4       4      10    0.050               0.050
#> 5       5      10    0.050               0.050
#> 6       6      10    0.050               0.050
#> 7       7      10    0.050               0.050
#> 8       8      15    0.075               0.075
#> 9       9      20    0.100               0.100
#> 10     10      24    0.120               0.120
#>    Sample Digital Physical Calculated
#> 1       1       5    0.025      0.025
#> 2       2       5    0.025      0.025
#> 3       3       5    0.025      0.025
#> 4       4      10    0.050      0.050
#> 5       5      10    0.050      0.050
#> 6       6      10    0.050      0.050
#> 7       7      10    0.050      0.050
#> 8       8      15    0.075      0.075
#> 9       9      20    0.100      0.100
#> 10     10      24    0.120      0.120

# Verify the conversion formula
cat("\nLead II conversion parameters:\n")
#> 
#> Lead II conversion parameters:
cat("  Gain:", gain, "ADC units/mV\n")
#>   Gain: 200 ADC units/mV
cat("  Baseline:", baseline, "ADC units\n")
#>   Baseline: 0 ADC units
#> Lead II conversion parameters:
#>   Gain: 200 ADC units/mV
#>   Baseline: 0 ADC units

The calculated physical units are derived from the digital units using the ADC conversion, and thus maintains the fidelity of the original signal.

Annotations

The WFDB software package utilizes a binary format to store annotations. Annotations are essentially markers or qualifiers of specific components of a signal, specifying both the specific time or position in the plot, and what channel the annotation refers to. Annotations are polymorphic, and multiple can be applied to a single signal dataset. The credit for this work goes directly to the original software creators, as this is just a wrapper to allow for flexible integration into R. All of the demonstrations below rely on the native {EGM} parsers, so they run identically whether or not external WFDB binaries are present.

To begin, let’s take an example of a simple ECG dataset. This data is included in the package, and can be accessed as below.

fp <- system.file('extdata', 'muse-sinus.xml', package = 'EGM')
ecg <- read_muse(fp)
fig <- ggm(ecg) + theme_egm_light()
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
fig

We may have customized software, manual approaches, or machine learning models that may label signal data. We can use the annotation_table() function to create WFDB-compatible annotations, updating both the egm object and the written file. To trial this, let’s label the peaks of the QRS complex from this 12-lead ECG.

  1. Create a quick, non-robust function for labeling QRS complex peaks. The function pracma::findpeaks() is quite good, but to avoid dependencies we are writing our own.
  2. Evaluate the fit of the peaks to the dataset
  3. Place the annotations into a table, updating the egm object
  4. Plot the results
# Let x = 10-second signal dataset
# We will apply this across the dataset
# This is an oversimplified approach.
find_peaks <- function(x,
                       threshold = 
                         mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE)
                       ) {
  
  # Ensure signal is "positive" for peak finding algorithm
  x <- abs(x)
  
  # Find the peaks
  peaks <- which(diff(sign(diff(x))) == -2) + 1
  
  # Filter the peaks
  peaks <- peaks[x[peaks] > threshold]
  
  # Return
  peaks
}

# Create a signal dataset
dat <- extract_signal(ecg)

# Find the peaks
sig <- dat[["I"]]
pk_loc <- find_peaks(sig)
pk_val <- sig[pk_loc]
pks <- data.frame(x = pk_loc, y = pk_val)

# Plot them
plot(sig, type = "l")
points(x = pks$x, y = pks$y, col = "orange")

The result is not bad for a simple peak finder, but it lets us generate a small dataset of annotations that can then be used. Please see the additional vignettes for advanced annotation options, such as multichannel plots, and multichannel annotations. We can take a look under the hood at the annotation positions we generated. The relevant arguments, which are displayed below, include:

# Find the peaks
raw_signal <- dat[["I"]]
peak_positions <- find_peaks(raw_signal)
peak_positions
#>  [1]   96  427  759 1091 1753 2085 2417 2750 3080 3412 3744 4076 4409

# Annotations do not need to store the value at that time point however
# The annotation table function has the following arguments
args(annotation_table)
#> function (annotator = character(), time = character(), sample = integer(), 
#>     frequency = integer(), type = character(), subtype = character(), 
#>     channel = integer(), number = integer(), ...) 
#> NULL

# We can fill this in as below using additional data from the original ECG
hea <- ecg$header
start <- attributes(hea)$record_line$start_time
hz <- attributes(hea)$record_line$frequency

ann <- annotation_table(
  annotator = "our_pks",
  sample = peak_positions,
  type = "R",
  frequency = hz,
  channel = "I"
)

# Here are our annotations
ann
#> <annotation_table: 13 `our_pks` annotations>
#>             time sample   type subtype channel number
#>           <char>  <num> <char>  <char>  <char>  <int>
#>  1: 00:00:00.192     96      R               I      0
#>  2: 00:00:00.854    427      R               I      0
#>  3: 00:00:01.518    759      R               I      0
#>  4: 00:00:02.182   1091      R               I      0
#>  5: 00:00:03.506   1753      R               I      0
#>  6:  00:00:04.17   2085      R               I      0
#>  7: 00:00:04.834   2417      R               I      0
#>  8:   00:00:05.5   2750      R               I      0
#>  9:  00:00:06.16   3080      R               I      0
#> 10: 00:00:06.824   3412      R               I      0
#> 11: 00:00:07.488   3744      R               I      0
#> 12: 00:00:08.152   4076      R               I      0
#> 13: 00:00:08.818   4409      R               I      0

# Then, add this back to the original signal
ecg$annotation <- ann
ecg
#> <Electrogram>
#> -------------------
#> Recording Duration:  10 seconds
#> Recording frequency  500  hz
#> Number of channels:  12 
#> Channel Names:  I II III AVF AVL AVR V1 V2 V3 V4 V5 V6 
#> Annotation:  our_pks 
#> Type: Standard 12-lead ECG

mirror server hosted at Truenetwork, Russian Federation.