---
title: "Introduction to rtransparency"
author: Stylianos Serghiou
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
description: >
  How to use the rtransparency package to identify and extract indicators of
  transparency from published biomedical articles, and how the detection works.
vignette: >
  %\VignetteIndexEntry{Introduction to rtransparency}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)
```

```{r setup}
library(rtransparency)
```

# Overview

`rtransparency` identifies and extracts **indicators of transparency** from the
full text of published biomedical articles. It works on two inputs: plain TXT
files (typically converted from PDFs) and PMC XML files (the JATS XML served by
PubMed Central). For each indicator it returns whether the indicator was found
and, when found, the sentence or statement that triggered the detection.

| Indicator | What it captures | TXT | PMC XML |
|---|---|:--:|:--:|
| Conflicts of interest | A COI / competing-interests disclosure | `rt_coi` | `rt_coi_pmc` |
| Funding | A funding / financial-support statement | `rt_fund` | `rt_fund_pmc` |
| Protocol registration | Registration on a trial / review registry | `rt_register` | `rt_register_pmc` |
| Novelty | Claims of novelty ("for the first time") | `rt_novelty` | `rt_novelty_pmc` |
| Replication | Replication / independent-validation components | `rt_replication` | `rt_replication_pmc` |
| Data sharing | Data deposited or made openly available | `rt_data_code` | `rt_data_code_pmc` |
| Code sharing | Source code / scripts made available | `rt_data_code` | `rt_data_code_pmc` |
| AI-use disclosure | A statement that generative AI was (or was not) used to prepare the manuscript | `rt_ai` | `rt_ai_pmc` |

`rt_all_pmc` runs all eight detectors together in a single pass: COI, funding,
registration, novelty, replication, data sharing, code sharing and AI-use
disclosure. (`rt_all` covers the first five from TXT; data, code and AI also have
standalone TXT detectors, `rt_data_code` and `rt_ai`, but are not part of the
`rt_all` wrapper.)

AI-use disclosure is the newest indicator. Journals have asked authors to
disclose any use of generative AI (ChatGPT and similar) in preparing a
manuscript only since 2023, so `rt_ai_pmc` evaluates the indicator only for
articles published in 2023 or later and returns `NA` for earlier ones.

The package and its validation are described in Serghiou et al., *Assessment of
transparency indicators across the biomedical literature: How open is open?*
(PLOS Biology, 2021, doi:10.1371/journal.pbio.3001107).

# How detection works

## Article parsing

PMC XML is parsed with `xml2`. The XML root is standardized to the `<article>`
node (the package accepts the OAI-PMH, EFetch `<pmc-articleset>` and bare
`<article>` shapes), the namespace is optionally stripped (`remove_ns = TRUE`),
and the text is split into the sections where each indicator usually appears:
acknowledgments, footnotes / author notes, the body, the methods, the abstract
and supplementary material. TXT files are read whole and split into paragraphs.

## Rule-based detection

Detection is **rule-based and interpretable**: each indicator is a curated set
of regular expressions applied to the relevant sections, rather than a machine
learning model. This keeps the output auditable (the matched statement is
returned) and reproducible.

* **Conflicts of interest.** Detected from structured COI footnotes
  (`fn-type = "conflict"`), from section titles ("Conflicts of interest",
  "Competing interests", "Declaration of interest", "Duality of interest"), and
  from a set of text patterns covering financial relationships, consulting,
  fees, board membership, patents and explicit "no competing interests"
  declarations. Honoraria-to-subjects and reference text are masked to reduce
  false positives.
* **Funding.** Detected from the XML `<funding-group>` element, from funding
  section titles, and from text patterns such as "supported by", "funded by",
  "grant from / number", named funders and award types. Acknowledged funding is
  required to use explicit funding language (a funding verb tied to a funder),
  so a bare mention of an institution or the word "support" is not enough.
  No-funding declarations are excluded.
* **Protocol registration.** Detected from registry identifiers
  (ClinicalTrials.gov `NCT`, PROSPERO `CRD`, ISRCTN, ANZCTR `ACTRN`, DRKS, IRCT,
  UMIN, ChiCTR) and from registration phrasing in the methods or footnotes.
* **Novelty and replication.** Detected from claim patterns such as "for the
  first time", "to our knowledge", "novel finding" (novelty) and "replicate",
  "independently validated", "confirmatory cohort" (replication), with negation
  filters ("failed to replicate").
* **Data and code sharing.** Detected by a native detector (`.detect_data_code`)
  built from public repository facts and curated benchmark statements:
  field-specific accession
  schemes (GEO `GSE`, SRA / BioProject `PRJNA`, PDB, ArrayExpress, dbGaP,
  ProteomeXchange, Dryad / Zenodo / figshare DOIs, ...), repository URLs and
  names, deposit / availability / data-availability-statement language, and
  supplement and file-format signals. Crucially it distinguishes **sharing**
  ("data were deposited in GEO") from **reuse** ("data were downloaded from
  GEO") and excludes "available on request". Code repositories (GitHub, GitLab,
  Bitbucket) only count as data when paired with a data noun, so a code-only
  GitHub link is not mistaken for data sharing.
* **AI-use disclosure.** Detected from a "Declaration of generative AI" type
  section title, and from text that names a generative-AI tool (ChatGPT, GPT-4,
  Copilot, Gemini, an LLM, ...) in a manuscript-preparation context ("used
  ChatGPT to improve the readability") or in an explicit negation ("no
  generative AI was used"). A negative lookahead keeps the tool sense of "large
  language model" out of the writing-object pattern, and AI used purely as a
  research method (not for writing) is not counted. Only evaluated for 2023
  onward.

## Languages

Conflict-of-interest and funding statements are detected not only in English but
also in **Spanish, Portuguese, French, German and Italian**, using
language-distinctive patterns matched on transliterated (accent-stripped) text.
The German conflict-of-interest detection rate, for example, rose from 33% to
97% once these were added. The other indicators are English-only for now.

# Usage: PMC XML

The package ships an example PMC XML file. We use it below; replace the path
with your own file to analyze a different article.

```{r}
xml_path <- system.file(
  "extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency"
)
```

## All indicators at once

`rt_all_pmc` returns all eight indicators in one call, together with the matched
statement text, the publication `year` and article metadata.

```{r}
all_indicators <- rt_all_pmc(xml_path, remove_ns = TRUE)

dplyr::glimpse(
  all_indicators[, c("pmid", "year", "is_coi_pred", "is_fund_pred",
                     "is_register_pred", "is_novelty_pred", "is_replication_pred",
                     "is_open_data", "is_open_code", "is_ai_pred")]
)
```

`is_ai_pred` is `NA` here because this example article predates 2023; for a 2023
or later article it would be `TRUE` or `FALSE`.

## Individual indicators

```{r}
coi <- rt_coi_pmc(xml_path, remove_ns = TRUE)
c(is_coi = coi$is_coi_pred, text = substr(coi$coi_text, 1, 120))
```

```{r}
fund <- rt_fund_pmc(xml_path, remove_ns = TRUE)
c(is_fund = fund$is_fund_pred, text = substr(fund$fund_text, 1, 120))
```

```{r}
register <- rt_register_pmc(xml_path, remove_ns = TRUE)
register$is_register_pred
```

## Data and code sharing

`rt_all_pmc` already reports `is_open_data` and `is_open_code`; `rt_data_code_pmc`
is the focused view that also returns the matched statements. Detection is native
and needs no external packages.

```{r}
data_code <- rt_data_code_pmc(xml_path, remove_ns = TRUE)

dplyr::glimpse(
  data_code[, c("is_open_data", "open_data_statements",
                "is_open_code", "open_code_statements")]
)
```

`rt_all_pmc` and `rt_data_code_pmc` also return `open_data_links` and
`open_code_links`: the repository and accession URLs extracted from the
statements, ready to pass to FAIR-assessment tooling such as
[`rfair`](https://github.com/choxos/rfair). Article metadata (title, journal,
identifiers, dates) is available separately via `rt_meta_pmc`.

```{r}
meta <- rt_meta_pmc(xml_path, remove_ns = TRUE)
dplyr::glimpse(meta[, c("pmid", "doi")])
```

## AI-use disclosure

`rt_ai_pmc` reports the publication `year`, the year-gated prediction
`is_ai_pred` (`NA` before 2023) and the matched text. The `ai-disclosure`
vignette covers this indicator in depth.

```{r}
ai <- rt_ai_pmc(xml_path, remove_ns = TRUE)
c(year = ai$year, is_ai = ai$is_ai_pred)
```

# Usage: TXT files

To analyze a PDF, first convert it to TXT with `rt_read_pdf` (this needs the
poppler `pdftotext` utility installed), then run the TXT detectors. The chunks
below are illustrative and are not executed when the vignette is built.

```{r, eval = FALSE}
pdf_path <- system.file(
  "extdata", "PMID32171256-PMC7071725.pdf", package = "rtransparency"
)
article <- rt_read_pdf(pdf_path)
writeLines(article, "article.txt")

rt_coi("article.txt")
rt_fund("article.txt")
rt_register("article.txt")
rt_data_code("article.txt")
rt_ai("article.txt")    # generative-AI-use disclosure
rt_all("article.txt")   # COI, funding, registration, novelty, replication
```

`rt_ai` is the plain-text counterpart of `rt_ai_pmc`. A text file carries no
reliable publication date, so `rt_ai` applies no 2023 year gate (`is_ai_pred` is
always `TRUE` or `FALSE`, never `NA`) and cannot confine the scan to back-matter
sections the way the XML detector does. Restrict it to articles published in 2023
or later, and expect a slightly higher false-positive rate on papers that use AI
as a research method.

# Processing many articles

`rt_all_pmc_dir()` runs all eight indicators over an entire directory (or a
vector of file paths) in one call, designed for corpus-scale analysis.

```{r, eval = FALSE}
# Sequential, in memory
res <- rt_all_pmc_dir("path/to/xml", remove_ns = TRUE)

# Resumable and parallel: results are written to a CSV in chunks, a re-run skips
# files already recorded, and a malformed file yields an is_success = FALSE row
# instead of aborting the run.
future::plan("multisession")
res <- rt_all_pmc_dir(
  "path/to/xml", remove_ns = TRUE, output = "results.csv", parallel = TRUE
)
```

# Summarizing a corpus

With one row per article, `rt_summary()` reports per-indicator prevalence with a
Wilson confidence interval and a sensitivity/specificity-corrected (Rogan-Gladen)
prevalence; `rt_score()` adds a per-article count of openness practices; and
`rt_plot()` draws prevalence bars and yearly trends. The
`transparency-summary` vignette covers this in depth.

```{r}
data(rt_demo)            # a small simulated example shipped with the package
rt_summary(rt_demo)[, c("indicator", "percent", "adj_percent")]
```

# Downloading PMC XML

PMC full-text XML can be downloaded by PMCID. The package exposes nothing for
this, but the `europepmc` (CRAN) or `metareadr` packages work well; the
following is illustrative.

```{r, eval = FALSE}
# europepmc::epmc_ftxt("PMC7071725")            # returns the XML document
# metareadr::mt_read_pmcoa("7071725", "article.xml")
```

# Validation

The detectors were benchmarked against the human-labeled XML benchmark of
Serghiou et al. (2021). The current package reaches roughly: COI 97% accuracy,
funding 97%, protocol registration 98%. The native data/code detector reaches
code 88% sensitivity / 99% specificity and data 77% sensitivity / 99%
specificity (see `inst/benchmark/` and `data-raw/benchmark/` in the source
repository for the reproducible benchmark). The native data/code values are
reproducible benchmark and regression estimates, not untouched external-validation
estimates.

# Naming convention and dependencies

Functions that operate on TXT files do not end in `_pmc`; functions that operate
on PMC XML end in `_pmc`. Data and code detection is implemented natively and no
longer requires the `oddpub` or `tokenizers` packages.
