Getting Started with clinTrialData • clinTrialData

Introduction

clinTrialData is a community-grown library of clinical trial example datasets for R. The package ships with a core set of studies and is designed to expand over time — anyone can contribute a new data source, and users can download any available study on demand without waiting for a new package release.

Data is stored in Parquet format and accessed through the connector package, giving a consistent API regardless of which study you are working with.

Key features:

Growing library: New datasets are added by the community as GitHub Release assets — no CRAN resubmission needed
On-demand download: Use download_study() to fetch any available study and cache it locally
Generic interface: Use connect_clinical_data() to connect to any available data source
Automatic discovery: list_data_sources() finds all studies on your machine; list_available_studies() shows everything available to download
Data protection: Downloaded and bundled datasets are locked against accidental modification

Installation

# Install from CRAN
install.packages("clinTrialData")

# Or the development version from GitHub:
# install.packages("remotes")
remotes::install_github("Lovemore-Gakava/clinTrialData")

Available Data Sources

library(clinTrialData)

# Studies on your machine (bundled + previously downloaded)
list_data_sources()
#>        source
#> 1 cdisc_pilot
#>                                                                                       description
#> 1 CDISC Pilot 01 Study — standard ADaM and SDTM datasets widely used for training and prototyping
#>      domains  format location
#> 1 adam, sdtm parquet  bundled

Quick Start

Connect to a Data Source

The package bundles the CDISC Pilot 01 study, so you can connect immediately:

# Connect to CDISC Pilot data
db <- connect_clinical_data("cdisc_pilot")
#> ℹ Replace some metadata informations...
#> ────────────────────────────────────────────────────────────────────────────────
#> Connection to:
#> → adam
#> • connector_fs
#> • /home/runner/work/_temp/Library/clinTrialData/exampledata/cdisc_pilot/adam
#> ────────────────────────────────────────────────────────────────────────────────
#> Connection to:
#> → sdtm
#> • connector_fs
#> • /home/runner/work/_temp/Library/clinTrialData/exampledata/cdisc_pilot/sdtm

# List available datasets in the ADaM domain
db$adam$list_content_cnt()
#>  [1] "adae.parquet"     "adlbc.parquet"    "adlbh.parquet"    "adlbhy.parquet"  
#>  [5] "adqsadas.parquet" "adqscibc.parquet" "adqsnpix.parquet" "adsl.parquet"    
#>  [9] "adtte.parquet"    "advs.parquet"

# Read the subject-level dataset
adsl <- db$adam$read_cnt("adsl")
#> → Found one file: /home/runner/work/_temp/Library/clinTrialData/exampledata/cdisc_pilot/adam/adsl.parquet
head(adsl[, c("USUBJID", "TRT01A", "AGE", "SEX", "RACE")])
#> # A tibble: 6 × 5
#>   USUBJID     TRT01A                 AGE SEX   RACE 
#>   <chr>       <chr>                <dbl> <chr> <chr>
#> 1 01-701-1015 Placebo                 63 F     WHITE
#> 2 01-701-1023 Placebo                 64 M     WHITE
#> 3 01-701-1028 Xanomeline High Dose    71 M     WHITE
#> 4 01-701-1033 Xanomeline Low Dose     74 M     WHITE
#> 5 01-701-1034 Xanomeline High Dose    77 F     WHITE
#> 6 01-701-1047 Placebo                 85 F     WHITE

Discover and Download Additional Studies

Studies beyond the bundled data can be downloaded from GitHub Releases:

# What's available to download?
list_available_studies()

# Download a study once — cached locally from then on
download_study("cdisc_pilot_extended")

# Where is the cache?
cache_dir()

Explore the Data

# Dimensions
dim(adsl)
#> [1] 254  48

# Quick structure overview
str(adsl, list.len = 10)
#> tibble [254 × 48] (S3: tbl_df/tbl/data.frame)
#>  $ STUDYID : chr [1:254] "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" "CDISCPILOT01" ...
#>   ..- attr(*, "label")= chr "Study Identifier"
#>  $ USUBJID : chr [1:254] "01-701-1015" "01-701-1023" "01-701-1028" "01-701-1033" ...
#>   ..- attr(*, "label")= chr "Unique Subject Identifier"
#>  $ SUBJID  : chr [1:254] "1015" "1023" "1028" "1033" ...
#>   ..- attr(*, "label")= chr "Subject Identifier for the Study"
#>  $ SITEID  : chr [1:254] "701" "701" "701" "701" ...
#>   ..- attr(*, "label")= chr "Study Site Identifier"
#>  $ SITEGR1 : chr [1:254] "701" "701" "701" "701" ...
#>   ..- attr(*, "label")= chr "Pooled Site Group 1"
#>  $ ARM     : chr [1:254] "Placebo" "Placebo" "Xanomeline High Dose" "Xanomeline Low Dose" ...
#>   ..- attr(*, "label")= chr "Description of Planned Arm"
#>  $ TRT01P  : chr [1:254] "Placebo" "Placebo" "Xanomeline High Dose" "Xanomeline Low Dose" ...
#>   ..- attr(*, "label")= chr "Planned Treatment for Period 01"
#>  $ TRT01PN : num [1:254] 0 0 81 54 81 0 54 54 54 0 ...
#>   ..- attr(*, "label")= chr "Planned Treatment for Period 01 (N)"
#>  $ TRT01A  : chr [1:254] "Placebo" "Placebo" "Xanomeline High Dose" "Xanomeline Low Dose" ...
#>   ..- attr(*, "label")= chr "Actual Treatment for Period 01"
#>  $ TRT01AN : num [1:254] 0 0 81 54 81 0 54 54 54 0 ...
#>   ..- attr(*, "label")= chr "Actual Treatment for Period 01 (N)"
#>   [list output truncated]

Working with Different Domains

ADaM Datasets

# Read adverse events data
adae <- db$adam$read_cnt("adae")
#> → Found one file: /home/runner/work/_temp/Library/clinTrialData/exampledata/cdisc_pilot/adam/adae.parquet
head(adae[, c("USUBJID", "AEDECOD", "AESEV", "AESER")])
#> # A tibble: 6 × 4
#>   USUBJID     AEDECOD                              AESEV    AESER
#>   <chr>       <chr>                                <chr>    <chr>
#> 1 01-701-1015 APPLICATION SITE ERYTHEMA            MILD     N    
#> 2 01-701-1015 APPLICATION SITE PRURITUS            MILD     N    
#> 3 01-701-1015 DIARRHOEA                            MILD     N    
#> 4 01-701-1023 ERYTHEMA                             MILD     N    
#> 5 01-701-1023 ERYTHEMA                             MODERATE N    
#> 6 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE MILD     N

SDTM Datasets

# Read demographics
dm <- db$sdtm$read_cnt("dm")
#> → Found one file: /home/runner/work/_temp/Library/clinTrialData/exampledata/cdisc_pilot/sdtm/dm.parquet
head(dm[, c("USUBJID", "ARM", "AGE", "SEX", "RACE")])
#> # A tibble: 6 × 5
#>   USUBJID     ARM                    AGE SEX   RACE 
#>   <chr>       <chr>                <dbl> <chr> <chr>
#> 1 01-701-1015 Placebo                 63 F     WHITE
#> 2 01-701-1023 Placebo                 64 M     WHITE
#> 3 01-701-1028 Xanomeline High Dose    71 M     WHITE
#> 4 01-701-1033 Xanomeline Low Dose     74 M     WHITE
#> 5 01-701-1034 Xanomeline High Dose    77 F     WHITE
#> 6 01-701-1047 Placebo                 85 F     WHITE

Example Analysis

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Basic demographic summary by treatment
adsl |>
  group_by(TRT01A) |>
  summarise(
    n = n(),
    mean_age = mean(AGE, na.rm = TRUE),
    female_pct = mean(SEX == "F", na.rm = TRUE) * 100,
    .groups = "drop"
  )
#> # A tibble: 3 × 4
#>   TRT01A                   n mean_age female_pct
#>   <chr>                <int>    <dbl>      <dbl>
#> 1 Placebo                 86     75.2       61.6
#> 2 Xanomeline High Dose    84     74.4       47.6
#> 3 Xanomeline Low Dose     84     75.7       59.5

Contributing New Data Sources

Anyone can add a new study to the library. Datasets live on GitHub Releases, not inside the package — so no pull request or CRAN submission is needed to add data.

Step 1: Prepare your data

Organize your Parquet files by domain:

your_new_study/
├── adam/
│   ├── adsl.parquet
│   └── adae.parquet
└── sdtm/
    ├── dm.parquet
    └── ae.parquet

Step 2: Upload data and metadata to a GitHub Release

Open an issue to request a release slot, then use the helper script:

source("data-raw/upload_to_release.R")

# Upload the data zip
upload_study_to_release("your_new_study", tag = "v1.1.0")

# Generate and upload metadata (enables dataset_info() for your study)
generate_and_upload_metadata(
  source      = "your_new_study",
  description = "Brief description of your study",
  version     = "v1.1.0",
  license     = "Your license here",
  source_url  = "https://link-to-original-data",
  tag         = "v1.1.0"
)

Step 3: Users can inspect and access it immediately

dataset_info("your_new_study")       # inspect before downloading
download_study("your_new_study")     # download and cache
connect_clinical_data("your_new_study")

No CRAN submission required. The study is available to all users as soon as it is uploaded.