Get started • edstr

library(edstr)

The edstr package extracts structured variables from clinical free text stored in a Clinical Data Warehouse (EDS). The pipeline runs in four steps: configure, import, clean, extract. A fifth function, edstr_view(), lets you explore matches interactively without saving any files.

Pipeline overview

flowchart LR
  A["edstr_config()"] --> B["edstr_import()"]
  B --> C["edstr_clean()"]
  C --> D["edstr_extract()"]
  C -.-> E["edstr_view()"]

Figure 1: edstr pipeline

Each step (except edstr_view()) saves a cache file in the output directory: edstr_import() and edstr_clean() save Parquet files; edstr_extract() saves an RDS file (the result is a complex nested list). If the file already exists, behaviour depends on the edstr_overwrite option set in edstr_config().

Configuration

edstr_config() sets shared parameters used by all downstream functions. Two arguments are required: the output directory and the file name prefix.

edstr_config(
  edstr_dirname = "output/fesf",
  edstr_filename = "fesf",
  edstr_text = "doc_texte",
  edstr_overwrite = TRUE
)

edstr_dirname and edstr_filename support glue syntax, e.g. "output/{Sys.Date()}".

Caching behaviour

edstr_overwrite = TRUE: overwrite without prompting.

edstr_overwrite = FALSE: load the cached file without prompting.

edstr_overwrite = NULL (default): prompt an interactive menu (load / overwrite / cancel).

Import

edstr_import() runs a SQL query against the Oracle database and saves the result.

df_import <- edstr_import(
  query = "sql/fesf.sql",
  head = 500,
  user = "my_user"
)

The query argument accepts either a SQL string or a path to a .sql file. head limits the number of returned rows — useful during development.

Cleaning

edstr_clean() applies regex replacements to the text column. Cleaning patterns are defined as a named character vector (or a list of named vectors): names are patterns, values are replacements.

df_clean <- edstr_clean(
  data = df_import,
  replace = c("\\p{Zs}{2,}" = " ", "&amp;" = "&")
)

Reusable cleaning patterns

For complex patterns, define them in a separate R file and load them with source():
df_clean <- edstr_clean(
  data = df_import,
  replace = source(here::here("config/clean.R"))$value
)

Extraction

edstr_extract() is the core of the pipeline. From cleaned text, it tokenises, matches concepts (regex), applies false-positive exclusions, and saves results as XLSX, JSON, and RDS files.

Concepts

Concepts are named regex patterns defining the clinical entities to search for. A simple vector creates independent concepts; a named list creates sub-concepts.

df_extract <- edstr_extract(
  data = df_clean,
  concepts = c(fracture = "fracture", femur = "f(e|é)mur|fesf"),
  token = c(1, 2),
  group = "id_pat"
)

Collapse and intersect

Two modes alter extraction behaviour when multiple concepts are defined:

collapse = TRUE: combines all patterns into a single regex (logical OR).
intersect = TRUE: keeps only documents matching all root concepts.

df_extract <- edstr_extract(
  data = df_clean,
  concepts = c(fracture = "fracture", femur = "f(e|é)mur|fesf"),
  intersect = TRUE,
  group = "id_pat"
)

Exclusions

False positives can be filtered in two ways:

Manual: a regex pattern via exclus_manual.
Automatic: heuristics on long tokens (controlled by exclus_auto_token_min).

df_extract <- edstr_extract(
  data = df_clean,
  concepts = c(fracture = "fracture"),
  exclus_manual = "fracture ouverte",
  exclus_auto_token_min = 10,
  group = "id_pat"
)

Interactive exploration

edstr_view() lets you test regex patterns without running a full extraction. This function does not save anything — it is designed for fast iteration.

df_view <- edstr_view(
  data = df_clean,
  pattern = "fractur",
  ngrams = 3
)

ngrams controls the number of tokens captured after the initial match: ngrams = 3 with pattern = "fractur" matches e.g. "fracture col femoral".

Output summary

edstr_extract() returns a nested list and saves three files:

File	Contents
`.xlsx`	Excel workbook with one sheet per result type (extraction, counts, exclusions, mismatch, parameters)
`.json`	Full list with all intermediate objects (JSON format)
`.rds`	Full list with all intermediate objects (R-native, used for caching)

The .rds file is the one automatically reloaded by the caching system when it already exists.