The edstr package extracts structured variables from clinical free text stored in a Clinical Data Warehouse (EDS). The pipeline runs in four steps: configure, import, clean, extract. A fifth function, edstr_view(), lets you explore matches interactively without saving any files.
Pipeline overview
flowchart LR A["edstr_config()"] --> B["edstr_import()"] B --> C["edstr_clean()"] C --> D["edstr_extract()"] C -.-> E["edstr_view()"]
Each step (except edstr_view()) saves a cache file in the output directory: edstr_import() and edstr_clean() save Parquet files; edstr_extract() saves an RDS file (the result is a complex nested list). If the file already exists, behaviour depends on the edstr_overwrite option set in edstr_config().
Configuration
edstr_config() sets shared parameters used by all downstream functions. Two arguments are required: the output directory and the file name prefix.
edstr_config(
edstr_dirname = "output/fesf",
edstr_filename = "fesf",
edstr_text = "doc_texte",
edstr_overwrite = TRUE
)edstr_dirname and edstr_filename support glue syntax, e.g. "output/{Sys.Date()}".
Caching behaviour
edstr_overwrite = TRUE: overwrite without prompting.edstr_overwrite = FALSE: load the cached file without prompting.edstr_overwrite = NULL(default): prompt an interactive menu (load / overwrite / cancel).
Import
edstr_import() runs a SQL query against the Oracle database and saves the result.
df_import <- edstr_import(
query = "sql/fesf.sql",
head = 500,
user = "my_user"
)The query argument accepts either a SQL string or a path to a .sql file. head limits the number of returned rows — useful during development.
Cleaning
edstr_clean() applies regex replacements to the text column. Cleaning patterns are defined as a named character vector (or a list of named vectors): names are patterns, values are replacements.
df_clean <- edstr_clean(
data = df_import,
replace = c("\\p{Zs}{2,}" = " ", "&" = "&")
)Reusable cleaning patterns
For complex patterns, define them in a separate R file and load them with
source():df_clean <- edstr_clean( data = df_import, replace = source(here::here("config/clean.R"))$value )
Extraction
edstr_extract() is the core of the pipeline. From cleaned text, it tokenises, matches concepts (regex), applies false-positive exclusions, and saves results as XLSX, JSON, and RDS files.
Concepts
Concepts are named regex patterns defining the clinical entities to search for. A simple vector creates independent concepts; a named list creates sub-concepts.
df_extract <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture", femur = "f(e|é)mur|fesf"),
token = c(1, 2),
group = "id_pat"
)Collapse and intersect
Two modes alter extraction behaviour when multiple concepts are defined:
-
collapse = TRUE: combines all patterns into a single regex (logical OR). -
intersect = TRUE: keeps only documents matching all root concepts.
df_extract <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture", femur = "f(e|é)mur|fesf"),
intersect = TRUE,
group = "id_pat"
)Exclusions
False positives can be filtered in two ways:
-
Manual: a regex pattern via
exclus_manual. -
Automatic: heuristics on long tokens (controlled by
exclus_auto_token_min).
df_extract <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture"),
exclus_manual = "fracture ouverte",
exclus_auto_token_min = 10,
group = "id_pat"
)Interactive exploration
edstr_view() lets you test regex patterns without running a full extraction. This function does not save anything — it is designed for fast iteration.
df_view <- edstr_view(
data = df_clean,
pattern = "fractur",
ngrams = 3
)ngrams controls the number of tokens captured after the initial match: ngrams = 3 with pattern = "fractur" matches e.g. "fracture col femoral".
Output summary
edstr_extract() returns a nested list and saves three files:
| File | Contents |
|---|---|
.xlsx |
Excel workbook with one sheet per result type (extraction, counts, exclusions, mismatch, parameters) |
.json |
Full list with all intermediate objects (JSON format) |
.rds |
Full list with all intermediate objects (R-native, used for caching) |
The .rds file is the one automatically reloaded by the caching system when it already exists.