Extract structured variables from clinical text

Tokenize source text, match concept patterns (regex), apply exclusions, perform source-level re-matching with accent normalisation, and save results as XLSX, JSON, and RDS files.

Usage

edstr_extract(
  data,
  text_input = getOption("edstr_text"),
  id = NULL,
  group = NULL,
  sample = NULL,
  seed = NULL,
  ano_hash = NULL,
  ano_hide = NULL,
  token = 1,
  concepts,
  collapse = FALSE,
  intersect = FALSE,
  starts_with_only = TRUE,
  exclus_manual = NULL,
  exclus_auto_escape = NULL,
  exclus_auto_token_min = 10,
  regex_replace = NULL,
  mismatch_data = FALSE,
  concept_color = "#0099FF",
  text_color = "#FF0000",
  save_as_gt = FALSE,
  dirname_suffix = if (!is.null(sample)) str_glue("sample_{sample}") else NULL,
  filename_suffix = dirname_suffix
)

Arguments

data: <data.frame> Input data containing at least a text column and a unique identifier column.
text_input: <character(1)> Name of the text column to analyse. Defaults to getOption("edstr_text") set by edstr_config().
id: <character(1)> Name of the unique identifier column. If not supplied, automatically detected (first column with no duplicates and no NA).
group: <character(1)> Optional grouping column (e.g. patient ID when rows are documents). If NULL, a sequential id_group is created.
sample: <integer(1)> Optional. Number of rows to randomly sample from data before extraction.
seed: <integer(1)> Optional. Random seed for reproducibility when sample is used.
ano_hash: <character> Column name(s) to pseudonymise by hashing.
ano_hide: <character> Column name(s) to pseudonymise by masking (replaced with "---").
token: <integer> N-gram sizes to use for tokenisation. Default 1 (unigrams). Use c(1, 2) for unigrams and bigrams.
concepts: <character|list> Named vector or nested named list of regex patterns defining the concepts to search for. Each name becomes a concept key; nested names create sub-concepts (e.g. list(cancer = c(sein = "sein|mammaire", poumon = "poumon")).
collapse: <logical(1)> If TRUE, OR-collapse all concept patterns into a single regex per root concept. Requires at least 2 concepts.
intersect: <logical(1)> If TRUE, keep only documents matching ALL root-level concepts. Requires at least 2 concepts.
starts_with_only: <logical(1)> If TRUE (default), token matching uses prefix mode: the pattern must match the start of a token, and the rest of the token is accepted (\\S*$ appended).
exclus_manual: <character(1)> Optional regex pattern. Matched tokens containing this pattern are excluded (manual false-positive filter).
exclus_auto_escape: <character(1)> Optional regex pattern. Tokens matching this pattern are removed from data_match before auto-exclusion runs.
exclus_auto_token_min: <numeric(1)> Minimum n-gram size for automatic exclusion heuristics (default 10). Auto-exclusions only apply to tokens with n > exclus_auto_token_min.
regex_replace: <character> Optional named vector of additional regex replacements for source matching (appended to the built-in accent normalisation rules).
mismatch_data: <logical(1)> If TRUE, include unmatched documents in the mismatch output. Default FALSE.
concept_color: <character(1)> Hex colour for concept highlighting in XLSX and gt output. Default "#0099FF".
text_color: <character(1)> Hex colour for text/extract highlighting in XLSX and gt output. Default "#FF0000".
save_as_gt: <logical(1)> If TRUE, generate gt::gt() tables alongside XLSX output. Requires the gt package.
dirname_suffix: <character(1)> Optional suffix appended to the output directory name. Defaults to "sample_{sample}" when sample is set.
filename_suffix: <character(1)> Optional suffix appended to output file names. Defaults to dirname_suffix.

Value

A nested list (invisibly returned from cache when the RDS file already exists) with elements:

data: List of data frames: base (input without text), match (initial matches), extract (final extraction).
regex: List: concepts (parsed patterns), replace (replacement rules), final (combined regex), match (source-level matches).
match: List: init (all matches), final (keep/drop after exclusions).
count: List: init (token-level counts), final (distinct match counts).
exclus: List: match (excluded matches), count (exclusion counts).
mismatch: List: id (unmatched IDs), regex (token vs source discrepancies).
summary: List: token (summary by token), concept (summary by concept), params (call parameters).
sheets: List: df (data frames per Excel sheet), gt (gt tables if save_as_gt = TRUE).

Details

Requires edstr_config() to be called first.

Examples

if (FALSE) { # \dontrun{
edstr_config(edstr_dirname = "output", edstr_filename = "my_study")

df <- edstr_import(query = "sql/my_query.sql")

result <- edstr_extract(
  data = df,
  concepts = c(diabete = "diabet", cancer = "cancer|tumeur"),
  token = c(1, 2),
  intersect = TRUE
)
} # }