Skip to contents

Tokenize source text, match concept patterns (regex), apply exclusions, perform source-level re-matching with accent normalisation, and save results as XLSX, JSON, and RDS files.

Usage

edstr_extract(
  data,
  text_input = getOption("edstr_text"),
  id = NULL,
  group = NULL,
  sample = NULL,
  seed = NULL,
  ano_hash = NULL,
  ano_hide = NULL,
  token = 1,
  concepts,
  collapse = FALSE,
  intersect = FALSE,
  starts_with_only = TRUE,
  exclus_manual = NULL,
  exclus_auto_escape = NULL,
  exclus_auto_token_min = 10,
  regex_replace = NULL,
  mismatch_data = FALSE,
  concept_color = "#0099FF",
  text_color = "#FF0000",
  save_as_gt = FALSE,
  dirname_suffix = if (!is.null(sample)) str_glue("sample_{sample}") else NULL,
  filename_suffix = dirname_suffix
)

Arguments

data

<data.frame> Input data containing at least a text column and a unique identifier column.

text_input

<character(1)> Name of the text column to analyse. Defaults to getOption("edstr_text") set by edstr_config().

id

<character(1)> Name of the unique identifier column. If not supplied, automatically detected (first column with no duplicates and no NA).

group

<character(1)> Optional grouping column (e.g. patient ID when rows are documents). If NULL, a sequential id_group is created.

sample

<integer(1)> Optional. Number of rows to randomly sample from data before extraction.

seed

<integer(1)> Optional. Random seed for reproducibility when sample is used.

ano_hash

<character> Column name(s) to pseudonymise by hashing.

ano_hide

<character> Column name(s) to pseudonymise by masking (replaced with "---").

token

<integer> N-gram sizes to use for tokenisation. Default 1 (unigrams). Use c(1, 2) for unigrams and bigrams.

concepts

<character|list> Named vector or nested named list of regex patterns defining the concepts to search for. Each name becomes a concept key; nested names create sub-concepts (e.g. list(cancer = c(sein = "sein|mammaire", poumon = "poumon")).

collapse

<logical(1)> If TRUE, OR-collapse all concept patterns into a single regex per root concept. Requires at least 2 concepts.

intersect

<logical(1)> If TRUE, keep only documents matching ALL root-level concepts. Requires at least 2 concepts.

starts_with_only

<logical(1)> If TRUE (default), token matching uses prefix mode: the pattern must match the start of a token, and the rest of the token is accepted (\\S*$ appended).

exclus_manual

<character(1)> Optional regex pattern. Matched tokens containing this pattern are excluded (manual false-positive filter).

exclus_auto_escape

<character(1)> Optional regex pattern. Tokens matching this pattern are removed from data_match before auto-exclusion runs.

exclus_auto_token_min

<numeric(1)> Minimum n-gram size for automatic exclusion heuristics (default 10). Auto-exclusions only apply to tokens with n > exclus_auto_token_min.

regex_replace

<character> Optional named vector of additional regex replacements for source matching (appended to the built-in accent normalisation rules).

mismatch_data

<logical(1)> If TRUE, include unmatched documents in the mismatch output. Default FALSE.

concept_color

<character(1)> Hex colour for concept highlighting in XLSX and gt output. Default "#0099FF".

text_color

<character(1)> Hex colour for text/extract highlighting in XLSX and gt output. Default "#FF0000".

save_as_gt

<logical(1)> If TRUE, generate gt::gt() tables alongside XLSX output. Requires the gt package.

dirname_suffix

<character(1)> Optional suffix appended to the output directory name. Defaults to "sample_{sample}" when sample is set.

filename_suffix

<character(1)> Optional suffix appended to output file names. Defaults to dirname_suffix.

Value

A nested list (invisibly returned from cache when the RDS file already exists) with elements:

data

List of data frames: base (input without text), match (initial matches), extract (final extraction).

regex

List: concepts (parsed patterns), replace (replacement rules), final (combined regex), match (source-level matches).

match

List: init (all matches), final (keep/drop after exclusions).

count

List: init (token-level counts), final (distinct match counts).

exclus

List: match (excluded matches), count (exclusion counts).

mismatch

List: id (unmatched IDs), regex (token vs source discrepancies).

summary

List: token (summary by token), concept (summary by concept), params (call parameters).

sheets

List: df (data frames per Excel sheet), gt (gt tables if save_as_gt = TRUE).

Details

Requires edstr_config() to be called first.

Examples

if (FALSE) { # \dontrun{
edstr_config(edstr_dirname = "output", edstr_filename = "my_study")

df <- edstr_import(query = "sql/my_query.sql")

result <- edstr_extract(
  data = df,
  concepts = c(diabete = "diabet", cancer = "cancer|tumeur"),
  token = c(1, 2),
  intersect = TRUE
)
} # }