edstr_extract() is the core of the pipeline. It tokenises cleaned text, matches user-defined concepts via regex, filters false positives, and exports results as XLSX, JSON, and RDS files.
Prerequisites
edstr_config(
edstr_dirname = "output/my_study",
edstr_filename = "my_study",
edstr_text = "note_text",
edstr_overwrite = FALSE
)
df_import <- edstr_import()
df_clean <- edstr_clean(data = df_import, replace = c("\\s+" = " "))Defining concepts
Concepts are named regex patterns that define the clinical entities to search for. A named character vector creates flat, independent concepts:
result <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture", femur = "f(e|é)mur|fesf"),
group = "id_pat"
)A nested named list creates sub-concepts grouped under a root:
result <- edstr_extract(
data = df_clean,
concepts = list(
fracture = c(
fesf = "fesf|extremite superieure",
col = "col (du )?femur"
)
),
group = "id_pat"
)Tokenisation
The token argument controls the n-gram sizes used for tokenisation. Unigrams (token = 1) are the default; adding bigrams captures multi-word expressions.
result <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture"),
token = c(1, 2),
group = "id_pat"
)By default, starts_with_only = TRUE appends \S*$ to patterns, matching any token that starts with the concept. Set starts_with_only = FALSE for exact matching.
Collapse and intersect
When multiple concepts are defined, two modes alter the matching logic:
-
collapse = TRUE: OR-collapses all concept patterns into a single regex per root concept. Useful when sub-concepts are synonyms that should be treated as one. -
intersect = TRUE: keeps only documents that match all root-level concepts. Useful for narrowing results to co-occurrences.
Both require at least two concepts.
result <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture", femur = "f(e|é)mur|fesf"),
intersect = TRUE,
group = "id_pat"
)Exclusions
False positives can be removed through two mechanisms:
Manual exclusions
A regex pattern passed to exclus_manual removes any matched token containing that pattern.
result <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture"),
exclus_manual = "fracture ouverte|ancienne fracture",
group = "id_pat"
)Automatic exclusions
Heuristic-based filtering on long tokens. exclus_auto_token_min sets the minimum n-gram size (default 10) above which auto-exclusion applies. exclus_auto_escape removes specific tokens from the match pool before auto-exclusion runs.
Sampling
For development and testing, sample draws a random subset of rows before extraction. Use seed for reproducibility.
result <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture"),
sample = 500,
seed = 42,
group = "id_pat"
)Pseudonymisation
Two arguments handle de-identification before extraction:
-
ano_hash: column names whose values are replaced by truncated hashes. -
ano_hide: column names whose values are masked with"---".
result <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fracture"),
ano_hash = "id_pat",
ano_hide = "patient_name",
group = "id_pat"
)Output structure
edstr_extract() returns a nested list and saves three files:
| File | Contents |
|---|---|
.xlsx |
Excel workbook with one sheet per result type (extraction, token counts, exclusions, source matching, mismatch, parameters) |
.json |
Full list with all intermediate objects (JSON format) |
.rds |
Full list with all intermediate objects (R-native, used for caching) |
The returned list contains:
-
data: data frames (base,match,extract) -
regex: parsed patterns, replacement rules, source-level matches -
match: initial and final (post-exclusion) matches -
count: token-level and distinct match counts -
exclus: excluded matches and counts -
mismatch: discrepancies between token and source matching -
summary: summaries by token and concept, plus call parameters -
sheets: data frames and optionalgttables for each Excel sheet