Tokenize source text, match concept patterns (regex), apply exclusions, perform source-level re-matching with accent normalisation, and save results as XLSX, JSON, and RDS files.
Usage
edstr_extract(
data,
text_input = getOption("edstr_text"),
id = NULL,
group = NULL,
sample = NULL,
seed = NULL,
ano_hash = NULL,
ano_hide = NULL,
token = 1,
concepts,
collapse = FALSE,
intersect = FALSE,
starts_with_only = TRUE,
exclus_manual = NULL,
exclus_auto_escape = NULL,
exclus_auto_token_min = 10,
regex_replace = NULL,
mismatch_data = FALSE,
concept_color = "#0099FF",
text_color = "#FF0000",
save_as_gt = FALSE,
dirname_suffix = if (!is.null(sample)) str_glue("sample_{sample}") else NULL,
filename_suffix = dirname_suffix
)Arguments
- data
<data.frame>Input data containing at least a text column and a unique identifier column.- text_input
<character(1)>Name of the text column to analyse. Defaults togetOption("edstr_text")set byedstr_config().- id
<character(1)>Name of the unique identifier column. If not supplied, automatically detected (first column with no duplicates and noNA).- group
<character(1)>Optional grouping column (e.g. patient ID when rows are documents). IfNULL, a sequentialid_groupis created.- sample
<integer(1)>Optional. Number of rows to randomly sample fromdatabefore extraction.- seed
<integer(1)>Optional. Random seed for reproducibility whensampleis used.- ano_hash
<character>Column name(s) to pseudonymise by hashing.- ano_hide
<character>Column name(s) to pseudonymise by masking (replaced with"---").- token
<integer>N-gram sizes to use for tokenisation. Default1(unigrams). Usec(1, 2)for unigrams and bigrams.- concepts
<character|list>Named vector or nested named list of regex patterns defining the concepts to search for. Each name becomes a concept key; nested names create sub-concepts (e.g.list(cancer = c(sein = "sein|mammaire", poumon = "poumon")).- collapse
<logical(1)>IfTRUE, OR-collapse all concept patterns into a single regex per root concept. Requires at least 2 concepts.- intersect
<logical(1)>IfTRUE, keep only documents matching ALL root-level concepts. Requires at least 2 concepts.- starts_with_only
<logical(1)>IfTRUE(default), token matching uses prefix mode: the pattern must match the start of a token, and the rest of the token is accepted (\\S*$appended).- exclus_manual
<character(1)>Optional regex pattern. Matched tokens containing this pattern are excluded (manual false-positive filter).- exclus_auto_escape
<character(1)>Optional regex pattern. Tokens matching this pattern are removed fromdata_matchbefore auto-exclusion runs.- exclus_auto_token_min
<numeric(1)>Minimum n-gram size for automatic exclusion heuristics (default10). Auto-exclusions only apply to tokens withn > exclus_auto_token_min.- regex_replace
<character>Optional named vector of additional regex replacements for source matching (appended to the built-in accent normalisation rules).- mismatch_data
<logical(1)>IfTRUE, include unmatched documents in the mismatch output. DefaultFALSE.- concept_color
<character(1)>Hex colour for concept highlighting in XLSX and gt output. Default"#0099FF".- text_color
<character(1)>Hex colour for text/extract highlighting in XLSX and gt output. Default"#FF0000".- save_as_gt
<logical(1)>IfTRUE, generategt::gt()tables alongside XLSX output. Requires thegtpackage.- dirname_suffix
<character(1)>Optional suffix appended to the output directory name. Defaults to"sample_{sample}"whensampleis set.- filename_suffix
<character(1)>Optional suffix appended to output file names. Defaults todirname_suffix.
Value
A nested list (invisibly returned from cache when the RDS file already exists) with elements:
dataList of data frames:
base(input without text),match(initial matches),extract(final extraction).regexList:
concepts(parsed patterns),replace(replacement rules),final(combined regex),match(source-level matches).matchList:
init(all matches),final(keep/drop after exclusions).countList:
init(token-level counts),final(distinct match counts).exclusList:
match(excluded matches),count(exclusion counts).mismatchList:
id(unmatched IDs),regex(token vs source discrepancies).summaryList:
token(summary by token),concept(summary by concept),params(call parameters).sheetsList:
df(data frames per Excel sheet),gt(gt tables ifsave_as_gt = TRUE).
Details
Requires edstr_config() to be called first.
Examples
if (FALSE) { # \dontrun{
edstr_config(edstr_dirname = "output", edstr_filename = "my_study")
df <- edstr_import(query = "sql/my_query.sql")
result <- edstr_extract(
data = df,
concepts = c(diabete = "diabet", cancer = "cancer|tumeur"),
token = c(1, 2),
intersect = TRUE
)
} # }