Apply regex-based replacements to a text column and save the result as a
Parquet file. If a cached file already exists, behaviour depends on the
edstr_overwrite option (see edstr_config()).
Usage
edstr_clean(data, text = getOption("edstr_text"), replace)Arguments
- data
<data.frame>The data to clean. Must contain the column specified bytext.- text
<character(1)>Name of the text column to clean. Defaults to theedstr_textoption set byedstr_config().- replace
A named character vector or a list of named character vectors. Names are regex patterns, values are replacements. When a list is provided, each element is applied sequentially via
stringr::str_replace_all().
Value
A data.frame with the cleaned text column.
Details
Requires edstr_config() to be called first.
Examples
# \donttest{
edstr_config(
edstr_dirname = tempdir(), edstr_filename = "my_study",
edstr_text = "note_text", edstr_overwrite = TRUE
)
#>
#> ── edstr_config ────────────────────────────────────────────────────────────────
#>
#> ℹ Root : /home/runner/work/edstr/edstr
#>
#> ℹ Files will be saved in /tmp/RtmpewHAgD with prefix my_study
#>
#> ────────────────────────────────────────────────────────────────────────────────
df_import <- data.frame(
id = 1:3,
note_text = c("diabete type\n2", "bilan normal", "diabete gestationnel")
)
df_clean <- edstr_clean(
data = df_import,
replace = c("\\s+" = " ", "\\n" = " ")
)
#>
#> ── edstr_clean ─────────────────────────────────────────────────────────────────
#>
#> ℹ Cleaning text (note_text)
#> ✔ Cleaning text (note_text) [27ms]
#>
#> ℹ Saving file my_study_clean
#> ✔ Saving file my_study_clean [14ms]
#>
#> ✔ File my_study_clean saved to /tmp/RtmpewHAgD/my_study_clean.parquet
#>
#> ℹ Dimensions
#> • 3 documents
#> • 2 variables
#>
#> ────────────────────────────────────────────────────────────────────────────────
# }