Clean text data — edstr

Apply regex-based replacements to a text column and save the result as a Parquet file. If a cached file already exists, behaviour depends on the edstr_overwrite option (see edstr_config()).

Usage

edstr_clean(data, text = getOption("edstr_text"), replace)

Arguments

data: <data.frame> The data to clean. Must contain the column specified by text.
text: <character(1)> Name of the text column to clean. Defaults to the edstr_text option set by edstr_config().
replace: A named character vector or a list of named character vectors. Names are regex patterns, values are replacements. When a list is provided, each element is applied sequentially via stringr::str_replace_all().

Value

A data.frame with the cleaned text column.

Details

Requires edstr_config() to be called first.

Examples

# \donttest{
edstr_config(
  edstr_dirname = tempdir(), edstr_filename = "my_study",
  edstr_text = "note_text", edstr_overwrite = TRUE
)
#> 
#> ── edstr_config ────────────────────────────────────────────────────────────────
#> 
#> ℹ Root : /home/runner/work/edstr/edstr
#> 
#> ℹ Files will be saved in /tmp/RtmpF1ZXT2 with prefix my_study
#> 
#> ────────────────────────────────────────────────────────────────────────────────

df_import <- data.frame(
  id = 1:3,
  note_text = c("diabete  type\n2", "bilan normal", "diabete gestationnel")
)

df_clean <- edstr_clean(
  data = df_import,
  replace = c("\\s+" = " ", "\\n" = " ")
)
#> 
#> ── edstr_clean ─────────────────────────────────────────────────────────────────
#> 
#> ℹ Cleaning text (note_text)
#> ✔ Cleaning text (note_text) [28ms]
#> 
#> ℹ Saving file my_study_clean
#> ✔ Saving file my_study_clean [13ms]
#> 
#> ✔ File my_study_clean saved to /tmp/RtmpF1ZXT2/my_study_clean.parquet
#> 
#> ℹ Dimensions
#>   • 3 documents
#>   • 2 variables
#> 
#> ────────────────────────────────────────────────────────────────────────────────
# }