Skip to contents

Apply regex-based replacements to a text column and save the result as a Parquet file. If a cached file already exists, behaviour depends on the edstr_overwrite option (see edstr_config()).

Usage

edstr_clean(data, text = getOption("edstr_text"), replace)

Arguments

data

<data.frame> The data to clean. Must contain the column specified by text.

text

<character(1)> Name of the text column to clean. Defaults to the edstr_text option set by edstr_config().

replace

A named character vector or a list of named character vectors. Names are regex patterns, values are replacements. When a list is provided, each element is applied sequentially via stringr::str_replace_all().

Value

A data.frame with the cleaned text column.

Details

Requires edstr_config() to be called first.

Examples

# \donttest{
edstr_config(
  edstr_dirname = tempdir(), edstr_filename = "my_study",
  edstr_text = "note_text", edstr_overwrite = TRUE
)
#> 
#> ── edstr_config ────────────────────────────────────────────────────────────────
#> 
#>  Root : /home/runner/work/edstr/edstr
#> 
#>  Files will be saved in /tmp/RtmpewHAgD with prefix my_study
#> 
#> ────────────────────────────────────────────────────────────────────────────────

df_import <- data.frame(
  id = 1:3,
  note_text = c("diabete  type\n2", "bilan normal", "diabete gestationnel")
)

df_clean <- edstr_clean(
  data = df_import,
  replace = c("\\s+" = " ", "\\n" = " ")
)
#> 
#> ── edstr_clean ─────────────────────────────────────────────────────────────────
#> 
#>  Cleaning text (note_text)
#>  Cleaning text (note_text) [27ms]
#> 
#>  Saving file my_study_clean
#>  Saving file my_study_clean [14ms]
#> 
#>  File my_study_clean saved to /tmp/RtmpewHAgD/my_study_clean.parquet
#> 
#>  Dimensions
#>   • 3 documents
#>   • 2 variables
#> 
#> ────────────────────────────────────────────────────────────────────────────────
# }