edstr_clean(): text cleaning • edstr

library(edstr)

edstr_clean() applies regex-based replacements to a text column and saves the cleaned data as a Parquet file. It requires edstr_config() to be called first.

Prerequisites

edstr_config(
  edstr_dirname = "output/my_study",
  edstr_filename = "my_study",
  edstr_text = "note_text",
  edstr_overwrite = FALSE
)

Defining replacement rules

The replace argument takes a named character vector where names are regex patterns and values are their replacements. Replacements are applied sequentially via stringr::str_replace_all().

df_clean <- edstr_clean(
  data = df_import,
  replace = c(
    "\\p{Zs}{2,}" = " ",
    "&amp;"        = "&",
    "&lt;"         = "<",
    "&gt;"         = ">"
  )
)

Using a list for ordered replacements

When the order matters (e.g. fixing HTML entities before stripping tags), pass a list of named vectors. Each element is applied in sequence.

df_clean <- edstr_clean(
  data = df_import,
  replace = list(
    entities = c("&amp;" = "&", "&lt;" = "<", "&gt;" = ">"),
    whitespace = c("\\p{Zs}{2,}" = " ", "\\n+" = " ")
  )
)

Externalising cleaning rules

For complex or project-specific patterns, define replacements in a standalone R file and load them with source(). This keeps the cleaning logic versioned, testable, and reusable across scripts.

df_clean <- edstr_clean(
  data = df_import,
  replace = source(here::here("config/clean.R"))$value
)

The sourced file should return a named character vector or list — see demo/config/clean.R for an example.

Specifying the text column

By default, edstr_clean() uses the edstr_text option set in edstr_config(). To override it for a specific call:

df_clean <- edstr_clean(
  data = df_import,
  text = "other_column",
  replace = c("\\s+" = " ")
)

Caching

Like other pipeline functions, edstr_clean() saves its output as {filename}_clean.parquet. On subsequent calls, it loads from cache based on the edstr_overwrite setting.