edstr_clean() applies regex-based replacements to a text column and saves the cleaned data as a Parquet file. It requires edstr_config() to be called first.
Prerequisites
edstr_config(
edstr_dirname = "output/my_study",
edstr_filename = "my_study",
edstr_text = "note_text",
edstr_overwrite = FALSE
)Defining replacement rules
The replace argument takes a named character vector where names are regex patterns and values are their replacements. Replacements are applied sequentially via stringr::str_replace_all().
df_clean <- edstr_clean(
data = df_import,
replace = c(
"\\p{Zs}{2,}" = " ",
"&" = "&",
"<" = "<",
">" = ">"
)
)Using a list for ordered replacements
When the order matters (e.g. fixing HTML entities before stripping tags), pass a list of named vectors. Each element is applied in sequence.
df_clean <- edstr_clean(
data = df_import,
replace = list(
entities = c("&" = "&", "<" = "<", ">" = ">"),
whitespace = c("\\p{Zs}{2,}" = " ", "\\n+" = " ")
)
)Externalising cleaning rules
For complex or project-specific patterns, define replacements in a standalone R file and load them with source(). This keeps the cleaning logic versioned, testable, and reusable across scripts.
df_clean <- edstr_clean(
data = df_import,
replace = source(here::here("config/clean.R"))$value
)The sourced file should return a named character vector or list — see demo/config/clean.R for an example.
Specifying the text column
By default, edstr_clean() uses the edstr_text option set in edstr_config(). To override it for a specific call:
df_clean <- edstr_clean(
data = df_import,
text = "other_column",
replace = c("\\s+" = " ")
)Caching
Like other pipeline functions, edstr_clean() saves its output as {filename}_clean.parquet. On subsequent calls, it loads from cache based on the edstr_overwrite setting.