edstr extracts structured variables from unstructured French clinical free text stored in an EDS (Entrepot de Donnees de Sante, i.e. an institutional clinical data warehouse). It provides a pipeline that imports data from an Oracle database, cleans text with regex rules, tokenises and matches user-defined concepts, filters false positives, and exports results as Excel, JSON, and RDS files.
Installation
Install from CRAN:
install.packages("edstr")Or install the development version from GitHub:
#install.packages("pak")
pak::pak("hebstr/edstr")System dependencies
-
Java JDK (>= 8): required by rJava, RJDBC, and DatabaseConnector for Oracle JDBC connectivity. On Debian/Ubuntu:
sudo apt install default-jdkthensudo R CMD javareconf. -
Oracle Instant Client: required for
edstr_import()to connect to an Oracle database.
If you do not need the database import step (edstr_import()), you can still use the cleaning and extraction functions on locally loaded data frames.
Pipeline overview
edstr_config()
|
v
edstr_import()
|
v
edstr_clean()
|
+---------> edstr_view()
| (interactive, no save)
v
edstr_extract()
|
v
.xlsx / .json / .rdsEach step (except edstr_view()) caches its output: edstr_import() and edstr_clean() save Parquet files; edstr_extract() saves an RDS file. If the file already exists, the caching system either loads it, overwrites it, or prompts the user, depending on the edstr_overwrite option.
edstr_view() branches off from cleaned data for interactive pattern exploration and does not save anything.
Exported functions
| Function | Description |
|---|---|
edstr_config() |
Set global options: output directory, file prefix, text column, and caching behaviour. Must be called first. |
edstr_import() |
Execute a SQL query against an Oracle database and cache the result as Parquet. |
edstr_clean() |
Apply sequential regex replacements to a text column and cache the result. |
edstr_extract() |
Tokenize text, match concepts, filter false positives, re-match against source text, and export results as XLSX, JSON, and RDS. |
edstr_view() |
Interactively search for a regex pattern in text and display match frequencies. Does not save. |
Quick start
Configure the pipeline
library(edstr)
edstr_config(
edstr_dirname = "output/my_study",
edstr_filename = "my_study",
edstr_text = "note_text",
edstr_overwrite = TRUE
)Import and clean
df_import <- edstr_import(
query = "sql/my_query.sql",
head = 1000,
user = "my_user"
)
df_clean <- edstr_clean(
data = df_import,
replace = c("\\p{Zs}{2,}" = " ", "\\n" = " ")
)Explore patterns
Use edstr_view() to iterate on regex patterns before extraction.
edstr_view(
data = df_clean,
pattern = "fractur",
ngrams = 3
)Extract structured variables
Flat concepts — a named character vector where each element is an independent concept:
result <- edstr_extract(
data = df_clean,
concepts = c(fracture = "fractur", femur = "f(e|e)mur|fesf"),
token = c(1, 2),
group = "id_pat"
)Nested concepts — a named list grouping sub-concepts under a root:
result <- edstr_extract(
data = df_clean,
concepts = list(
fracture = c(
fesf = "fesf|extremite superieure",
col = "col (du )?femur"
)
),
group = "id_pat",
exclus_manual = "ancienne fracture|fracture ouverte"
)Key features
Concept matching. Concepts are named regex patterns defining clinical entities. A named character vector creates independent concepts; a nested named list groups sub-concepts under a root.
Collapse and intersect modes. collapse = TRUE OR-combines all patterns into a single regex. intersect = TRUE keeps only documents matching all root-level concepts.
False-positive filtering. Manual exclusions via a user-supplied regex (exclus_manual) and automatic heuristics on long tokens (exclus_auto_token_min).
Source re-matching. After token-level matching on ASCII-transliterated n-grams, patterns are re-matched against the original text with accent normalisation. Discrepancies between the two are flagged as mismatches for review.
Built-in caching. edstr_import() and edstr_clean() write Parquet files; edstr_extract() writes an RDS file. The edstr_overwrite option (TRUE / FALSE / NULL) controls whether existing files are overwritten, loaded silently, or trigger an interactive prompt.
Output
edstr_extract() returns a nested list and saves three files:
| File | Contents |
|---|---|
.xlsx |
Excel workbook with one sheet per result type (extraction, counts, exclusions, mismatch, parameters) |
.json |
Full list with all intermediate objects (JSON format) |
.rds |
Full nested list with all intermediate objects (R-native, used for caching) |
Vignettes
Detailed documentation is available in six vignettes:
- Get started (
vignette("edstr")) - Pipeline configuration (
vignette("config")) - Data import (
vignette("import")) - Text cleaning (
vignette("clean")) - Text extraction (
vignette("extract")) - Interactive exploration (
vignette("explore"))
Contributing
Bug reports and feature requests: https://github.com/hebstr/edstr/issues
Source code: https://github.com/hebstr/edstr