Skip to contents

edstr extracts structured variables from unstructured French clinical free text stored in an EDS (Entrepot de Donnees de Sante, i.e. an institutional clinical data warehouse). It provides a pipeline that imports data from an Oracle database, cleans text with regex rules, tokenises and matches user-defined concepts, filters false positives, and exports results as Excel, JSON, and RDS files.

Installation

Install from CRAN:

Or install the development version from GitHub:

#install.packages("pak")
pak::pak("hebstr/edstr")

System dependencies

  • Java JDK (>= 8): required by rJava, RJDBC, and DatabaseConnector for Oracle JDBC connectivity. On Debian/Ubuntu: sudo apt install default-jdk then sudo R CMD javareconf.
  • Oracle Instant Client: required for edstr_import() to connect to an Oracle database.

If you do not need the database import step (edstr_import()), you can still use the cleaning and extraction functions on locally loaded data frames.

Pipeline overview

edstr_config()
      |
      v
edstr_import()
      |
      v
edstr_clean()
      |
      +---------> edstr_view()
      |           (interactive, no save)
      v
edstr_extract()
      |
      v
  .xlsx / .json / .rds

Each step (except edstr_view()) caches its output: edstr_import() and edstr_clean() save Parquet files; edstr_extract() saves an RDS file. If the file already exists, the caching system either loads it, overwrites it, or prompts the user, depending on the edstr_overwrite option.

edstr_view() branches off from cleaned data for interactive pattern exploration and does not save anything.

Exported functions

Function Description
edstr_config() Set global options: output directory, file prefix, text column, and caching behaviour. Must be called first.
edstr_import() Execute a SQL query against an Oracle database and cache the result as Parquet.
edstr_clean() Apply sequential regex replacements to a text column and cache the result.
edstr_extract() Tokenize text, match concepts, filter false positives, re-match against source text, and export results as XLSX, JSON, and RDS.
edstr_view() Interactively search for a regex pattern in text and display match frequencies. Does not save.

Quick start

Configure the pipeline

library(edstr)

edstr_config(
  edstr_dirname = "output/my_study",
  edstr_filename = "my_study",
  edstr_text = "note_text",
  edstr_overwrite = TRUE
)

Import and clean

df_import <- edstr_import(
  query = "sql/my_query.sql",
  head = 1000,
  user = "my_user"
)

df_clean <- edstr_clean(
  data = df_import,
  replace = c("\\p{Zs}{2,}" = " ", "\\n" = " ")
)

Explore patterns

Use edstr_view() to iterate on regex patterns before extraction.

edstr_view(
  data = df_clean,
  pattern = "fractur",
  ngrams = 3
)

Extract structured variables

Flat concepts — a named character vector where each element is an independent concept:

result <- edstr_extract(
  data = df_clean,
  concepts = c(fracture = "fractur", femur = "f(e|e)mur|fesf"),
  token = c(1, 2),
  group = "id_pat"
)

Nested concepts — a named list grouping sub-concepts under a root:

result <- edstr_extract(
  data = df_clean,
  concepts = list(
    fracture = c(
      fesf = "fesf|extremite superieure",
      col = "col (du )?femur"
    )
  ),
  group = "id_pat",
  exclus_manual = "ancienne fracture|fracture ouverte"
)

Key features

Concept matching. Concepts are named regex patterns defining clinical entities. A named character vector creates independent concepts; a nested named list groups sub-concepts under a root.

Collapse and intersect modes. collapse = TRUE OR-combines all patterns into a single regex. intersect = TRUE keeps only documents matching all root-level concepts.

False-positive filtering. Manual exclusions via a user-supplied regex (exclus_manual) and automatic heuristics on long tokens (exclus_auto_token_min).

Source re-matching. After token-level matching on ASCII-transliterated n-grams, patterns are re-matched against the original text with accent normalisation. Discrepancies between the two are flagged as mismatches for review.

Built-in caching. edstr_import() and edstr_clean() write Parquet files; edstr_extract() writes an RDS file. The edstr_overwrite option (TRUE / FALSE / NULL) controls whether existing files are overwritten, loaded silently, or trigger an interactive prompt.

Output

edstr_extract() returns a nested list and saves three files:

File Contents
.xlsx Excel workbook with one sheet per result type (extraction, counts, exclusions, mismatch, parameters)
.json Full list with all intermediate objects (JSON format)
.rds Full nested list with all intermediate objects (R-native, used for caching)

Vignettes

Detailed documentation is available in six vignettes:

Contributing

Bug reports and feature requests: https://github.com/hebstr/edstr/issues

Source code: https://github.com/hebstr/edstr

License

GPL-3