Date: 2026-04-09 Milestone: 4 (Real Data Pipelines) — remaining Important items Status: Design approved, not started
Three related parser/recognizer features that improve handling of real R scripts without adding new pipeline node types or executors:
library(), set.seed(), options(),
etc. instead of emitting UnsupportedNode. Track loaded
packages for disambiguation.dplyr::filter), spelling variants
(summarize), base R equivalents (read.delim,
read.table), and fread() to canonical
recognizer names via a data-driven lookup table.filter → dplyr vs stats).No new pipeline node types. No mapper changes. No executor changes. This is entirely a parser/recognizer layer feature.
Before the normal AST walk, recognizeR() scans top-level
statements for library() and require() calls.
These are already keyword tokens in the lexer.
Behavior: 1. Extract the package name (first
positional arg as string or identifier) 2. Add to a
Set<string> called loadedPackages 3.
library(tidyverse) expands to the tidyverse core set:
dplyr, tidyr, ggplot2,
readr, purrr, tibble,
stringr, forcats, lubridate 4.
Mark the statement as consumed — no UnsupportedNode, no
diagnostic
A static set of function names that return null from
recognizeFunctionCall with no diagnostic
(silent skip):
| Function | Why ignorable | Notes |
|---|---|---|
set.seed() |
RNG seed — no effect on pipeline | Permanently ignorable |
options() |
Global R options | Permanently ignorable |
rm() / remove() |
Object cleanup | Permanently ignorable |
print() / cat() / message() /
warning() |
Console output | Permanently ignorable |
source() |
Script inclusion — can’t follow | Permanently ignorable |
ggplot() + ggsave() |
Visualization | Temporary — future visualization node. Note in code comment. |
View() / head() / tail() /
str() / glimpse() |
Inspection | Permanently ignorable |
install.packages() |
Package management | Permanently ignorable |
Sys.time() / proc.time() /
system.time() |
Timing | Permanently ignorable |
stopifnot() / assert_that() |
Assertions | Permanently ignorable |
dir.create() / file.path() |
File system | Permanently ignorable |
suppressWarnings() /
suppressMessages() |
Warning suppression | Permanently ignorable |
invisible() |
Return value suppression | Permanently ignorable |
setwd() / getwd() |
Working directory | Permanently ignorable |
theme_set() / theme() |
ggplot theming | Temporary — moves with ggplot support |
scale_color_*() / scale_fill_*() etc. |
ggplot scales | Temporary — handled via scale_ prefix
check, not exhaustive list |
Implementation: a Set<string> checked at the top
of recognizeFunctionCall, before the alias lookup and
switch.
loadedPackages
propagationThe loadedPackages set is built during the pre-pass and
threaded through to the alias resolution and disambiguation logic
(Section 3). It is returned alongside the AnalysisCall[]
from recognizeR() for potential future use by the mapper or
UI (e.g., showing which packages the script expects).
A single Map<string, AliasEntry> mapping variant
function names to canonical recognizer names.
type AliasEntry = {
canonical: string; // the name the recognizer switch expects
package?: string; // originating package (for diagnostics/disambiguation)
};
const FUNCTION_ALIASES: Map<string, AliasEntry>;In recognizeFunctionCall, before the existing
switch:
node.name in FUNCTION_ALIASESentry.canonical as the switch key, stash
original name in args.originalFunctionnode.name as-is (existing
behavior)Namespace-qualified variants — every supported
function gets pkg::fn entries:
| Alias | Canonical | Package |
|---|---|---|
dplyr::filter |
filter |
dplyr |
dplyr::mutate |
mutate |
dplyr |
dplyr::select |
select |
dplyr |
dplyr::summarise |
summarise |
dplyr |
dplyr::summarize |
summarise |
dplyr |
dplyr::arrange |
arrange |
dplyr |
dplyr::rename |
rename |
dplyr |
dplyr::group_by |
group_by |
dplyr |
dplyr::left_join |
left_join |
dplyr |
dplyr::inner_join |
inner_join |
dplyr |
dplyr::full_join |
full_join |
dplyr |
dplyr::right_join |
right_join |
dplyr |
dplyr::anti_join |
anti_join |
dplyr |
dplyr::semi_join |
semi_join |
dplyr |
dplyr::bind_rows |
bind_rows |
dplyr |
readr::read_csv |
read_csv |
readr |
haven::read_dta |
read_dta |
haven |
fixest::feols |
feols |
fixest |
fixest::fepois |
fepois |
fixest |
fixest::feglm |
feglm |
fixest |
fixest::fenegbin |
fenegbin |
fixest |
fixest::etable |
etable |
fixest |
lfe::felm |
felm |
lfe |
AER::ivreg |
ivreg |
AER |
estimatr::lm_robust |
lm |
estimatr |
texreg::screenreg |
screenreg |
texreg |
texreg::htmlreg |
htmlreg |
texreg |
stargazer::stargazer |
stargazer |
stargazer |
modelsummary::modelsummary |
modelsummary |
modelsummary |
sandwich::vcovHC |
vcovHC |
sandwich |
sandwich::vcovCL |
vcovCL |
sandwich |
lmtest::coeftest |
coeftest |
lmtest |
data.table::fread |
read.csv |
data.table |
Base R equivalents:
| Alias | Canonical | Notes |
|---|---|---|
read.delim |
read.csv |
Tab-separated; same data-load node for now |
read.table |
read.csv |
General delimited; same data-load node for now |
read.csv2 |
read.csv |
European CSV (; separator); same node for now |
read_csv2 |
read_csv |
readr European CSV |
read_tsv |
read_csv |
readr tab-separated |
read_delim |
read_csv |
readr general delimited |
fread |
read.csv |
data.table CSV reader |
read_stata |
read_dta |
haven alias |
Future extension: As new recognizer patterns are added, their namespace-qualified forms are added as rows in this table. The table is the single source of truth for “what names map to what”.
The data-load node currently only extracts the
file argument. Future work should extend it with optional
sep, header, select (column
subset), and nrows arguments, extracted from the aliased
function’s args. This enables correct handling of tab-separated and
fixed-width files.
Some bare function names are ambiguous across R packages. The primary
case: filter() is dplyr::filter (row
subsetting) or stats::filter (time series convolution).
type AmbiguousResolution = {
package: string;
canonical: string | null; // null = unsupported in our system
};
const AMBIGUOUS_FUNCTIONS: Map<string, AmbiguousResolution[]>;Known ambiguous functions:
| Function | dplyr (supported) | Conflict package (unsupported) |
|---|---|---|
filter |
dplyr::filter → filter |
stats::filter |
select |
dplyr::select → select |
MASS::select |
lag |
dplyr::lag (future) |
stats::lag |
For a bare call to an ambiguous function:
dplyr::filter())
— resolved by alias table, no ambiguitystats::filter)loadedPackages contains the supported
package (dplyr) — resolve to dplyrloadedPackages contains the conflict package
but NOT dplyr — resolve to unsupported, emit diagnosticRule 5 defaults to dplyr because in applied econ scripts, bare
filter() is overwhelmingly dplyr. A false positive
(treating stats::filter as dplyr’s) would produce a visible
error at execution time (wrong args), while a false negative (marking
dplyr’s filter as unsupported) silently drops a pipeline
step.
pkg::fn)Add a DoubleColon token with pattern /::/
in the multi-char operators section, higher priority
than the existing Colon (:) token. Token
ordering in the allTokens array:
DoubleColon, // :: before :
Colon, // : (sequence operator, interaction in formulas)
Parse Identifier :: Identifier ( args ) as a
FunctionCallNode with name set to
"pkg::fn". This avoids adding a new AST node type.
Specifically, in the expression parsing rule where
Identifier followed by LParen produces a
FunctionCallNode: also check for
Identifier DoubleColon Identifier LParen and concatenate
the name as "${pkg}::${fn}".
The :: is not a general binary operator in our parser —
we only need it in the function-call position. This keeps the parser
change minimal.
No new node type. FunctionCallNode.name can now contain
:: (e.g., "dplyr::filter"). The alias table
normalizes this before the recognizer switch.
Add an optional field to AnalysisCall:
interface AnalysisCall {
// ... existing fields ...
originalFunction?: string; // stashed when alias table rewrites the name
}This preserves the original function name for diagnostics and UI display (e.g., showing “fread” in the node label even though it’s recognized as data-load).
| Area | Tests |
|---|---|
| Lexer | DoubleColon token produced for ::,
Colon still works for : in formulas |
| Parser | dplyr::filter(x > 0) → FunctionCallNode
with name: "dplyr::filter" |
| Ignorable code | library(dplyr) → no node, no diagnostic.
set.seed(42) → no node, no diagnostic.
ggplot(...) → no node. Verify these don’t appear in
pipeline. |
| Loaded packages | library(dplyr) adds “dplyr” to set.
library(tidyverse) expands to core packages.
require(haven) adds “haven”. |
| Alias resolution | dplyr::filter(x > 0) → same result as
filter(x > 0). fread("data.csv") →
data-load node. read.delim("data.tsv") →
data-load node. |
| Disambiguation | Bare filter() with library(dplyr) →
recognized. Bare filter() without library → recognized with
diagnostic. stats::filter() → unsupported with
diagnostic. |
| fread | fread("file.csv") → data-load node.
data.table::fread("file.csv") → data-load
node. |
| Integration | Full script with
library(dplyr); df <- read_csv("a.csv"); df %>% filter(x > 0) %>% mutate(y = log(x))
— library consumed, pipe chain works, no spurious unsupported
nodes. |
These features don’t change the UI — they affect what gets recognized
from code input. Existing E2E tests should continue to pass. A new E2E
test with library() + dplyr::filter() in the
code editor would be valuable but not required for this spec.
| File | Change |
|---|---|
src/core/parsers/r/tokens.ts |
Add DoubleColon token |
src/core/parsers/r/parser.ts |
Parse pkg::fn() → FunctionCallNode with
composite name |
src/core/parsers/r/recognizer.ts |
Ignorable set, alias table, disambiguation table,
loadedPackages pre-pass, alias resolution before
switch |
src/core/parsers/shared/analysis-call.ts |
Add optional originalFunction field |
src/core/parsers/r/recognizer.test.ts |
Tests for all new recognition patterns |
src/core/parsers/r/lexer.test.ts |
DoubleColon token tests |
src/core/parsers/r/parser.test.ts |
Namespace-qualified call parsing tests |
src/core/pipeline/integration.test.ts |
Integration test with library + aliases |
DT[i, j, by] syntax —
deferred to separate specacross() / pivot_longer()
— Nice-to-have, not in this specdata-load stays file-arg-only for now;
sep/header/select/nrows
are a future extension