M4 Remaining Important: Ignorable Code, Function Aliases, fread

M4 Remaining Important: Ignorable Code, Function Aliases, fread

Date: 2026-04-09 Milestone: 4 (Real Data Pipelines) — remaining Important items Status: Design approved, not started

Overview

Three related parser/recognizer features that improve handling of real R scripts without adding new pipeline node types or executors:

  1. Ignorable code — silently skip library(), set.seed(), options(), etc. instead of emitting UnsupportedNode. Track loaded packages for disambiguation.
  2. Function alias table — map namespace-qualified calls (dplyr::filter), spelling variants (summarize), base R equivalents (read.delim, read.table), and fread() to canonical recognizer names via a data-driven lookup table.
  3. Function disambiguation — use loaded packages set + pipe context to resolve ambiguous bare function names (filter → dplyr vs stats).

No new pipeline node types. No mapper changes. No executor changes. This is entirely a parser/recognizer layer feature.


1. Ignorable Code + Loaded Packages

Pre-pass: library/require

Before the normal AST walk, recognizeR() scans top-level statements for library() and require() calls. These are already keyword tokens in the lexer.

Behavior: 1. Extract the package name (first positional arg as string or identifier) 2. Add to a Set<string> called loadedPackages 3. library(tidyverse) expands to the tidyverse core set: dplyr, tidyr, ggplot2, readr, purrr, tibble, stringr, forcats, lubridate 4. Mark the statement as consumed — no UnsupportedNode, no diagnostic

Ignorable function list

A static set of function names that return null from recognizeFunctionCall with no diagnostic (silent skip):

Function Why ignorable Notes
set.seed() RNG seed — no effect on pipeline Permanently ignorable
options() Global R options Permanently ignorable
rm() / remove() Object cleanup Permanently ignorable
print() / cat() / message() / warning() Console output Permanently ignorable
source() Script inclusion — can’t follow Permanently ignorable
ggplot() + ggsave() Visualization Temporary — future visualization node. Note in code comment.
View() / head() / tail() / str() / glimpse() Inspection Permanently ignorable
install.packages() Package management Permanently ignorable
Sys.time() / proc.time() / system.time() Timing Permanently ignorable
stopifnot() / assert_that() Assertions Permanently ignorable
dir.create() / file.path() File system Permanently ignorable
suppressWarnings() / suppressMessages() Warning suppression Permanently ignorable
invisible() Return value suppression Permanently ignorable
setwd() / getwd() Working directory Permanently ignorable
theme_set() / theme() ggplot theming Temporary — moves with ggplot support
scale_color_*() / scale_fill_*() etc. ggplot scales Temporary — handled via scale_ prefix check, not exhaustive list

Implementation: a Set<string> checked at the top of recognizeFunctionCall, before the alias lookup and switch.

loadedPackages propagation

The loadedPackages set is built during the pre-pass and threaded through to the alias resolution and disambiguation logic (Section 3). It is returned alongside the AnalysisCall[] from recognizeR() for potential future use by the mapper or UI (e.g., showing which packages the script expects).


2. Function Alias Table

A single Map<string, AliasEntry> mapping variant function names to canonical recognizer names.

Data structure

type AliasEntry = {
  canonical: string;   // the name the recognizer switch expects
  package?: string;    // originating package (for diagnostics/disambiguation)
};

const FUNCTION_ALIASES: Map<string, AliasEntry>;

Resolution flow

In recognizeFunctionCall, before the existing switch:

  1. Look up node.name in FUNCTION_ALIASES
  2. If found → use entry.canonical as the switch key, stash original name in args.originalFunction
  3. If not found → use node.name as-is (existing behavior)

Alias categories

Namespace-qualified variants — every supported function gets pkg::fn entries:

Alias Canonical Package
dplyr::filter filter dplyr
dplyr::mutate mutate dplyr
dplyr::select select dplyr
dplyr::summarise summarise dplyr
dplyr::summarize summarise dplyr
dplyr::arrange arrange dplyr
dplyr::rename rename dplyr
dplyr::group_by group_by dplyr
dplyr::left_join left_join dplyr
dplyr::inner_join inner_join dplyr
dplyr::full_join full_join dplyr
dplyr::right_join right_join dplyr
dplyr::anti_join anti_join dplyr
dplyr::semi_join semi_join dplyr
dplyr::bind_rows bind_rows dplyr
readr::read_csv read_csv readr
haven::read_dta read_dta haven
fixest::feols feols fixest
fixest::fepois fepois fixest
fixest::feglm feglm fixest
fixest::fenegbin fenegbin fixest
fixest::etable etable fixest
lfe::felm felm lfe
AER::ivreg ivreg AER
estimatr::lm_robust lm estimatr
texreg::screenreg screenreg texreg
texreg::htmlreg htmlreg texreg
stargazer::stargazer stargazer stargazer
modelsummary::modelsummary modelsummary modelsummary
sandwich::vcovHC vcovHC sandwich
sandwich::vcovCL vcovCL sandwich
lmtest::coeftest coeftest lmtest
data.table::fread read.csv data.table

Base R equivalents:

Alias Canonical Notes
read.delim read.csv Tab-separated; same data-load node for now
read.table read.csv General delimited; same data-load node for now
read.csv2 read.csv European CSV (; separator); same node for now
read_csv2 read_csv readr European CSV
read_tsv read_csv readr tab-separated
read_delim read_csv readr general delimited
fread read.csv data.table CSV reader
read_stata read_dta haven alias

Future extension: As new recognizer patterns are added, their namespace-qualified forms are added as rows in this table. The table is the single source of truth for “what names map to what”.

Future: data-load args

The data-load node currently only extracts the file argument. Future work should extend it with optional sep, header, select (column subset), and nrows arguments, extracted from the aliased function’s args. This enables correct handling of tab-separated and fixed-width files.


3. Function Disambiguation

Problem

Some bare function names are ambiguous across R packages. The primary case: filter() is dplyr::filter (row subsetting) or stats::filter (time series convolution).

Ambiguous function table

type AmbiguousResolution = {
  package: string;
  canonical: string | null;  // null = unsupported in our system
};

const AMBIGUOUS_FUNCTIONS: Map<string, AmbiguousResolution[]>;

Known ambiguous functions:

Function dplyr (supported) Conflict package (unsupported)
filter dplyr::filterfilter stats::filter
select dplyr::selectselect MASS::select
lag dplyr::lag (future) stats::lag

Resolution priority

For a bare call to an ambiguous function:

  1. Namespace-qualified (dplyr::filter()) — resolved by alias table, no ambiguity
  2. In a pipe chain — must be the dplyr version (you can’t pipe into stats::filter)
  3. loadedPackages contains the supported package (dplyr) — resolve to dplyr
  4. loadedPackages contains the conflict package but NOT dplyr — resolve to unsupported, emit diagnostic
  5. Neither loaded, standalone calldefault to dplyr with an informational diagnostic: “Assuming dplyr::filter(); add library(dplyr) to clarify”

Rule 5 defaults to dplyr because in applied econ scripts, bare filter() is overwhelmingly dplyr. A false positive (treating stats::filter as dplyr’s) would produce a visible error at execution time (wrong args), while a false negative (marking dplyr’s filter as unsupported) silently drops a pipeline step.


4. Lexer + Parser: Namespace-Qualified Calls (pkg::fn)

Lexer

Add a DoubleColon token with pattern /::/ in the multi-char operators section, higher priority than the existing Colon (:) token. Token ordering in the allTokens array:

DoubleColon,   // :: before :
Colon,         // : (sequence operator, interaction in formulas)

Parser

Parse Identifier :: Identifier ( args ) as a FunctionCallNode with name set to "pkg::fn". This avoids adding a new AST node type.

Specifically, in the expression parsing rule where Identifier followed by LParen produces a FunctionCallNode: also check for Identifier DoubleColon Identifier LParen and concatenate the name as "${pkg}::${fn}".

The :: is not a general binary operator in our parser — we only need it in the function-call position. This keeps the parser change minimal.

AST

No new node type. FunctionCallNode.name can now contain :: (e.g., "dplyr::filter"). The alias table normalizes this before the recognizer switch.


5. AnalysisCall Changes

Add an optional field to AnalysisCall:

interface AnalysisCall {
  // ... existing fields ...
  originalFunction?: string;  // stashed when alias table rewrites the name
}

This preserves the original function name for diagnostics and UI display (e.g., showing “fread” in the node label even though it’s recognized as data-load).


6. Testing Strategy

Unit tests

Area Tests
Lexer DoubleColon token produced for ::, Colon still works for : in formulas
Parser dplyr::filter(x > 0)FunctionCallNode with name: "dplyr::filter"
Ignorable code library(dplyr) → no node, no diagnostic. set.seed(42) → no node, no diagnostic. ggplot(...) → no node. Verify these don’t appear in pipeline.
Loaded packages library(dplyr) adds “dplyr” to set. library(tidyverse) expands to core packages. require(haven) adds “haven”.
Alias resolution dplyr::filter(x > 0) → same result as filter(x > 0). fread("data.csv")data-load node. read.delim("data.tsv")data-load node.
Disambiguation Bare filter() with library(dplyr) → recognized. Bare filter() without library → recognized with diagnostic. stats::filter() → unsupported with diagnostic.
fread fread("file.csv")data-load node. data.table::fread("file.csv")data-load node.
Integration Full script with library(dplyr); df <- read_csv("a.csv"); df %>% filter(x > 0) %>% mutate(y = log(x)) — library consumed, pipe chain works, no spurious unsupported nodes.

No E2E changes

These features don’t change the UI — they affect what gets recognized from code input. Existing E2E tests should continue to pass. A new E2E test with library() + dplyr::filter() in the code editor would be valuable but not required for this spec.


7. Files Modified

File Change
src/core/parsers/r/tokens.ts Add DoubleColon token
src/core/parsers/r/parser.ts Parse pkg::fn()FunctionCallNode with composite name
src/core/parsers/r/recognizer.ts Ignorable set, alias table, disambiguation table, loadedPackages pre-pass, alias resolution before switch
src/core/parsers/shared/analysis-call.ts Add optional originalFunction field
src/core/parsers/r/recognizer.test.ts Tests for all new recognition patterns
src/core/parsers/r/lexer.test.ts DoubleColon token tests
src/core/parsers/r/parser.test.ts Namespace-qualified call parsing tests
src/core/pipeline/integration.test.ts Integration test with library + aliases

8. Non-Goals