Recursive Argument Recognition for Dataset/Model Bindings

Recursive Argument Recognition for Dataset/Model Bindings

Date: 2026-04-29 Owner: Mikel Petri Status: Design — pending review

Problem

When a user writes a model call with an inline transformation inside the data= argument, the recognizer extracts the bare identifier and silently throws away the surrounding expression. The model runs on the unfiltered/untransformed dataset; coefficients drift; no diagnostic.

Concrete example (paper #10 wage-risk, BHPS Heckman wage equation):

felm(log(he) ~ poly(age, 2) + ... | id, data = expo[expo$valid.he == 1])

The recognizer returns data: "expo" and discards [expo$valid.he == 1]. On the synthetic CSV every row satisfies valid.he == 1, so the bug hides; on real BHPS data, rows with he <= 0/NA/Inf are silently included and the fit changes.

Same idiom in paper #4 investor-memory and tier-2/3 audit-queue papers (data.table-native authors). Listed in BACKLOG as “Inline-filter in felm/feols data= arg silently dropped”.

Root cause

The recognizer treats argument extraction as “find the binding name (a string)” rather than “find an upstream node (could itself be a recognized call)”. Two parallel extractors both implement the flat-name model:

The standalone recognizers for subset(), filter(), and bracket subset (recognizeSubset :2556, recognizeBracketSubset :2573, dplyr filter :1766) do lift to data-filter AnalysisCalls correctly — but only when invoked from the top-level statement walker. They are never called from inside an argument extractor.

Goal

Replace the flat-name extraction with recursive recognition at every call site whose argument is a dataset or model binding. The argument’s AST sub-tree is treated as a candidate for recognition. If it is recognizable, we synthesize a deterministic lifted binding (__lift_<span>_<stage>), insert the lifted AnalysisCall(s) ahead of the model call, and rewrite the model call’s data= to reference the lifted binding. If it is not recognizable, we fall back to opaque-promoting the entire model call.

Non-goals

Architecture

The principle

A call site recurses iff its argument is a dataset or model binding (something the rest of the pipeline could consume as an upstream node). It stays flat iff its argument is a column name or scalar literal (a string label or value with no upstream node).

The recursive helper

Add a single helper in src/core/parsers/r/recognizer.ts:

function recognizeAsBinding(
  node: RNode,
  scope: Map<string, AnalysisCall>,
  loadedPackages: Set<string>,
  externalScope: ExternalScope,
  diagnostics: Diagnostic[],
  sourceCode: string,
): { name: string; lifted: AnalysisCall[] };

Behavior, in order:

  1. node.type === 'identifier'{ name: node.name, lifted: [] }. Today’s behavior.
  2. node.type === 'dollar-access'{ name: extractRefName(node), lifted: [] }. Today’s behavior.
  3. node.type === 'literal' && string → { name: node.value, lifted: [] }. Today’s behavior.
  4. node.type === 'subset' (bracket subset) → dispatch to liftBracketSubset(node, ...) (see below).
  5. node.type === 'function-call' → delegate to the existing AST-walker dispatcher recognizeFunctionCall(node, scope, ...). If it returns an AnalysisCall whose kind is one of the dataset-producing kinds — data-load, data-filter, data-mutate, data-select, data-summarise, data-arrange, data-rename, data-clean-names, data-join, data-bind-rows — capture it as a lifted binding. The standalone dispatch switch at recognizer.ts:2527-2564 already covers every dplyr verb (filter, mutate, select, summarise, arrange, rename), every join (inner_join, left_join, full_join, right_join, anti_join, semi_join), merge, rbind, group_by, and subset. Recursion gets all of them on day one with no per-function code. (Functions outside this set — na.omit, user-defined helpers, unrecognized package functions — recognizeFunctionCall returns null or a non-dataset kind; we fall through to rule 7 and trigger Edge C opaque promotion.)
  6. node.type === 'binary-op' and op === '|>', OR node.type === 'function-call' and the call sits inside a %>% chain → dispatch to the extracted pipe walker (see B1 below).
  7. Anything else → { name: '?', lifted: [] }. The caller is expected to opaque-promote.

Each lifted AnalysisCall carries: - A deterministic assignedTo of the form __lift_<span.start>_<span.end>_<kind-tag> (e.g., __lift_142_171_filter). Matches the inliner pattern noted in CLAUDE.md (“Synthetic lifted bindings are deterministic”). - A sourceSpan equal to the original sub-tree’s span — accurate byte range of the lifted expression, used for UI source-highlighting. - A parentSpanStart: number field on the AnalysisCall, equal to the parent model call’s sourceSpan.start. The merge sort at recognizer.ts:356 is updated from a.sourceSpan.start - b.sourceSpan.start to (a.parentSpanStart ?? a.sourceSpan.start) - (b.parentSpanStart ?? b.sourceSpan.start). Lifted nodes thus sort by the position of their parent, guaranteeing they appear immediately before the parent in the final call list. Tie-break: lifted before non-lifted at equal keys. - A lifted: true flag for the comparator tie-break and for downstream UI (“this node was synthesized from an inline argument”).

Adoption sites (binding sites recurse, column sites stay flat)

File Line Site Today After
recognizer.ts 2465, 2479 dplyr mutate / summarise first arg extractRefName recognizeAsBinding
recognizer.ts 2563 subset(df, cond) first arg extractRefName recognizeAsBinding
recognizer.ts 2577 df[cond, ] node.object extractRefName recognizeAsBinding
recognizer.ts 2649 t.test(... data=df) data arg extractRefName recognizeAsBinding
recognizer.ts 2682 lm(... data=df) data arg extractRefName recognizeAsBinding
recognizer.ts 2708 ols(... data=df) data arg extractRefName recognizeAsBinding
recognizer.ts 2801 glm(... data=df) data arg extractRefName recognizeAsBinding
recognizer.ts 2730, 2820 robcov(fit, ...) / similar post-hoc model arg extractRefName recognizeAsBinding
recognizer.ts 992 extractDataFromArgs (felm/feols/ivreg/lm_robust source-text path) regex hybrid: source-text formula + AST data= (see below)

Sites that stay flat (column-name / scalar-literal contracts):

Line Site Reason
2225–6, 2496–7 binary-op left/right operands (filter conditions) column refs, not bindings
2664, 2665 t.test(x, y) x/y args column refs (typically df$col)
2683, 2710, 2804 weights= arg executor consumes as column name, not vector binding (separate decision if a paper demands change)
2188 other column-name sites flat by design

Source-text path (felm / feols / ivreg / lm_robust)

These four functions extract args from raw source text because the IV-formula sanitizer (sanitizeIVFormulas, replacing | with ,) corrupts the formula slot. The data= slot is uncorrupted — the sanitizer touches only the formula AST.

Hybrid extraction: - Formula continues to be extracted via extractFormulaArg from raw source text (today’s behavior). - data= is extracted from the AST of the same call. The pre-scan path already has a FunctionCallNode available (or can locate it by sourceSpan matching). Pass that AST node to recognizeAsBinding.

Implementation: extend recognizeIvregFromSource / recognizeFelmFromSource / recognizeFeolsFromSource / recognizeLmRobustFromSource to accept the FunctionCallNode in addition to rawArgs. Internally, look up the data arg via getNamedArg(node.args, 'data') and run it through recognizeAsBinding. Replace the regex extractDataFromArgs for these four functions; keep the regex for any other source-text consumers.

What the MVP unlocks for free

Because rule 5 reuses the existing dispatcher, every form below works in data= from day one — no per-function spec change:

data= form Lifted chain
data = subset(df, cond) one data-filter
data = filter(df, cond) one data-filter
data = mutate(df, x = expr) one data-mutate
data = select(df, col1, col2) one data-select
data = arrange(df, col) one data-arrange
data = rename(df, new = old) one data-rename
data = summarise(df, m = mean(x)) one data-summarise
data = inner_join(a, b, by="id") one data-join (recursive on both a and b!)
data = merge(a, b, by="id") one data-join (via recognizeMerge)
data = rbind(a, b) one data-bind-rows
data = subset(filter(df, c1), c2) nested: outer data-filter consumes lifted inner data-filter
data = inner_join(subset(a, c1), b, by="id") nested: lifted inner data-filter feeds into lifted data-join

Joins and rbind recurse on each of their multiple data args independently, since each goes through recognizeAsBinding.

Edge handling

Edge A — joint filter + select bracket

SubsetNode.args carries 1 or 2 elements. Three sub-cases:

AST shape Inner expression Lifted output
args: [cond] (1 arg, no comma) df[cond] (data.table no-comma) one data-filter
args: [cond, MISSING] (trailing comma, empty 2nd) df[cond, ] one data-filter
args: [MISSING, c(...)] (leading-empty, see grammar note) df[, c("col")] one data-select
args: [cond, c(...)] (both present) df[cond, c("col")] data-filterdata-select chain (two synthetic bindings, deterministic names)

Grammar prerequisite: the existing parser rule (parser.ts:341) is argList = argument (Comma argument)* — the first argument is unconditional, so df[, c("col")] (leading empty) does not parse today. Spec includes a small grammar tweak: change argList to allow an empty first slot inside indexSuffix only (not inside callSuffix). Concretely, introduce indexArgList that wraps argList with leading-empty support:

indexArgList → (Comma argument (Comma argument)* trailing?)
             | argument (Comma argument)* trailing?

The visitor emits MISSING (or null) for empty slots so the recognizer can distinguish. The pure base-R df[, "col"] form works the same way after this.

The new column-only form (df[, c(...)]) also closes the BACKLOG bullet “Base-R column subset df[, c('col1','col2')] as a free byproduct.

extractStringVector (already present in the recognizer) extracts the column name list from the c(...) AST node.

Edge B — pipes inside data=

Today the pipe walker is inlined inside the top-level statement walker (recognizer.ts:1670+). It handles dplyr verbs, model functions, and currentDataVar threading.

Refactor it into a reusable function:

function recognizePipeChain(
  chain: RNode,        // the binary-op '|>' or function-call wrapped %>% chain
  scope: Map<...>,
  loadedPackages: Set<...>,
  externalScope: ExternalScope,
  diagnostics: Diagnostic[],
  sourceCode: string,
  initialDataVar?: string,
): { finalBinding: string; lifted: AnalysisCall[] };

The top-level statement walker calls it (passing initialDataVar = undefined and accepting the lifted calls as the sequence). recognizeAsBinding calls it for pipe-typed nodes encountered inside an argument, generating intermediate __lift_<span>_pipe<i> bindings for each step.

Synthetic intermediate bindings are deterministic per CLAUDE.md’s lifted-bindings invariant — so re-recognition produces identical IDs.

Edge C — unknown inner forms (opaque promotion)

When recognizeAsBinding returns { name: '?', lifted: [] } (rule 7 above), the caller is responsible for opaque-promoting the model call. Concretely: each model recognizer (recognizeLinearModel, recognizeFelmFromSource, etc.) receives the result from recognizeAsBinding, and if name === '?' AND the node was non-trivial (not a real missing arg — i.e., dataArg was present but unrecognizable), it returns an opaqueFallbackFromSource AnalysisCall instead of the typed model call.

opaqueFallbackFromSource already exists at recognizer.ts:1087. It produces a webr-opaque AnalysisCall whose rSource is the original (sanitized) call text. WebR evaluates the original R verbatim, including na.omit(df) or any user-defined helper.

Tradeoff (already accepted): the model loses typed RegressionResult for that one call. Future broom::tidy() typed-marshaler (existing BACKLOG item) auto-promotes opaque fits back to typed.

Data flow

For lm(y ~ x, data = expo[expo$valid.he == 1]):

recognizeR (top-level walk)
  → recognizeFunctionCall('lm', node)
    → recognizeLinearModel(node)
      → dataArg = getNamedArg(node.args, 'data')   // SubsetNode
      → recognizeAsBinding(SubsetNode, ...)
        → liftBracketSubset(SubsetNode, ...)
          → returns { name: '__lift_42_71_filter',
                      lifted: [{ kind: 'data-filter',
                                 args: { data: 'expo', condition: 'valid.he == 1' },
                                 assignedTo: '__lift_42_71_filter',
                                 sourceSpan: <inner span> }] }
      → AnalysisCall { kind: 'linear-model',
                       args: { data: '__lift_42_71_filter', ... },
                       formula: { outcome: 'y', terms: [...] },
                       sourceSpan: <call span> }
  → prepend lifted calls to the result list
  → final sort by sourceSpan.start

Mapper sees a normal two-node chain: data-filterlinear-model. No mapper changes required.

For the joint form data = df[cond, c("col1","col2")], the lifted sequence is:

[{ kind: 'data-filter', assignedTo: '__lift_<inner>_filter', ... },
 { kind: 'data-select', assignedTo: '__lift_<outer>_select',
   args: { data: '__lift_<inner>_filter', columns: ['col1','col2'] }, ... }]

Model call’s data arg becomes __lift_<outer>_select.

Error handling

Testing strategy

Unit tests (recognizer.test.ts)

  1. Inline filter forms — AST path (lm/glm/ols/t.test): each of the four bracket forms produces a lifted data-filter + the model call’s data= references the synthetic binding. 1a. Inline dplyr / join / merge / rbind forms — AST path: data = mutate(df, ...), data = inner_join(a, b, by=...), data = subset(filter(df, c1), c2) (nested), data = inner_join(subset(a, c1), b, by=...) (nested across multiple data args). Each emits the right lifted chain.
  2. Inline filter forms — source-text path (felm/feols/ivreg/lm_robust): same four forms produce the same lifted shape via the hybrid AST-data-extraction + source-text-formula path.
  3. Joint filter + select (df[cond, c("col1","col2")]): two lifted nodes (data-filter then data-select) chained correctly.
  4. Column-only bracket (df[, c("col")]): one lifted data-select. Closes BACKLOG bullet.
  5. Pipe inside data= (data = df %>% filter(c) %>% select(...)): each pipe step lifted; final binding flows into model call.
  6. Unknown inner form (data = na.omit(df), data = my_helper(df)): model call opaque-promoted; no silent drop.
  7. Already-flat data arg (data = df): byte-identical to pre-fix behavior. Regression guard.
  8. Already-existing standalone-call patterns: byte-identical AnalysisCall sequence for code that doesn’t use inline forms (every existing paper test).

Integration tests (integration.test.ts)

  1. buildPipeline for each of the four bracket forms inside felm: data-filterlinear-model chain in the resulting pipeline; mapper edges resolve correctly.
  2. Joint filter+select inside felm: three-node chain, edge port types match.

Replication tests (paper-match)

  1. Wage-risk (paper #10) — extend replicate-wage-risk.test.ts with a synthetic CSV variant that has rows with valid.he == 0. Pre-fix: coefficients differ from lfe::felm (the silent bug). Post-fix: agreement to <1e-3 relative tolerance. This is the canonical regression test for the issue.
  2. Investor-memory (paper #4) — once the data.table := walrus blocker (separate BACKLOG item) is unblocked, m[type == 0 & ...] inline-filter test follows the same pattern. Skipped in this MVP; the architecture supports it once the walrus lexer change lands.

Existing tests

  1. All existing paper-replication tests (replicate-monopoly, replicate-pollution, replicate-soil, replicate-keep-enemies-closer, etc.) must continue to pass byte-identically. None of them use inline-filter forms today, so the change should be invisible to them.

Verification gate

npm run build && npm test && npm run lint && npm run test:e2e — plus npm run test:paper-match for the wage-risk extended test.

Implementation phases

The work decomposes into five sequential phases. Each phase ends in a green build + tests.

  1. Phase 1 — recognizeAsBinding skeleton + rules 1–3 (identifier / dollar / literal) + rule 5 (function-call delegation). Adds the helper; rule 5 delegates to recognizeFunctionCall and accepts dataset-kind results as lifted bindings. Adopt at one easy site (recognizeLinearModel for lm()) to validate the wiring. Tests: regression guards + unit tests 1a (joins/mutates/dplyr inside data= of lm()).
  2. Phase 2 — Bracket subset cases (Edge A row-only forms) + rule 4. SubsetNode recognition for df[cond] and df[cond, ] inside data=. Adopt across lm/glm/ols/t.test. Tests: unit tests 1 (AST path).
  3. Phase 3 — Source-text path hybrid (felm/feols/ivreg/lm_robust). Wire AST-based data= extraction; keep source-text formula. Tests: unit tests 2 + replication test 11 (wage-risk regression).
  4. Phase 4 — Joint filter+select + parser grammar tweak (Edge A joint). Grammar change for indexArgList leading-empty; lift df[cond, c(...)] and df[, c(...)]. Tests: unit tests 3, 4 + integration 9, 10.
  5. Phase 5 — Pipe walker extraction + pipes inside data= (Edge B), opaque promotion finalization (Edge C). Tests: unit tests 5, 6.

Each phase is a single PR-sized commit, reviewable independently.

Open questions

None. Implementation can proceed.

Affected files

Modified: - src/core/parsers/r/recognizer.ts — adds recognizeAsBinding, liftBracketSubset, refactors model-call recognizers, hybrid source-text path. - src/core/parsers/r/parser.tsindexArgList grammar change for leading-empty bracket arg. - src/core/parsers/r/visitor.ts — emit MISSING placeholder for empty arg slots inside indexSuffix. - src/core/parsers/r/ast.ts — extend SubsetNode.args element type to RNode | { type: 'missing'; span: Span } (or equivalent sentinel). - src/core/parsers/r/recognizer.test.ts — unit tests 1–8. - src/core/pipeline/integration.test.ts — integration tests 9, 10. - src/core/pipeline/replicate-wage-risk.test.ts — replication test 11. - BACKLOG.md — strike “Inline-filter in felm/feols data= arg silently dropped” and “Base-R column subset df[, c('col1','col2')]; update wage-risk + investor-memory paper limitations.

Read-only references (no changes): - src/core/pipeline/mapper.ts — verifies no changes needed; existing data-filter and data-select mapper cases handle the lifted nodes. - src/core/parsers/r/ast.tsSubsetNode type unchanged.

References