Date: 2026-04-29 Owner: Mikel Petri Status: Design — pending review
When a user writes a model call with an inline transformation inside
the data= argument, the recognizer extracts the bare
identifier and silently throws away the surrounding
expression. The model runs on the unfiltered/untransformed
dataset; coefficients drift; no diagnostic.
Concrete example (paper #10 wage-risk, BHPS Heckman wage equation):
felm(log(he) ~ poly(age, 2) + ... | id, data = expo[expo$valid.he == 1])The recognizer returns data: "expo" and discards
[expo$valid.he == 1]. On the synthetic CSV every row
satisfies valid.he == 1, so the bug hides; on real BHPS
data, rows with he <= 0/NA/Inf are silently included and
the fit changes.
Same idiom in paper #4 investor-memory and tier-2/3 audit-queue
papers (data.table-native authors). Listed in BACKLOG as
“Inline-filter in felm/feols
data= arg silently dropped”.
The recognizer treats argument extraction as “find the binding name (a string)” rather than “find an upstream node (could itself be a recognized call)”. Two parallel extractors both implement the flat-name model:
extractRefName(node) at
recognizer.ts:3353 handles only identifier /
dollar-access / string-literal nodes; everything else
returns '?'. Used by lm, glm,
ols, t.test recognizers (:2674,
:2700, :2682, :2649).extractDataFromArgs
at recognizer.ts:992 regex-matches
^data\s*=\s*(\w+). The \w+ greedily matches
the bare identifier and stops at [ / (. Used
by felm, feols, ivreg,
lm_robust (which bypass AST due to IV-formula
| sanitization).The standalone recognizers for subset(),
filter(), and bracket subset (recognizeSubset
:2556, recognizeBracketSubset
:2573, dplyr filter :1766) do lift to
data-filter AnalysisCalls correctly — but only when invoked
from the top-level statement walker. They are never called from inside
an argument extractor.
Replace the flat-name extraction with recursive recognition
at every call site whose argument is a dataset or model
binding. The argument’s AST sub-tree is treated as a candidate
for recognition. If it is recognizable, we synthesize a deterministic
lifted binding (__lift_<span>_<stage>), insert
the lifted AnalysisCall(s) ahead of the model call, and rewrite the
model call’s data= to reference the lifted binding. If it
is not recognizable, we fall back to opaque-promoting the entire model
call.
t.test x/y args, weights=). These
remain flat — they consume a string label, not a binding.data-select).weights=
as a vector binding) — flagged as future work.A call site recurses iff its argument is a dataset or model binding (something the rest of the pipeline could consume as an upstream node). It stays flat iff its argument is a column name or scalar literal (a string label or value with no upstream node).
Add a single helper in
src/core/parsers/r/recognizer.ts:
function recognizeAsBinding(
node: RNode,
scope: Map<string, AnalysisCall>,
loadedPackages: Set<string>,
externalScope: ExternalScope,
diagnostics: Diagnostic[],
sourceCode: string,
): { name: string; lifted: AnalysisCall[] };Behavior, in order:
node.type === 'identifier' →
{ name: node.name, lifted: [] }. Today’s behavior.node.type === 'dollar-access' →
{ name: extractRefName(node), lifted: [] }. Today’s
behavior.node.type === 'literal' && string →
{ name: node.value, lifted: [] }. Today’s behavior.node.type === 'subset' (bracket subset) → dispatch to
liftBracketSubset(node, ...) (see below).node.type === 'function-call' and
node.name is one of subset,
filter, na.omit (and any other future
“produces-a-dataset” function) → dispatch to the corresponding
standalone recognizer, capturing its returned AnalysisCall as a lifted
binding.node.type === 'binary-op' and
op === '|>', OR
node.type === 'function-call' and the call sits inside a
%>% chain → dispatch to the extracted pipe walker (see
B1 below).{ name: '?', lifted: [] }. The caller
is expected to opaque-promote.Each lifted AnalysisCall carries: - A deterministic
assignedTo of the form
__lift_<span.start>_<span.end>_<kind-tag>
(e.g., __lift_142_171_filter). Matches the inliner pattern
noted in CLAUDE.md (“Synthetic lifted bindings are
deterministic”). - A sourceSpan equal to the original
sub-tree’s span. The recognizer’s call-ordering sort (by
sourceSpan.start) places lifted calls before their
parent model call automatically, since the inner span starts at the same
position or after the parent’s start. To guarantee strict ordering, the
lifted span’s start is offset by -0.5 (a
fractional sentinel) inside the comparator only — the lifted node’s
printed span stays accurate. (Implementation: extend the sort
comparator to break ties using lifted flag.) - A
lifted: true flag on the AnalysisCall for the comparator
and for downstream UI (“this node was synthesized from an inline
argument”). Optional — can be left implicit.
| File | Line | Site | Today | After |
|---|---|---|---|---|
recognizer.ts |
2465, 2479 | dplyr mutate / summarise first arg |
extractRefName |
recognizeAsBinding |
recognizer.ts |
2563 | subset(df, cond) first arg |
extractRefName |
recognizeAsBinding |
recognizer.ts |
2577 | df[cond, ] node.object |
extractRefName |
recognizeAsBinding |
recognizer.ts |
2649 | t.test(... data=df) data arg |
extractRefName |
recognizeAsBinding |
recognizer.ts |
2682 | lm(... data=df) data arg |
extractRefName |
recognizeAsBinding |
recognizer.ts |
2708 | ols(... data=df) data arg |
extractRefName |
recognizeAsBinding |
recognizer.ts |
2801 | glm(... data=df) data arg |
extractRefName |
recognizeAsBinding |
recognizer.ts |
2730, 2820 | robcov(fit, ...) / similar post-hoc model arg |
extractRefName |
recognizeAsBinding |
recognizer.ts |
992 | extractDataFromArgs (felm/feols/ivreg/lm_robust
source-text path) |
regex | hybrid: source-text formula + AST data= (see
below) |
Sites that stay flat (column-name / scalar-literal contracts):
| Line | Site | Reason |
|---|---|---|
| 2225–6, 2496–7 | binary-op left/right operands (filter conditions) | column refs, not bindings |
| 2664, 2665 | t.test(x, y) x/y args |
column refs (typically df$col) |
| 2683, 2710, 2804 | weights= arg |
executor consumes as column name, not vector binding (separate decision if a paper demands change) |
| 2188 | other column-name sites | flat by design |
These four functions extract args from raw source text because the
IV-formula sanitizer (sanitizeIVFormulas, replacing
| with ,) corrupts the formula slot.
The data= slot is uncorrupted — the
sanitizer touches only the formula AST.
Hybrid extraction: - Formula continues to be
extracted via extractFormulaArg from raw source text
(today’s behavior). - data= is extracted
from the AST of the same call. The pre-scan path already has a
FunctionCallNode available (or can locate it by
sourceSpan matching). Pass that AST node to
recognizeAsBinding.
Implementation: extend recognizeIvregFromSource /
recognizeFelmFromSource /
recognizeFeolsFromSource /
recognizeLmRobustFromSource to accept the FunctionCallNode
in addition to rawArgs. Internally, look up the
data arg via getNamedArg(node.args, 'data')
and run it through recognizeAsBinding. Replace the regex
extractDataFromArgs for these four functions; keep the
regex for any other source-text consumers.
SubsetNode.args carries 1 or 2 elements. Three
sub-cases:
| AST shape | Inner expression | Lifted output |
|---|---|---|
args: [cond] (1 arg, no comma) |
df[cond] (data.table no-comma) |
one data-filter |
args: [cond, MISSING] (trailing comma, empty 2nd) |
df[cond, ] |
one data-filter |
args: [MISSING, c(...)] (leading-empty, see grammar
note) |
df[, c("col")] |
one data-select |
args: [cond, c(...)] (both present) |
df[cond, c("col")] |
data-filter → data-select chain (two
synthetic bindings, deterministic names) |
Grammar prerequisite: the existing parser rule
(parser.ts:341) is
argList = argument (Comma argument)* — the first argument
is unconditional, so df[, c("col")] (leading empty) does
not parse today. Spec includes a small grammar tweak: change
argList to allow an empty first slot inside
indexSuffix only (not inside callSuffix).
Concretely, introduce indexArgList that wraps
argList with leading-empty support:
indexArgList → (Comma argument (Comma argument)* trailing?)
| argument (Comma argument)* trailing?
The visitor emits MISSING (or null) for
empty slots so the recognizer can distinguish. The pure base-R
df[, "col"] form works the same way after this.
The new column-only form (df[, c(...)]) also
closes the BACKLOG bullet “Base-R column subset
df[, c('col1','col2')]” as a free byproduct.
extractStringVector (already present in the recognizer)
extracts the column name list from the c(...) AST node.
data=Today the pipe walker is inlined inside the top-level statement
walker (recognizer.ts:1670+). It handles dplyr verbs, model
functions, and currentDataVar threading.
Refactor it into a reusable function:
function recognizePipeChain(
chain: RNode, // the binary-op '|>' or function-call wrapped %>% chain
scope: Map<...>,
loadedPackages: Set<...>,
externalScope: ExternalScope,
diagnostics: Diagnostic[],
sourceCode: string,
initialDataVar?: string,
): { finalBinding: string; lifted: AnalysisCall[] };The top-level statement walker calls it (passing
initialDataVar = undefined and accepting the lifted calls
as the sequence). recognizeAsBinding calls it for
pipe-typed nodes encountered inside an argument, generating intermediate
__lift_<span>_pipe<i> bindings for each
step.
Synthetic intermediate bindings are deterministic per CLAUDE.md’s lifted-bindings invariant — so re-recognition produces identical IDs.
When recognizeAsBinding returns
{ name: '?', lifted: [] } (rule 7 above), the
caller is responsible for opaque-promoting the model
call. Concretely: each model recognizer
(recognizeLinearModel,
recognizeFelmFromSource, etc.) receives the result from
recognizeAsBinding, and if name === '?' AND
the node was non-trivial (not a real missing arg — i.e.,
dataArg was present but unrecognizable), it returns an
opaqueFallbackFromSource AnalysisCall instead of the typed
model call.
opaqueFallbackFromSource already exists at
recognizer.ts:1087. It produces a webr-opaque
AnalysisCall whose rSource is the original (sanitized) call
text. WebR evaluates the original R verbatim, including
na.omit(df) or any user-defined helper.
Tradeoff (already accepted): the model loses typed
RegressionResult for that one call. Future
broom::tidy() typed-marshaler (existing BACKLOG item)
auto-promotes opaque fits back to typed.
For lm(y ~ x, data = expo[expo$valid.he == 1]):
recognizeR (top-level walk)
→ recognizeFunctionCall('lm', node)
→ recognizeLinearModel(node)
→ dataArg = getNamedArg(node.args, 'data') // SubsetNode
→ recognizeAsBinding(SubsetNode, ...)
→ liftBracketSubset(SubsetNode, ...)
→ returns { name: '__lift_42_71_filter',
lifted: [{ kind: 'data-filter',
args: { data: 'expo', condition: 'valid.he == 1' },
assignedTo: '__lift_42_71_filter',
sourceSpan: <inner span> }] }
→ AnalysisCall { kind: 'linear-model',
args: { data: '__lift_42_71_filter', ... },
formula: { outcome: 'y', terms: [...] },
sourceSpan: <call span> }
→ prepend lifted calls to the result list
→ final sort by sourceSpan.start
Mapper sees a normal two-node chain: data-filter →
linear-model. No mapper changes required.
For the joint form data = df[cond, c("col1","col2")],
the lifted sequence is:
[{ kind: 'data-filter', assignedTo: '__lift_<inner>_filter', ... },
{ kind: 'data-select', assignedTo: '__lift_<outer>_select',
args: { data: '__lift_<inner>_filter', columns: ['col1','col2'] }, ... }]
Model call’s data arg becomes
__lift_<outer>_select.
data-filter is created normally; the data-filter
executor surfaces the column-not-found error at runtime, same as today’s
standalone filter behavior. No regression.data= → opaque-promote the entire model call
(current pipe walker emits opaque for unknown steps;
recognizeAsBinding propagates '?').recognizer.test.ts)data-filter + the model call’s data=
references the synthetic binding.df[cond, c("col1","col2")]): two lifted nodes
(data-filter then data-select) chained
correctly.df[, c("col")]):
one lifted data-select. Closes BACKLOG bullet.data = df %>% filter(c) %>% select(...)): each pipe
step lifted; final binding flows into model call.data = na.omit(df), data = my_helper(df)):
model call opaque-promoted; no silent drop.data = df):
byte-identical to pre-fix behavior. Regression guard.integration.test.ts)data-filter → linear-model chain
in the resulting pipeline; mapper edges resolve correctly.replicate-wage-risk.test.ts with a synthetic CSV variant
that has rows with valid.he == 0. Pre-fix: coefficients
differ from lfe::felm (the silent bug). Post-fix: agreement
to <1e-3 relative tolerance. This is the canonical regression test
for the issue.data.table := walrus blocker (separate BACKLOG item) is
unblocked, m[type == 0 & ...] inline-filter test
follows the same pattern. Skipped in this MVP; the architecture supports
it once the walrus lexer change lands.replicate-monopoly, replicate-pollution,
replicate-soil, replicate-keep-enemies-closer,
etc.) must continue to pass byte-identically. None of them use
inline-filter forms today, so the change should be invisible to
them.npm run build && npm test && npm run lint && npm run test:e2e
— plus npm run test:paper-match for the wage-risk extended
test.
The work decomposes into five sequential phases. Each phase ends in a green build + tests.
recognizeAsBinding skeleton +
identifier/dollar/literal cases. No behavior change; refactor
existing call sites to use the helper. Tests: regression guards
(existing tests pass byte-identically).df[cond] and df[cond, ] lift
correctly inside data=. Tests: unit tests 1 (AST
path).data=
extraction; keep source-text formula. Tests: unit tests 2 + replication
test 11 (wage-risk regression).indexArgList
leading-empty; lift df[cond, c(...)] and
df[, c(...)]. Tests: unit tests 3, 4 + integration 9,
10.data= (Edge B), opaque promotion on unknown forms (Edge
C). Tests: unit tests 5, 6.Each phase is a single PR-sized commit, reviewable independently.
None. Implementation can proceed.
Modified: -
src/core/parsers/r/recognizer.ts — adds
recognizeAsBinding, liftBracketSubset,
refactors model-call recognizers, hybrid source-text path. -
src/core/parsers/r/parser.ts — indexArgList
grammar change for leading-empty bracket arg. -
src/core/parsers/r/ast-builder.ts (or equivalent visitor) —
emit MISSING placeholder for empty arg slots inside
indexSuffix. -
src/core/parsers/r/recognizer.test.ts — unit tests 1–8. -
src/core/pipeline/integration.test.ts — integration tests
9, 10. - src/core/pipeline/replicate-wage-risk.test.ts —
replication test 11. - BACKLOG.md — strike
“Inline-filter in felm/feols
data= arg silently dropped” and “Base-R column
subset df[, c('col1','col2')]”; update wage-risk +
investor-memory paper limitations.
Read-only references (no changes): -
src/core/pipeline/mapper.ts — verifies no changes needed;
existing data-filter and data-select mapper
cases handle the lifted nodes. - src/core/parsers/r/ast.ts
— SubsetNode type unchanged.
felm/feols data= arg silently
dropped”df[, c('col1','col2')]”sourceSpan.start)