Status: Draft, 2026-05-17. Closes:
BACKLOG bullets at lines 167 (list-style model$col = expr
is NOT data-mutate) and 285 (Sys.info()[K] /
actual[[K]] bracket-on-non-dataset). Affected
papers: #20, #39, #40, #42, #43, #44, #53 (8 papers,
cross-paper signal continues on any param-bag / vec[idx]
pattern).
Interlyse’s primitive set (data-mutate,
data-filter, data-select, …) is supposed to
model language-agnostic statistical/data atoms: “add a column to a
dataset”, “subset rows of a dataset”. They should mean the same thing in
R, Stata, Python, Julia. The recognizer’s job is to map source code to
this primitive set.
Today the recognizer over-fires on R-specific object syntax whose meaning is unrelated to Interlyse’s primitives:
# Paper #39 finite-score: Gurobi solver parameter bag, NOT a dataset
model <- list()
model$vtype = c('B','B','C') # → currently emits data-mutate
model$obj = c(1,2,3) # → currently emits data-mutate
model$A = cbind(...) # → currently emits data-mutate
gurobi(model, params) # ← opaque consumer# Paper #44 privatizing-disability-insurance: numeric vector, NOT a dataset
Income.Drop <- c(0.1, 0.2, 0.3)
rho <- Income.Drop[2] # → currently emits data-filter# Paper #40 targeting-precision-medicine: function-call object, NOT a binding
user <- Sys.info()[7] # → currently emits data-filter
data_moments <- actual[[1]] # → currently emits data-filter (actual is opaque-call return)The root cause: the recognizer can’t tell list/$/vector
apart from data.frame/$/dataset because it doesn’t consult
the binding’s shape at the disambiguation sites. The
binding$col = expr and binding[expr]
recognition paths emit typed primitives unconditionally, manufacturing
phantom DAG nodes that operate on bindings that aren’t datasets.
The fix is to consult the binding’s known shape (when available)
before classifying — and to route mis-classifications to
webr-opaque, which is the honest representation: “this is
R-specific plumbing syntax that doesn’t map to an Interlyse
primitive.”
We are not promoting list$col = expr or
vec[idx] to typed primitives. Their consumers are opaque
(Gurobi, nloptr, mle2, optim, designmatch, JAGS — none are typed in
Interlyse), their contents are opaque, and no executable surface unlocks
by classifying them. They stay at webr-opaque, which
preserves the abstraction: Interlyse’s primitive set models
statistical/data intent, not language-specific object plumbing.
The cross-language abstraction lens validates the choice to route these to opaque rather than promote them to primitives:
| Pattern | R | Python | MATLAB | Julia | Stata |
|---|---|---|---|---|---|
| Param bag → opaque solver | m <- list(); m$x = …; gurobi(m) |
m = {}; m['x'] = …; gp.Model(m) |
m.x = …; gurobi(m) |
m = Dict(); m[:x] = …; gurobi(m) |
options on commands (no analogue) |
| Scalar from numeric vector | vec[2] |
vec[1] |
vec(2) |
vec[2] |
vec[2] |
The syntactic shape exists cross-language, but the semantic analysis intent does not: in every case the param bag’s content is consumed by an opaque solver and the scalar’s value feeds opaque calibration code. Promoting them to typed primitives would add DAG nodes without unlocking any inspection / computation / comparison / export surface — negative ROI.
Extend GlobalValue in
src/core/rdata/manifest.ts:
export type GlobalValue =
| { kind: 'scalar-string'; value: string }
| { kind: 'scalar-number'; value: number }
| { kind: 'scalar-bool'; value: boolean }
| { kind: 'vector-string'; values: string[] }
| { kind: 'vector-number'; values: number[] }
| { kind: 'vector-bool'; values: boolean[] }
| { kind: 'named-list'; members: Record<string, GlobalValue> } // existing — empty members map now allowed
| { kind: 'list-opaque' } // NEW: list-shaped, contents not statically extractable
| { kind: 'vector-opaque' }; // NEW: vector-shaped, contents not statically extractableThe existing content-aware kinds (scalar-*,
vector-* with values, named-list with members)
stay unchanged and continue to carry content. The two new kinds carry
only the shape — used for disambiguation when content extraction can’t
succeed but the shape is unambiguous from the constructor call.
New shape predicates added alongside isScalarGlobal /
isVectorGlobal / isNamedListGlobal:
export function isListShaped(v: GlobalValue): boolean {
return v.kind === 'named-list' || v.kind === 'list-opaque';
}
export function isVectorShaped(v: GlobalValue): boolean {
return v.kind === 'vector-string' || v.kind === 'vector-number'
|| v.kind === 'vector-bool' || v.kind === 'vector-opaque';
}
export function isScalarShaped(v: GlobalValue): boolean {
return v.kind === 'scalar-string' || v.kind === 'scalar-number' || v.kind === 'scalar-bool';
}We deliberately do not add an
opaque-call kind for arbitrary function-call returns. That
would collide with <- read.csv(...) and other recognized
reader/mutate/filter chains where the recognizer’s own
scope map correctly identifies the binding as a Dataset.
Disambiguation for unknown function-call returns (paper #40’s
actual <- moments_wgts(...) case) is handled by the
bracket-syntax check in §6, not by a new GlobalValue
kind.
Loosen classifyAsGlobalValue in
src/core/parsers/file-registry.ts:
| Source pattern | Today | After |
|---|---|---|
list() (empty) |
undefined |
{ kind: 'named-list', members: {} } |
list(a = "x", b = 1) (all-static) |
{ kind: 'named-list', members: {...} } |
unchanged |
list(a = some_call(), b = identifier) (non-static
members) |
undefined |
{ kind: 'list-opaque' } |
c(1, 2, 3) (all-static numeric) |
{ kind: 'vector-number', values: [1,2,3] } |
unchanged |
c(some_call(), x) (non-static elements) |
undefined |
{ kind: 'vector-opaque' } |
numeric(n) / integer(n) /
character(n) / vector("numeric", n) |
undefined |
{ kind: 'vector-opaque' } |
any other function call (read.csv, lm,
opaque) |
undefined |
undefined (unchanged — recognizer’s scope owns
this) |
The rule: classify by constructor name (list,
c,
numeric/integer/character/vector).
If the constructor matches and the contents are static, return the
existing rich kind. If the constructor matches but contents aren’t
static, fall through to the shape-only kind. Any other RHS stays
undefined, leaving the recognizer’s scope map
as the authoritative source for that binding.
The members: {} relaxation on named-list is
a non-breaking change: existing isNamedListGlobal and
resolveMemberAccess (in named-list-eval.ts)
already handle the empty case correctly —
resolveMemberAccess with a non-empty path against an
empty-members named-list returns undefined (path lookup
miss), which is the right semantics.
Two seams in src/core/parsers/r/recognizer.ts. At each,
consult the binding’s shape via a new helper
bindingShape(name, externalScope, walkBindingShapes) that
checks:
walkBindingShapes.get(name) (in-walk map — see §7)externalScope?.globalValues.get(name)?.value
(pre-scanned cross-file shape)undefined if neither matches.$-mutate seam (recognizeAssignment, ~line 764)Current: binding$col = expr always emits
data-mutate regardless of what binding
holds.
New:
let shape = bindingShape(dataName, ...)
if shape !== undefined and isListShaped(shape):
skip the data-mutate emit; fall through to webr-opaque
else:
emit data-mutate as today
The default-allow stance (emit data-mutate when shape is unknown)
preserves existing behavior for
df <- read.csv(...); df$newcol = expr —
read.csv isn’t pre-scanned by
classifyAsGlobalValue (it’s a function call), so
shape is undefined, and we fall through to the
existing data-mutate emit. The recognizer’s own scope map
handles the canonical-binding rewrite as today.
The SubsetNode.args shape carries R’s own
dataset-vs-vector signal:
| Bracket form | AST shape | Decision |
|---|---|---|
df[cond, ] (comma, row-filter) |
[<expr>, <missing>] |
data-filter (today, unchanged) |
df[, cols] (comma, col-select) |
[<missing>, <vector>] |
data-select (today, unchanged) |
df[expr] (no comma, single arg) |
[<expr>] |
data-filter ONLY IF binding is known-dataset (via
scope.get(name)?.kind in dataset-producing set). Otherwise
→ webr-opaque |
lst[["a"]] (double-bracket) |
outer [<index>], inner args: []
(named-list-eval encoding) |
webr-opaque always (list-element access, not a Dataset operation) |
f()[K] (bracket on function-call object) |
node.object.type === 'function-call' |
webr-opaque (no binding to verify) |
The comma-form paths are dataset-specific R syntax and stay as today (they’re never the over-fire surface). The new logic refines:
data-filter unconditionally. New: only when the binding is
known-dataset-shaped (from scope.get(name)). The
“known-dataset-shaped” set is the existing dataset-producing kinds:
data-load, data-mutate,
data-filter, data-select,
data-summarise, data-join,
data-bind-rows, data-bind-cols,
data-transform, and any recognized typed reader call that
produces a Dataset. Existing helper
isDatasetProducingKind(call) will be added if not
present.lst[["a"]] as a nested SubsetNode with an
outer args: [<literal>] and inner
SubsetNode { object: identifier, args: [] }. Detect this
shape and route to opaque always — [[ ]] is list-element
access in R, never a dataset operation.node.object.type === 'function-call', there’s no binding to
look up — route to opaque.If a future typed primitive needs to consume a binding-built
param bag (m <- list(); m$x = expr; analyze(m) where
analyze is recognized), the recognizer for that primitive
can whitelist m by adding it to a “param-bag-input” set
before the $-mutate seam consults the shape. No primitive
in the current corpus needs this; we add the hook only when a real case
arises.
recognizeR already maintains a local
scope: Map<string, AnalysisCall> for the
canonical-binding rewrite. We add a sibling local map:
const walkBindingShapes: Map<string, GlobalValue> = new Map();Populated at every recognizeAssignment entry where the
RHS classifies via the loosened classifyAsGlobalValue. This
catches inline assignments inside function bodies, if
blocks, and loops — sites the file-registry pre-scan doesn’t reach (the
pre-scan only walks top-level ast.statements).
Lookup precedence in
bindingShape(name, externalScope, walkBindingShapes): 1.
walkBindingShapes.get(name) — most local, most current. 2.
externalScope?.globalValues.get(name)?.value — cross-file
pre-scanned. 3. undefined — fall through to default
behavior.
The within-walk map is not shared across
recognizeR invocations on different files — each file’s
recognizer pass starts with an empty map. Cross-file shapes come
exclusively through externalScope.globalValues, which is
already wired by FileRegistry.runRecognizer.
recognizer.test.ts)model <- list(); model$x = c(...) → 1 webr-opaque
(list()) + 1 webr-opaque (the $x = ...
mutation). 0 data-mutate.model <- list(); model$x = c(...); model$y = c(...); model$z = cbind(...)
→ 4 webr-opaque, 0 data-mutate.vec <- c(0.1, 0.2, 0.3); rho <- vec[2] → 0
data-filter (was 1 today). The vec[2] access falls to
opaque.Sys.info()[7] → 0 data-filter, 1 webr-opaque (bare
function-call object → opaque per §6.2).actual <- moments_wgts(data); data_moments <- actual[[1]]
→ 0 data-filter for the [[1]] access. Double-bracket →
opaque per §6.2.list() classification:
model <- list() followed by no mutations →
globalValues.get('model').value.kind === 'named-list' and
members === {}.list(a = some_call(), b = x) classification: →
globalValues.get('m').value.kind === 'list-opaque'.c(some_call(), x) classification: →
globalValues.get('v').value.kind === 'vector-opaque'.df <- read.csv("..."); df[cond, ] → still 1
data-filter (comma-form path unchanged).df <- read.csv("..."); df$newcol = expr → still 1
data-mutate (binding shape unknown for read.csv returns,
default-allow).df <- read.csv("..."); df[1:5] (no-comma single-arg)
→ 1 data-filter (binding is known-dataset via scope).m <- lm(y ~ x, data = df); m$residuals → reads, not
writes. Unaffected. (Sanity check.)feols(..., panel.id = ~ unit + time) — formula-arg, no
binding. Unaffected.stargazer(..., add.lines = list(c(...))) — inline
list(...), no binding-write pattern. Unaffected.$= assignments fall to webr-opaque (no phantom
data-mutates). Adjust the existing pinned node count.vec[idx] accesses no longer emit data-filter.Sys.info()[7] and actual[[1]] no longer emit
data-filter.Paper-match opt-in tests (RUN_PAPER_MATCH=1) pin
numerical outputs of headline regressions — unaffected by
recognizer-classification changes.
Medium risk. The fix is at two recognizer seams that fire across every recognized paper.
list() +
list-opaque) and doesn’t intersect with any existing
recognized dataset-binding source. Default-allow (unknown shape → emit
data-mutate) preserves all existing dataset-mutate paths.scope check, additive); double-bracket and
function-call-object cases route to opaque (today they emit data-filter
against meaningless bindings — pure improvement).list(...) / c(...) /
numeric(N) / etc.). Doesn’t shadow scope.Mitigations: - Run full test suite including the existing replicate-*.test.ts corpus before merging. - Manually inspect node-count diffs on the 8 affected papers vs pre-existing pinned expectations. - The Cross-paper signal text in BACKLOG bullets at lines 167 and 285 already enumerates the eight papers that need pinned-count updates; no surprise sites should emerge.
scope: Map<string, AnalysisCall> to
add an explicit kind field per binding (Option C from
brainstorming). Out of scope; the kind is already on the
AnalysisCall itself.globalValues map when they land; no parser-specific work
here.param-bag typed primitive for opaque solvers —
discussed in §2/§3, rejected as negative ROI (consumers stay
opaque).scalar-extract typed primitive for
vec[N] — same reasoning.binding$x = … opaque nodes into a
single “param-bag construction” cluster — UI presentation refinement,
separate work.src/core/rdata/manifest.ts — add
list-opaque and vector-opaque kinds,
isListShaped / isVectorShaped /
isScalarShaped predicates.src/core/parsers/file-registry.ts — extend
classifyAsGlobalValue (5 new patterns per §5).src/core/parsers/r/recognizer.ts — add
bindingShape() helper, walkBindingShapes map,
isDatasetProducingKind() helper, consultation logic at
recognizeAssignment $-mutate path and
recognizeBracketSubset.src/core/parsers/r/recognizer.test.ts — tests 1–14
(§8.1, 8.2).src/core/parsers/file-registry.test.ts — classifier
tests for §5’s new patterns.src/core/pipeline/replicate-*.test.ts — adjust pinned
node counts for papers #20, #39, #40, #42, #43, #44, #53.BACKLOG.md — mark bullets at lines 167 and 285 as
- [x] DONE with the changeset summary.Single PR with all changes (recognizer + classifier + replicate-test
count updates) since the count expectations are tightly coupled to the
recognizer behavior. CI gate:
npm run build && npm test && npm run lint.
E2E unaffected (no UI surface). Paper-match opt-in unaffected (numerical
results, not pipeline shape).