Binding-origin disambiguation — Interlyse design spec 2026-05-17

Binding-origin disambiguation for recognizer over-fire

Status: Draft, 2026-05-17. Closes: BACKLOG bullets at lines 167 (list-style model$col = expr is NOT data-mutate) and 285 (Sys.info()[K] / actual[[K]] bracket-on-non-dataset). Affected papers: #20, #39, #40, #42, #43, #44, #53 (8 papers, cross-paper signal continues on any param-bag / vec[idx] pattern).

1. Problem

Interlyse’s primitive set (data-mutate, data-filter, data-select, …) is supposed to model language-agnostic statistical/data atoms: “add a column to a dataset”, “subset rows of a dataset”. They should mean the same thing in R, Stata, Python, Julia. The recognizer’s job is to map source code to this primitive set.

Today the recognizer over-fires on R-specific object syntax whose meaning is unrelated to Interlyse’s primitives:

# Paper #39 finite-score: Gurobi solver parameter bag, NOT a dataset
model <- list()
model$vtype = c('B','B','C')      # → currently emits data-mutate
model$obj   = c(1,2,3)             # → currently emits data-mutate
model$A     = cbind(...)           # → currently emits data-mutate
gurobi(model, params)              # ← opaque consumer

# Paper #44 privatizing-disability-insurance: numeric vector, NOT a dataset
Income.Drop <- c(0.1, 0.2, 0.3)
rho <- Income.Drop[2]              # → currently emits data-filter

# Paper #40 targeting-precision-medicine: function-call object, NOT a binding
user <- Sys.info()[7]              # → currently emits data-filter
data_moments <- actual[[1]]        # → currently emits data-filter (actual is opaque-call return)

The root cause: the recognizer can’t tell list/$/vector apart from data.frame/$/dataset because it doesn’t consult the binding’s shape at the disambiguation sites. The binding$col = expr and binding[expr] recognition paths emit typed primitives unconditionally, manufacturing phantom DAG nodes that operate on bindings that aren’t datasets.

The fix is to consult the binding’s known shape (when available) before classifying — and to route mis-classifications to webr-opaque, which is the honest representation: “this is R-specific plumbing syntax that doesn’t map to an Interlyse primitive.”

2. Non-goal

We are not promoting list$col = expr or vec[idx] to typed primitives. Their consumers are opaque (Gurobi, nloptr, mle2, optim, designmatch, JAGS — none are typed in Interlyse), their contents are opaque, and no executable surface unlocks by classifying them. They stay at webr-opaque, which preserves the abstraction: Interlyse’s primitive set models statistical/data intent, not language-specific object plumbing.

3. Cross-language anchor

The cross-language abstraction lens validates the choice to route these to opaque rather than promote them to primitives:

Pattern	R	Python	MATLAB	Julia	Stata
Param bag → opaque solver	`m <- list(); m$x = …; gurobi(m)`	`m = {}; m['x'] = …; gp.Model(m)`	`m.x = …; gurobi(m)`	`m = Dict(); m[:x] = …; gurobi(m)`	options on commands (no analogue)
Scalar from numeric vector	`vec[2]`	`vec[1]`	`vec(2)`	`vec[2]`	`vec[2]`

The syntactic shape exists cross-language, but the semantic analysis intent does not: in every case the param bag’s content is consumed by an opaque solver and the scalar’s value feeds opaque calibration code. Promoting them to typed primitives would add DAG nodes without unlocking any inspection / computation / comparison / export surface — negative ROI.

4. Type extension

Extend GlobalValue in src/core/rdata/manifest.ts:

export type GlobalValue =
  | { kind: 'scalar-string'; value: string }
  | { kind: 'scalar-number'; value: number }
  | { kind: 'scalar-bool';   value: boolean }
  | { kind: 'vector-string'; values: string[] }
  | { kind: 'vector-number'; values: number[] }
  | { kind: 'vector-bool';   values: boolean[] }
  | { kind: 'named-list';    members: Record<string, GlobalValue> }  // existing — empty members map now allowed
  | { kind: 'list-opaque' }      // NEW: list-shaped, contents not statically extractable
  | { kind: 'vector-opaque' };   // NEW: vector-shaped, contents not statically extractable

The existing content-aware kinds (scalar-*, vector-* with values, named-list with members) stay unchanged and continue to carry content. The two new kinds carry only the shape — used for disambiguation when content extraction can’t succeed but the shape is unambiguous from the constructor call.

New shape predicates added alongside isScalarGlobal / isVectorGlobal / isNamedListGlobal:

export function isListShaped(v: GlobalValue): boolean {
  return v.kind === 'named-list' || v.kind === 'list-opaque';
}
export function isVectorShaped(v: GlobalValue): boolean {
  return v.kind === 'vector-string' || v.kind === 'vector-number'
      || v.kind === 'vector-bool'   || v.kind === 'vector-opaque';
}
export function isScalarShaped(v: GlobalValue): boolean {
  return v.kind === 'scalar-string' || v.kind === 'scalar-number' || v.kind === 'scalar-bool';
}

We deliberately do not add an opaque-call kind for arbitrary function-call returns. That would collide with <- read.csv(...) and other recognized reader/mutate/filter chains where the recognizer’s own scope map correctly identifies the binding as a Dataset. Disambiguation for unknown function-call returns (paper #40’s actual <- moments_wgts(...) case) is handled by the bracket-syntax check in §6, not by a new GlobalValue kind.

5. Classifier extension

Loosen classifyAsGlobalValue in src/core/parsers/file-registry.ts:

Source pattern	Today	After
`list()` (empty)	`undefined`	`{ kind: 'named-list', members: {} }`
`list(a = "x", b = 1)` (all-static)	`{ kind: 'named-list', members: {...} }`	unchanged
`list(a = some_call(), b = identifier)` (non-static members)	`undefined`	`{ kind: 'list-opaque' }`
`c(1, 2, 3)` (all-static numeric)	`{ kind: 'vector-number', values: [1,2,3] }`	unchanged
`c(some_call(), x)` (non-static elements)	`undefined`	`{ kind: 'vector-opaque' }`
`numeric(n)` / `integer(n)` / `character(n)` / `vector("numeric", n)`	`undefined`	`{ kind: 'vector-opaque' }`
any other function call (`read.csv`, `lm`, opaque)	`undefined`	`undefined` (unchanged — recognizer’s scope owns this)

The rule: classify by constructor name (list, c, numeric/integer/character/vector). If the constructor matches and the contents are static, return the existing rich kind. If the constructor matches but contents aren’t static, fall through to the shape-only kind. Any other RHS stays undefined, leaving the recognizer’s scope map as the authoritative source for that binding.

The members: {} relaxation on named-list is a non-breaking change: existing isNamedListGlobal and resolveMemberAccess (in named-list-eval.ts) already handle the empty case correctly — resolveMemberAccess with a non-empty path against an empty-members named-list returns undefined (path lookup miss), which is the right semantics.

6. Recognizer consultation seams

Two seams in src/core/parsers/r/recognizer.ts. At each, consult the binding’s shape via a new helper bindingShape(name, externalScope, walkBindingShapes) that checks:

walkBindingShapes.get(name) (in-walk map — see §7)
externalScope?.globalValues.get(name)?.value (pre-scanned cross-file shape)
Returns undefined if neither matches.

6.1 `$`-mutate seam (recognizeAssignment, ~line 764)

Current: binding$col = expr always emits data-mutate regardless of what binding holds.

New:

let shape = bindingShape(dataName, ...)
if shape !== undefined and isListShaped(shape):
   skip the data-mutate emit; fall through to webr-opaque
else:
   emit data-mutate as today

The default-allow stance (emit data-mutate when shape is unknown) preserves existing behavior for df <- read.csv(...); df$newcol = expr — read.csv isn’t pre-scanned by classifyAsGlobalValue (it’s a function call), so shape is undefined, and we fall through to the existing data-mutate emit. The recognizer’s own scope map handles the canonical-binding rewrite as today.

6.2 Bracket-subset seam (recognizeBracketSubset, ~line 3577)

The SubsetNode.args shape carries R’s own dataset-vs-vector signal:

Bracket form	AST shape	Decision
`df[cond, ]` (comma, row-filter)	`[<expr>, <missing>]`	data-filter (today, unchanged)
`df[, cols]` (comma, col-select)	`[<missing>, <vector>]`	data-select (today, unchanged)
`df[expr]` (no comma, single arg)	`[<expr>]`	data-filter ONLY IF binding is known-dataset (via `scope.get(name)?.kind` in dataset-producing set). Otherwise → webr-opaque
`lst[["a"]]` (double-bracket)	outer `[<index>]`, inner `args: []` (named-list-eval encoding)	webr-opaque always (list-element access, not a Dataset operation)
`f()[K]` (bracket on function-call object)	`node.object.type === 'function-call'`	webr-opaque (no binding to verify)

The comma-form paths are dataset-specific R syntax and stay as today (they’re never the over-fire surface). The new logic refines:

No-comma single-arg bracket: today fires data-filter unconditionally. New: only when the binding is known-dataset-shaped (from scope.get(name)). The “known-dataset-shaped” set is the existing dataset-producing kinds: data-load, data-mutate, data-filter, data-select, data-summarise, data-join, data-bind-rows, data-bind-cols, data-transform, and any recognized typed reader call that produces a Dataset. Existing helper isDatasetProducingKind(call) will be added if not present.
Double-bracket: the parser encodes lst[["a"]] as a nested SubsetNode with an outer args: [<literal>] and inner SubsetNode { object: identifier, args: [] }. Detect this shape and route to opaque always — [[ ]] is list-element access in R, never a dataset operation.
Function-call object: when node.object.type === 'function-call', there’s no binding to look up — route to opaque.

6.3 Future extensibility

If a future typed primitive needs to consume a binding-built param bag (m <- list(); m$x = expr; analyze(m) where analyze is recognized), the recognizer for that primitive can whitelist m by adding it to a “param-bag-input” set before the $-mutate seam consults the shape. No primitive in the current corpus needs this; we add the hook only when a real case arises.

7. New within-walk state

recognizeR already maintains a local scope: Map<string, AnalysisCall> for the canonical-binding rewrite. We add a sibling local map:

const walkBindingShapes: Map<string, GlobalValue> = new Map();

Populated at every recognizeAssignment entry where the RHS classifies via the loosened classifyAsGlobalValue. This catches inline assignments inside function bodies, if blocks, and loops — sites the file-registry pre-scan doesn’t reach (the pre-scan only walks top-level ast.statements).

Lookup precedence in bindingShape(name, externalScope, walkBindingShapes): 1. walkBindingShapes.get(name) — most local, most current. 2. externalScope?.globalValues.get(name)?.value — cross-file pre-scanned. 3. undefined — fall through to default behavior.

The within-walk map is not shared across recognizeR invocations on different files — each file’s recognizer pass starts with an empty map. Cross-file shapes come exclusively through externalScope.globalValues, which is already wired by FileRegistry.runRecognizer.

8. Testing

8.1 Unit tests (`recognizer.test.ts`)

model <- list(); model$x = c(...) → 1 webr-opaque (list()) + 1 webr-opaque (the $x = ... mutation). 0 data-mutate.
model <- list(); model$x = c(...); model$y = c(...); model$z = cbind(...) → 4 webr-opaque, 0 data-mutate.
vec <- c(0.1, 0.2, 0.3); rho <- vec[2] → 0 data-filter (was 1 today). The vec[2] access falls to opaque.
Sys.info()[7] → 0 data-filter, 1 webr-opaque (bare function-call object → opaque per §6.2).
actual <- moments_wgts(data); data_moments <- actual[[1]] → 0 data-filter for the [[1]] access. Double-bracket → opaque per §6.2.
Empty list() classification: model <- list() followed by no mutations → globalValues.get('model').value.kind === 'named-list' and members === {}.
list(a = some_call(), b = x) classification: → globalValues.get('m').value.kind === 'list-opaque'.
c(some_call(), x) classification: → globalValues.get('v').value.kind === 'vector-opaque'.

8.2 Regression coverage

df <- read.csv("..."); df[cond, ] → still 1 data-filter (comma-form path unchanged).
df <- read.csv("..."); df$newcol = expr → still 1 data-mutate (binding shape unknown for read.csv returns, default-allow).
df <- read.csv("..."); df[1:5] (no-comma single-arg) → 1 data-filter (binding is known-dataset via scope).
m <- lm(y ~ x, data = df); m$residuals → reads, not writes. Unaffected. (Sanity check.)
feols(..., panel.id = ~ unit + time) — formula-arg, no binding. Unaffected.
Inline arg constructions like stargazer(..., add.lines = list(c(...))) — inline list(...), no binding-write pattern. Unaffected.

8.3 Per-paper replicate tests

Paper #39 finite-score: assert the Gurobi param-bag’s $= assignments fall to webr-opaque (no phantom data-mutates). Adjust the existing pinned node count.
Paper #44 privatizing-disability-insurance: assert the 14 vec[idx] accesses no longer emit data-filter.
Paper #20 rising-markups, #42 revolving-door, #43 monetary-fiscal, #53 bidding-for-firms: adjust expected node counts in their replicate tests (they pin current misclassified counts; expectations updated in the same commit).
Paper #40 targeting-precision-medicine: assert Sys.info()[7] and actual[[1]] no longer emit data-filter.

8.4 Paper-match tests

Paper-match opt-in tests (RUN_PAPER_MATCH=1) pin numerical outputs of headline regressions — unaffected by recognizer-classification changes.

9. Regression-risk surface

Medium risk. The fix is at two recognizer seams that fire across every recognized paper.

$-mutate seam: routes a previously-emitted data-mutate to opaque when binding is known-list-shaped. The known-list-shape set is narrow (list() + list-opaque) and doesn’t intersect with any existing recognized dataset-binding source. Default-allow (unknown shape → emit data-mutate) preserves all existing dataset-mutate paths.
Bracket-subset seam: comma-form paths unchanged; no-comma form now gated on known-dataset binding (existing scope check, additive); double-bracket and function-call-object cases route to opaque (today they emit data-filter against meaningless bindings — pure improvement).
Within-walk shape map: populated only on classifiable RHS (list(...) / c(...) / numeric(N) / etc.). Doesn’t shadow scope.

Mitigations: - Run full test suite including the existing replicate-*.test.ts corpus before merging. - Manually inspect node-count diffs on the 8 affected papers vs pre-existing pinned expectations. - The Cross-paper signal text in BACKLOG bullets at lines 167 and 285 already enumerates the eight papers that need pinned-count updates; no surprise sites should emerge.

10. Out of scope

Refactoring scope: Map<string, AnalysisCall> to add an explicit kind field per binding (Option C from brainstorming). Out of scope; the kind is already on the AnalysisCall itself.
Stata/Python/Julia parsers — they’ll feed the same globalValues map when they land; no parser-specific work here.
A param-bag typed primitive for opaque solvers — discussed in §2/§3, rejected as negative ROI (consumers stay opaque).
A scalar-extract typed primitive for vec[N] — same reasoning.
Grouping consecutive binding$x = … opaque nodes into a single “param-bag construction” cluster — UI presentation refinement, separate work.

11. Files touched

src/core/rdata/manifest.ts — add list-opaque and vector-opaque kinds, isListShaped / isVectorShaped / isScalarShaped predicates.
src/core/parsers/file-registry.ts — extend classifyAsGlobalValue (5 new patterns per §5).
src/core/parsers/r/recognizer.ts — add bindingShape() helper, walkBindingShapes map, isDatasetProducingKind() helper, consultation logic at recognizeAssignment $-mutate path and recognizeBracketSubset.
src/core/parsers/r/recognizer.test.ts — tests 1–14 (§8.1, 8.2).
src/core/parsers/file-registry.test.ts — classifier tests for §5’s new patterns.
src/core/pipeline/replicate-*.test.ts — adjust pinned node counts for papers #20, #39, #40, #42, #43, #44, #53.
BACKLOG.md — mark bullets at lines 167 and 285 as - [x] DONE with the changeset summary.

12. Rollout

Single PR with all changes (recognizer + classifier + replicate-test count updates) since the count expectations are tightly coupled to the recognizer behavior. CI gate: npm run build && npm test && npm run lint. E2E unaffected (no UI surface). Paper-match opt-in unaffected (numerical results, not pipeline shape).