Binding-origin disambiguation — Interlyse design spec 2026-05-17

Binding-origin disambiguation for recognizer over-fire

Status: Draft, 2026-05-17. Closes: BACKLOG bullets at lines 167 (list-style model$col = expr is NOT data-mutate) and 285 (Sys.info()[K] / actual[[K]] bracket-on-non-dataset). Affected papers: #20, #39, #40, #42, #43, #44, #53 (8 papers, cross-paper signal continues on any param-bag / vec[idx] pattern).

1. Problem

Interlyse’s primitive set (data-mutate, data-filter, data-select, …) is supposed to model language-agnostic statistical/data atoms: “add a column to a dataset”, “subset rows of a dataset”. They should mean the same thing in R, Stata, Python, Julia. The recognizer’s job is to map source code to this primitive set.

Today the recognizer over-fires on R-specific object syntax whose meaning is unrelated to Interlyse’s primitives:

# Paper #39 finite-score: Gurobi solver parameter bag, NOT a dataset
model <- list()
model$vtype = c('B','B','C')      # → currently emits data-mutate
model$obj   = c(1,2,3)             # → currently emits data-mutate
model$A     = cbind(...)           # → currently emits data-mutate
gurobi(model, params)              # ← opaque consumer
# Paper #44 privatizing-disability-insurance: numeric vector, NOT a dataset
Income.Drop <- c(0.1, 0.2, 0.3)
rho <- Income.Drop[2]              # → currently emits data-filter
# Paper #40 targeting-precision-medicine: function-call object, NOT a binding
user <- Sys.info()[7]              # → currently emits data-filter
data_moments <- actual[[1]]        # → currently emits data-filter (actual is opaque-call return)

The root cause: the recognizer can’t tell list/$/vector apart from data.frame/$/dataset because it doesn’t consult the binding’s shape at the disambiguation sites. The binding$col = expr and binding[expr] recognition paths emit typed primitives unconditionally, manufacturing phantom DAG nodes that operate on bindings that aren’t datasets.

The fix is to consult the binding’s known shape (when available) before classifying — and to route mis-classifications to webr-opaque, which is the honest representation: “this is R-specific plumbing syntax that doesn’t map to an Interlyse primitive.”

2. Non-goal

We are not promoting list$col = expr or vec[idx] to typed primitives. Their consumers are opaque (Gurobi, nloptr, mle2, optim, designmatch, JAGS — none are typed in Interlyse), their contents are opaque, and no executable surface unlocks by classifying them. They stay at webr-opaque, which preserves the abstraction: Interlyse’s primitive set models statistical/data intent, not language-specific object plumbing.

3. Cross-language anchor

The cross-language abstraction lens validates the choice to route these to opaque rather than promote them to primitives:

Pattern R Python MATLAB Julia Stata
Param bag → opaque solver m <- list(); m$x = …; gurobi(m) m = {}; m['x'] = …; gp.Model(m) m.x = …; gurobi(m) m = Dict(); m[:x] = …; gurobi(m) options on commands (no analogue)
Scalar from numeric vector vec[2] vec[1] vec(2) vec[2] vec[2]

The syntactic shape exists cross-language, but the semantic analysis intent does not: in every case the param bag’s content is consumed by an opaque solver and the scalar’s value feeds opaque calibration code. Promoting them to typed primitives would add DAG nodes without unlocking any inspection / computation / comparison / export surface — negative ROI.

4. Type extension

Extend GlobalValue in src/core/rdata/manifest.ts:

export type GlobalValue =
  | { kind: 'scalar-string'; value: string }
  | { kind: 'scalar-number'; value: number }
  | { kind: 'scalar-bool';   value: boolean }
  | { kind: 'vector-string'; values: string[] }
  | { kind: 'vector-number'; values: number[] }
  | { kind: 'vector-bool';   values: boolean[] }
  | { kind: 'named-list';    members: Record<string, GlobalValue> }  // existing — empty members map now allowed
  | { kind: 'list-opaque' }      // NEW: list-shaped, contents not statically extractable
  | { kind: 'vector-opaque' };   // NEW: vector-shaped, contents not statically extractable

The existing content-aware kinds (scalar-*, vector-* with values, named-list with members) stay unchanged and continue to carry content. The two new kinds carry only the shape — used for disambiguation when content extraction can’t succeed but the shape is unambiguous from the constructor call.

New shape predicates added alongside isScalarGlobal / isVectorGlobal / isNamedListGlobal:

export function isListShaped(v: GlobalValue): boolean {
  return v.kind === 'named-list' || v.kind === 'list-opaque';
}
export function isVectorShaped(v: GlobalValue): boolean {
  return v.kind === 'vector-string' || v.kind === 'vector-number'
      || v.kind === 'vector-bool'   || v.kind === 'vector-opaque';
}
export function isScalarShaped(v: GlobalValue): boolean {
  return v.kind === 'scalar-string' || v.kind === 'scalar-number' || v.kind === 'scalar-bool';
}

We deliberately do not add an opaque-call kind for arbitrary function-call returns. That would collide with <- read.csv(...) and other recognized reader/mutate/filter chains where the recognizer’s own scope map correctly identifies the binding as a Dataset. Disambiguation for unknown function-call returns (paper #40’s actual <- moments_wgts(...) case) is handled by the bracket-syntax check in §6, not by a new GlobalValue kind.

5. Classifier extension

Loosen classifyAsGlobalValue in src/core/parsers/file-registry.ts:

Source pattern Today After
list() (empty) undefined { kind: 'named-list', members: {} }
list(a = "x", b = 1) (all-static) { kind: 'named-list', members: {...} } unchanged
list(a = some_call(), b = identifier) (non-static members) undefined { kind: 'list-opaque' }
c(1, 2, 3) (all-static numeric) { kind: 'vector-number', values: [1,2,3] } unchanged
c(some_call(), x) (non-static elements) undefined { kind: 'vector-opaque' }
numeric(n) / integer(n) / character(n) / vector("numeric", n) undefined { kind: 'vector-opaque' }
any other function call (read.csv, lm, opaque) undefined undefined (unchanged — recognizer’s scope owns this)

The rule: classify by constructor name (list, c, numeric/integer/character/vector). If the constructor matches and the contents are static, return the existing rich kind. If the constructor matches but contents aren’t static, fall through to the shape-only kind. Any other RHS stays undefined, leaving the recognizer’s scope map as the authoritative source for that binding.

The members: {} relaxation on named-list is a non-breaking change: existing isNamedListGlobal and resolveMemberAccess (in named-list-eval.ts) already handle the empty case correctly — resolveMemberAccess with a non-empty path against an empty-members named-list returns undefined (path lookup miss), which is the right semantics.

6. Recognizer consultation seams

Two seams in src/core/parsers/r/recognizer.ts. At each, consult the binding’s shape via a new helper bindingShape(name, externalScope, walkBindingShapes) that checks:

  1. walkBindingShapes.get(name) (in-walk map — see §7)
  2. externalScope?.globalValues.get(name)?.value (pre-scanned cross-file shape)
  3. Returns undefined if neither matches.

6.1 $-mutate seam (recognizeAssignment, ~line 764)

Current: binding$col = expr always emits data-mutate regardless of what binding holds.

New:

let shape = bindingShape(dataName, ...)
if shape !== undefined and isListShaped(shape):
   skip the data-mutate emit; fall through to webr-opaque
else:
   emit data-mutate as today

The default-allow stance (emit data-mutate when shape is unknown) preserves existing behavior for df <- read.csv(...); df$newcol = exprread.csv isn’t pre-scanned by classifyAsGlobalValue (it’s a function call), so shape is undefined, and we fall through to the existing data-mutate emit. The recognizer’s own scope map handles the canonical-binding rewrite as today.

6.2 Bracket-subset seam (recognizeBracketSubset, ~line 3577)

The SubsetNode.args shape carries R’s own dataset-vs-vector signal:

Bracket form AST shape Decision
df[cond, ] (comma, row-filter) [<expr>, <missing>] data-filter (today, unchanged)
df[, cols] (comma, col-select) [<missing>, <vector>] data-select (today, unchanged)
df[expr] (no comma, single arg) [<expr>] data-filter ONLY IF binding is known-dataset (via scope.get(name)?.kind in dataset-producing set). Otherwise → webr-opaque
lst[["a"]] (double-bracket) outer [<index>], inner args: [] (named-list-eval encoding) webr-opaque always (list-element access, not a Dataset operation)
f()[K] (bracket on function-call object) node.object.type === 'function-call' webr-opaque (no binding to verify)

The comma-form paths are dataset-specific R syntax and stay as today (they’re never the over-fire surface). The new logic refines:

6.3 Future extensibility

If a future typed primitive needs to consume a binding-built param bag (m <- list(); m$x = expr; analyze(m) where analyze is recognized), the recognizer for that primitive can whitelist m by adding it to a “param-bag-input” set before the $-mutate seam consults the shape. No primitive in the current corpus needs this; we add the hook only when a real case arises.

7. New within-walk state

recognizeR already maintains a local scope: Map<string, AnalysisCall> for the canonical-binding rewrite. We add a sibling local map:

const walkBindingShapes: Map<string, GlobalValue> = new Map();

Populated at every recognizeAssignment entry where the RHS classifies via the loosened classifyAsGlobalValue. This catches inline assignments inside function bodies, if blocks, and loops — sites the file-registry pre-scan doesn’t reach (the pre-scan only walks top-level ast.statements).

Lookup precedence in bindingShape(name, externalScope, walkBindingShapes): 1. walkBindingShapes.get(name) — most local, most current. 2. externalScope?.globalValues.get(name)?.value — cross-file pre-scanned. 3. undefined — fall through to default behavior.

The within-walk map is not shared across recognizeR invocations on different files — each file’s recognizer pass starts with an empty map. Cross-file shapes come exclusively through externalScope.globalValues, which is already wired by FileRegistry.runRecognizer.

8. Testing

8.1 Unit tests (recognizer.test.ts)

  1. model <- list(); model$x = c(...) → 1 webr-opaque (list()) + 1 webr-opaque (the $x = ... mutation). 0 data-mutate.
  2. model <- list(); model$x = c(...); model$y = c(...); model$z = cbind(...) → 4 webr-opaque, 0 data-mutate.
  3. vec <- c(0.1, 0.2, 0.3); rho <- vec[2] → 0 data-filter (was 1 today). The vec[2] access falls to opaque.
  4. Sys.info()[7] → 0 data-filter, 1 webr-opaque (bare function-call object → opaque per §6.2).
  5. actual <- moments_wgts(data); data_moments <- actual[[1]] → 0 data-filter for the [[1]] access. Double-bracket → opaque per §6.2.
  6. Empty list() classification: model <- list() followed by no mutations → globalValues.get('model').value.kind === 'named-list' and members === {}.
  7. list(a = some_call(), b = x) classification: → globalValues.get('m').value.kind === 'list-opaque'.
  8. c(some_call(), x) classification: → globalValues.get('v').value.kind === 'vector-opaque'.

8.2 Regression coverage

  1. df <- read.csv("..."); df[cond, ] → still 1 data-filter (comma-form path unchanged).
  2. df <- read.csv("..."); df$newcol = expr → still 1 data-mutate (binding shape unknown for read.csv returns, default-allow).
  3. df <- read.csv("..."); df[1:5] (no-comma single-arg) → 1 data-filter (binding is known-dataset via scope).
  4. m <- lm(y ~ x, data = df); m$residuals → reads, not writes. Unaffected. (Sanity check.)
  5. feols(..., panel.id = ~ unit + time) — formula-arg, no binding. Unaffected.
  6. Inline arg constructions like stargazer(..., add.lines = list(c(...))) — inline list(...), no binding-write pattern. Unaffected.

8.3 Per-paper replicate tests

  1. Paper #39 finite-score: assert the Gurobi param-bag’s $= assignments fall to webr-opaque (no phantom data-mutates). Adjust the existing pinned node count.
  2. Paper #44 privatizing-disability-insurance: assert the 14 vec[idx] accesses no longer emit data-filter.
  3. Paper #20 rising-markups, #42 revolving-door, #43 monetary-fiscal, #53 bidding-for-firms: adjust expected node counts in their replicate tests (they pin current misclassified counts; expectations updated in the same commit).
  4. Paper #40 targeting-precision-medicine: assert Sys.info()[7] and actual[[1]] no longer emit data-filter.

8.4 Paper-match tests

Paper-match opt-in tests (RUN_PAPER_MATCH=1) pin numerical outputs of headline regressions — unaffected by recognizer-classification changes.

9. Regression-risk surface

Medium risk. The fix is at two recognizer seams that fire across every recognized paper.

Mitigations: - Run full test suite including the existing replicate-*.test.ts corpus before merging. - Manually inspect node-count diffs on the 8 affected papers vs pre-existing pinned expectations. - The Cross-paper signal text in BACKLOG bullets at lines 167 and 285 already enumerates the eight papers that need pinned-count updates; no surprise sites should emerge.

10. Out of scope

11. Files touched

12. Rollout

Single PR with all changes (recognizer + classifier + replicate-test count updates) since the count expectations are tightly coupled to the recognizer behavior. CI gate: npm run build && npm test && npm run lint. E2E unaffected (no UI surface). Paper-match opt-in unaffected (numerical results, not pipeline shape).