WebR RData Load — Design Spec

WebR RData Load — Design Spec

Date: 2026-05-01 Milestone: WebR follow-up (Session 5 of WebR follow-up sessions) Predecessors: - 2026-04-20 WebR Integration (Session 1 — typed framework + lm_robust) - 2026-04-22 WebR Opaque Nodes (Session 2 — opaque path end-to-end) - 2026-04-27 WebR VFS Bridge (Session 4 — file-byte sync, extractFilePath, per-dispatch CWD)

1. Context

load(file = "*.RData") is the single biggest cross-paper blocker per the WebR audit. Cited by paper #4 (investor-memory), paper #13 (cost-of-consumer-collateral — every Table 4–6 script), and paper #17 (policy-targeting — ~245 chunks, top of the audit). All three rely on RData intermediates as their data input, and the current pipeline can’t wire those inputs into typed regression nodes.

The VFS bridge already syncs *.RData bytes into the worker’s /workspace/ and load is in KNOWN_READERS. The bytes reach R fine. The gap is recognition: load() falls through to recognizeOpaqueSideEffect, the worker runs it under captureR into globalenv, the loaded objects (dat, pdat, model_results) appear in R’s env, but the recognizer’s TS-side scope never learns of them. A subsequent feols(y ~ x, data = dat) resolves dat against an empty scope — the typed linear-model node either gets emitted with no data edge (silent pipeline malformation), or the entire feols call falls through to opaque, severing the typed model recognition.

Goal: make RData-loaded bindings first-class data sources in the typed pipeline. Recognizing load() should be enough that downstream feols(..., data = dat) flows through the existing typed-model machinery without the user pre-converting RData to CSV.

2. Approach summary

The recognizer treats load() as a declaration of intent rather than a runtime node: walking statements in source order, encountering load(file = "x.RData") updates a loadChain: string[] carried in recognizer state, but emits no AnalysisCall for the load() statement itself. When subsequent code references an unresolved identifier (dats <- dat, feols(..., data = dat)), the wildcard fires lazily and emits a synthetic webr-rdata-load AnalysisCall — a self-contained data-source primitive whose params capture the load chain and the bound name to extract.

Each webr-rdata-load node executes in an isolated R env: load all chain files into a fresh new.env(), return e[[boundName]]. The existing webr-opaque probe-and-marshal layer auto-detects is.data.frame and either marshals to a MarshaledDataset (auto-unwrapped to a TS Dataset by worker-manager) or returns an opaque-binding handle. No new marshaler infra; the data-source flows into the typed pipeline through the same path opaque nodes use today.

Globalenv is never touched by RData loads. This eliminates state-pollution bugs across reruns and across sibling extracts within a single Run, at the documented cost of opaque code that does globalenv-introspection (ls(), exists("dat"), mget(...)) not seeing the RData-loaded names.

3. Scope

In

Out (future sessions / deferred)

4. Architecture

4.1 Module changes

src/core/parsers/r/extract-file-path.ts       [unchanged]
                                              `load` already in KNOWN_READERS
                                              for VFS-sync targeting

src/core/parsers/r/recognize-rdata-load.ts    [new]
                                              Wildcard state object;
                                              extract emission helper;
                                              deterministic synthetic name

src/core/parsers/r/recognizer.ts              [modified]
                                              Detect load() in statement walk;
                                              update loadChain state;
                                              wire scope-miss hook for unresolved
                                              identifier references

src/core/parsers/shared/analysis-call.ts      [modified]
                                              Add 'rdata-load' to AnalysisKind union

src/core/pipeline/types.ts                    [modified]
                                              Add WebRRDataLoadParams + WebRRDataLoadNode;
                                              extend PipelineNode union;
                                              add NODE_PORTS entry

src/core/pipeline/mapper.ts                   [modified]
                                              rdata-load AnalysisCall → WebRRDataLoadNode

src/core/pipeline/executor.ts                 [modified]
                                              Dispatch webr-rdata-load via worker-manager;
                                              receive auto-unwrapped Dataset (kind:'dataset')
                                              or opaque-binding result

src/core/webr/protocol.ts                     [modified]
                                              Add 'dispatch-rdata-load' request type
                                              (or extend dispatch-opaque)

src/workers/webr-worker.ts                    [modified]
                                              Handler for the new request type;
                                              private-env load + bare-name extract;
                                              existing probe-and-marshal

src/workers/worker-manager.ts                 [modified]
                                              Route webr-rdata-load nodes;
                                              compute cwd from originFile (existing helper)

src/ui/components/pipeline-node.tsx           [modified]
                                              Source-shape rendering with file icon;
                                              tooltip showing path + boundName

reference-papers/INTERLYSE-RUN-STATUS.md      [modified]
                                              §13 updated: drop CSV workaround;
                                              reclassify if applicable

BACKLOG.md                                    [modified]
                                              Mark load() bullet DONE

CLAUDE.md                                     [modified]
                                              Add No-Go invariants

4.2 Node type and ports

// src/core/pipeline/types.ts
export interface WebRRDataLoadParams {
  /**
   * Workspace-relative paths of every load() call encountered in source
   * order up to this extract's emission point, latest = last. The worker
   * loads them in this order into a fresh private env; latest-wins
   * overwrite semantics resolve naturally inside that env.
   */
  loadChain: string[];

  /**
   * R object name extracted from the loaded env. Deterministic per
   * (lastLoadSpan, name) pair via the synthetic-id scheme in §4.4.
   */
  boundName: string;

  /** Path of the R script that produced this call (workspace-relative). */
  originFile?: string;
}

export interface WebRRDataLoadNode extends PipelineNodeBase {
  type: 'webr-rdata-load';
  params: WebRRDataLoadParams;
  result?: OpaqueResult;   // matches webr-opaque shape — Dataset | opaque-binding
}

The result field uses the existing OpaqueResult discriminated union (same as webr-opaque). For typical RData files containing data.frames, the worker’s is.data.frame probe returns true and the payload is { kind: 'dataset', dataset: MarshaledDataset }. The worker-manager unwraps OpaqueResult{kind:'dataset'} to a TS Dataset before populating downstream inputData (per the existing webr-opaque executor invariant in CLAUDE.md). Non-data-frame results carry { kind: 'opaque-binding', ... } and bubble up as an upstream-error to Dataset-port consumers — same behavior as webr-opaque data-port consumers today.

Ports are static (no marshaler-spec-driven dynamic shape):

// NODE_PORTS entry
'webr-rdata-load': {
  inputs: [],
  outputs: [{ name: 'out', dataType: 'any' }],
}

'any' matches 'dataset' for port validation (existing rule). When the consumer is a typed data port (Dataset-typed), runtime resolution checks result.kind === 'dataset' and either flows the Dataset or errors with the same message the opaque path uses today.

getPortsFor() returns the static NODE_PORTS['webr-rdata-load'] directly — no per-instance variation.

4.3 AnalysisCall shape

Add 'rdata-load' to the AnalysisKind union:

// src/core/parsers/shared/analysis-call.ts — modified
type AnalysisKind =
  | ... existing kinds ...
  | 'webr-typed'
  | 'webr-opaque'
  | 'rdata-load';   // NEW

The AnalysisCall produced by the recognizer:

{
  kind: 'rdata-load',
  args: {
    loadChain: string[],
    boundName: string,
  },
  assignedTo: '__load_<lastLoadSpan.start>_<lastLoadSpan.end>_<boundName>',
  sourceSpan: <lastLoadCall.span>,   // see §4.5 — points at the load() statement
  originFile: <currentFile.path>,
}

The mapper passes args.loadChain and args.boundName directly into the node’s params. assignedTo provides the deterministic synthetic binding name that downstream consumers reference.

4.4 Recognizer state and lazy emission

Recognizer state extends with two fields:

interface RecognizerState {
  // ... existing scope: Map<string, AnalysisCall> ...
  // ... existing diagnostics, loadedPackages, ... ...

  /**
   * Source-order list of load() file paths encountered so far. Pushed on
   * each load() call seen. Used by the wildcard scope-miss hook to compose
   * the chain that goes into emitted rdata-load extracts.
   */
  loadChain: string[];

  /**
   * Span of the most recent load() statement. Used as the sourceSpan of any
   * extract emitted while this load is the active wildcard target — clicking
   * an extract in the DAG highlights this load() line in the editor.
   * Cleared/never-set if loadChain is empty.
   */
  lastLoadSpan?: Span;
}

Walking the AST: the existing top-level statement loop adds one new branch:

for (const stmt of program.body) {
  // ... existing: ignorable calls, loops, function defs, ... ...

  // NEW: detect load() with literal path; absorb into chain state.
  if (isLoadCall(stmt)) {
    const path = extractFilePath(stmt /* or stmt.value if assigned */);
    if (path) {
      state.loadChain.push(path);
      state.lastLoadSpan = stmt.span;
      continue;   // emit no AnalysisCall for the load() itself
    }
    // Programmatic path (paste0, variable) — fall through to opaque.
  }

  const result = tryRecognizeStatement(stmt, state);
  // ... existing emit ...
}

isLoadCall(stmt) returns true for: - A bare top-level FunctionCallNode with name === 'load'. - An assignment whose RHS is FunctionCallNode with name === 'load' (rare; load() is normally unassigned).

Programmatic load paths (e.g., load(file = paste0(dir, "x.RData")) where dir is a variable) fall through extractFilePath returning null. These emit as webr-opaque side-effect — same as today. No wildcard activated; downstream references to RData-loaded names will fail with the same clear “object not found” R error they get now. Documented limitation; matches the broader programmatic-path limitation called out in the VFS bridge spec.

The scope-miss hook is the heart of lazy emission. The existing scope is a Map<string, AnalysisCall> consulted in data= arg resolution (recognize-data-arg.ts) and in identifier-RHS resolution. We add a hook: when a lookup misses scope AND loadChain.length > 0, emit a synthetic extract on the fly and return it as if it had been there.

Implementation pattern:

// src/core/parsers/r/recognize-rdata-load.ts
export function resolveOrEmitRDataExtract(
  name: string,
  state: RecognizerState,
  emit: (call: AnalysisCall) => void,
): AnalysisCall | undefined {
  const existing = state.scope.get(name);
  if (existing) return existing;
  if (state.loadChain.length === 0 || !state.lastLoadSpan) return undefined;

  const synthName = `__load_${state.lastLoadSpan.start}_${state.lastLoadSpan.end}_${name}`;
  // Idempotence: if a previous miss already emitted this extract, reuse.
  const cached = state.scope.get(synthName);
  if (cached) {
    state.scope.set(name, cached);
    return cached;
  }

  const call: AnalysisCall = {
    kind: 'rdata-load',
    args: {
      loadChain: [...state.loadChain],   // snapshot
      boundName: name,
    },
    assignedTo: synthName,
    sourceSpan: state.lastLoadSpan,
    originFile: state.currentFile,
  };
  emit(call);
  state.scope.set(synthName, call);
  state.scope.set(name, call);   // alias the user-visible name to the synthetic
  return call;
}

Call sites that consult scope and should route through this hook: - recognize-data-arg.tsdata= arg resolution. The hook fires when data=dat references an unbound dat. - The recognizer’s RHS-identifier handling in tryRecognizeStatementdats <- dat references an unbound dat. - webr-opaque input-binding resolution (unionInputBindings in recognize-opaque.ts) — the regex/AST scan currently filters identifiers by scope.has(name); we change this to resolveOrEmitRDataExtract so opaque rSource that references an RData-loaded name gets the right edge.

Each call site is a small change: one-line replacement of scope.get(name) (or scope.has(name) in the opaque-scan filter) with the hook.

Idempotence: re-recognition of identical source produces identical synthetic ids (deterministic from lastLoadSpan) and the cached-call check ensures no duplicates. Matches the existing __lift_<span>_<tag> pattern (CLAUDE.md No-Go: “Synthetic lifted bindings are deterministic”).

Multi-load attribution: when load A appears at line 33, then load B at line 42, then feols(..., data=dat) at line 50 — lastLoadSpan at emission time is B’s span. The extract’s sourceSpan and synthetic id derive from B; the chain is [A.RData, B.RData]; boundName is dat. Latest-wins resolves naturally inside the worker’s private env (B overwrites A).

If B doesn’t actually contain dat (only A did), the worker errors with Error: object 'dat' not found in load chain — clear, surfaces as the extract node’s runtime error. Same error R itself would produce. Acceptable corner case; documented in the No-Go.

Same name re-attributed across loads: when the user writes load(A); use(dat); load(B); use(dat) — the first dat reference emits an extract attributed to A (__load_<A_span>_dat); the second reference, after B, attributes to B (__load_<B_span>_dat). Two distinct extract nodes. Both are valid: each pulls from its respective private env. The user sees two source nodes in the DAG, one per attribution. Matches R’s “the dat after B is B’s dat, the dat before B is A’s dat” semantic.

4.5 sourceSpan rationale

The extract’s sourceSpan points at the load() statement that introduced the binding (the last/most-recent load in the chain at emission time).

4.6 Worker dispatch

Two protocol options:

Option A (preferred): new request type 'dispatch-rdata-load'.

// src/core/webr/protocol.ts — additions
| { type: 'dispatch-rdata-load';
    id: string;
    loadChain: string[];     // workspace-relative paths
    boundName: string;
    cwd?: string;
  }

Response reuses the existing 'dispatch-result' shape with OpaquePayload (kind: 'dataset' | 'opaque-binding'). Errors use 'dispatch-error' with stages 'load' (a chain file failed to load), 'extract' (boundName not found in env), 'probe', 'marshal'.

Worker handler (mirrors handleDispatchOpaque for the probe-and-marshal half; differs in the eval shape):

async function handleDispatchRDataLoad(
  req: Extract<WebRRequest, { type: 'dispatch-rdata-load' }>,
): Promise<void> {
  if (!webR) {
    post({ type: 'dispatch-error', id: req.id, stage: 'load', error: 'worker not initialized' });
    return;
  }

  const resultBinding = `.n_${req.id.replace(/[^a-z0-9]/gi, '_')}`;

  try {
    if (req.cwd) {
      await webR.evalRVoid(`suppressWarnings(setwd(${JSON.stringify(req.cwd)}))`);
    }

    // Prefix each chain path with /workspace per VFS bridge convention.
    const chainLiterals = req.loadChain
      .map(p => JSON.stringify(`/workspace/${p}`))
      .join(', ');
    const boundLit = JSON.stringify(req.boundName);

    // Private env per dispatch — globalenv untouched. Latest-wins inside e.
    const script = `
      ${resultBinding} <- local({
        e <- new.env()
        for (f in c(${chainLiterals})) {
          load(file = f, envir = e)
        }
        if (!exists(${boundLit}, envir = e, inherits = FALSE)) {
          stop(sprintf("object '%s' not found in load chain", ${boundLit}))
        }
        e[[${boundLit}]]
      })
    `;
    await webR.evalRVoid(script);
  } catch (e) {
    post({ type: 'dispatch-error', id: req.id, stage: 'load',
           error: e instanceof Error ? e.message : String(e) });
    return;
  }

  // Probe + marshal — reuse the exact same path handleDispatchOpaque uses.
  let isDataFrame: boolean;
  try {
    const r = await webR.evalR(`is.data.frame(${resultBinding})`);
    isDataFrame = Boolean(await (r as unknown as { toBoolean: () => Promise<boolean> }).toBoolean());
  } catch (e) {
    post({ type: 'dispatch-error', id: req.id, stage: 'probe',
           error: e instanceof Error ? e.message : String(e) });
    return;
  }

  if (isDataFrame) {
    try {
      const marshaled = await marshalDatasetFromR(resultBinding);
      post({ type: 'dispatch-result', id: req.id,
             result: { kind: 'opaque', payload: { kind: 'dataset', dataset: marshaled } } });
    } catch (e) {
      post({ type: 'dispatch-error', id: req.id, stage: 'marshal',
             error: e instanceof Error ? e.message : String(e) });
    }
  } else {
    post({ type: 'dispatch-result', id: req.id,
           result: { kind: 'opaque', payload: { kind: 'opaque-binding', binding: resultBinding } } });
  }
}

Note on cleanup: unlike dispatch-typed which rm()s resultBinding after marshaling, rdata-load keeps the binding alive in globalenv (same as opaque assignment). Downstream opaque consumers may reference the synthetic binding name; cleanup happens at session end via worker termination. The private e env is local to the local({...}) block and garbage-collected after the dispatch.

Option B (rejected): extend dispatch-opaque with a flag. Considered and rejected — the rSource shape, error stages, and the assignment vs. side-effect dispatch logic are different enough that a dedicated request type is cleaner. ~30 lines saved by Option B aren’t worth the discriminant overload.

4.7 Worker-manager routing

// src/workers/worker-manager.ts — additions

async function dispatchRDataLoad(node: WebRRDataLoadNode): Promise<OpaqueResult> {
  const cwd = cwdFor(node.params.originFile);
  const response = await sendAndAwait({
    type: 'dispatch-rdata-load',
    id: nextDispatchId(),
    loadChain: node.params.loadChain,
    boundName: node.params.boundName,
    cwd,
  });
  // Same OpaqueResult shape returned by the existing dispatchOpaque path.
  // The executor's downstream consumer resolution (inputData population)
  // unwraps kind:'dataset' to TS Dataset and surfaces kind:'opaque-binding'
  // as an upstream-error for Dataset-port consumers — reuses the existing
  // webr-opaque consumer-resolution path verbatim.
  return resultFromOpaqueResponse(response);
}

Wire dispatchRDataLoad into the executor’s node-type switch. The existing prewarm set (onPipelineChange per CLAUDE.md No-Go) needs webr-rdata-load added — opaque-only and rdata-only pipelines should both warm WebR on first edit.

4.8 Mapper

// src/core/pipeline/mapper.ts — new case
case 'rdata-load': {
  return {
    id: freshIdFromAssignedTo(call),   // uses synthetic name as id seed
    type: 'webr-rdata-load',
    label: `Load: ${call.args.boundName} from ${basename(call.args.loadChain.at(-1) ?? '?')}`,
    sourceSpan: call.sourceSpan,
    params: {
      loadChain: call.args.loadChain,
      boundName: call.args.boundName,
      originFile: call.originFile,
    },
    status: 'pending',
    version: 0,
  };
}

Edge resolution: webr-rdata-load has no inputs, so no incoming edges. Outgoing edges are added by the existing binding-resolution pass — consumers’ data=/identifier args resolve to assignedTo (the synthetic name), which the mapper looks up in the binding map and adds an edge from <rdata-load>.out → <consumer>.<port>.

4.9 Reference scan and VFS sync

extractFilePath already lists load in KNOWN_READERS, so VFS-bridge sync is unchanged: referencedFiles accumulates *.RData paths from every recognized load() call, and worker-manager pushes those bytes to /workspace/. The recognizer’s load() detection (§4.4) reuses extractFilePath for path extraction — single source of truth, per the No-Go invariant.

When the recognizer absorbs a load() into chain state and emits no AnalysisCall, the file path is still visited by the reference scan via the same extractFilePath call. No VFS-sync regression.

4.10 UI rendering

webr-rdata-load renders as a source-shape node (no input handles, one output handle), parallel to data-load:

Status states match the standard set: pending, running, complete, error. The error state surfaces the worker’s stage-specific error (e.g., “object ‘dat’ not found in load chain” from the extract stage).

5. Behavior

5.1 Single load, single extract (paper #13’s headline shape)

load(file = "./data/data_with_charge_offs.RData")
dats <- dat
felm(yfill ~ collateral + log_amt | timeInt + Disaster_Id | 0 | Disaster_Id, data = dats)

Recognizer walk: 1. load(...)loadChain = ["./data/data_with_charge_offs.RData"], lastLoadSpan = <load_span>. No AnalysisCall emitted. 2. dats <- dat → RHS is identifier dat. Scope miss for dat. Wildcard active → emit extract: - kind: 'rdata-load', args: { loadChain: [...], boundName: 'dat' } - assignedTo: '__load_<load_span>_<load_span_end>_dat' - sourceSpan: <load_span> - Scope: dat aliased to the extract. - The dats <- dat assignment itself: aliasing recognizer sees RHS-identifier dat resolves to the extract; emits a no-op or binding-rename AnalysisCall as it does today for plain alias assignments. 3. felm(..., data = dats) → recognized as linear-model. data=dats resolves to the alias chain, then to the extract’s assignedTo. Edge added: <extract>.out → <linear-model>.data.

Pipeline shape:

[webr-rdata-load: dat from data_with_charge_offs.RData] ──► [linear-model: felm(...)]

Execution: - webr-rdata-load dispatches to worker → loads file into private env → returns e$datis.data.frame probe TRUE → marshals to Dataset → worker-manager unwraps → extract.result = Dataset. - linear-model reads extract.result from the TS pipeline store → runs OLS via the existing native executor → produces RegressionResult.

Coefficient agreement vs fixest::feols on the same data: exact (per paper #13 INTERLYSE-RUN-STATUS).

5.2 Multi-load, latest-wins

load("A.RData")     # introduces dat = A_value
load("B.RData")     # introduces dat = B_value (overwrites A's per R semantics)
feols(y ~ x, data = dat)

Recognizer: 1. load A → chain=[A]. 2. load B → chain=[A, B], lastLoadSpan = B’s. 3. data=dat → wildcard fires with chain=[A, B]. Extract attributed to B’s span: __load_<B_span>_dat, loadChain: [A, B], boundName: dat.

Worker dispatch loads A then B into private env; B overwrites A’s dat in the env; returns B’s dat. ✓

If B doesn’t actually contain dat (only A did): private env after both loads has A’s dat (B’s load is a no-op for the dat name). e[[boundName]] returns A’s dat, even though we attributed to B. Acceptable — we surface A’s dat, which is what the user would get under R’s load(A); load(B); use(dat) semantics if B didn’t redefine dat.

5.3 Same name, different attributions

load("A.RData")
felm(y ~ x, data = dat)     # references dat, attributed to A
load("B.RData")
felm(y ~ x, data = dat)     # references dat, attributed to B

Two distinct extract nodes: - __load_<A_span>_dat with chain=[A] - __load_<B_span>_dat with chain=[A, B]

Each runs in its own private env: the first returns A’s dat; the second returns whichever-of-A-or-B has dat with B winning. Two source nodes, two felms, no false sharing. Matches R semantics where the two dat references could (in principle) resolve to different bindings depending on what B redefined.

5.4 Programmatic load path

data_dir <- "./data"
load(file = paste0(data_dir, "/x.RData"))
feols(y ~ x, data = dat)

extractFilePath returns null (paste0 isn’t a literal). load() falls through to recognizeOpaqueSideEffect — no chain update, no wildcard. The data=dat resolves via existing scope rules: scope miss → falls through to the existing webr-opaque opaque path with dat as a free identifier outside scope (no edge). The feols call ends up opaque; the user sees the same behavior as today.

Documented limitation. Future work: extend extractFilePath to fold simple paste0 of file-scope literals (the “Numeric file-scope constants” backlog item generalizes here for strings too).

5.5 Side-effect-only load()

load("A.RData")
print(ls())

load() updates chain state. No subsequent name reference triggers an extract. No webr-rdata-load node is emitted; the load() statement effectively disappears from the DAG. print(ls()) is a webr-opaque side-effect, runs in globalenv, and sees an empty (RData-binding-free) globalenv. Documented limitation: globalenv-introspection of RData-loaded names doesn’t work.

In practice this corner is rare in the corpus (the audit doesn’t show globalenv-introspection idioms paired with load()). If a paper hits it, the per-paper workaround is to surface the binding via an explicit reference (e.g., add force(dat) or invisible(dat) after load, which makes dat referenced and triggers extract emission).

5.6 Rerun behavior

Today’s product semantics: Run = full topological re-execution of the entire DAG. Each webr-rdata-load re-dispatches on every Run.

Each dispatch runs in a fresh private env; globalenv state from prior Runs doesn’t affect correctness. Across two consecutive Runs of identical source: - Run 1: each extract creates a fresh e, loads chain, returns e[[boundName]]. Result Dataset cached on node.result in the TS pipeline store. - Run 2: each extract creates a fresh e (independent of Run 1’s), loads chain, returns the same value. ✓

Sibling extract execution order within a Run is arbitrary (topological sort); private envs prevent cross-extract pollution. ✓

Hypothetical per-node rerun (not in current product): would pull the cached Dataset from extractNode.result for downstream model nodes — no WebR re-dispatch. Re-running the extract itself (e.g., after editing the source) creates a fresh dispatch under the new params; no stale state.

5.7 Worker init / prewarm

webr-rdata-load nodes are added to the prewarm set in onPipelineChange (alongside webr-typed and webr-opaque). A pipeline with only rdata-load nodes (no opaque or typed) still triggers WebR worker boot on edit — without this, Run would hit cold-start latency.

VFS-sync of referenced *.RData files happens through the existing pipeline (per the VFS bridge spec): on each pipeline rebuild, referencedFiles is recomputed; new entries are pushed to /workspace/ when the worker is ready. By the time webr-rdata-load dispatches, its chain files are present at /workspace/<path>.

6. Testing

6.1 Unit (Vitest)

src/core/parsers/r/recognize-rdata-load.test.ts — new file: - Single-load + single-name reference → extract emitted with correct chain, boundName, sourceSpan = load span, deterministic synthetic id. - Multi-load + name reference after both → extract chain has both files in source order; sourceSpan = last load. - Same name referenced twice with one load between → two distinct extracts, each with its own attribution. - Programmatic load path → no chain update; falls through to opaque. - load() followed by no name reference → no extract emitted. - Idempotence: re-recognize identical source → same synthetic ids; no duplicate AnalysisCalls. - load(file = X) named-arg form vs load("path") positional form both recognized.

src/core/parsers/r/recognizer.rdata-load.test.ts — recognizer integration: - load → dats <- dat → felm(..., data=dats) (paper #13 shape) → pipeline contains one rdata-load + one alias + one linear-model with the correct edges. - load → opaque code referencing the loaded name → feols(...) → opaque code’s free-identifier resolution picks up the extract; all three nodes wired correctly.

src/core/pipeline/mapper.test.ts — extend: - 'rdata-load' AnalysisCall → WebRRDataLoadNode with right params, label, sourceSpan. - Edge resolution: synthetic assignedTo resolves to the extract; consumer data= arg picks up the edge.

src/workers/webr-worker.test.ts — extend (real WebR is gated, mock here): - handleDispatchRDataLoad mock-WebR test: chain is loaded in order; e[[boundName]] returns expected value; probe runs. - Error cases: chain file load failure → ‘load’ stage error; boundName missing in env → ‘extract’ stage error.

6.2 Integration (Vitest, gated WebR)

src/workers/webr-worker.integration.rdata.test.ts: - Real WebR worker. Push a small .RData fixture into /workspace/. Dispatch dispatch-rdata-load with that path and a boundName. Assert the round-trip Dataset matches the file’s contents. - Multi-file chain where the second file’s binding overwrites the first’s; assert latest-wins. - Sibling-extract independence: dispatch two extracts referencing different boundNames in the same file; assert both succeed and return correct values regardless of dispatch order.

src/core/pipeline/integration.test.ts — extend: - Paper #13 collateral structural assertion: paste the headline load → dats <- dat → felm(...) chunk; recognize + map; assert pipeline has 1 rdata-load + 1 alias + 1 linear-model + 0 opaque nodes; FE/cluster/2-term extraction correct.

6.3 Validation (paper-match, gated)

src/core/pipeline/replicate-collateral.test.ts — new file, two-test pattern (canonical example: replicate-soil.test.ts):

Test 1 (always runs in npm test): in-tree synthetic CSV. The existing structural assertion in integration.test.ts already covers the CSV path; this test extends to assert the rdata-load path works with a synthetic .RData fixture committed alongside the CSV. Generated via a small examples/collateral-bunching-loans.R helper script that reads the CSV and save()s it to examples/collateral-bunching-loans.RData. The synthetic .RData is bytes-stable (set.seed in the generator) so the test is deterministic.

Test 2 (gated on RUN_PAPER_MATCH=1 + data_with_charge_offs.RData presence): runs the paper’s actual headline chunk against the simulated-data RData in the deposit. Skipped silently in plain npm test and on fresh clones (no replication-package data).

Both tests verify: (a) the rdata-load node executes successfully, (b) the marshaled Dataset matches a parallel read.csv (test 1) or load+as.data.frame via R sidecar (test 2), (c) downstream felm/lm coefficients match the native-R baseline within the standard tolerance (<0.00005 for statistics, <0.00001 for p-values per CLAUDE.md).

6.4 E2E (Playwright)

e2e/webr-rdata-load.spec.ts: - Drag-drop a small ZIP containing one .RData fixture and one .R script with the load() → felm(...) chunk. - Wait for DAG containing a webr-rdata-load node (source-shaped, file icon, “R” badge) and a linear-model node connected to it. - Click Run. - Assert the rdata-load node transitions pending → running → complete; assert the regression result panel renders coefficients matching the expected synthetic-data values.

6.5 Audit regression

Re-run audit-opaque.ts after implementation. Expected: - load occurrences in unsupported/webr-opaque drop substantially (every paper that used load() now has rdata-load nodes instead). - New webr-rdata-load metric appears with non-zero count across multiple packages (paper #4, #13, #17 minimum). - Models-blocked count drops by the count attributable to RData-only data inputs (paper #13’s 4-6 models, paper #4’s models, partial unblock for paper #17 — full unblock requires the MILP primitives that are out of scope here).

7. CLAUDE.md No-Go list additions

8. Shipping criteria

  1. All unit and integration tests pass.
  2. The paper-match gated test (RUN_PAPER_MATCH=1) for paper #13 passes when the deposit’s RData fixture is present.
  3. E2E test passes in Chromium.
  4. npm run build && npm test && npm run lint && npm run test:e2e green.
  5. npm run test:paper-match green (when data is present; skip-cleanly otherwise per CLAUDE.md).
  6. Manual smoke: paste paper #13’s headline load → dats <- dat → felm(...) chunk, click Run, verify regression results match the documented coefficients (collateral 0.062789, log_amt 0.028844 on the synthetic data).
  7. INTERLYSE-RUN-STATUS §13 updated: drop the “Convert RData to CSV” workaround note; reclassify the load(.RData) blocker as resolved; add the rdata-load row to the “What was needed to make it run” table.
  8. BACKLOG load(file = "*.RData") bullet marked DONE with cross-reference to this spec and to paper #13’s verification.
  9. CLAUDE.md No-Go entries added.

9. Migration notes

10. Risk & rollback

11. Out-of-scope reminders