Date: 2026-05-01 Milestone: WebR
follow-up (Session 5 of WebR follow-up sessions)
Predecessors: - 2026-04-20 WebR Integration (Session 1
— typed framework + lm_robust) - 2026-04-22 WebR Opaque
Nodes (Session 2 — opaque path end-to-end) - 2026-04-27 WebR VFS Bridge
(Session 4 — file-byte sync, extractFilePath, per-dispatch
CWD)
load(file = "*.RData") is the single biggest cross-paper
blocker per the WebR audit. Cited by paper #4 (investor-memory), paper
#13 (cost-of-consumer-collateral — every Table 4–6 script), and paper
#17 (policy-targeting — ~245 chunks, top of the audit). All three rely
on RData intermediates as their data input, and the current pipeline
can’t wire those inputs into typed regression nodes.
The VFS bridge already syncs *.RData bytes into the
worker’s /workspace/ and load is in
KNOWN_READERS. The bytes reach R fine. The gap is
recognition: load() falls through to
recognizeOpaqueSideEffect, the worker runs it under
captureR into globalenv, the loaded objects
(dat, pdat, model_results) appear
in R’s env, but the recognizer’s TS-side scope never learns of them. A
subsequent feols(y ~ x, data = dat) resolves
dat against an empty scope — the typed
linear-model node either gets emitted with no
data edge (silent pipeline malformation), or the entire
feols call falls through to opaque, severing the typed
model recognition.
Goal: make RData-loaded bindings first-class data
sources in the typed pipeline. Recognizing load() should be
enough that downstream feols(..., data = dat) flows through
the existing typed-model machinery without the user pre-converting RData
to CSV.
The recognizer treats load() as a declaration of
intent rather than a runtime node: walking statements in source
order, encountering load(file = "x.RData") updates a
loadChain: string[] carried in recognizer state, but emits
no AnalysisCall for the load() statement itself. When
subsequent code references an unresolved identifier
(dats <- dat, feols(..., data = dat)), the
wildcard fires lazily and emits a synthetic webr-rdata-load
AnalysisCall — a self-contained data-source primitive whose params
capture the load chain and the bound name to extract.
Each webr-rdata-load node executes in an
isolated R env: load all chain files into a fresh
new.env(), return e[[boundName]]. The existing
webr-opaque probe-and-marshal layer auto-detects
is.data.frame and either marshals to a
MarshaledDataset (auto-unwrapped to a TS
Dataset by worker-manager) or returns an opaque-binding
handle. No new marshaler infra; the data-source flows into the typed
pipeline through the same path opaque nodes use today.
Globalenv is never touched by RData loads. This eliminates
state-pollution bugs across reruns and across sibling extracts within a
single Run, at the documented cost of opaque code that does
globalenv-introspection (ls(), exists("dat"),
mget(...)) not seeing the RData-loaded names.
PipelineNode variant:
WebRRDataLoadNode with params
{ loadChain, boundName, originFile? }.loadChain: string[] updated
as statements are walked; loadWildcard flag indicating
“unresolved identifiers should resolve via the chain.”recognize-rdata-load.ts: new module owning the wildcard
scope-miss hook and the synthetic AnalysisCall emission.'rdata-load' AnalysisKind →
WebRRDataLoadNode.webr-rdata-load via a new
'dispatch-rdata-load' worker request type that wraps the
existing probe-and-marshal layer.webr-rdata-load in
NODE_PORTS and a passthrough in
getPortsFor().data-load), label
Load: <boundName> from <basename>, file
icon.data_with_charge_offs.RData end-to-end (paper-match test
gated on RUN_PAPER_MATCH + data presence).load(.RData) → CSV workaround.load(file = "*.RData") bullet marked DONE..RData binary header parser for
advance name extraction. Would let the recognizer pre-validate that
referenced names actually exist in the file (typo detection) and surface
“this load brings: {dat, pdat, …}” in the inspector. Real cost (~one
binary parser); marginal benefit until users hit typo-debugging pain.
Documented as a future UX improvement.load_<span>.dat named bindings in globalenv with
reverse-fallback resolution). Worker-side perf optimization that avoids
redundant .RData deserialization across sibling extracts
and across Runs. Same node shape as private-env design — pure
dispatch-layer change. Defer until a paper with large
.RData files materializes the perf wall.webr-binary-load primitive
unifying
readRDS/haven::read_*/read_xlsx/etc.
These return single values, not env-emitter bindings — different shape
from load(). They fit the existing typed-marshaler pattern
(one R call → one typed payload) and should land as a separate
session.ls(),
exists("dat"), mget(...) on RData-loaded
names). Structurally incompatible with rerun correctness under shared
state. Documented as a No-Go invariant; per-paper workaround is to
surface the binding via an explicit reference rather than
introspection.FileRegistry, audit, ~80 unit tests) for
marginal benefit over lazy emission.source(...) functions,
.n_<id> opaque bindings). Tracked separately; the
per-extract private env design here is correct in isolation and doesn’t
depend on a broader cleanup.src/core/parsers/r/extract-file-path.ts [unchanged]
`load` already in KNOWN_READERS
for VFS-sync targeting
src/core/parsers/r/recognize-rdata-load.ts [new]
Wildcard state object;
extract emission helper;
deterministic synthetic name
src/core/parsers/r/recognizer.ts [modified]
Detect load() in statement walk;
update loadChain state;
wire scope-miss hook for unresolved
identifier references
src/core/parsers/shared/analysis-call.ts [modified]
Add 'rdata-load' to AnalysisKind union
src/core/pipeline/types.ts [modified]
Add WebRRDataLoadParams + WebRRDataLoadNode;
extend PipelineNode union;
add NODE_PORTS entry
src/core/pipeline/mapper.ts [modified]
rdata-load AnalysisCall → WebRRDataLoadNode
src/core/pipeline/executor.ts [modified]
Dispatch webr-rdata-load via worker-manager;
receive auto-unwrapped Dataset (kind:'dataset')
or opaque-binding result
src/core/webr/protocol.ts [modified]
Add 'dispatch-rdata-load' request type
(or extend dispatch-opaque)
src/workers/webr-worker.ts [modified]
Handler for the new request type;
private-env load + bare-name extract;
existing probe-and-marshal
src/workers/worker-manager.ts [modified]
Route webr-rdata-load nodes;
compute cwd from originFile (existing helper)
src/ui/components/pipeline-node.tsx [modified]
Source-shape rendering with file icon;
tooltip showing path + boundName
reference-papers/INTERLYSE-RUN-STATUS.md [modified]
§13 updated: drop CSV workaround;
reclassify if applicable
BACKLOG.md [modified]
Mark load() bullet DONE
CLAUDE.md [modified]
Add No-Go invariants
// src/core/pipeline/types.ts
export interface WebRRDataLoadParams {
/**
* Workspace-relative paths of every load() call encountered in source
* order up to this extract's emission point, latest = last. The worker
* loads them in this order into a fresh private env; latest-wins
* overwrite semantics resolve naturally inside that env.
*/
loadChain: string[];
/**
* R object name extracted from the loaded env. Deterministic per
* (lastLoadSpan, name) pair via the synthetic-id scheme in §4.4.
*/
boundName: string;
/** Path of the R script that produced this call (workspace-relative). */
originFile?: string;
}
export interface WebRRDataLoadNode extends PipelineNodeBase {
type: 'webr-rdata-load';
params: WebRRDataLoadParams;
result?: OpaqueResult; // matches webr-opaque shape — Dataset | opaque-binding
}The result field uses the existing
OpaqueResult discriminated union (same as
webr-opaque). For typical RData files containing
data.frames, the worker’s is.data.frame probe returns true
and the payload is
{ kind: 'dataset', dataset: MarshaledDataset }. The
worker-manager unwraps OpaqueResult{kind:'dataset'} to a TS
Dataset before populating downstream inputData
(per the existing webr-opaque executor invariant in CLAUDE.md).
Non-data-frame results carry
{ kind: 'opaque-binding', ... } and bubble up as an
upstream-error to Dataset-port consumers — same behavior as
webr-opaque data-port consumers today.
Ports are static (no marshaler-spec-driven dynamic shape):
// NODE_PORTS entry
'webr-rdata-load': {
inputs: [],
outputs: [{ name: 'out', dataType: 'any' }],
}'any' matches 'dataset' for port validation
(existing rule). When the consumer is a typed data port
(Dataset-typed), runtime resolution checks
result.kind === 'dataset' and either flows the Dataset or
errors with the same message the opaque path uses today.
getPortsFor() returns the static
NODE_PORTS['webr-rdata-load'] directly — no per-instance
variation.
Add 'rdata-load' to the AnalysisKind
union:
// src/core/parsers/shared/analysis-call.ts — modified
type AnalysisKind =
| ... existing kinds ...
| 'webr-typed'
| 'webr-opaque'
| 'rdata-load'; // NEWThe AnalysisCall produced by the recognizer:
{
kind: 'rdata-load',
args: {
loadChain: string[],
boundName: string,
},
assignedTo: '__load_<lastLoadSpan.start>_<lastLoadSpan.end>_<boundName>',
sourceSpan: <lastLoadCall.span>, // see §4.5 — points at the load() statement
originFile: <currentFile.path>,
}The mapper passes args.loadChain and
args.boundName directly into the node’s params.
assignedTo provides the deterministic synthetic binding
name that downstream consumers reference.
Recognizer state extends with two fields:
interface RecognizerState {
// ... existing scope: Map<string, AnalysisCall> ...
// ... existing diagnostics, loadedPackages, ... ...
/**
* Source-order list of load() file paths encountered so far. Pushed on
* each load() call seen. Used by the wildcard scope-miss hook to compose
* the chain that goes into emitted rdata-load extracts.
*/
loadChain: string[];
/**
* Span of the most recent load() statement. Used as the sourceSpan of any
* extract emitted while this load is the active wildcard target — clicking
* an extract in the DAG highlights this load() line in the editor.
* Cleared/never-set if loadChain is empty.
*/
lastLoadSpan?: Span;
}Walking the AST: the existing top-level statement loop adds one new branch:
for (const stmt of program.body) {
// ... existing: ignorable calls, loops, function defs, ... ...
// NEW: detect load() with literal path; absorb into chain state.
if (isLoadCall(stmt)) {
const path = extractFilePath(stmt /* or stmt.value if assigned */);
if (path) {
state.loadChain.push(path);
state.lastLoadSpan = stmt.span;
continue; // emit no AnalysisCall for the load() itself
}
// Programmatic path (paste0, variable) — fall through to opaque.
}
const result = tryRecognizeStatement(stmt, state);
// ... existing emit ...
}isLoadCall(stmt) returns true for: - A bare top-level
FunctionCallNode with name === 'load'. - An
assignment whose RHS is FunctionCallNode with
name === 'load' (rare; load() is normally
unassigned).
Programmatic load paths (e.g.,
load(file = paste0(dir, "x.RData")) where dir
is a variable) fall through extractFilePath returning null.
These emit as webr-opaque side-effect — same as today. No
wildcard activated; downstream references to RData-loaded names will
fail with the same clear “object not found” R error they get now.
Documented limitation; matches the broader programmatic-path limitation
called out in the VFS bridge spec.
The scope-miss hook is the heart of lazy emission.
The existing scope is a Map<string, AnalysisCall>
consulted in data= arg resolution (recognize-data-arg.ts)
and in identifier-RHS resolution. We add a hook: when a lookup misses
scope AND loadChain.length > 0, emit a synthetic extract
on the fly and return it as if it had been there.
Implementation pattern:
// src/core/parsers/r/recognize-rdata-load.ts
export function resolveOrEmitRDataExtract(
name: string,
state: RecognizerState,
emit: (call: AnalysisCall) => void,
): AnalysisCall | undefined {
const existing = state.scope.get(name);
if (existing) return existing;
if (state.loadChain.length === 0 || !state.lastLoadSpan) return undefined;
const synthName = `__load_${state.lastLoadSpan.start}_${state.lastLoadSpan.end}_${name}`;
// Idempotence: if a previous miss already emitted this extract, reuse.
const cached = state.scope.get(synthName);
if (cached) {
state.scope.set(name, cached);
return cached;
}
const call: AnalysisCall = {
kind: 'rdata-load',
args: {
loadChain: [...state.loadChain], // snapshot
boundName: name,
},
assignedTo: synthName,
sourceSpan: state.lastLoadSpan,
originFile: state.currentFile,
};
emit(call);
state.scope.set(synthName, call);
state.scope.set(name, call); // alias the user-visible name to the synthetic
return call;
}Call sites that consult scope and should route
through this hook: - recognize-data-arg.ts —
data= arg resolution. The hook fires when
data=dat references an unbound dat. - The
recognizer’s RHS-identifier handling in
tryRecognizeStatement — dats <- dat
references an unbound dat. - webr-opaque
input-binding resolution (unionInputBindings in
recognize-opaque.ts) — the regex/AST scan currently filters
identifiers by scope.has(name); we change this to
resolveOrEmitRDataExtract so opaque rSource that references
an RData-loaded name gets the right edge.
Each call site is a small change: one-line replacement of
scope.get(name) (or scope.has(name) in the
opaque-scan filter) with the hook.
Idempotence: re-recognition of identical source
produces identical synthetic ids (deterministic from
lastLoadSpan) and the cached-call check ensures no
duplicates. Matches the existing
__lift_<span>_<tag> pattern (CLAUDE.md No-Go:
“Synthetic lifted bindings are deterministic”).
Multi-load attribution: when load A appears at line
33, then load B at line 42, then feols(..., data=dat) at
line 50 — lastLoadSpan at emission time is B’s span. The
extract’s sourceSpan and synthetic id derive from B; the
chain is [A.RData, B.RData]; boundName is
dat. Latest-wins resolves naturally inside the worker’s
private env (B overwrites A).
If B doesn’t actually contain dat (only A did), the
worker errors with
Error: object 'dat' not found in load chain — clear,
surfaces as the extract node’s runtime error. Same error R itself would
produce. Acceptable corner case; documented in the No-Go.
Same name re-attributed across loads: when the user
writes load(A); use(dat); load(B); use(dat) — the first
dat reference emits an extract attributed to A
(__load_<A_span>_dat); the second reference, after B,
attributes to B (__load_<B_span>_dat). Two distinct
extract nodes. Both are valid: each pulls from its respective private
env. The user sees two source nodes in the DAG, one per attribution.
Matches R’s “the dat after B is B’s dat, the dat before B is A’s dat”
semantic.
The extract’s sourceSpan points at the
load() statement that introduced the binding (the
last/most-recent load in the chain at emission time).
load() line. Tells the user “this is where
this data came from.” Parallels data-load’s sourceSpan
pointing at read.csv().dat and
pdat extracted from the same load() both
highlight the same source line — communicates the relationship
visually.sourceSpan.start (CLAUDE.md). Extract’s
start = load’s start places the extract adjacent to where the load was
in source order. Edges to consumers (later in source) resolve
naturally.lifted: true / parentSpanStart
machinery: extracts aren’t lifted out of a single parent call
(the data-arg pattern); they’re emitted from wildcard scope on first
reference, which can be in any of several downstream statements.
sourceSpan = load span is the natural choice.Two protocol options:
Option A (preferred): new request type
'dispatch-rdata-load'.
// src/core/webr/protocol.ts — additions
| { type: 'dispatch-rdata-load';
id: string;
loadChain: string[]; // workspace-relative paths
boundName: string;
cwd?: string;
}Response reuses the existing 'dispatch-result' shape
with OpaquePayload (kind: 'dataset' |
'opaque-binding'). Errors use 'dispatch-error'
with stages 'load' (a chain file failed to load),
'extract' (boundName not found in env),
'probe', 'marshal'.
Worker handler (mirrors
handleDispatchOpaque for the probe-and-marshal half;
differs in the eval shape):
async function handleDispatchRDataLoad(
req: Extract<WebRRequest, { type: 'dispatch-rdata-load' }>,
): Promise<void> {
if (!webR) {
post({ type: 'dispatch-error', id: req.id, stage: 'load', error: 'worker not initialized' });
return;
}
const resultBinding = `.n_${req.id.replace(/[^a-z0-9]/gi, '_')}`;
try {
if (req.cwd) {
await webR.evalRVoid(`suppressWarnings(setwd(${JSON.stringify(req.cwd)}))`);
}
// Prefix each chain path with /workspace per VFS bridge convention.
const chainLiterals = req.loadChain
.map(p => JSON.stringify(`/workspace/${p}`))
.join(', ');
const boundLit = JSON.stringify(req.boundName);
// Private env per dispatch — globalenv untouched. Latest-wins inside e.
const script = `
${resultBinding} <- local({
e <- new.env()
for (f in c(${chainLiterals})) {
load(file = f, envir = e)
}
if (!exists(${boundLit}, envir = e, inherits = FALSE)) {
stop(sprintf("object '%s' not found in load chain", ${boundLit}))
}
e[[${boundLit}]]
})
`;
await webR.evalRVoid(script);
} catch (e) {
post({ type: 'dispatch-error', id: req.id, stage: 'load',
error: e instanceof Error ? e.message : String(e) });
return;
}
// Probe + marshal — reuse the exact same path handleDispatchOpaque uses.
let isDataFrame: boolean;
try {
const r = await webR.evalR(`is.data.frame(${resultBinding})`);
isDataFrame = Boolean(await (r as unknown as { toBoolean: () => Promise<boolean> }).toBoolean());
} catch (e) {
post({ type: 'dispatch-error', id: req.id, stage: 'probe',
error: e instanceof Error ? e.message : String(e) });
return;
}
if (isDataFrame) {
try {
const marshaled = await marshalDatasetFromR(resultBinding);
post({ type: 'dispatch-result', id: req.id,
result: { kind: 'opaque', payload: { kind: 'dataset', dataset: marshaled } } });
} catch (e) {
post({ type: 'dispatch-error', id: req.id, stage: 'marshal',
error: e instanceof Error ? e.message : String(e) });
}
} else {
post({ type: 'dispatch-result', id: req.id,
result: { kind: 'opaque', payload: { kind: 'opaque-binding', binding: resultBinding } } });
}
}Note on cleanup: unlike dispatch-typed
which rm()s resultBinding after marshaling,
rdata-load keeps the binding alive in globalenv (same as opaque
assignment). Downstream opaque consumers may reference the synthetic
binding name; cleanup happens at session end via worker termination. The
private e env is local to the local({...})
block and garbage-collected after the dispatch.
Option B (rejected): extend
dispatch-opaque with a flag. Considered and rejected — the
rSource shape, error stages, and the assignment vs. side-effect dispatch
logic are different enough that a dedicated request type is cleaner. ~30
lines saved by Option B aren’t worth the discriminant overload.
// src/workers/worker-manager.ts — additions
async function dispatchRDataLoad(node: WebRRDataLoadNode): Promise<OpaqueResult> {
const cwd = cwdFor(node.params.originFile);
const response = await sendAndAwait({
type: 'dispatch-rdata-load',
id: nextDispatchId(),
loadChain: node.params.loadChain,
boundName: node.params.boundName,
cwd,
});
// Same OpaqueResult shape returned by the existing dispatchOpaque path.
// The executor's downstream consumer resolution (inputData population)
// unwraps kind:'dataset' to TS Dataset and surfaces kind:'opaque-binding'
// as an upstream-error for Dataset-port consumers — reuses the existing
// webr-opaque consumer-resolution path verbatim.
return resultFromOpaqueResponse(response);
}Wire dispatchRDataLoad into the executor’s node-type
switch. The existing prewarm set (onPipelineChange per
CLAUDE.md No-Go) needs webr-rdata-load added — opaque-only
and rdata-only pipelines should both warm WebR on first edit.
// src/core/pipeline/mapper.ts — new case
case 'rdata-load': {
return {
id: freshIdFromAssignedTo(call), // uses synthetic name as id seed
type: 'webr-rdata-load',
label: `Load: ${call.args.boundName} from ${basename(call.args.loadChain.at(-1) ?? '?')}`,
sourceSpan: call.sourceSpan,
params: {
loadChain: call.args.loadChain,
boundName: call.args.boundName,
originFile: call.originFile,
},
status: 'pending',
version: 0,
};
}Edge resolution: webr-rdata-load has no inputs, so no
incoming edges. Outgoing edges are added by the existing
binding-resolution pass — consumers’ data=/identifier args
resolve to assignedTo (the synthetic name), which the
mapper looks up in the binding map and adds an edge from
<rdata-load>.out → <consumer>.<port>.
extractFilePath already lists load in
KNOWN_READERS, so VFS-bridge sync is unchanged:
referencedFiles accumulates *.RData paths from
every recognized load() call, and worker-manager pushes
those bytes to /workspace/. The recognizer’s
load() detection (§4.4) reuses extractFilePath
for path extraction — single source of truth, per the No-Go
invariant.
When the recognizer absorbs a load() into chain state
and emits no AnalysisCall, the file path is still visited by the
reference scan via the same extractFilePath call. No
VFS-sync regression.
webr-rdata-load renders as a source-shape node (no input
handles, one output handle), parallel to data-load:
data-load style) with a small
“R” badge in the corner to distinguish the WebR origin.Load: <boundName> on the first
line, <basename of last chain path> on the
second.Status states match the standard set: pending,
running, complete, error. The
error state surfaces the worker’s stage-specific error
(e.g., “object ‘dat’ not found in load chain” from the extract
stage).
load(file = "./data/data_with_charge_offs.RData")
dats <- dat
felm(yfill ~ collateral + log_amt | timeInt + Disaster_Id | 0 | Disaster_Id, data = dats)Recognizer walk: 1. load(...) →
loadChain = ["./data/data_with_charge_offs.RData"],
lastLoadSpan = <load_span>. No AnalysisCall emitted.
2. dats <- dat → RHS is identifier dat.
Scope miss for dat. Wildcard active → emit extract: -
kind: 'rdata-load',
args: { loadChain: [...], boundName: 'dat' } -
assignedTo: '__load_<load_span>_<load_span_end>_dat'
- sourceSpan: <load_span> - Scope: dat
aliased to the extract. - The dats <- dat assignment
itself: aliasing recognizer sees RHS-identifier dat
resolves to the extract; emits a no-op or binding-rename AnalysisCall as
it does today for plain alias assignments. 3.
felm(..., data = dats) → recognized as
linear-model. data=dats resolves to the alias
chain, then to the extract’s assignedTo. Edge added:
<extract>.out → <linear-model>.data.
Pipeline shape:
[webr-rdata-load: dat from data_with_charge_offs.RData] ──► [linear-model: felm(...)]
Execution: - webr-rdata-load dispatches to worker →
loads file into private env → returns e$dat →
is.data.frame probe TRUE → marshals to Dataset
→ worker-manager unwraps → extract.result = Dataset. -
linear-model reads extract.result from the TS
pipeline store → runs OLS via the existing native executor → produces
RegressionResult.
Coefficient agreement vs fixest::feols on the same data:
exact (per paper #13 INTERLYSE-RUN-STATUS).
load("A.RData") # introduces dat = A_value
load("B.RData") # introduces dat = B_value (overwrites A's per R semantics)
feols(y ~ x, data = dat)Recognizer: 1. load A → chain=[A]. 2. load B → chain=[A, B],
lastLoadSpan = B’s. 3. data=dat → wildcard fires with
chain=[A, B]. Extract attributed to B’s span:
__load_<B_span>_dat, loadChain: [A, B],
boundName: dat.
Worker dispatch loads A then B into private env; B overwrites A’s dat in the env; returns B’s dat. ✓
If B doesn’t actually contain dat (only A did): private
env after both loads has A’s dat (B’s load is a no-op for the
dat name). e[[boundName]] returns A’s dat,
even though we attributed to B. Acceptable — we surface A’s dat, which
is what the user would get under R’s
load(A); load(B); use(dat) semantics if B didn’t redefine
dat.
load("A.RData")
felm(y ~ x, data = dat) # references dat, attributed to A
load("B.RData")
felm(y ~ x, data = dat) # references dat, attributed to BTwo distinct extract nodes: - __load_<A_span>_dat
with chain=[A] - __load_<B_span>_dat with chain=[A,
B]
Each runs in its own private env: the first returns A’s dat; the
second returns whichever-of-A-or-B has dat with B winning. Two source
nodes, two felms, no false sharing. Matches R semantics where the two
dat references could (in principle) resolve to different
bindings depending on what B redefined.
data_dir <- "./data"
load(file = paste0(data_dir, "/x.RData"))
feols(y ~ x, data = dat)extractFilePath returns null (paste0 isn’t a literal).
load() falls through to
recognizeOpaqueSideEffect — no chain update, no wildcard.
The data=dat resolves via existing scope rules: scope miss
→ falls through to the existing webr-opaque opaque path with
dat as a free identifier outside scope (no edge). The feols
call ends up opaque; the user sees the same behavior as today.
Documented limitation. Future work: extend
extractFilePath to fold simple paste0 of
file-scope literals (the “Numeric file-scope constants” backlog item
generalizes here for strings too).
load()load("A.RData")
print(ls())load() updates chain state. No subsequent name reference
triggers an extract. No webr-rdata-load node is emitted;
the load() statement effectively disappears from the DAG.
print(ls()) is a webr-opaque side-effect, runs in
globalenv, and sees an empty (RData-binding-free) globalenv. Documented
limitation: globalenv-introspection of RData-loaded names doesn’t
work.
In practice this corner is rare in the corpus (the audit doesn’t show
globalenv-introspection idioms paired with load()). If a
paper hits it, the per-paper workaround is to surface the binding via an
explicit reference (e.g., add force(dat) or
invisible(dat) after load, which makes dat
referenced and triggers extract emission).
Today’s product semantics: Run = full topological re-execution of the
entire DAG. Each webr-rdata-load re-dispatches on every
Run.
Each dispatch runs in a fresh private env; globalenv state from prior
Runs doesn’t affect correctness. Across two consecutive Runs of
identical source: - Run 1: each extract creates a fresh e,
loads chain, returns e[[boundName]]. Result Dataset cached
on node.result in the TS pipeline store. - Run 2: each
extract creates a fresh e (independent of Run 1’s), loads
chain, returns the same value. ✓
Sibling extract execution order within a Run is arbitrary (topological sort); private envs prevent cross-extract pollution. ✓
Hypothetical per-node rerun (not in current
product): would pull the cached Dataset from
extractNode.result for downstream model nodes — no WebR
re-dispatch. Re-running the extract itself (e.g., after editing the
source) creates a fresh dispatch under the new params; no stale
state.
webr-rdata-load nodes are added to the prewarm set in
onPipelineChange (alongside webr-typed and
webr-opaque). A pipeline with only rdata-load nodes (no
opaque or typed) still triggers WebR worker boot on edit — without this,
Run would hit cold-start latency.
VFS-sync of referenced *.RData files happens through the
existing pipeline (per the VFS bridge spec): on each pipeline rebuild,
referencedFiles is recomputed; new entries are pushed to
/workspace/ when the worker is ready. By the time
webr-rdata-load dispatches, its chain files are present at
/workspace/<path>.
src/core/parsers/r/recognize-rdata-load.test.ts — new
file: - Single-load + single-name reference → extract emitted with
correct chain, boundName, sourceSpan = load span, deterministic
synthetic id. - Multi-load + name reference after both → extract chain
has both files in source order; sourceSpan = last load. - Same name
referenced twice with one load between → two distinct extracts, each
with its own attribution. - Programmatic load path → no chain update;
falls through to opaque. - load() followed by no name
reference → no extract emitted. - Idempotence: re-recognize identical
source → same synthetic ids; no duplicate AnalysisCalls. -
load(file = X) named-arg form vs load("path")
positional form both recognized.
src/core/parsers/r/recognizer.rdata-load.test.ts —
recognizer integration: -
load → dats <- dat → felm(..., data=dats) (paper #13
shape) → pipeline contains one rdata-load + one alias + one linear-model
with the correct edges. -
load → opaque code referencing the loaded name → feols(...)
→ opaque code’s free-identifier resolution picks up the extract; all
three nodes wired correctly.
src/core/pipeline/mapper.test.ts — extend: -
'rdata-load' AnalysisCall → WebRRDataLoadNode
with right params, label, sourceSpan. - Edge resolution: synthetic
assignedTo resolves to the extract; consumer
data= arg picks up the edge.
src/workers/webr-worker.test.ts — extend (real WebR is
gated, mock here): - handleDispatchRDataLoad mock-WebR
test: chain is loaded in order; e[[boundName]] returns
expected value; probe runs. - Error cases: chain file load failure →
‘load’ stage error; boundName missing in env → ‘extract’ stage
error.
src/workers/webr-worker.integration.rdata.test.ts: -
Real WebR worker. Push a small .RData fixture into
/workspace/. Dispatch dispatch-rdata-load with
that path and a boundName. Assert the round-trip Dataset matches the
file’s contents. - Multi-file chain where the second file’s binding
overwrites the first’s; assert latest-wins. - Sibling-extract
independence: dispatch two extracts referencing different boundNames in
the same file; assert both succeed and return correct values regardless
of dispatch order.
src/core/pipeline/integration.test.ts — extend: - Paper
#13 collateral structural assertion: paste the headline
load → dats <- dat → felm(...) chunk; recognize + map;
assert pipeline has 1 rdata-load + 1 alias + 1 linear-model + 0 opaque
nodes; FE/cluster/2-term extraction correct.
src/core/pipeline/replicate-collateral.test.ts — new
file, two-test pattern (canonical example:
replicate-soil.test.ts):
Test 1 (always runs in npm test):
in-tree synthetic CSV. The existing structural assertion in
integration.test.ts already covers the CSV path; this test
extends to assert the rdata-load path works with a synthetic
.RData fixture committed alongside the CSV. Generated via a
small examples/collateral-bunching-loans.R helper script
that reads the CSV and save()s it to
examples/collateral-bunching-loans.RData. The synthetic
.RData is bytes-stable (set.seed in the generator) so the
test is deterministic.
Test 2 (gated on RUN_PAPER_MATCH=1 +
data_with_charge_offs.RData presence): runs the paper’s
actual headline chunk against the simulated-data RData in the deposit.
Skipped silently in plain npm test and on fresh clones (no
replication-package data).
Both tests verify: (a) the rdata-load node executes successfully, (b)
the marshaled Dataset matches a parallel read.csv (test 1)
or load+as.data.frame via R sidecar (test 2), (c)
downstream felm/lm coefficients match the
native-R baseline within the standard tolerance (<0.00005 for
statistics, <0.00001 for p-values per CLAUDE.md).
e2e/webr-rdata-load.spec.ts: - Drag-drop a small ZIP
containing one .RData fixture and one .R
script with the load() → felm(...) chunk. - Wait for DAG
containing a webr-rdata-load node (source-shaped, file
icon, “R” badge) and a linear-model node connected to it. -
Click Run. - Assert the rdata-load node transitions pending → running →
complete; assert the regression result panel renders coefficients
matching the expected synthetic-data values.
Re-run audit-opaque.ts after implementation. Expected: -
load occurrences in
unsupported/webr-opaque drop substantially
(every paper that used load() now has rdata-load nodes
instead). - New webr-rdata-load metric appears with
non-zero count across multiple packages (paper #4, #13, #17 minimum). -
Models-blocked count drops by the count attributable to RData-only data
inputs (paper #13’s 4-6 models, paper #4’s models, partial unblock for
paper #17 — full unblock requires the MILP primitives that are out of
scope here).
webr-rdata-load extracts run in private
envs — load() writes do not propagate to
globalenv. Opaque code that does ls(),
exists("name"), or mget() will not see
RData-loaded bindings. If a paper hits this, the per-paper workaround is
to surface the binding via an explicit reference (forces extract
emission); never re-architect the load primitive to use shared globalenv
state — that breaks rerun and sibling-execution-order correctness.webr-rdata-load synthetic ids are
deterministic —
__load_<lastLoadSpan.start>_<lastLoadSpan.end>_<boundName>.
Required for idempotent re-recognition. Same span ⇒ same id; cached
scope entry prevents duplicate emission.load() chain state is recognizer-internal, not
pipeline-visible — loadChain lives in recognizer
state and is snapshotted into each emitted extract’s params. Never pass
chain state to the worker as a “session-wide chain”; each extract
dispatch is self-contained.extractFilePath remains the single source of
truth for load’s file-arg position — the
recognizer’s load() detection in §4.4 calls
extractFilePath; never duplicate the function-name →
arg-position table inline. Programmatic paths (paste0, vars) return null
and fall through to opaque, matching the documented VFS-bridge
limitation.webr-rdata-load is added to the
onPipelineChange prewarm set — opaque-only and
rdata-load-only pipelines both need a warm WebR worker; regressing this
forces cold-start on Run.webr-rdata-load is added to
collectRequiredPackages’s scan path if non-base packages
become relevant — load() is base-R (no install
needed), so today this is a no-op. If future extensions require a
package (e.g., reading R<3.5 RData via a compatibility shim), the
scan must be extended.RUN_PAPER_MATCH=1) for
paper #13 passes when the deposit’s RData fixture is present.npm run build && npm test && npm run lint && npm run test:e2e
green.npm run test:paper-match green (when data is present;
skip-cleanly otherwise per CLAUDE.md).load → dats <- dat → felm(...) chunk, click Run, verify
regression results match the documented coefficients (collateral
0.062789, log_amt 0.028844 on the synthetic data).load(.RData) blocker as
resolved; add the rdata-load row to the “What was needed to make it run”
table.load(file = "*.RData") bullet marked DONE with
cross-reference to this spec and to paper #13’s verification.webR.evalRVoid and
webR.evalR already exist; new.env() and
load(envir=) are base R.AnalysisKind union gains 'rdata-load' —
TypeScript exhaustiveness will surface every missing switch case at
build time.PipelineNode discriminated union gains
WebRRDataLoadNode — same exhaustiveness coverage.NODE_PORTS and getPortsFor() get one new
entry; no callers need updates beyond the discriminated-union ones.'any' port-validation rule is unchanged from
webr-opaque; no validation regressions.makeNode() helpers (per CLAUDE.md “Adding
a New Primitive” extending guide).load() errors at runtime (corrupt
RData, unsupported R version): worker returns 'load' stage
error; extract node enters error state; consumer nodes show
upstream-error. User can fix the file and re-run.boundName not in chain: worker
'extract' stage error with the clear message
object 'dat' not found in load chain. User can audit the
.RData (e.g., open in R) to confirm what names it brings..RData
reloaded by many sibling extracts within a Run): documented limitation.
Mitigation is the prefix-cache scheme deferred to a future session —
pure dispatch-layer change, doesn’t affect node shape or recognizer
state.node.result: same shape as existing
data-load nodes with large CSVs. Bounded by browser TS-side
heap, not WebR’s WASM heap. Existing constraints apply.VITE_DISABLE_WEBR=1 env flag as the rest of WebR. With WebR
disabled, webr-rdata-load nodes show as a clear “WebR
disabled” error on Run; the rest of the pipeline still functions for
pure-CSV paths..RData binary header parser for advance name
extraction → future UX improvement.webr-binary-load primitive → readRDS /
haven::read_dta / etc. fit the typed-marshaler shape, separate
session.load() calls →
matches the VFS bridge’s existing limitation; out of scope.