Date: 2026-05-01 Milestone: WebR
follow-up (Session 5 of WebR follow-up sessions)
Supersedes:
2026-05-01-webr-rdata-load-design.md (runtime-extract
design — first pass; retained in git for trace-back).
Predecessors: - 2026-04-20 WebR Integration (Session 1
— typed framework + lm_robust) - 2026-04-22 WebR Opaque
Nodes (Session 2 — opaque path end-to-end) - 2026-04-27 WebR VFS Bridge
(Session 4 — file-byte sync, extractFilePath, per-dispatch
CWD)
load(file = "*.RData") is the single biggest cross-paper
blocker per the WebR audit. Cited by paper #4 (investor-memory), paper
#13 (cost-of-consumer-collateral — every Table 4–6 script), paper #17
(policy-targeting — ~245 chunks, top of the audit), and many other
corpus papers shipping .RData intermediates.
The first design pass (the superseded sibling spec) treated RData as
a runtime extension point: recognizer state, lazy synthetic extract
emission, a new webr-rdata-load primitive dispatched at Run
time. That design works but introduces a new node type, recognizer state
machinery, and per-Run worker round-trips for what is structurally a
static container format.
This spec reframes RData as a multi-dataset
container — semantically what it is. An .RData
file is a tar of R objects. Uploading one is uploading multiple datasets
and globals. The right architecture extracts at upload time, registers
the contents in the existing data registry under a namespaced path
scheme (X.RData::dat), and lets the existing
data-load primitive handle resolution. WebR is invoked once
per upload to extract; not at Run time for these data sources.
The user’s load(file = "X.RData") source line becomes a
recognizer-side namespace activation hint — pure
scope-resolution metadata, no runtime semantics. Everything downstream
of that goes through the same data-load →
linear-model flow that CSV uploads already use.
Upload-time extraction: - .RData bytes
hit WorkspaceStore (existing VFS-bridge sync). -
Worker-manager dispatches a one-time extraction request:
load(envir = e); ls(e); classify each name. - Each
extracted name is classified and registered: - data.frame
(including tibbles, which inherit from data.frame) → marshal to
MarshaledDataset → unwrap to TS Dataset →
register under namespaced path
<file>::<binding> in the dataset registry. -
Scalar (numeric, character, logical) → register as a workspace global. -
Simple vector (numeric, character, logical) → register as a workspace
global. - Named list of the above (recursive) → register as a structured
global. - Anything else (S4, fitted models, matrices,
factors-not-in-frame, complex named lists with non-extractable members,
functions) → leave as opaque-binding handle in the worker’s globalenv;
mark as “not surfaced to TS” with the binding name surfaced to the data
panel under an “opaque” zone.
RDataManifest records: dataset names, global
names, opaque-binding names, and any failures.Recognizer: - load(file = "X.RData") →
activate the X.RData namespace by appending its file path to a
per-recognizer-pass activatedNamespaces: string[]. No
AnalysisCall emitted for the load() statement itself. - Unresolved
identifier reference (feols(..., data = dat)) → walk
activatedNamespaces in reverse (latest
first), checking each file’s manifest for dat. First hit
wins. Resolves to namespaced path <file>::dat. Emit
standard data-load AnalysisCall with
args.file = "<file>::dat". Sourcespan points at the
load() statement that activated the winning namespace. -
Globals referenced by name (e.g., n_max from a global
numeric) — resolved through the existing FileRegistry global-scope
mechanism, extended to also pull from upload-extracted globals.
Run time: - data-load node with
namespaced path: executor looks up
<file>::<binding> in the dataset registry and
returns the cached Dataset. No worker round-trip for the data source
itself. - Opaque code that references RData-sourced names: existing
input-binding mechanism marshals the workspace’s Dataset into the worker
as a globalenv binding — same mechanism today’s opaque dispatches use
for any TS-Dataset input.
Net result: RData becomes “just another upload
format.” DAG shows data-load nodes with file paths; user
mental model is identical to CSV uploads. WebR’s role is reduced to a
one-shot extractor at upload time and the existing opaque-code execution
layer; not a runtime data dependency.
'extract-rdata' request type. Loads the file into a fresh
env, enumerates names, classifies each, marshals
data.frames/scalars/vectors/named-lists, leaves complex objects as
opaque handles in globalenv.RDataManifest capturing extracted names + types + values;
namespaced dataset registry entries
(<file>::<binding> keys); workspace globals
store (a parallel map to FileRegistry’s global maps, keyed by binding
name with discriminated value types)..RData
upload, kick off background extraction; show progress indicator;
populate manifest + registry on completion; surface failures via
banner.<file>::<binding> paths flow through the
existing dataset registry. data-load params unchanged
(still { file: string }).activatedNamespaces: string[] in recognizer state; updated
on each load(file = "literal") encounter; consulted by the
unresolved-identifier resolution path with reverse-order lookup.value type to support
string | string[] | number | number[] | boolean | NamedList;
merge upload-extracted globals into the per-file scope alongside
source-derived globals.lst <- list(a = ..., b = ...) and register as structured
globals (recursive shape).lst$member and lst[["member"]] access against
named-list globals (recursive).min.year <- 1996). Lands the BACKLOG “Numeric
file-scope constants” bullet as a side effect (cited by paper #10
wage-risk)..RData upload, rm() corresponding opaque
handles from worker globalenv; clear manifest + registry entries; clear
workspace globals from that file.load(.RData) blocker as
resolved.load(file = "*.RData")
DONE; mark “Numeric file-scope constants” DONE (side effect);
cross-reference this spec..RData files surfaces memory
pressure. Same architecture, finer-grained execution..dta/.sav/.xlsx/.rds:
the same pattern unifies these formats with RData. Out of scope here; a
follow-up session can lift the upload-extraction routine to a
polymorphic file-type dispatcher. The data-panel and recognizer changes
in this spec are RData-shaped but extensible..RData header parser for
advance name extraction without invoking WebR. Considered and rejected:
WebR is the right tool for R-binary deserialization; the upload-time
worker-boot cost is the real trade.ls(), exists("dat") from
globalenv-introspection idioms (without naming dat
directly) won’t see RData names because the recognizer’s resolution
doesn’t visit them. Documented limitation.dat; if they
reference dat without a load() to
disambiguate, currently falls through to opaque (no namespace
activated). A “did you mean: A.RData::dat or B.RData::dat?” diagnostic
is a future UX improvement.load(file = paste0(dir, "x.RData"))): same documented
limitation as VFS bridge — literal paths only. Falls through to
opaque.src/ui/store/workspace.ts [modified]
Add rdataManifests: Map<filePath, RDataManifest>;
add rdataDatasets: Map<namespacedPath, Dataset>;
add globals: Map<name, GlobalValue>;
wiring: addFiles(...) triggers extraction
for .RData entries.
src/core/rdata/manifest.ts [new]
RDataManifest type;
GlobalValue discriminated union
(scalar | vector | named-list, recursive).
src/core/webr/protocol.ts [modified]
Add 'extract-rdata' request type +
'extract-rdata-result' response.
src/workers/webr-worker.ts [modified]
handleExtractRData: load → enumerate
→ classify → marshal data.frames →
return manifest + opaque-binding
handles left in globalenv.
src/workers/worker-manager.ts [modified]
extractRData(filePath): orchestrates
the extract dispatch; populates
workspace store; emits
extraction-complete events for UI.
src/core/parsers/file-registry.ts [modified]
GlobalConstantInfo extended to
discriminated value types;
merge upload-extracted globals
into per-file scope.
src/core/parsers/r/recognizer.ts [modified]
Detect load() with literal path;
update activatedNamespaces;
wire scope-miss hook for
unresolved identifier lookups
with reverse-order namespace walk.
src/core/parsers/r/file-scope-collector.ts [modified]
(or wherever source-side constants
are collected today) — extend to
recognize numeric literals and
named-list literals.
src/core/parsers/r/expression-evaluator.ts [modified]
(or wherever scope-resolution lives)
— handle lst$member and lst[["member"]]
against named-list globals.
src/core/pipeline/data-registry.ts [modified — if exists, else inline
in executor]
Resolve namespaced paths
(<file>::<binding>) by lookup
into rdataDatasets.
src/core/pipeline/executor.ts [modified]
data-load executor: when path
contains '::', resolve via
rdataDatasets registry.
src/ui/components/panels/data-panel.tsx [modified]
Tree-grouped view by file;
datasets/globals/opaque sections;
per-binding preview.
src/ui/components/banners/extraction-status.tsx [new]
Per-file banner showing
extraction progress and any
failed bindings.
reference-papers/INTERLYSE-RUN-STATUS.md [modified]
§13 updated: drop CSV workaround.
BACKLOG.md [modified]
Mark load() bullet DONE;
mark numeric constants bullet DONE.
CLAUDE.md [modified]
Add No-Go invariants.
// src/core/rdata/manifest.ts
/** Classification of every name found in an RData file at extraction. */
export interface RDataManifest {
filePath: string; // workspace-relative
datasets: string[]; // names that materialized as Datasets
globals: Record<string, GlobalValue>; // names that materialized as TS values
opaqueBindings: string[]; // names left as opaque handles in worker globalenv
failures: { name: string; reason: string }[]; // names that couldn't be classified or marshaled
extractedAt: number; // timestamp; for debugging
}
/** Discriminated union covering source-text constants and RData-extracted globals. */
export type GlobalValue =
| { kind: 'scalar-string'; value: string }
| { kind: 'scalar-number'; value: number }
| { kind: 'scalar-bool'; value: boolean }
| { kind: 'vector-string'; values: string[] }
| { kind: 'vector-number'; values: number[] }
| { kind: 'vector-bool'; values: boolean[] }
| { kind: 'named-list'; members: Record<string, GlobalValue> }; // recursiveThe same GlobalValue type is used for both
source-text constants (collected from
min.year <- 1996, path <- "data.csv",
vars <- c("a","b"),
lst <- list(a = 1, b = "x")) and
upload-extracted globals. This unifies the recognizer’s
scope-resolution path; consumers don’t care about origin.
The existing FileRegistry.globalConstants (string
scalars) and globalVectors (string vectors) get folded into
a single
globalValues: Map<string, { value: GlobalValue; sourceFile: string }>
map. The sourceFile field is the R script for
source-derived globals or the .RData file path for
upload-extracted ones. Backward-compatible migration: existing call
sites accessing globalConstants /
globalVectors switch to globalValues filtered
by kind.
Protocol additions:
// src/core/webr/protocol.ts
| { type: 'extract-rdata';
id: string;
path: string; // workspace-relative; worker prefixes /workspace/
}
| { type: 'extract-rdata-result';
id: string;
manifest: RDataManifest;
error?: string; // file-level errors (load failed entirely, etc.)
}Worker handler:
async function handleExtractRData(
req: Extract<WebRRequest, { type: 'extract-rdata' }>,
): Promise<void> {
if (!webR) {
post({ type: 'extract-rdata-result', id: req.id,
manifest: emptyManifest(req.path),
error: 'worker not initialized' });
return;
}
const absPath = `/workspace/${req.path}`;
const manifest: RDataManifest = {
filePath: req.path,
datasets: [],
globals: {},
opaqueBindings: [],
failures: [],
extractedAt: Date.now(),
};
try {
// Load into a fresh env to enumerate names without polluting globalenv yet.
await webR.evalRVoid(`.interlyse_extract_env <- new.env()`);
await webR.evalRVoid(`load(file = ${JSON.stringify(absPath)}, envir = .interlyse_extract_env)`);
const namesR = await webR.evalR(`ls(.interlyse_extract_env)`);
const names = await (namesR as unknown as { toArray: () => Promise<string[]> }).toArray();
for (const name of names) {
try {
const classification = await classifyAndMarshal(name);
switch (classification.kind) {
case 'dataset':
manifest.datasets.push(name);
break;
case 'global':
manifest.globals[name] = classification.value;
break;
case 'opaque':
// Promote to globalenv with a deterministic prefixed name to avoid
// collision across multiple uploaded RData files. The prefix uses
// a short hash of the file path; lifecycle (rm on file removal)
// tracks these via the manifest's opaqueBindings list.
await webR.evalRVoid(
`assign(${JSON.stringify(prefixedName(req.path, name))}, ` +
` .interlyse_extract_env[[${JSON.stringify(name)}]], ` +
` envir = globalenv())`,
);
manifest.opaqueBindings.push(name);
break;
}
} catch (e) {
manifest.failures.push({
name,
reason: e instanceof Error ? e.message : String(e),
});
}
}
} catch (e) {
post({ type: 'extract-rdata-result', id: req.id, manifest,
error: e instanceof Error ? e.message : String(e) });
return;
} finally {
await webR.evalRVoid(`rm(.interlyse_extract_env, envir = globalenv())`);
}
// Marshaled datasets are returned via a separate dataset-batch message —
// see §4.4. The manifest message itself only carries names + globals.
post({ type: 'extract-rdata-result', id: req.id, manifest });
}classifyAndMarshal(name) is a worker-internal helper
that branches on R-side type checks:
# Pseudocode (actual worker code is TS calling R via evalR)
val <- .interlyse_extract_env[[name]]
if (is.data.frame(val)) {
# Marshal as MarshaledDataset (existing marshalDatasetFromR pattern)
return classification: 'dataset' (post bytes via separate message)
}
if (is.numeric(val) && length(val) == 1) {
return classification: 'global' { kind: 'scalar-number', value: as.numeric(val) }
}
if (is.character(val) && length(val) == 1) { ... 'scalar-string' ... }
if (is.logical(val) && length(val) == 1) { ... 'scalar-bool' ... }
if (is.numeric(val) && length(val) > 1) { ... 'vector-number' ... }
if (is.character(val) && length(val) > 1) { ... 'vector-string' ... }
if (is.logical(val) && length(val) > 1) { ... 'vector-bool' ... }
if (is.list(val) && !is.null(names(val)) && all(names(val) != "")) {
# Recursive classification of each member; if all members classify cleanly,
# promote to a 'named-list' global. If any member fails, fall back to opaque.
members <- ... recursive call on each member ...
if (members.allOk) return classification: 'global' { kind: 'named-list', members }
else return classification: 'opaque'
}
# Everything else: leave in globalenv as opaque binding.
return classification: 'opaque'Memory note: MarshaledDataset for each
data.frame is built once at extraction; bytes flow back to TS once and
the worker’s copy in .interlyse_extract_env is
garbage-collected after the rm() in the
finally block. Opaque bindings stay alive in globalenv
(with prefixed names) until file removal or worker restart.
// src/workers/worker-manager.ts
async function extractRData(filePath: string): Promise<void> {
await ensureWebRWorker(); // boots if not yet running
const id = nextDispatchId();
const datasets = new Map<string, MarshaledDataset>();
// The dispatch may yield multiple messages: zero or more 'rdata-dataset'
// (one per data.frame), then one 'extract-rdata-result' (the manifest).
// Existing dispatcher correlates by id.
const result = await dispatchExtractRData(id, filePath, datasets);
// Populate workspace store
const workspace = useWorkspaceStore.getState();
workspace.registerRDataExtraction({
filePath,
manifest: result.manifest,
datasets, // Dataset values keyed by binding name
});
if (result.error) {
// Fatal extraction error (file-level): record + surface banner
workspace.recordExtractionError(filePath, result.error);
}
}Trigger: WorkspaceStore.addFiles(...)
checks file extension; for .RData / .rdata,
schedules extractRData(path) in the background (does NOT
block the upload completion). The UI shows an “extracting…” indicator on
the file in the data panel until extraction completes.
Concurrency: extractions for different files run
sequentially (worker FIFO queue per CLAUDE.md atomic-end-to-end
invariant). A second .RData upload starts after the first
completes.
// src/ui/store/workspace.ts — additions
interface WorkspaceState {
// ... existing files: Map<string, Uint8Array> ...
/** Per-file extraction manifest. Populated post-extraction. */
rdataManifests: Map<string, RDataManifest>;
/** Namespaced dataset registry: key = "<file>::<binding>". */
rdataDatasets: Map<string, Dataset>;
/**
* Workspace globals — both source-text and upload-extracted.
* Keyed by binding name; values are GlobalValue-typed.
* Conflicts (same name from multiple sources) are resolved at scope-merge
* time per-recognizer-pass (latest source-order wins, matching existing
* FileRegistry behavior).
*/
globals: Map<string, { value: GlobalValue; sourceFile: string }>;
/** Extraction lifecycle ops. */
registerRDataExtraction: (entry: ExtractionResult) => void;
removeRDataExtraction: (filePath: string) => Promise<void>; // also rm() opaque handles
recordExtractionError: (filePath: string, error: string) => void;
}removeRDataExtraction(path) clears the manifest, deletes
namespaced registry entries
(<path>::<binding>), removes the file’s globals
from the workspace globals map, and posts an rm() request
to the worker for any opaque handles registered under that file.
Recognizer state extension:
interface RecognizerState {
// ... existing scope, diagnostics, loadedPackages ...
/**
* Source-order list of RData file paths whose namespaces have been activated
* by load() calls in the current pass. Each entry's manifest is consulted
* for unresolved-identifier resolution; reverse-order walk (latest first).
* Programmatic load paths (paste0, vars) do not activate.
*/
activatedNamespaces: string[];
/**
* Per-activated-file span of the load() statement that activated it. Used
* to set sourceSpan on emitted data-load nodes so click-to-source lands on
* the load() line. If the same file is load()'d multiple times, the most
* recent activation's span overwrites — matches reverse-order resolution.
*/
namespaceLoadSpans: Map<string, Span>;
}Walking the AST: the existing top-level statement
loop adds a branch for load():
for (const stmt of program.body) {
// ... existing: ignorable calls, loops, function defs ...
if (isLoadCall(stmt)) {
const path = extractFilePath(stmt /* or stmt.value if assigned */);
if (path && externalScope.rdataManifests.has(path)) {
state.activatedNamespaces.push(path);
continue; // emit no AnalysisCall for the load() itself
}
// Fall through: programmatic path or unknown file — treat as opaque
// side-effect (existing behavior).
}
const result = tryRecognizeStatement(stmt, state);
// ... existing emit ...
}Scope-miss hook for unresolved identifiers: when the
recognizer’s RHS-identifier or data= arg resolution misses
scope (consults state.scope and
externalScope.globals and finds nothing), it falls through
to:
function resolveAgainstActivatedNamespaces(
name: string,
state: RecognizerState,
externalScope: ExternalScope,
): { kind: 'dataset'; namespacedPath: string; loadSpan: Span }
| { kind: 'global'; value: GlobalValue }
| undefined
{
// Walk activated namespaces in reverse — latest first.
for (let i = state.activatedNamespaces.length - 1; i >= 0; i--) {
const file = state.activatedNamespaces[i];
const manifest = externalScope.rdataManifests.get(file);
if (!manifest) continue;
if (manifest.datasets.includes(name)) {
// Dataset hit — caller emits a data-load AnalysisCall with namespaced path.
const loadSpan = state.namespaceLoadSpans.get(file); // tracked per activation
return { kind: 'dataset', namespacedPath: `${file}::${name}`, loadSpan };
}
if (name in manifest.globals) {
return { kind: 'global', value: manifest.globals[name] };
}
}
return undefined;
}Caller integration: - The data= arg
recognizer (recognize-data-arg.ts) consults this hook on miss; on a
'dataset' hit, it emits a data-load
AnalysisCall with args.file = namespacedPath, sourceSpan =
loadSpan, and wires the resulting node as the data
upstream. - The RHS-identifier path (dats <- dat) does
the same: emit a data-load with the namespaced path; alias
the lhs in scope. - Global hits resolve as the existing global-scope
mechanism does.
Idempotence: emitted data-load
AnalysisCalls have deterministic shape (path is span-independent —
namespacedPath is purely a function of file + binding, both stable).
Re-recognition of identical source produces identical calls.
namespaceLoadSpans tracking: as
activatedNamespaces is appended on each load()
encounter, we also record the load-statement span per file so the
resulting data-load node’s sourceSpan can
point at the load() line. If the same file is load()’d
twice (defensive idiom — re-loading), latest wins (matches R semantics;
reverse-order resolution naturally picks the latest activation’s
span).
Datasets registered at <file>::<binding>
keys. The existing data-load executor — which today accepts
a file string and looks it up in the dataset registry —
gets one extra resolution rule:
function resolveDataset(path: string, registry: WorkspaceState): Dataset | undefined {
// Existing: direct file-path lookup against the parsed-dataset registry
// (CSVs / DTAs / etc., parsed at upload time). Field name follows whatever
// the current implementation uses (the spec deliberately doesn't pin the
// exact identifier — implementer reads the executor and parallels it).
const direct = registry.getParsedDataset(path);
if (direct) return direct;
// NEW: namespaced lookup for RData-extracted datasets — distinguished by
// the '::' separator in the path. Lookup always uses the full namespaced
// string as the key.
return registry.rdataDatasets.get(path);
}data-load params unchanged:
{ file: string }. The path syntax is the entire encoding.
UI / mapper / recognizer don’t introspect on the :: — they
treat it as an opaque path string. Only the executor’s resolution logic
distinguishes.
The existing source-text constant collector (today: string scalars and string vectors) is extended:
x <- 42,
min.year <- 1996 →
{ kind: 'scalar-number', value: 42 }. Detection: assignment
whose RHS is a numeric literal (or unary-minus over numeric
literal).years <- c(1996, 1997, 1998) →
{ kind: 'vector-number', values: [...] }.lst <- list(a = 1, b = "x", c = c(1,2,3), d = list(inner = 42))
→ recursively classify each member; if all members classify,
{ kind: 'named-list', members }. Members can be any
GlobalValue (recursive).The expression evaluator (used by data-mutate /
data-filter for column expressions and by recognizer’s
identifier-resolution for inline-substitution) gets matching
extensions:
lst$a / lst[["a"]] → resolve via
members.a.lst$d$inner) → recursive
resolution.lst[[1]]) → ordered-member access;
named-list preserves insertion order in
Record<string, GlobalValue> (JS spec-guaranteed for
string keys).Bounded scope: this is not a full R list
interpreter. We handle: - Literal list(...) assignments at
the top level (source-text or upload-extracted). - Member access via
$ and [["..."]]. - No support for: list
mutation (lst$a <- 5), list concatenation, programmatic
key access, runtime-computed members.
Tree-grouped view, file-level groups expandable:
▼ data_with_charge_offs.RData (extracted)
Datasets (2)
▸ dat [preview] [bind to data-load]
▸ pdat [preview] [bind to data-load]
Globals (1)
▸ n_max = 42 (number)
Opaque (1)
▸ model_results (R list — available to opaque code)
▼ collateral-bunching-loans.csv
Datasets (1)
▸ collateral-bunching-loans [preview] [bind to data-load]
▶ extracting… large_file.RData (in progress, 3/8 names)
A per-file extraction-status banner appears on the right edge during extraction, transitioning to a small ✓ on completion or to a yellow ⚠ if there were per-binding failures (with a “see details” link expanding the failures list).
Auto-binding on completion: when extraction
completes, any pre-existing data-load nodes whose
params.file matches a freshly-registered namespaced path
get their binding-status updated from “missing” to “bound.” Existing
path-match auto-binding mechanism, no new logic.
File removal (user removes upload from data panel):
1. WorkspaceStore.removeRDataExtraction(filePath): - Delete
manifest from rdataManifests. - Delete
<filePath>::* entries from
rdataDatasets. - Delete globals entries with
sourceFile === filePath. - Post rm-bindings
worker request listing prefixedName(filePath, name) for
each opaque binding. 2. Pipeline rebuild fires (existing
onPipelineChange); recognizer no longer finds
filePath in externalScope.rdataManifests;
data-load nodes pointing at <filePath>::* now resolve
to a missing-Dataset error.
Worker restart (Reset WebR Session future feature): all globalenv state lost. Workspace store still has manifests; Datasets are TS-side and unaffected. Opaque-binding handles are gone; opaque code referencing them errors at next dispatch. Re-extraction-on-boot is deferred to the Reset Session feature spec.
Re-uploading a file with the same path: existing wipe-confirmation flow runs; on Replace, prior extraction is cleaned up via the file-removal path before new extraction kicks off.
Source:
load(file = "./data/data_with_charge_offs.RData")
dats <- dat
felm(yfill ~ collateral + log_amt | timeInt + Disaster_Id | 0 | Disaster_Id, data = dats)Upload: data_with_charge_offs.RData arrives; worker
extracts; manifest =
{ datasets: ['dat', 'pdat'], globals: {}, opaqueBindings: [], ... };
namespaced datasets registered.
Recognizer pass: 1. load(...) → path matches an entry in
externalScope.rdataManifests →
activatedNamespaces.push("data/data_with_charge_offs.RData").
No AnalysisCall. 2. dats <- dat: RHS identifier
dat misses scope → namespace walk: latest =
data_with_charge_offs.RData; manifest contains
dat. Hit — emit data-load AnalysisCall with
args.file = "data/data_with_charge_offs.RData::dat",
sourceSpan = <load_span>, assignedTo =
<synthetic-or-just-dat>. Add to scope. 3.
felm(..., data = dats): dats resolves to the
alias from step 2 → consumes the data-load node’s
output.
Pipeline:
[data-load: data_with_charge_offs.RData::dat] ──► [linear-model: felm(...)]
Execution: data-load executor sees :: in
path → resolves via rdataDatasets → returns cached Dataset.
linear-model runs natively. No worker round-trip for the
data source.
load("A.RData") # manifest: datasets=['dat', 'x'], ...
load("B.RData") # manifest: datasets=['dat', 'y'], ...
feols(y ~ x, data = dat) # references dat — which one?
feols(y ~ x, data = x) # references x — which one?
feols(y ~ x, data = y) # references y — which one?activatedNamespaces after both loads:
["A.RData", "B.RData"].
Resolution (reverse order): - dat → check B (has dat) →
resolve to B.RData::dat. ✓ Latest-wins. - x →
check B (no x) → fall back to A (has x) → resolve to
A.RData::x. ✓ Cascade hit. - y → check B (has
y) → resolve to B.RData::y. ✓
The DAG has three data-load nodes, each with its own
namespaced path. The user gets correct attribution without writing any
extra code.
data_dir <- "./data"
load(file = paste0(data_dir, "/x.RData"))
feols(y ~ x, data = dat)extractFilePath returns null (paste0 not literal).
load() falls through to existing webr-opaque side-effect
path (no namespace activated). dat reference doesn’t
resolve → falls through to webr-opaque too. Same behavior as today;
documented limitation parallel to the VFS bridge’s programmatic-path
limitation.
load()load("A.RData")
print(ls())Namespace activated; no name reference in subsequent code triggers
any data-load emission. print(ls()) is a webr-opaque
side-effect; runs in worker globalenv.
What does ls() see? The opaque-binding
handles for non-extracted complex objects (registered with prefixed
names like __rdata_<hash>_funcfoo), if any. NOT the
extracted data.frames/scalars (those live in TS). Limitation: this idiom
is partial — opaque code’s ls() shows the prefixed handles,
not the user-visible binding names.
If a future paper needs full globalenv-introspection of RData names: option is to surface dataset/global bindings into globalenv too, with their original (un-prefixed) names. Adds memory/binding-lifecycle complexity; defer until needed.
config <- list(
threshold = 0.05,
vars = c("x1", "x2", "x3"),
filters = list(min_age = 18, max_age = 65)
)
df %>% filter(age >= config$filters$min_age, age <= config$filters$max_age)Source collector classifies config as
{ kind: 'named-list', members: { threshold: { kind: 'scalar-number', value: 0.05 }, vars: { kind: 'vector-string', values: [...] }, filters: { kind: 'named-list', members: { min_age: ..., max_age: ... } } } }.
data-filter recognizer’s expression-evaluator resolves
config$filters$min_age → 18,
config$filters$max_age → 65, substitutes
literals into the filter condition. Existing data-filter
machinery does the rest.
load("A.RData")
felm(y ~ x, data = dat) # currently bound to A.RData::datUser removes A.RData from the data panel. Workspace
lifecycle: manifest cleared, namespaced dataset entries removed, opaque
handles rm()d in worker. Pipeline rebuild fires:
recognizer’s externalScope.rdataManifests no longer
contains A.RData; the load() falls through (no
activation); data = dat doesn’t resolve → falls through to
webr-opaque. The data-load node referencing
A.RData::dat is no longer emitted; if it had been pinned in
the editor’s pipeline state, it shows a missing-binding error.
User can either re-upload, or remove the load() from
source. The behavior is consistent with how the existing CSV path
handles file removal.
src/core/rdata/manifest.test.ts: -
RDataManifest round-trip serialization. -
GlobalValue discriminated union exhaustiveness. -
prefixedName(path, name) produces deterministic,
collision-free names.
src/workers/webr-worker.extract-rdata.test.ts (mock
WebR): - Classification of fixtures: pure data.frame manifest, mixed
(data.frame + scalar + vector + named-list + opaque), all-failures. -
finally-block cleanup of the temp env.
src/core/parsers/file-registry.test.ts — extend: -
Numeric scalar collection from x <- 42,
x <- -3.14. - Named-list collection (single level +
nested). - Source-text and upload-extracted globals merge correctly.
src/core/parsers/r/recognizer.namespace.test.ts — new: -
Single load + reference: emits data-load with namespaced
path; sourceSpan = load span. - Multi-load with cascading:
name-only-in-A referenced after both loads → resolves to A. -
Programmatic load path: no namespace activated; falls through to opaque.
- Same file load()’d twice: latest activation’s span used
for sourceSpan. - Idempotence: re-recognition produces identical
AnalysisCalls.
src/core/parsers/r/expression-evaluator.named-list.test.ts
— new: - lst$a resolves to scalar. -
lst[["a"]] resolves identically. - Nested access
(lst$d$inner). - Missing member errors clearly.
src/core/pipeline/data-registry.test.ts (or
executor.test.ts): - data-load with namespaced path
resolves via rdataDatasets. - data-load with
plain CSV path still resolves via existing
parsedDatasets.
src/workers/worker-manager.extract-rdata.integration.test.ts:
- Real WebR worker. Sync a small .RData fixture into
/workspace/. Trigger extraction. Assert manifest + Dataset
bytes match expected values. - .RData containing all object
types (data.frame, scalar, vector, named list, S4 placeholder) →
manifest classifies each correctly;
data.frames/scalars/vectors/named-lists round-trip. - File removal
post-extraction → opaque handles cleared from worker globalenv (assert
via subsequent exists() probe).
src/core/pipeline/integration.test.ts — extend: - Paper
#13 collateral structural assertion using the namespaced-path flow:
paste headline; recognize + map; assert pipeline has 1 data-load
(namespaced) + 1 alias (or absorbed) + 1 linear-model; correct
edges.
src/core/pipeline/replicate-collateral.test.ts — extend
the existing file (today: synthetic-CSV-only, paper #13’s Table 4
modNaive verified against fixest::feols). Add the two-test
pattern (canonical: replicate-soil.test.ts) that exercises
the upload-extraction path:
Test 1 (in-tree, always runs): synthetic
examples/collateral-bunching-loans.RData generated by an
examples/collateral-bunching-loans.R script (reads the
existing CSV stub, save()s as RData with
set.seed(2026) for byte stability). Verifies the
upload-extraction → recognizer → executor chain end-to-end against the
synthetic data.
Test 2 (gated RUN_PAPER_MATCH=1 + data
presence): paper’s actual data_with_charge_offs.RData from
the deposit. Skipped silently in plain npm test and on
fresh clones.
Both verify (a) extraction succeeds, (b) the resulting Dataset
matches read.csv of the equivalent CSV (test 1) or
load()+as.data.frame via R sidecar (test 2), (c) downstream
felm/lm coefficients match within
tolerance.
e2e/rdata-upload-extraction.spec.ts: - Drag-drop ZIP
containing one .RData and one .R script with
the load() → felm(...) chunk. - Assert “extracting…”
indicator appears, then resolves with extraction-complete. - Assert data
panel shows the file with its Datasets section populated. - Assert
pipeline DAG has data-load (namespaced) →
linear-model. - Click Run. Assert regression result panel
renders coefficients matching expected values. - Remove the file from
the data panel. Assert data-load shows missing-binding
error.
Re-run audit-opaque.ts after implementation. Expected: -
load occurrences in
unsupported/webr-opaque drop to ~zero (every
paper now resolves through namespace activation). - Models-blocked count
drops by the count attributable to RData-only data inputs (paper #13 4-6
models, paper #4 models, partial unblock for paper #17 — full unblock
requires the MILP primitives that are out of scope). -
Numeric-scalar-not-substituted occurrences (the BACKLOG
numeric-constants bullet) drop to ~zero — collateral side effect.
externalScope.rdataManifests; the executor reads from
workspace.rdataDatasets. Both populated once at
upload.<file>::<binding> path syntax is
the encoding for namespaced datasets — never extend
data-load params with separate
file/binding fields. The path is the key into
rdataDatasets. Invariant: a path containing ::
resolves via the namespaced registry; a path without ::
resolves via the parsed-CSV registry.load() — programmatic paths (paste0, vars) do not
activate. Falls through to webr-opaque side-effect, which is the same
behavior the file-not-found case already has.prefixedName(path, name) is deterministic;
manifest.opaqueBindings lists the names; file-removal sends
rm() for the prefixed names. Never leak handles across
file-removal cycles.list(...) assignments + $ /
[["..."]] member access only. No list mutation, no
programmatic key access, no runtime-computed members. Document the
bound.{ kind: 'scalar-number' } GlobalValue. The inliner’s
free-var substitution pulls these into expression-evaluator scope..interlyse_extract_env is created, populated, queried,
and rm()’d in a single dispatch. Never leave the temp env
alive across dispatches; opaque bindings get promoted to globalenv with
prefixed names before the temp env is cleared.RUN_PAPER_MATCH=1) for paper
#13 passes when the deposit’s RData is present.npm run build && npm test && npm run lint && npm run test:e2e
green.npm run test:paper-match green when data is present;
skip-cleanly otherwise.load → dats <- dat → felm(...) chunk, click Run, verify
regression results match documented coefficients (collateral 0.062789,
log_amt 0.028844 on the synthetic data).load(.RData) blocker as
resolved; an “RData upload-extraction” row added to the paper’s “What
was needed to make it run” table.load(file = "*.RData") bullet marked DONE with
cross-reference; “Numeric file-scope constants” bullet marked DONE.load(),
new.env(), ls(), is.data.frame(),
assign() are all base R.AnalysisKind union: NO new kinds. RData uses the
existing data-load kind.PipelineNode union: NO new variants.data-load params unchanged (single file
string).FileRegistry.globalConstants and
globalVectors are folded into
globalValues: Map<string, { value: GlobalValue; sourceFile: string }>.
Existing call sites update one line each (filter by
kind)..RData bytes still sync to
/workspace/. The new extraction request operates on those
same bytes..RData
containing many large data.frames, all materialize in TS heap. Lazy
materialization is a deferred optimization..RData::dat and a CSV named dat: ambiguous
registry key. Practice: namespaced paths always contain ::;
plain CSV paths never do; collision is structural-impossible. Documented
invariant.VITE_DISABLE_WEBR=1 env flag as the rest of WebR. With WebR
disabled, RData uploads register in the workspace store but don’t
extract; the user sees an “extraction unavailable; WebR is disabled”
banner; CSV/DTA paths are unaffected..dta /
.sav / .xlsx / .rds (same
pattern; separate session)..RData header parser (rejected; WebR is the
right tool).