RData Upload Extraction — Design Spec

RData Upload Extraction — Design Spec

Date: 2026-05-01 Milestone: WebR follow-up (Session 5 of WebR follow-up sessions) Supersedes: 2026-05-01-webr-rdata-load-design.md (runtime-extract design — first pass; retained in git for trace-back). Predecessors: - 2026-04-20 WebR Integration (Session 1 — typed framework + lm_robust) - 2026-04-22 WebR Opaque Nodes (Session 2 — opaque path end-to-end) - 2026-04-27 WebR VFS Bridge (Session 4 — file-byte sync, extractFilePath, per-dispatch CWD)

1. Context

load(file = "*.RData") is the single biggest cross-paper blocker per the WebR audit. Cited by paper #4 (investor-memory), paper #13 (cost-of-consumer-collateral — every Table 4–6 script), paper #17 (policy-targeting — ~245 chunks, top of the audit), and many other corpus papers shipping .RData intermediates.

The first design pass (the superseded sibling spec) treated RData as a runtime extension point: recognizer state, lazy synthetic extract emission, a new webr-rdata-load primitive dispatched at Run time. That design works but introduces a new node type, recognizer state machinery, and per-Run worker round-trips for what is structurally a static container format.

This spec reframes RData as a multi-dataset container — semantically what it is. An .RData file is a tar of R objects. Uploading one is uploading multiple datasets and globals. The right architecture extracts at upload time, registers the contents in the existing data registry under a namespaced path scheme (X.RData::dat), and lets the existing data-load primitive handle resolution. WebR is invoked once per upload to extract; not at Run time for these data sources.

The user’s load(file = "X.RData") source line becomes a recognizer-side namespace activation hint — pure scope-resolution metadata, no runtime semantics. Everything downstream of that goes through the same data-loadlinear-model flow that CSV uploads already use.

2. Approach summary

Upload-time extraction: - .RData bytes hit WorkspaceStore (existing VFS-bridge sync). - Worker-manager dispatches a one-time extraction request: load(envir = e); ls(e); classify each name. - Each extracted name is classified and registered: - data.frame (including tibbles, which inherit from data.frame) → marshal to MarshaledDataset → unwrap to TS Dataset → register under namespaced path <file>::<binding> in the dataset registry. - Scalar (numeric, character, logical) → register as a workspace global. - Simple vector (numeric, character, logical) → register as a workspace global. - Named list of the above (recursive) → register as a structured global. - Anything else (S4, fitted models, matrices, factors-not-in-frame, complex named lists with non-extractable members, functions) → leave as opaque-binding handle in the worker’s globalenv; mark as “not surfaced to TS” with the binding name surfaced to the data panel under an “opaque” zone.

Recognizer: - load(file = "X.RData") → activate the X.RData namespace by appending its file path to a per-recognizer-pass activatedNamespaces: string[]. No AnalysisCall emitted for the load() statement itself. - Unresolved identifier reference (feols(..., data = dat)) → walk activatedNamespaces in reverse (latest first), checking each file’s manifest for dat. First hit wins. Resolves to namespaced path <file>::dat. Emit standard data-load AnalysisCall with args.file = "<file>::dat". Sourcespan points at the load() statement that activated the winning namespace. - Globals referenced by name (e.g., n_max from a global numeric) — resolved through the existing FileRegistry global-scope mechanism, extended to also pull from upload-extracted globals.

Run time: - data-load node with namespaced path: executor looks up <file>::<binding> in the dataset registry and returns the cached Dataset. No worker round-trip for the data source itself. - Opaque code that references RData-sourced names: existing input-binding mechanism marshals the workspace’s Dataset into the worker as a globalenv binding — same mechanism today’s opaque dispatches use for any TS-Dataset input.

Net result: RData becomes “just another upload format.” DAG shows data-load nodes with file paths; user mental model is identical to CSV uploads. WebR’s role is reduced to a one-shot extractor at upload time and the existing opaque-code execution layer; not a runtime data dependency.

3. Scope

In

Out (future sessions / deferred)

4. Architecture

4.1 Module changes

src/ui/store/workspace.ts                     [modified]
                                              Add rdataManifests: Map<filePath, RDataManifest>;
                                              add rdataDatasets: Map<namespacedPath, Dataset>;
                                              add globals: Map<name, GlobalValue>;
                                              wiring: addFiles(...) triggers extraction
                                              for .RData entries.

src/core/rdata/manifest.ts                    [new]
                                              RDataManifest type;
                                              GlobalValue discriminated union
                                              (scalar | vector | named-list, recursive).

src/core/webr/protocol.ts                     [modified]
                                              Add 'extract-rdata' request type +
                                              'extract-rdata-result' response.

src/workers/webr-worker.ts                    [modified]
                                              handleExtractRData: load → enumerate
                                              → classify → marshal data.frames →
                                              return manifest + opaque-binding
                                              handles left in globalenv.

src/workers/worker-manager.ts                 [modified]
                                              extractRData(filePath): orchestrates
                                              the extract dispatch; populates
                                              workspace store; emits
                                              extraction-complete events for UI.

src/core/parsers/file-registry.ts             [modified]
                                              GlobalConstantInfo extended to
                                              discriminated value types;
                                              merge upload-extracted globals
                                              into per-file scope.

src/core/parsers/r/recognizer.ts              [modified]
                                              Detect load() with literal path;
                                              update activatedNamespaces;
                                              wire scope-miss hook for
                                              unresolved identifier lookups
                                              with reverse-order namespace walk.

src/core/parsers/r/file-scope-collector.ts    [modified]
                                              (or wherever source-side constants
                                              are collected today) — extend to
                                              recognize numeric literals and
                                              named-list literals.

src/core/parsers/r/expression-evaluator.ts    [modified]
                                              (or wherever scope-resolution lives)
                                              — handle lst$member and lst[["member"]]
                                              against named-list globals.

src/core/pipeline/data-registry.ts            [modified — if exists, else inline
                                              in executor]
                                              Resolve namespaced paths
                                              (<file>::<binding>) by lookup
                                              into rdataDatasets.

src/core/pipeline/executor.ts                 [modified]
                                              data-load executor: when path
                                              contains '::', resolve via
                                              rdataDatasets registry.

src/ui/components/panels/data-panel.tsx       [modified]
                                              Tree-grouped view by file;
                                              datasets/globals/opaque sections;
                                              per-binding preview.

src/ui/components/banners/extraction-status.tsx [new]
                                              Per-file banner showing
                                              extraction progress and any
                                              failed bindings.

reference-papers/INTERLYSE-RUN-STATUS.md      [modified]
                                              §13 updated: drop CSV workaround.

BACKLOG.md                                    [modified]
                                              Mark load() bullet DONE;
                                              mark numeric constants bullet DONE.

CLAUDE.md                                     [modified]
                                              Add No-Go invariants.

4.2 RDataManifest and GlobalValue

// src/core/rdata/manifest.ts

/** Classification of every name found in an RData file at extraction. */
export interface RDataManifest {
  filePath: string;                       // workspace-relative
  datasets: string[];                     // names that materialized as Datasets
  globals: Record<string, GlobalValue>;   // names that materialized as TS values
  opaqueBindings: string[];               // names left as opaque handles in worker globalenv
  failures: { name: string; reason: string }[];   // names that couldn't be classified or marshaled
  extractedAt: number;                    // timestamp; for debugging
}

/** Discriminated union covering source-text constants and RData-extracted globals. */
export type GlobalValue =
  | { kind: 'scalar-string'; value: string }
  | { kind: 'scalar-number'; value: number }
  | { kind: 'scalar-bool'; value: boolean }
  | { kind: 'vector-string'; values: string[] }
  | { kind: 'vector-number'; values: number[] }
  | { kind: 'vector-bool'; values: boolean[] }
  | { kind: 'named-list'; members: Record<string, GlobalValue> };   // recursive

The same GlobalValue type is used for both source-text constants (collected from min.year <- 1996, path <- "data.csv", vars <- c("a","b"), lst <- list(a = 1, b = "x")) and upload-extracted globals. This unifies the recognizer’s scope-resolution path; consumers don’t care about origin.

The existing FileRegistry.globalConstants (string scalars) and globalVectors (string vectors) get folded into a single globalValues: Map<string, { value: GlobalValue; sourceFile: string }> map. The sourceFile field is the R script for source-derived globals or the .RData file path for upload-extracted ones. Backward-compatible migration: existing call sites accessing globalConstants / globalVectors switch to globalValues filtered by kind.

4.3 Worker-side extraction

Protocol additions:

// src/core/webr/protocol.ts
| { type: 'extract-rdata';
    id: string;
    path: string;     // workspace-relative; worker prefixes /workspace/
  }

| { type: 'extract-rdata-result';
    id: string;
    manifest: RDataManifest;
    error?: string;   // file-level errors (load failed entirely, etc.)
  }

Worker handler:

async function handleExtractRData(
  req: Extract<WebRRequest, { type: 'extract-rdata' }>,
): Promise<void> {
  if (!webR) {
    post({ type: 'extract-rdata-result', id: req.id,
           manifest: emptyManifest(req.path),
           error: 'worker not initialized' });
    return;
  }

  const absPath = `/workspace/${req.path}`;
  const manifest: RDataManifest = {
    filePath: req.path,
    datasets: [],
    globals: {},
    opaqueBindings: [],
    failures: [],
    extractedAt: Date.now(),
  };

  try {
    // Load into a fresh env to enumerate names without polluting globalenv yet.
    await webR.evalRVoid(`.interlyse_extract_env <- new.env()`);
    await webR.evalRVoid(`load(file = ${JSON.stringify(absPath)}, envir = .interlyse_extract_env)`);
    const namesR = await webR.evalR(`ls(.interlyse_extract_env)`);
    const names = await (namesR as unknown as { toArray: () => Promise<string[]> }).toArray();

    for (const name of names) {
      try {
        const classification = await classifyAndMarshal(name);
        switch (classification.kind) {
          case 'dataset':
            manifest.datasets.push(name);
            break;
          case 'global':
            manifest.globals[name] = classification.value;
            break;
          case 'opaque':
            // Promote to globalenv with a deterministic prefixed name to avoid
            // collision across multiple uploaded RData files. The prefix uses
            // a short hash of the file path; lifecycle (rm on file removal)
            // tracks these via the manifest's opaqueBindings list.
            await webR.evalRVoid(
              `assign(${JSON.stringify(prefixedName(req.path, name))}, ` +
              `       .interlyse_extract_env[[${JSON.stringify(name)}]], ` +
              `       envir = globalenv())`,
            );
            manifest.opaqueBindings.push(name);
            break;
        }
      } catch (e) {
        manifest.failures.push({
          name,
          reason: e instanceof Error ? e.message : String(e),
        });
      }
    }
  } catch (e) {
    post({ type: 'extract-rdata-result', id: req.id, manifest,
           error: e instanceof Error ? e.message : String(e) });
    return;
  } finally {
    await webR.evalRVoid(`rm(.interlyse_extract_env, envir = globalenv())`);
  }

  // Marshaled datasets are returned via a separate dataset-batch message —
  // see §4.4. The manifest message itself only carries names + globals.
  post({ type: 'extract-rdata-result', id: req.id, manifest });
}

classifyAndMarshal(name) is a worker-internal helper that branches on R-side type checks:

# Pseudocode (actual worker code is TS calling R via evalR)
val <- .interlyse_extract_env[[name]]

if (is.data.frame(val)) {
  # Marshal as MarshaledDataset (existing marshalDatasetFromR pattern)
  return classification: 'dataset' (post bytes via separate message)
}

if (is.numeric(val) && length(val) == 1) {
  return classification: 'global' { kind: 'scalar-number', value: as.numeric(val) }
}
if (is.character(val) && length(val) == 1) { ... 'scalar-string' ... }
if (is.logical(val) && length(val) == 1) { ... 'scalar-bool' ... }

if (is.numeric(val) && length(val) > 1) { ... 'vector-number' ... }
if (is.character(val) && length(val) > 1) { ... 'vector-string' ... }
if (is.logical(val) && length(val) > 1) { ... 'vector-bool' ... }

if (is.list(val) && !is.null(names(val)) && all(names(val) != "")) {
  # Recursive classification of each member; if all members classify cleanly,
  # promote to a 'named-list' global. If any member fails, fall back to opaque.
  members <- ... recursive call on each member ...
  if (members.allOk) return classification: 'global' { kind: 'named-list', members }
  else return classification: 'opaque'
}

# Everything else: leave in globalenv as opaque binding.
return classification: 'opaque'

Memory note: MarshaledDataset for each data.frame is built once at extraction; bytes flow back to TS once and the worker’s copy in .interlyse_extract_env is garbage-collected after the rm() in the finally block. Opaque bindings stay alive in globalenv (with prefixed names) until file removal or worker restart.

4.4 Worker-manager orchestration

// src/workers/worker-manager.ts

async function extractRData(filePath: string): Promise<void> {
  await ensureWebRWorker();   // boots if not yet running

  const id = nextDispatchId();
  const datasets = new Map<string, MarshaledDataset>();
  // The dispatch may yield multiple messages: zero or more 'rdata-dataset'
  // (one per data.frame), then one 'extract-rdata-result' (the manifest).
  // Existing dispatcher correlates by id.

  const result = await dispatchExtractRData(id, filePath, datasets);

  // Populate workspace store
  const workspace = useWorkspaceStore.getState();
  workspace.registerRDataExtraction({
    filePath,
    manifest: result.manifest,
    datasets,   // Dataset values keyed by binding name
  });

  if (result.error) {
    // Fatal extraction error (file-level): record + surface banner
    workspace.recordExtractionError(filePath, result.error);
  }
}

Trigger: WorkspaceStore.addFiles(...) checks file extension; for .RData / .rdata, schedules extractRData(path) in the background (does NOT block the upload completion). The UI shows an “extracting…” indicator on the file in the data panel until extraction completes.

Concurrency: extractions for different files run sequentially (worker FIFO queue per CLAUDE.md atomic-end-to-end invariant). A second .RData upload starts after the first completes.

4.5 Workspace store

// src/ui/store/workspace.ts — additions

interface WorkspaceState {
  // ... existing files: Map<string, Uint8Array> ...

  /** Per-file extraction manifest. Populated post-extraction. */
  rdataManifests: Map<string, RDataManifest>;

  /** Namespaced dataset registry: key = "<file>::<binding>". */
  rdataDatasets: Map<string, Dataset>;

  /**
   * Workspace globals — both source-text and upload-extracted.
   * Keyed by binding name; values are GlobalValue-typed.
   * Conflicts (same name from multiple sources) are resolved at scope-merge
   * time per-recognizer-pass (latest source-order wins, matching existing
   * FileRegistry behavior).
   */
  globals: Map<string, { value: GlobalValue; sourceFile: string }>;

  /** Extraction lifecycle ops. */
  registerRDataExtraction: (entry: ExtractionResult) => void;
  removeRDataExtraction: (filePath: string) => Promise<void>;   // also rm() opaque handles
  recordExtractionError: (filePath: string, error: string) => void;
}

removeRDataExtraction(path) clears the manifest, deletes namespaced registry entries (<path>::<binding>), removes the file’s globals from the workspace globals map, and posts an rm() request to the worker for any opaque handles registered under that file.

4.6 Recognizer namespace tracking

Recognizer state extension:

interface RecognizerState {
  // ... existing scope, diagnostics, loadedPackages ...

  /**
   * Source-order list of RData file paths whose namespaces have been activated
   * by load() calls in the current pass. Each entry's manifest is consulted
   * for unresolved-identifier resolution; reverse-order walk (latest first).
   * Programmatic load paths (paste0, vars) do not activate.
   */
  activatedNamespaces: string[];

  /**
   * Per-activated-file span of the load() statement that activated it. Used
   * to set sourceSpan on emitted data-load nodes so click-to-source lands on
   * the load() line. If the same file is load()'d multiple times, the most
   * recent activation's span overwrites — matches reverse-order resolution.
   */
  namespaceLoadSpans: Map<string, Span>;
}

Walking the AST: the existing top-level statement loop adds a branch for load():

for (const stmt of program.body) {
  // ... existing: ignorable calls, loops, function defs ...

  if (isLoadCall(stmt)) {
    const path = extractFilePath(stmt /* or stmt.value if assigned */);
    if (path && externalScope.rdataManifests.has(path)) {
      state.activatedNamespaces.push(path);
      continue;   // emit no AnalysisCall for the load() itself
    }
    // Fall through: programmatic path or unknown file — treat as opaque
    // side-effect (existing behavior).
  }

  const result = tryRecognizeStatement(stmt, state);
  // ... existing emit ...
}

Scope-miss hook for unresolved identifiers: when the recognizer’s RHS-identifier or data= arg resolution misses scope (consults state.scope and externalScope.globals and finds nothing), it falls through to:

function resolveAgainstActivatedNamespaces(
  name: string,
  state: RecognizerState,
  externalScope: ExternalScope,
): { kind: 'dataset'; namespacedPath: string; loadSpan: Span }
  | { kind: 'global'; value: GlobalValue }
  | undefined
{
  // Walk activated namespaces in reverse — latest first.
  for (let i = state.activatedNamespaces.length - 1; i >= 0; i--) {
    const file = state.activatedNamespaces[i];
    const manifest = externalScope.rdataManifests.get(file);
    if (!manifest) continue;

    if (manifest.datasets.includes(name)) {
      // Dataset hit — caller emits a data-load AnalysisCall with namespaced path.
      const loadSpan = state.namespaceLoadSpans.get(file);   // tracked per activation
      return { kind: 'dataset', namespacedPath: `${file}::${name}`, loadSpan };
    }
    if (name in manifest.globals) {
      return { kind: 'global', value: manifest.globals[name] };
    }
  }
  return undefined;
}

Caller integration: - The data= arg recognizer (recognize-data-arg.ts) consults this hook on miss; on a 'dataset' hit, it emits a data-load AnalysisCall with args.file = namespacedPath, sourceSpan = loadSpan, and wires the resulting node as the data upstream. - The RHS-identifier path (dats <- dat) does the same: emit a data-load with the namespaced path; alias the lhs in scope. - Global hits resolve as the existing global-scope mechanism does.

Idempotence: emitted data-load AnalysisCalls have deterministic shape (path is span-independent — namespacedPath is purely a function of file + binding, both stable). Re-recognition of identical source produces identical calls.

namespaceLoadSpans tracking: as activatedNamespaces is appended on each load() encounter, we also record the load-statement span per file so the resulting data-load node’s sourceSpan can point at the load() line. If the same file is load()’d twice (defensive idiom — re-loading), latest wins (matches R semantics; reverse-order resolution naturally picks the latest activation’s span).

4.7 Path syntax and dataset registry

Datasets registered at <file>::<binding> keys. The existing data-load executor — which today accepts a file string and looks it up in the dataset registry — gets one extra resolution rule:

function resolveDataset(path: string, registry: WorkspaceState): Dataset | undefined {
  // Existing: direct file-path lookup against the parsed-dataset registry
  // (CSVs / DTAs / etc., parsed at upload time). Field name follows whatever
  // the current implementation uses (the spec deliberately doesn't pin the
  // exact identifier — implementer reads the executor and parallels it).
  const direct = registry.getParsedDataset(path);
  if (direct) return direct;

  // NEW: namespaced lookup for RData-extracted datasets — distinguished by
  // the '::' separator in the path. Lookup always uses the full namespaced
  // string as the key.
  return registry.rdataDatasets.get(path);
}

data-load params unchanged: { file: string }. The path syntax is the entire encoding. UI / mapper / recognizer don’t introspect on the :: — they treat it as an opaque path string. Only the executor’s resolution logic distinguishes.

4.8 Source-code named-list and numeric-scalar collection

The existing source-text constant collector (today: string scalars and string vectors) is extended:

The expression evaluator (used by data-mutate / data-filter for column expressions and by recognizer’s identifier-resolution for inline-substitution) gets matching extensions:

Bounded scope: this is not a full R list interpreter. We handle: - Literal list(...) assignments at the top level (source-text or upload-extracted). - Member access via $ and [["..."]]. - No support for: list mutation (lst$a <- 5), list concatenation, programmatic key access, runtime-computed members.

4.9 UI — data panel

Rendering rule depends on whether the file is a multi-binding container:

▼ data_with_charge_offs.RData (extracted)
    Datasets
      ▸ dat              [preview] [bind to data-load]
      ▸ pdat             [preview] [bind to data-load]
    Globals
      ▸ n_max  = 42      (number)
    Opaque
      ▸ model_results    (R list — available to opaque code)

▸ collateral-bunching-loans.csv     [preview] [bind to data-load]

▶ extracting…  large_file.RData     (in progress, 3/8 names)

A per-file extraction-status banner appears on the right edge during extraction, transitioning to a small ✓ on completion or to a yellow ⚠ if there were per-binding failures (with a “see details” link expanding the failures list).

Auto-binding on completion: when extraction completes, any pre-existing data-load nodes whose params.file matches a freshly-registered namespaced path get their binding-status updated from “missing” to “bound.” Existing path-match auto-binding mechanism, no new logic.

Determining “multi-binding”: an RData file is multi-binding if its manifest has any of: (a) ≥2 datasets, (b) ≥1 dataset + ≥1 global, (c) ≥1 dataset + ≥1 opaque, or (d) zero datasets but ≥1 global or opaque (rare — RData with only scalars/lists). A file with exactly one dataset and zero globals/opaque renders flat; the dataset’s display name is the file’s basename (without .RData extension), with the namespaced-path lookup happening transparently behind it.

4.10 Lifecycle

File removal (user removes upload from data panel): 1. WorkspaceStore.removeRDataExtraction(filePath): - Delete manifest from rdataManifests. - Delete <filePath>::* entries from rdataDatasets. - Delete globals entries with sourceFile === filePath. - Post rm-bindings worker request listing prefixedName(filePath, name) for each opaque binding. 2. Pipeline rebuild fires (existing onPipelineChange); recognizer no longer finds filePath in externalScope.rdataManifests; data-load nodes pointing at <filePath>::* now resolve to a missing-Dataset error.

Worker restart (Reset WebR Session future feature): all globalenv state lost. Workspace store still has manifests; Datasets are TS-side and unaffected. Opaque-binding handles are gone; opaque code referencing them errors at next dispatch. Re-extraction-on-boot is deferred to the Reset Session feature spec.

Re-uploading a file with the same path: existing wipe-confirmation flow runs; on Replace, prior extraction is cleaned up via the file-removal path before new extraction kicks off.

5. Behavior

5.1 Single-load happy path (paper #13)

Source:

load(file = "./data/data_with_charge_offs.RData")
dats <- dat
felm(yfill ~ collateral + log_amt | timeInt + Disaster_Id | 0 | Disaster_Id, data = dats)

Upload: data_with_charge_offs.RData arrives; worker extracts; manifest = { datasets: ['dat', 'pdat'], globals: {}, opaqueBindings: [], ... }; namespaced datasets registered.

Recognizer pass: 1. load(...) → path matches an entry in externalScope.rdataManifestsactivatedNamespaces.push("data/data_with_charge_offs.RData"). No AnalysisCall. 2. dats <- dat: RHS identifier dat misses scope → namespace walk: latest = data_with_charge_offs.RData; manifest contains dat. Hit — emit data-load AnalysisCall with args.file = "data/data_with_charge_offs.RData::dat", sourceSpan = <load_span>, assignedTo = <synthetic-or-just-dat>. Add to scope. 3. felm(..., data = dats): dats resolves to the alias from step 2 → consumes the data-load node’s output.

Pipeline:

[data-load: data_with_charge_offs.RData::dat] ──► [linear-model: felm(...)]

Execution: data-load executor sees :: in path → resolves via rdataDatasets → returns cached Dataset. linear-model runs natively. No worker round-trip for the data source.

5.2 Multi-load with cascading resolution

load("A.RData")     # manifest: datasets=['dat', 'x'], ...
load("B.RData")     # manifest: datasets=['dat', 'y'], ...
feols(y ~ x, data = dat)   # references dat — which one?
feols(y ~ x, data = x)     # references x — which one?
feols(y ~ x, data = y)     # references y — which one?

activatedNamespaces after both loads: ["A.RData", "B.RData"].

Resolution (reverse order): - dat → check B (has dat) → resolve to B.RData::dat. ✓ Latest-wins. - x → check B (no x) → fall back to A (has x) → resolve to A.RData::x. ✓ Cascade hit. - y → check B (has y) → resolve to B.RData::y. ✓

The DAG has three data-load nodes, each with its own namespaced path. The user gets correct attribution without writing any extra code.

5.3 Programmatic load path

data_dir <- "./data"
load(file = paste0(data_dir, "/x.RData"))
feols(y ~ x, data = dat)

extractFilePath returns null (paste0 not literal). load() falls through to existing webr-opaque side-effect path (no namespace activated). dat reference doesn’t resolve → falls through to webr-opaque too. Same behavior as today; documented limitation parallel to the VFS bridge’s programmatic-path limitation.

5.4 Side-effect-only load()

load("A.RData")
print(ls())

Namespace activated; no name reference in subsequent code triggers any data-load emission. print(ls()) is a webr-opaque side-effect; runs in worker globalenv.

What does ls() see? The opaque-binding handles for non-extracted complex objects (registered with prefixed names like __rdata_<hash>_funcfoo), if any. NOT the extracted data.frames/scalars (those live in TS). Limitation: this idiom is partial — opaque code’s ls() shows the prefixed handles, not the user-visible binding names.

If a future paper needs full globalenv-introspection of RData names: option is to surface dataset/global bindings into globalenv too, with their original (un-prefixed) names. Adds memory/binding-lifecycle complexity; defer until needed.

5.5 Source-code named lists

config <- list(
  threshold = 0.05,
  vars = c("x1", "x2", "x3"),
  filters = list(min_age = 18, max_age = 65)
)
df %>% filter(age >= config$filters$min_age, age <= config$filters$max_age)

Source collector classifies config as { kind: 'named-list', members: { threshold: { kind: 'scalar-number', value: 0.05 }, vars: { kind: 'vector-string', values: [...] }, filters: { kind: 'named-list', members: { min_age: ..., max_age: ... } } } }.

data-filter recognizer’s expression-evaluator resolves config$filters$min_age18, config$filters$max_age65, substitutes literals into the filter condition. Existing data-filter machinery does the rest.

5.6 File removal mid-session

load("A.RData")
felm(y ~ x, data = dat)   # currently bound to A.RData::dat

User removes A.RData from the data panel. Workspace lifecycle: manifest cleared, namespaced dataset entries removed, opaque handles rm()d in worker. Pipeline rebuild fires: recognizer’s externalScope.rdataManifests no longer contains A.RData; the load() falls through (no activation); data = dat doesn’t resolve → falls through to webr-opaque. The data-load node referencing A.RData::dat is no longer emitted; if it had been pinned in the editor’s pipeline state, it shows a missing-binding error.

User can either re-upload, or remove the load() from source. The behavior is consistent with how the existing CSV path handles file removal.

6. Testing

6.1 Unit (Vitest)

src/core/rdata/manifest.test.ts: - RDataManifest round-trip serialization. - GlobalValue discriminated union exhaustiveness. - prefixedName(path, name) produces deterministic, collision-free names.

src/workers/webr-worker.extract-rdata.test.ts (mock WebR): - Classification of fixtures: pure data.frame manifest, mixed (data.frame + scalar + vector + named-list + opaque), all-failures. - finally-block cleanup of the temp env.

src/core/parsers/file-registry.test.ts — extend: - Numeric scalar collection from x <- 42, x <- -3.14. - Named-list collection (single level + nested). - Source-text and upload-extracted globals merge correctly.

src/core/parsers/r/recognizer.namespace.test.ts — new: - Single load + reference: emits data-load with namespaced path; sourceSpan = load span. - Multi-load with cascading: name-only-in-A referenced after both loads → resolves to A. - Programmatic load path: no namespace activated; falls through to opaque. - Same file load()’d twice: latest activation’s span used for sourceSpan. - Idempotence: re-recognition produces identical AnalysisCalls.

src/core/parsers/r/expression-evaluator.named-list.test.ts — new: - lst$a resolves to scalar. - lst[["a"]] resolves identically. - Nested access (lst$d$inner). - Missing member errors clearly.

src/core/pipeline/data-registry.test.ts (or executor.test.ts): - data-load with namespaced path resolves via rdataDatasets. - data-load with plain CSV path still resolves via existing parsedDatasets.

6.2 Integration (Vitest, gated WebR for the worker round-trip)

src/workers/worker-manager.extract-rdata.integration.test.ts: - Real WebR worker. Sync a small .RData fixture into /workspace/. Trigger extraction. Assert manifest + Dataset bytes match expected values. - .RData containing all object types (data.frame, scalar, vector, named list, S4 placeholder) → manifest classifies each correctly; data.frames/scalars/vectors/named-lists round-trip. - File removal post-extraction → opaque handles cleared from worker globalenv (assert via subsequent exists() probe).

src/core/pipeline/integration.test.ts — extend: - Paper #13 collateral structural assertion using the namespaced-path flow: paste headline; recognize + map; assert pipeline has 1 data-load (namespaced) + 1 alias (or absorbed) + 1 linear-model; correct edges.

6.3 Validation (paper-match, gated)

src/core/pipeline/replicate-collateral.test.ts — extend the existing file (today: synthetic-CSV-only, paper #13’s Table 4 modNaive verified against fixest::feols). Add the two-test pattern (canonical: replicate-soil.test.ts) that exercises the upload-extraction path:

Test 1 (in-tree, always runs): synthetic examples/collateral-bunching-loans.RData generated by an examples/collateral-bunching-loans.R script (reads the existing CSV stub, save()s as RData with set.seed(2026) for byte stability). Verifies the upload-extraction → recognizer → executor chain end-to-end against the synthetic data.

Test 2 (gated RUN_PAPER_MATCH=1 + data presence): paper’s actual data_with_charge_offs.RData from the deposit. Skipped silently in plain npm test and on fresh clones.

Both verify (a) extraction succeeds, (b) the resulting Dataset matches read.csv of the equivalent CSV (test 1) or load()+as.data.frame via R sidecar (test 2), (c) downstream felm/lm coefficients match within tolerance.

6.4 E2E (Playwright)

e2e/rdata-upload-extraction.spec.ts: - Drag-drop ZIP containing one .RData and one .R script with the load() → felm(...) chunk. - Assert “extracting…” indicator appears, then resolves with extraction-complete. - Assert data panel shows the file with its Datasets section populated. - Assert pipeline DAG has data-load (namespaced) → linear-model. - Click Run. Assert regression result panel renders coefficients matching expected values. - Remove the file from the data panel. Assert data-load shows missing-binding error.

6.5 Audit regression

Re-run audit-opaque.ts after implementation. Expected: - load occurrences in unsupported/webr-opaque drop to ~zero (every paper now resolves through namespace activation). - Models-blocked count drops by the count attributable to RData-only data inputs (paper #13 4-6 models, paper #4 models, partial unblock for paper #17 — full unblock requires the MILP primitives that are out of scope). - Numeric-scalar-not-substituted occurrences (the BACKLOG numeric-constants bullet) drop to ~zero — collateral side effect.

7. CLAUDE.md No-Go list additions

8. Shipping criteria

  1. All unit and integration tests pass.
  2. Paper-match gated test (RUN_PAPER_MATCH=1) for paper #13 passes when the deposit’s RData is present.
  3. E2E test passes in Chromium.
  4. npm run build && npm test && npm run lint && npm run test:e2e green.
  5. npm run test:paper-match green when data is present; skip-cleanly otherwise.
  6. Manual smoke: upload paper #13’s deposit ZIP, paste the headline load → dats <- dat → felm(...) chunk, click Run, verify regression results match documented coefficients (collateral 0.062789, log_amt 0.028844 on the synthetic data).
  7. INTERLYSE-RUN-STATUS §13 updated: drop the “Convert RData to CSV” workaround note; reclassify the load(.RData) blocker as resolved; an “RData upload-extraction” row added to the paper’s “What was needed to make it run” table.
  8. BACKLOG load(file = "*.RData") bullet marked DONE with cross-reference; “Numeric file-scope constants” bullet marked DONE.
  9. CLAUDE.md No-Go entries added.
  10. Superseded sibling spec retains its SUPERSEDED banner (linked from this spec’s header) and remains in git for trace-back.

9. Migration notes

10. Risk & rollback

11. Out-of-scope reminders