WebR VFS Bridge — Design Spec

WebR VFS Bridge — Design Spec

Date: 2026-04-27 Milestone: WebR follow-up (Session 4 of WebR follow-up sessions) Predecessors: - 2026-04-20 WebR Integration (Session 1 — typed framework + lm_robust) - 2026-04-22 WebR Opaque Nodes (Session 2 — opaque path end-to-end)

1. Context

The opaque-nodes session unlocked paste-as-written for ~70 unique R functions by emitting webr-opaque nodes for unrecognized assignments. The single remaining wall is file I/O inside the WebR worker: read_xlsx("../Data/MasterData.xlsx"), readRDS("results/m1.rds"), haven::read_sav("survey.sav"), even read.csv("foo.csv") on uploaded files all fail with “file not found” because the WebR worker’s filesystem is empty — uploads only ever reach the TS-side dataset registry.

Today’s binary-input papers (INTERLYSE-RUN-STATUS papers #2/#3/#4) require a manual workaround: convert .xlsx/.rds/.sav/.dta to CSV outside the app, place the CSV in examples/, and point the R code at it. This is a hard barrier: any paper using a binary input format fails on first paste, and there’s no in-app path forward.

Today’s worker also has no story for files written by R: ggsave(), write.csv(), writeLines(), etc. produce files in the worker’s VFS that die silently when the worker terminates. Replication papers routinely write tables and figures to disk; users have no way to retrieve them.

This session bridges both directions. Uploaded files (binary and otherwise) get mirrored into the WebR worker’s VFS so opaque R code can read them; files written by the worker get surfaced in an artifacts panel so users can download them. The is.data.frame probe added in session 2 already auto-marshals worker results to TS Datasets when applicable, so the moment read_xlsx(...) succeeds inside R, the data flows downstream into the existing pipeline machinery.

A future “editor mode” UI (Scripts/Data side-panels for authoring, separate from replication tree-view) will sit on top of the same workspace store this session establishes; that UI work is split into a follow-up spec.

2. Scope

In

Out (future sessions)

3. Architecture

3.1 Module changes

src/ui/store/workspace.ts                    [new]
                                              Path-keyed Uint8Array store + sync queue;
                                              originalUploads set; lifecycle ops

src/ui/store/files.ts                         [modified]
                                              Build WorkspaceStore from extracted ZIP;
                                              feed binary files (today discarded)

src/ui/components/toolbar/upload-zone.tsx    [modified]
                                              Accept .xlsx/.xls/.rds/.rdata/.sav;
                                              wipe-workspace confirmation

src/core/zip/extractor.ts                     [modified]
                                              Extract binary file bytes (today excluded);
                                              drop per-file size cap

src/core/parsers/shared/analysis-call.ts     [modified]
                                              Add originFile?: string

src/core/parsers/file-registry.ts             [modified]
                                              Thread originFile into recognizer call

src/core/parsers/r/recognizer.ts              [modified]
                                              Pass originFile through to AnalysisCall;
                                              call extractFilePath during opaque walk;
                                              return referencedFiles in result

src/core/parsers/r/extract-file-path.ts       [new]
                                              Shared helper: known-reader registry
                                              + extractFilePath(call) → string | null;
                                              also reusable by future typed marshalers

src/core/pipeline/types.ts                    [modified]
                                              Add originFile?: string to webr-typed
                                              and webr-opaque params

src/core/pipeline/mapper.ts                   [modified]
                                              Carry originFile from AnalysisCall to node

src/core/webr/protocol.ts                     [modified]
                                              Add cwd?: string to WebRRequest
                                              dispatch-typed and dispatch-opaque

src/core/webr/dispatch.ts                     [modified]
                                              Plumb cwd through dispatcher API

src/workers/webr-worker.ts                   [modified]
                                              Handle FS-write requests;
                                              setwd(cwd) before each eval

src/workers/worker-manager.ts                [modified]
                                              Sync workspace bytes on worker init;
                                              incrementally sync on new uploads;
                                              compute cwd from originFile;
                                              post-Run artifact discovery

src/ui/store/artifacts.ts                     [new]
                                              Discovered artifacts; preview cache;
                                              download orchestration

src/ui/components/panels/artifacts-panel.tsx [new]
                                              Collapsible artifacts panel

3.2 WorkspaceStore

// src/ui/store/workspace.ts
interface WorkspaceState {
  files: Map<string, Uint8Array>;         // path → bytes (path-keyed flat map; '/' in keys forms tree)
  originalUploads: Set<string>;           // paths present after last upload (for artifact diff)
  syncedToWebR: Set<string>;              // subset of files that have been pushed to /workspace/
  totalSize: number;                       // running sum for the 1.5GB cap

  addFiles: (entries: Array<{ path: string; bytes: Uint8Array }>) => void;
  wipe: () => Promise<void>;               // also wipes WebR /workspace/
  removeFile: (path: string) => void;
  getPendingSync: () => Array<{ path: string; bytes: Uint8Array }>;
  markSynced: (paths: string[]) => void;
  markUnsynced: () => void;                // called when WebR worker is recreated
}

The store is the single source of truth for “what’s in the workspace.” Both the existing TS-side dataset registry (parsed CSVs/DTAs) and the new VFS sync read from it. CSV bytes live here even though parsed Datasets exist elsewhere; with the reference scan in §3.3, CSV bytes are only pushed to VFS if opaque R code references them by path — so the duplication is opt-in, not automatic.

3.3 Reference scan

A shared helper extracts file path arguments from known file-reader function calls. The same helper is reusable by future typed marshalers (Scenario B in design discussion) — its placement in src/core/parsers/r/ rather than inside the opaque walker is deliberate.

// src/core/parsers/r/extract-file-path.ts
import type { FunctionCallNode } from './ast.ts';

// Each entry: function name → which arg holds the path.
// 'name:<arg>' for keyword args; numeric position (0-based) for positional.
const KNOWN_READERS: Record<string, { argName?: string; argPos: number }> = {
  // Always-data.frame readers
  'read.csv':           { argPos: 0, argName: 'file' },
  'read.delim':         { argPos: 0, argName: 'file' },
  'read.table':         { argPos: 0, argName: 'file' },
  'read.dta':           { argPos: 0, argName: 'file' },
  'read_csv':           { argPos: 0, argName: 'file' },
  'read_tsv':           { argPos: 0, argName: 'file' },
  'read_delim':         { argPos: 0, argName: 'file' },
  'fread':              { argPos: 0, argName: 'input' },
  'read_xlsx':          { argPos: 0, argName: 'path' },
  'read_excel':         { argPos: 0, argName: 'path' },
  'readxl::read_xlsx':  { argPos: 0, argName: 'path' },
  'readxl::read_excel': { argPos: 0, argName: 'path' },
  'haven::read_dta':    { argPos: 0, argName: 'file' },
  'haven::read_sav':    { argPos: 0, argName: 'file' },
  'haven::read_sas':    { argPos: 0, argName: 'file' },
  // Polymorphic readers (must stay opaque permanently)
  'readRDS':            { argPos: 0, argName: 'file' },
  'readr::read_rds':    { argPos: 0, argName: 'file' },
  'load':               { argPos: 0, argName: 'file' },
};

/** Returns the literal file path arg if `call` is a known reader, else null. */
export function extractFilePath(call: FunctionCallNode): string | null {
  const entry = KNOWN_READERS[call.name];
  if (!entry) return null;

  // Prefer named arg if present
  if (entry.argName) {
    const named = call.args.find(a => a.name === entry.argName);
    if (named?.value.type === 'literal' && typeof named.value.value === 'string') {
      return named.value.value;
    }
  }
  // Fall back to positional
  const positional = call.args[entry.argPos];
  if (positional?.value.type === 'literal' && typeof positional.value.value === 'string') {
    return positional.value.value;
  }
  return null;  // Programmatic path (paste0, variable, etc.) — not extractable.
}

The recognizer’s binding-walk calls extractFilePath for every FunctionCallNode it visits during opaque emission. Hits accumulate into a referencedFiles: Set<string> returned alongside the existing calls: AnalysisCall[] from recognizeR(). FileRegistry aggregates per-file sets into one pipeline-wide set.

Path resolution against the workspace happens in worker-manager (the recognizer doesn’t know the workspace). Resolution rules: 1. Exact match against workspace.files keys. 2. Resolve relative to originFile’s directory (matches the CWD scheme). 3. Basename match (case-sensitive — Linux semantics inside WebR).

A reference that resolves to a workspace file → that file is added to the sync set. References that don’t resolve are dropped (they’ll fail at R-eval time with a clear file-not-found error, which the user can address by uploading the missing file).

The reference scan is recomputed on every pipeline rebuild (which already runs on every code edit via setCodeForTab and loadZip). The cost is O(nodes × known_readers) map lookups — negligible.

3.4 VFS sync pipeline

Sync operates on the resolved subset syncTargets = referencedFiles ∩ workspace.files (resolved against originFile-relative directories per §3.3). Files in the workspace that no R code references are never pushed to VFS.

Three triggers, one code path (workerManager.syncWorkspaceToWebR):

  1. Worker boot (ensureWebRWorker): after init-ready, compute syncTargets, iterate the unsynced subset, post FS-write messages before resolving the webrReady promise. Status stays at loading until sync completes; only then transitions to ready. Callers of ensureWebRWorker() can therefore assume the FS is populated for current syncTargets when the promise resolves.

  2. Pipeline rebuild adds new references: when referencedFiles grows (user edits R code to reference a new file), worker-manager posts FS-write for the newly-referenced files (if WebR is up). If WebR isn’t up, they’re queued like any other.

  3. New file added while it’s already referenced: workspace.addFiles(...) notifies worker-manager. If the new file matches any entry in referencedFiles, post FS-write immediately. Otherwise no-op until referenced.

  4. Workspace wipe: post fs-wipe; worker calls webR.FS.unlink() over each path under /workspace/. After ack, mark all files unsynced; the next sync trigger re-pushes whatever’s still in syncTargets.

Protocol additions (all four FS operations consolidated here for reference; fs-list and fs-read are used in §3.6):

// src/core/webr/protocol.ts — additions to WebRRequest
| { type: 'fs-write'; id: string; entries: Array<{ path: string; bytes: Uint8Array }> }
| { type: 'fs-wipe';  id: string }
| { type: 'fs-list';  id: string; root: string }
| { type: 'fs-read';  id: string; path: string }

// additions to WebRResponse
| { type: 'fs-ack';         id: string; written?: string[]; error?: string }
| { type: 'fs-list-result'; id: string; entries: Array<{ path: string; size: number; mtime: number }> }
| { type: 'fs-read-result'; id: string; bytes?: Uint8Array; error?: string }

For fs-write, the worker creates parent directories as needed (webR.FS.mkdir recursive), then writes each file with webR.FS.writeFile. Paths in the request are workspace-relative (e.g., code/01-prep.R); the worker prepends /workspace/ to form absolute paths.

The fs-list root argument is /workspace; the worker walks recursively and returns one entry per file (not directory). fs-read round-trips bytes for downloads/previews.

3.5 CWD threading

// src/core/parsers/shared/analysis-call.ts — modified
interface AnalysisCall {
  // ... existing fields
  originFile?: string;  // path relative to workspace root, e.g. "code/01-prep.R"
}

FileRegistry.processFiles() already iterates per-file; the recognizer just needs to know the current entry.path and stamp it onto every emitted call. For inline-paste / single-file mode where there’s no meaningful “origin file” (just the editor’s tab content), originFile is left undefined.

// src/core/pipeline/types.ts — extend params
interface WebRTypedParams { /* ... */ originFile?: string; }
interface WebROpaqueParams { /* ... */ originFile?: string; }
// src/workers/worker-manager.ts — derive cwd before dispatch
function cwdFor(originFile: string | undefined): string {
  if (!originFile) return '/workspace';
  const slash = originFile.lastIndexOf('/');
  return slash < 0 ? '/workspace' : `/workspace/${originFile.slice(0, slash)}`;
}

Each WebRRequest carries cwd; the worker prefixes eval with setwd(cwd) (sticky — no restoration). Cost: one extra evalRVoid per dispatch — negligible compared to the actual eval.

3.6 Artifact discovery

After a Run completes (all in-flight dispatches settled), worker-manager posts an fs-list request (see protocol additions in §3.4). The worker walks /workspace/ recursively (webR.FS.readdir + stat for each entry), returns one entry per file. Worker-manager diffs the result paths against workspace.originalUploads; any path not in the set is an artifact. Each artifact in the store carries: path, size, mtime, mime-type guess (from extension).

The artifacts store keeps the latest snapshot. The UI panel renders it grouped by parent directory; new-since-previous-Run paths get a “new” dot for one render cycle (cleared on next Run start).

Downloads: clicking an artifact triggers an fs-read request → bytes round-trip back to TS → Blob + <a download> synthetic link. Previews for text/CSV/SVG/PNG/PDF under 5MB use the same round-trip but render inline (text: pre-wrap; SVG: inline; PNG: img src=data URI; CSV: small table with first 50 rows; PDF: object embed).

3.7 Lifecycle & wipe-confirmation

UploadZone with a ZIP, when the workspace already has files: show a modal with “This will replace your current workspace. Continue?” and Cancel/Replace buttons. On Replace, call workspace.wipe() (which wipes WebR FS too) before extracting the new ZIP. On Cancel, abort the upload.

Single-file uploads (any type) skip the prompt and append. Editor mode (future spec) will create files programmatically; same append path.

The originalUploads set is recomputed at the end of every upload completion (i.e., originalUploads = new Set(workspace.files.keys())). After the wipe-and-replace flow, the new uploads are the new originals; previously-discovered artifacts are gone (they were in the wiped VFS) and the artifacts panel resets.

4. Behavior

4.1 Sync timing

The key invariant: when webrReady promise resolves, all current syncTargets are present in /workspace/. Dispatches downstream of ensureWebRWorker() await it, so they can safely assume their inputs are readable. Files outside syncTargets are never written; if R code at runtime tries to read one, it gets a normal “file not found” error.

4.2 Error handling

4.3 CWD edge cases

4.4 Artifact lifecycle

5. Memory model

Steady state per file: - Unreferenced files (most of workspace.files for typical packages): bytes exist only in TS-side WorkspaceStore. Not in WebR. One copy. - Referenced files (the subset R code actually reads): bytes exist in TS-side WorkspaceStore and in WebR’s WASM heap simultaneously. Copy semantics (not Transferable). Two copies.

For a typical 100–200MB upload where ~10–30% of files are R-readable inputs, the extra-vs-today cost is bounded to that ~10–30%, not the full upload. Modern browsers handle this comfortably; the 1.5GB total cap covers it.

Per file budget: none (per-file cap removed). Total cap stays at 1.5GB.

Future optimization (out of scope): switch to Transferable ArrayBuffer when the dataset-marshal Transferable backlog item lands. Both paths share the same postMessage envelope at that point.

6. UI

6.1 Upload zone

[Upload] button — accept attribute extended:
  ".zip,.R,.r,.csv,.dta,.xlsx,.xls,.rds,.rdata,.sav"

Drop and click handlers route binary types into workspace.addFiles(...) directly (no parsing) and trigger sync if WebR is up.

6.2 Wipe-confirmation modal

Plain modal, two buttons: - “Replace workspace” (primary, destructive) - “Cancel” (secondary)

Body: lists the files currently in the workspace that will be removed (collapsed to “X files (Y MB)” if >5).

6.3 Artifacts panel

Sits in the right sidebar (alongside the existing properties/results panels), collapsed by default. Header shows artifact count and total size. When expanded:

▼ Artifacts (3 files, 1.2 MB)
  output/
    ▸ tables/main.tex      [download]
    ▸ figs/coef-plot.svg   [preview] [download]
  results/
    ▸ m1-summary.csv       [preview] [download]

Preview opens an inline overlay (SVG inline; PNG via data URI; text/CSV with row truncation; PDF via <object>).

6.4 No changes to results / pipeline panels

Artifacts are orthogonal to typed pipeline outputs; they don’t appear in the DAG. (A future “promote artifact to dataset” feature is out of scope — the auto-marshal probe already handles the case where R loads a file and assigns to a binding, which is the normal data-input path.)

7. Testing

8. Migration notes

9. Risk & rollback

10. Out of scope reminders