WebR Opaque Nodes — Design Spec

WebR Opaque Nodes — Design Spec

Date: 2026-04-22 Milestone: M6 enabler (Session 2 of WebR follow-up sessions) Predecessor: 2026-04-20 WebR Integration (Session 1 — typed framework + lm_robust, merged)

1. Context

Session 1 landed the WebR framework plus one typed marshaler (lm_robust). The worker protocol scaffolded dispatch-opaque but nothing on the TS side emits or consumes it. This session wires the opaque path end-to-end, restructures the recognizer around a binding-level walk, and adds a statement-block fallback for parser failures.

Data from the 2026-04-22 opaque inventory (REPLICATION-AUDIT-OPAQUE.md, 83 packages, 10,753 unsupported-node occurrences):

Framework goal: every assignment whose RHS is call-like produces a pipeline node — typed when recognized, opaque when not. Scalar-literal, simple-vector, and evaluator-computable RHS (e.g., x <- 5, threshold <- 0.05, names <- c("a","b","c"), dir <- "/path") continue to update the recognizer’s scope map without emitting a node. The parser’s pattern table becomes a sub-dispatch inside the walk, not the primary traversal.

2. Scope

In

Out (future sessions)

3. Architecture

3.1 Module changes

src/core/parsers/r/recognizer.ts             [restructured]
                                              Binding-level walk as primary loop;
                                              pattern table becomes sub-dispatch

src/core/parsers/r/recognize-opaque.ts       [new]
                                              Free-variable extraction,
                                              opaque AnalysisCall construction

src/core/parsers/r/inliner.ts                [modified]
                                              Unwrap trailing return(x) → x

src/core/parsers/shared/analysis-call.ts     [modified]
                                              Add 'webr-opaque' to AnalysisKind

src/core/pipeline/types.ts                   [modified]
                                              Add WebROpaqueNode; add 'any' dataType;
                                              extend getPortsFor() for dynamic ports

src/core/pipeline/mapper.ts                  [modified]
                                              webr-opaque AnalysisCall → WebROpaqueNode

src/core/pipeline/executor.ts                [modified]
                                              Opaque executor branch; probe-result
                                              handling; downstream error propagation

src/core/webr/protocol.ts                    [modified]
                                              Extend dispatch-opaque response with
                                              probe-result discriminated union

src/workers/webr-worker.ts                   [modified]
                                              handleDispatchOpaque probes result,
                                              marshals Dataset or returns binding

src/workers/worker-manager.ts                [modified]
                                              Route webr-opaque nodes to dispatchOpaque

src/ui/components/pipeline-node.tsx          [modified]
                                              webr-opaque rendering variant

3.2 Node type and ports

// src/core/pipeline/types.ts
export interface WebROpaqueParams {
  rSource: string;         // RHS sliced from original source
  resultBinding: string;   // R name the output lives under (`.n_<id>`)
  inputBindings: string[]; // ordered R identifiers this node depends on
  origin: 'assignment' | 'statement-fallback';
  unassigned: boolean;     // true for top-level side-effect calls (no LHS)
}

export type OpaqueResult =
  | { kind: 'dataset'; dataset: Dataset }
  | { kind: 'opaque-binding'; binding: string }
  | { kind: 'side-effect'; capturedText?: string };

export interface WebROpaqueNode extends PipelineNodeBase {
  type: 'webr-opaque';
  params: WebROpaqueParams;
  result?: OpaqueResult;
}

Ports are per-instance and computed dynamically:

// src/core/pipeline/types.ts — extend getPortsFor()
if (node.type === 'webr-opaque') {
  return {
    inputs: node.params.inputBindings.map(name => ({ name, dataType: 'any' })),
    outputs: [{ name: 'out', dataType: 'any' }],
  };
}

New 'any' dataType is added to PortDataType. Port validation treats it as bidirectionally compatible with 'dataset' and 'model'. No existing port uses 'any' — it’s reserved for opaque edges.

3.3 Rendering (UI)

pipeline-node.tsx adds a variant for webr-opaque:

webr-opaque is selectable, inspectable in the properties panel, and shows execution state (idle / running / dataset-marshaled / opaque-binding-only / side-effect-captured / error).

Results-panel rendering by result kind:

result.kind Panel content
dataset Dataset table viewer (same as native data-nodes)
opaque-binding Badge: “R value kept in WebR env (binding: <name>)” + source echo
side-effect + capturedText Preformatted text block (stdout/stderr from captureR)
side-effect without capturedText Empty-state: “This node ran with no captured output. If it produced a plot, image capture is coming in session 3.”

4. Inliner return() unwrap — prerequisite

Current bug: inline(f) where f <- function(x) { body; return(x) } emits AnalysisCalls for the body statements plus a top-level return(x) that the recognizer then flags as unsupported.

Fix: in tryInline() (src/core/parsers/r/inliner.ts), after argument substitution, inspect the final body statement:

Test coverage: add inliner-return.test.ts with five cases — single tail return, assigned tail return, no return (unchanged), early return (should fall through guard), named return value (return(value = x)x).

Success metric: audit-opaque.ts rerun shows return drops from 490 unsupported occurrences to ≤ 5 (tolerates residual edge cases).

5. Recognizer — binding-level walk

5.1 Current structure

recognizeR today walks each top-level statement, and for AssignmentNode tries a fixed pattern table of function-name matches (tryRecognizeLm, tryRecognizeFeols, …). If no pattern matches, emits { kind: 'unsupported', ... }.

5.2 New structure

The primary loop iterates top-level statements in source order. For each, it dispatches through a sequence of sub-recognizers:

for (const stmt of program.body) {
  if (isIgnorableCall(stmt)) continue;              // library(), setwd(), etc.
  if (stmt.type === 'for' || stmt.type === 'control-flow') {
    handleLoopOrControlFlow(stmt); continue;
  }
  if (stmt.type === 'function-def') {
    registerFunctionDef(stmt); continue;
  }

  const result = tryRecognizeStatement(stmt, scope);
  //  ^ returns: typed AnalysisCall(s) | opaque AnalysisCall | 'evaluated-in-scope' | 'ignored'

  for (const call of result.calls) {
    if (call.assignedTo) scope.set(call.assignedTo, call);
    emit(call);
  }
}

tryRecognizeStatement is the new dispatch:

function tryRecognizeStatement(stmt, scope):
  1. Try typed patterns by RHS function name (existing tryRecognizeLm, tryRecognizeFeols, ...)
     → If one matches, return { calls: [typedCall] }
  2. Try scope-eval for evaluable RHS (existing mutate-column-arithmetic path)
     → If it's a scalar/vector the evaluator can handle, update scope and return { calls: [] }
  3. Try inline expansion (existing tryInline for user-defined function calls)
     → If successful, recurse on inlined body statements
  4. Try loop/apply expansion (existing tryExpandLoop, tryExpandApply)
     → If successful, flush expansion result
  5. **Opaque fallback (NEW):** construct a webr-opaque AnalysisCall from the assignment
     → Return { calls: [opaqueCall] }

Step 5 is the new insertion. Steps 1–4 are the existing logic refactored into a dispatch sequence. Pattern match still fires first — the only semantic change is that step 5 emits webr-opaque instead of unsupported.

5.3 Opaque AnalysisCall construction (recognize-opaque.ts)

function recognizeOpaqueAssignment(
  assignment: AssignmentNode,
  scope: Map<string, AnalysisCall>,
  source: string,
): AnalysisCall {
  const rhs = assignment.value;
  const rhsSource = source.slice(rhs.span.start, rhs.span.end);
  const freeVars = extractFreeIdentifiers(rhs);
  const inputBindings = freeVars.filter(v => scope.has(v));

  return {
    kind: 'webr-opaque',
    args: {
      rSource: rhsSource,
      inputBindings,
      origin: 'assignment',
    },
    assignedTo: assignment.target.name,
    sourceSpan: assignment.span,
  };
}

For unassigned top-level calls (plot(x), print(summary(m))) that the pattern table doesn’t recognize and aren’t in the ignorable-call list:

function recognizeOpaqueSideEffect(call: FunctionCallNode, scope, source): AnalysisCall {
  return {
    kind: 'webr-opaque',
    args: {
      rSource: source.slice(call.span.start, call.span.end),
      inputBindings: extractFreeIdentifiers(call).filter(v => scope.has(v)),
      origin: 'assignment',  // treated identically; result kind will be 'side-effect' at execute
    },
    sourceSpan: call.span,
    // No assignedTo → mapper emits a terminal node (no outgoing edges)
  };
}

5.4 Free-variable extraction

extractFreeIdentifiers(node) walks the AST and returns identifiers that are referenced but not bound within the expression. Filters out:

Reuses isNSEContext() and related helpers from recognize-data-arg.ts rather than reimplementing.

5.5 Recognizer API — unchanged externally

recognizeR(ast, source, scope) → { calls, diagnostics, loadedPackages } signature stays identical. Only the internal structure changes. Existing callers (FileRegistry, audit, direct recognizer tests) need no modifications.

6. Statement-block opaque fallback

6.1 Parser error-recovery gap

The R parser uses Chevrotain with error recovery. When a statement fails to parse (e.g., a syntax the lexer doesn’t handle, an unrecognized construct), the parser skips to the next statement boundary. Today, the failed span is lost — no AnalysisCall is emitted for it, and no diagnostic carries its source.

Representative parse-failure cases observed in the corpus (17 files with errors across 15 papers; one file alone has 356 errors):

Shape Example Today With fallback
R 4.1 lambda shorthand map(xs, \(x) x * 2) lexer drops statement opaque node evals in WebR
Complex felm multi-\| formula felm(y ~ x \| fe1 + fe2 \| z ~ w \| cluster, data=df) recovery skips the assignment, downstream m undefined opaque node with resultBinding = .n_<id> that downstream can reference
S4 slot access cascade val <- tree@data @ trips recovery, following statements also dropped one opaque node per skipped statement
Large swath in one file qje-do-financial-concerns/.../Figure_1.R — 183 parse errors ~180 silently-dropped statements; pipeline looks artificially small surfaces as opaque nodes so the user can see what got routed to WebR

6.2 Extension

parseR is extended to return an additional field unparsedRanges: Span[] — spans in the source that the parser skipped due to recovery. The recognizer receives these and emits one webr-opaque AnalysisCall per span:

for (const span of ast.unparsedRanges) {
  const rSource = source.slice(span.start, span.end);
  calls.push({
    kind: 'webr-opaque',
    args: {
      rSource,
      inputBindings: scanIdentifiersAgainstScope(rSource, scope),
      origin: 'statement-fallback',
    },
    sourceSpan: span,
    // No assignedTo → terminal node (best-effort evaluation; if the R side
    // assigns to user bindings, those are picked up by subsequent opaque
    // nodes via scope through worker-side R env persistence)
  });
}

6.3 scanIdentifiersAgainstScope — best-effort input inference

No AST means no proper free-variable extraction, but the recognizer already maintains a source-order scope map. A regex-based token scan intersected with scope is sufficient for the common case:

const R_RESERVED = new Set([
  'function', 'if', 'else', 'for', 'while', 'repeat', 'break', 'next',
  'return', 'TRUE', 'FALSE', 'NULL', 'NA', 'NA_integer_', 'NA_real_',
  'NA_character_', 'NaN', 'Inf', 'in',
]);

function scanIdentifiersAgainstScope(
  source: string,
  scope: Map<string, AnalysisCall>,
): string[] {
  const tokens = source.match(/\b[a-zA-Z.][a-zA-Z0-9._]*\b/g) ?? [];
  const seen = new Set<string>();
  for (const tok of tokens) {
    if (R_RESERVED.has(tok)) continue;
    if (scope.has(tok)) seen.add(tok);
  }
  return [...seen];
}

Example — m <- felm(y ~ x1 + x2 | fe1 + fe2 | z1 + z2 ~ w1 | cluster, data = df) where the parser trips on the multi-| formula and emits a fallback span covering the whole assignment:

Trade-offs:

6.3 Span collection in parseR

The Chevrotain parser’s error recovery already computes the skip range internally. Extracting it requires hooking the SKIP_TOKEN recovery path in the CST visitor and recording { start, end } pairs. Implementation: a small skipListener attached to the parser instance, populating a per-parse unparsedRanges array.

7. Mapper

mapper.ts gets one new case:

case 'webr-opaque': {
  return {
    id: freshId(call),
    type: 'webr-opaque',
    label: formatOpaqueLabel(call.args.rSource),  // first ~40 chars of rSource
    span: call.sourceSpan,
    params: {
      rSource: call.args.rSource,
      resultBinding: `.n_${freshId(call).replace(/[^a-z0-9]/gi, '_')}`,
      inputBindings: call.args.inputBindings,
      origin: call.args.origin,
      unassigned: !call.assignedTo,
    },
  };
}

Edges to the opaque node are added by the existing binding-resolution pass: for each name in inputBindings, look up the upstream node in the binding map and add an edge upstream.out → opaque.<name>.

8. Executor and worker protocol

8.1 Worker protocol extension

WebRResponse’s dispatch-result discriminated union gains a probe-aware opaque result:

type OpaquePayload =
  | { kind: 'dataset'; dataset: MarshaledDataset }
  | { kind: 'opaque-binding'; binding: string }
  | { kind: 'side-effect' };

type DispatchResult =
  | { kind: 'typed'; payload: unknown }
  | { kind: 'opaque'; payload: OpaquePayload };

8.2 Worker handleDispatchOpaque extended

The worker branches on req.unassigned before bindAndEval so side-effect calls don’t double-execute (running both as assignment and again under capture would redraw plots, re-write files, etc.).

Assignment path (default):

const rName = req.resultBinding;
const ok = await bindAndEval(req, rName);   // binds `rName <- rSource`
if (!ok) { /* post dispatch-error */ return; }

Side-effect path (unassigned):

if (req.unassigned) {
  // bindInputs first (no eval) so captureR can reference inputs
  await bindInputsOnly(req.inputs);

  const captured = await webR.captureR(req.rSource, { captureStreams: true });
  const capturedText = captured.output
    .filter(o => o.type === 'stdout' || o.type === 'stderr')
    .map(o => o.data)
    .join('\n');
  post({ type: 'dispatch-result', id, result: {
    kind: 'opaque',
    payload: { kind: 'side-effect', capturedText: capturedText || undefined },
  } });
  return;
}

const isDataFrame = await webR.evalR(`is.data.frame(${rName})`).then(r => r.toBoolean());
if (isDataFrame) {
  const marshaled = await marshalDatasetFromR(rName);  // reuse toMarshaled pattern in reverse
  post({ type: 'dispatch-result', id, result: { kind: 'opaque', payload: { kind: 'dataset', dataset: marshaled } } });
  // Keep rName in R env — downstream opaque/typed nodes may reference it
} else {
  post({ type: 'dispatch-result', id, result: { kind: 'opaque', payload: { kind: 'opaque-binding', binding: rName } } });
}

marshalDatasetFromR(rName) is a new helper — mirror of bindDatasetToR:

  1. ncol <- ncol(rName); nrow <- nrow(rName); colnames <- colnames(rName)
  2. For each column: col <- rName[[i]]; if (is.numeric(col)) typedArray else if (is.factor(col)) levels+codes else coerce to character
  3. Assemble MarshaledDataset with columns in original order.

Numeric: as.numeric then toTypedArray(). Factor: as.integer(col) - 1L for codes, levels(col) for levels. Character: as.factor(col) then marshal as categorical. Logical: coerce to numeric 0/1 (documented lossiness).

8.3 Executor branch

src/core/pipeline/executor.ts adds a case for webr-opaque:

case 'webr-opaque': {
  // Resolve upstream inputs — each inputBinding either comes from a Dataset-producing
  // node (native or opaque-probe-succeeded) or from an opaque-binding handle.
  const inputs: Record<string, MarshaledDataset> = {};
  const carriedBindings: string[] = [];  // R names already live in env

  for (const name of node.params.inputBindings) {
    const upstream = findUpstreamNode(node, name);
    const result = upstream.result;
    if (result?.kind === 'dataset') {
      inputs[name] = toMarshaled(result.dataset);
    } else if (result?.kind === 'opaque-binding') {
      carriedBindings.push(result.binding);  // already bound under result.binding name
      // Re-alias if binding name differs from input port name
      if (result.binding !== name) rebind(name, result.binding);
    } else {
      throw new Error(`Cannot resolve input "${name}" for opaque node ${node.id}`);
    }
  }

  const payload = await workerManager.webrDispatchOpaque({
    nodeId: node.id,
    rSource: node.params.rSource,
    resultBinding: node.params.resultBinding,
    inputs,
    unassigned: node.params.unassigned,   // threaded from the AnalysisCall (no assignedTo)
  });

  node.result = payload;  // { kind: 'dataset', ... } | { kind: 'opaque-binding', ... } | { kind: 'side-effect' }
}

8.4 Downstream consumer handling

When a TS-native node’s data input comes from an opaque upstream:

Consumer port validation is done once, per-pipeline-run, before dispatch. All impacted model/transform nodes are marked error before the executor starts so the user sees the whole affected region at once.

8.5 Sticky handling deferred

Session 3 will implement automatic promotion of downstream TS-native nodes to WebR (re-dispatch lm as webr-typed lm when its data arg is an opaque-binding). That requires both the broom bridge and lm/glm marshalers. Out of scope here — this session errors clearly instead.

9. Worker protocol — full request/response

Request (unchanged shape, new unassigned flag):

{
  type: 'dispatch-opaque';
  id: string;
  rSource: string;
  resultBinding: string;         // e.g., '.n_abc123'
  inputs: Record<string, MarshaledDataset>;
  unassigned?: boolean;          // true for side-effect calls
}

Response (extended):

{
  type: 'dispatch-result';
  id: string;
  result: { kind: 'opaque'; payload: OpaquePayload };
}
// or
{
  type: 'dispatch-error';
  id: string;
  stage: 'bind-inputs' | 'eval' | 'probe' | 'marshal';
  error: string;
}

New probe stage covers is.data.frame failures (e.g., result is an unevaluated promise, a missing value, or an error object). marshal covers marshalDatasetFromR failures (e.g., unsupported column type like complex numbers — documented failure mode).

10. CLAUDE.md No-Go List additions

11. Testing

11.1 Unit (Vitest)

11.2 Integration

11.3 Audit regression

audit-opaque.ts is rerun after implementation. Expected deltas:

Metric Before After
return occurrences 490 ≤ 5
Total unsupported nodes 10,753 ≤ 3,000 (most become webr-opaque; the rest are ignorable scalars/side-effects)
webr-opaque nodes (new metric) 0 ≥ 500
Models blocked (existing audit) 154 ≤ 100

Regression criterion: webr-opaque nodes appear in ≥ 30 of 83 packages (≥ 36% coverage for the opaque path).

11.4 E2E (Playwright)

e2e/webr-opaque.spec.ts:

11.5 Validation — golden comparison

Paired with session 1’s validation pattern: pick one opaque path (readRDS of a CSV-equivalent .rds fixture) and assert the loaded Dataset matches a TS-native read_csv of the same content.

12. Shipping criteria

  1. All unit and integration tests pass.
  2. return audit count drops to ≤ 5.
  3. Audit regression thresholds met (§11.3).
  4. E2E test passes in Chromium.
  5. npm run build && npm test && npm run lint && npm run test:e2e green.
  6. Manual smoke: upload a replication package containing an unknown R function that returns a data.frame, verify opaque node renders with dashed border, Run completes, downstream lm shows coefficients.

13. Follow-up sessions

Session Item Rationale
3 Broom auto-typing bridge Run broom::tidy(binding) on opaque results to auto-promote to typed regression — covers ~200 model classes via one generic path.
3 Sticky-opaque / auto-WebR promotion of downstream When auto-marshal fails (opaque result is a list/model), re-dispatch downstream TS-native nodes as webr-typed. Requires lm/glm marshalers beyond the existing lm_robust.
3 Plot/SVG capture for side-effect opaque nodes captureR with captureGraphics: true returns SVG for plot()/ggplot(). Needs a UI renderer (canvas or SVG embed), scaling, download. This session only captures stdout/stderr text.
3 UI “cast opaque to regression” User-triggered broom extraction for nodes the probe couldn’t auto-type.
3+ Virtual FS artifact download Files written by ggsave(), write.csv(), etc. currently die with the worker. Surface a “Download artifacts” panel that enumerates new files in a designated output dir.
M6 Native-alias + native-implement bucket work data.frame, as.data.frame, cbind, distinct, fixef, etc. — tracked on the opaque-inventory report as Wave 1/2 backlog.
Any Per-function typed marshalers readRDS, excel readers, sf::* — each is a ~15–30 min registerMarshaler() call with a known output shape. Follow the lm_robust pattern.

14. Open questions