M5b: Loop/Lapply Expansion — Design Spec

M5b: Loop/Lapply Expansion — Design Spec

Date: 2026-04-12 Milestone: 5b (Code Intelligence) Depends on: M5a (multi-file infrastructure, function inlining)

Problem

Real replication packages iterate over specifications using for-loops and lapply. The recognizer currently silently skips ForNode (recognizer.ts:177–180). A census of 3,777 R files across 93 Top 5 journal replication packages found 346 estimation loops — for/lapply/sapply calls with lm/feols/felm/glm/ivreg/lm_robust in the body.

Goal

Expand for-loops and lapply/sapply/map calls into N pipeline nodes (one per iteration), feeding into existing group detection (outcome variants, specification variants, etc.).

Coverage Census

346 estimation loops classified by iterable pattern:

Pattern Count % Example
A: Literal vector 22 6% for (y in c("earn", "hrs"))
B: Named variable 134 39% outcomes <- c(...); for (y in outcomes)
C: Numeric index 166 48% for (i in 1:N) { lm(...outcomes[i]...) }
D: Grid/complex 24 7% Nested loops, expand.grid

By loop type: 308 for-loops (89%), 38 lapply/sapply (11%). Of the 38, only 7 contain direct estimation calls — all are Category 1 (anonymous function, single iterable).

By estimation function: lm 208 (60%), feols 50 (14%), glm 38 (11%), lm_robust 25 (7%), felm 17 (5%), ivreg 8 (2%).

All four patterns (A+B+C+D) are in scope — 100% coverage of estimation loops.

Architecture: Recognizer-Level Expansion

Expansion happens inside the recognizer, reusing the inliner’s substitution infrastructure. A new loop-expander.ts module is called where the silent skip currently lives. This mirrors how inliner.ts is a separate module called from the recognizer.

Expansion Chain

ForNode
  → resolve iterable to concrete values
  → for each value:
      → clone body (deepCloneAndSubstitute)
      → substitute loop variable with literal
      → collapse subscripts (vec[3] → literal)
      → evaluate paste0/paste calls
      → resolve formula()/as.formula() calls
      → re-recognize via recognizeR()
      → annotate provenance
  → return AnalysisCall[]

Module: src/core/parsers/r/loop-expander.ts

Public API:

interface LoopExpansionResult {
  calls: AnalysisCall[];
  diagnostics: Diagnostic[];
}

function tryExpandLoop(
  forNode: ForNode,
  globalVectors: Map<string, VectorConstantInfo>,
  globalConstants: Map<string, ConstantInfo>,
  globalFunctions: Map<string, FunctionInfo>,
): LoopExpansionResult | null;

function tryExpandApply(
  callNode: FunctionCallNode,
  globalVectors: Map<string, VectorConstantInfo>,
  globalConstants: Map<string, ConstantInfo>,
  globalFunctions: Map<string, FunctionInfo>,
): LoopExpansionResult | null;

tryExpandLoop returns null when the iterable is not statically resolvable (recognizer emits diagnostic and silently skips, same as today). tryExpandApply desugars to a synthetic ForNode and delegates to tryExpandLoop.

Scope Extension: Vector Constants

ConstantInfo (file-registry.ts) currently stores single string values. Add parallel tracking for vector assignments.

New type:

interface VectorConstantInfo {
  values: string[];
  sourceFile: string;
}

Collection: In FileRegistry.addFile() top-level assignment scan, when the value is a VectorNode whose elements are all string or numeric literals, store in globalVectors. Numeric literals are stored as strings (e.g., c(1, 2, 3)["1", "2", "3"]).

Scope threading: RecognizerScope gains globalVectors: Map<string, VectorConstantInfo>. Flows through FileRegistry → recognizer → loop expander, same path as globalConstants and globalFunctions.

Iterable Resolution

resolveIterable(node, globalVectors, globalConstants)string[] | null

AST shape Resolution
VectorNode with all literal elements Extract values directly
IdentifierNode matching a globalVectors entry Look up stored values
BinaryOpNode with : operator, both sides numeric literals Generate range [start..end]
FunctionCallNode seq_len(N) / seq_along(vec) Resolve N or vec length, generate [1..N]
FunctionCallNode length(vec) / nrow(grid) Resolve from known vectors / expand.grid decomposition
Anything else Return null (not statically resolvable)

For 1:length(outcomes) — the : operator has a FunctionCallNode("length", [IdentifierNode("outcomes")]) on the RHS. Resolve outcomes from globalVectors, return length, generate range.

Substitution & Subscript Collapse

Pattern A/B (string iteration)

Loop variable maps directly to a string value. Substitution map: { y → LiteralNode("earnings") }. Applied via deepCloneAndSubstitute() from inliner.ts (already handles all node types including ForNode).

Pattern C (numeric index)

After substituting iLiteralNode(3), the AST contains SubsetNode(IdentifierNode("outcomes"), [LiteralNode(3)]). A subscript collapse pass rewrites this:

collapseSubscripts(node, globalVectors) — post-order AST walk:

  1. vec[i]: SubsetNode where object is an identifier in globalVectors and args[0] is a numeric literal → LiteralNode(values[i - 1]) (R is 1-indexed)
  2. vec[[i]]: Same handling (double-bracket is single-element extraction). Implementation note: verify that the R parser produces SubsetNode for [[ — if it produces a distinct node type, add a case for it

This pass runs after substitution, before evaluatePasteCalls, so collapsed strings feed into paste evaluation:

substitute loop var → collapse subscripts → evaluate paste → resolve formula → re-recognize

expand.grid Decomposition

Rather than tracking grid objects and computing cross-product indices, decompose expand.grid loops into nested ForNodes.

Detection: When resolving an iterable like 1:nrow(specs), check if specs was assigned from an expand.grid(...) call where all arguments are named with string-vector values.

Decomposition: Synthesize nested ForNodes from the grid columns:

# Original:
specs <- expand.grid(outcome = c("a","b"), estimator = c("ols","iv"))
for (i in 1:nrow(specs)) { lm(specs$outcome[i], specs$estimator[i]) }

# Synthesized:
for (outcome in c("a","b")) {
  for (estimator in c("ols","iv")) {
    lm(outcome, estimator)  # specs$outcome[i] → outcome, specs$estimator[i] → estimator
  }
}

The nested loops expand naturally via recursive re-recognition — outer loop expands, each expanded body contains the inner loop, re-recognition expands the inner loop. Existing group detection discovers the cross-product axes.

Body rewriting: When synthesizing nested loops, rewrite specs$col[i] references in the body to plain identifiers matching the grid column names. This is a targeted AST rewrite: find SubsetNode(DollarAccessNode(IdentifierNode("specs"), "col"), [IdentifierNode("i")]) and replace with IdentifierNode("col").

Detection point: In the recognizer’s top-level statement walk, when we see specs <- expand.grid(...), parse the named arguments and store in a local expandGrids: Map<string, Map<string, string[]>> map (variable name → column name → values). When we later encounter for (i in 1:nrow(specs)), look up specs in this map to trigger decomposition. This map is local to the recognizer invocation, not persisted in RecognizerScope.

Lapply/Sapply/Map Desugaring

Match function calls in the recognizer where: - Name is lapply, sapply, vapply, map, purrr::map, map_df, map_dfr - First argument is the iterable - An argument is a FunctionDefNode (anonymous function)

Desugaring to synthetic ForNode:

// lapply(outcomes, function(y) { lm(... y ...) })
//   → ForNode { variable: "y", iterable: outcomes, body: funcDef.body }
const syntheticFor: ForNode = {
  type: 'for',
  variable: funcDef.params[0].name,
  iterable: callNode.args[0].value,  // the iterable expression
  body: funcDef.body,
  span: callNode.span,
};
return tryExpandLoop(syntheticFor, globalVectors, globalConstants, globalFunctions);

Then the standard expansion path handles everything.

Not in scope: map2 (zero estimation calls in census), ... args forwarding (zero occurrences), named function as FUN argument (zero occurrences).

Provenance

Each expanded AnalysisCall gets:

sourceLoop?: {
  variable: string;      // "y", "i"
  iteration: number;     // 0-based
  value: string;         // "earnings", "2"
  type: 'for' | 'lapply' | 'sapply' | 'map';
}

Plus sourceSpan set to the ForNode’s span (or the lapply call’s span for desugared calls).

Guard Checks

Return null (unexpandable, silent skip + diagnostic) when:

Condition Diagnostic message
Iterable not resolvable "For-loop iterable 'x' is not statically resolvable"
>200 iterations "For-loop exceeds 200 iterations, skipping"
Body has >200 statements "For-loop body too large, skipping"
Body contains NSE markers "For-loop body uses eval/do.call, skipping"

No diagnostic for successful expansion (normal path, no noise). Info-level diagnostic when expand.grid is decomposed into nested loops.

Integration Points

Recognizer (recognizer.ts)

Replace the silent skip at line 177–180:

} else if (node.type === 'for') {
  const result = tryExpandLoop(node, /* scope fields */);
  if (result) {
    for (const call of result.calls) {
      if (assignedTo) call.assignedTo = assignedTo;
      calls.push(call);
      if (call.assignedTo) scope.set(call.assignedTo, call);
    }
    diagnostics.push(...result.diagnostics);
  }
  return;
}

For lapply/sapply/map, add a check in recognizeFunctionCall before the existing dispatch. Note: recognizeFunctionCall will need the externalScope threaded through (it already receives it) plus the local expandGrids map if we want lapply over grid iterables. For the initial implementation, only the externalScope (which contains globalVectors) is needed.

if (['lapply', 'sapply', 'vapply', 'map', 'map_df', 'map_dfr'].includes(name)) {
  const result = tryExpandApply(node, /* scope fields from externalScope */);
  if (result) return result.calls;
}

FileRegistry (file-registry.ts)

Add globalVectors collection in addFile() alongside stringConstants:

} else if (value.type === 'vector' && value.elements.every(isStringOrNumericLiteral)) {
  entry.vectorConstants.set(name, extractLiterals(value));
}

Thread through to RecognizerScope.

RecognizerScope

export interface RecognizerScope {
  globalFunctions: Map<string, FunctionInfo>;
  globalConstants: Map<string, ConstantInfo>;
  globalVectors: Map<string, VectorConstantInfo>;  // NEW
}

Inliner (inliner.ts)

No changes to the inliner itself. deepCloneAndSubstitute, evaluatePasteCalls, resolveFormulaCalls, and reconstructSource are imported and reused by the loop expander.

Group Detection

No changes. Expanded nodes are independent AnalysisCall objects with different outcome/RHS/data params. Existing detectOutcomeVariants, detectSpecificationVariants, etc. auto-group them.

Explicit Non-Goals

  1. No recursive loop expansion — expanded body goes through single-level re-recognition only
  2. No map2/parallel iteration — zero estimation usage in census
  3. No ... args forwarding — zero occurrences
  4. No dynamic iterable resolutionunique(df$group), function return values, etc. are skipped
  5. No list-element bindingresults <- lapply(...) doesn’t connect expanded calls to downstream stargazer(results[[1]])
  6. No while loops — not used for specification iteration
  7. No grid object storageexpand.grid is decomposed to nested loops, not tracked as a persistent object

Testing Strategy

Unit tests (loop-expander.test.ts)

Integration tests (pipeline/integration.test.ts)

E2E tests

Validates Against

Expanded loops from at least 3 reference papers covering patterns A–D. The 346-loop census provides ground truth for coverage measurement.