Date: 2026-04-12 Milestone: 5b (Code Intelligence) Depends on: M5a (multi-file infrastructure, function inlining)
Real replication packages iterate over specifications using for-loops
and lapply. The recognizer currently silently skips ForNode
(recognizer.ts:177–180). A census of 3,777 R files across 93 Top 5
journal replication packages found 346 estimation loops
— for/lapply/sapply calls with
lm/feols/felm/glm/ivreg/lm_robust
in the body.
Expand for-loops and lapply/sapply/map calls into N pipeline nodes (one per iteration), feeding into existing group detection (outcome variants, specification variants, etc.).
346 estimation loops classified by iterable pattern:
| Pattern | Count | % | Example |
|---|---|---|---|
| A: Literal vector | 22 | 6% | for (y in c("earn", "hrs")) |
| B: Named variable | 134 | 39% | outcomes <- c(...); for (y in outcomes) |
| C: Numeric index | 166 | 48% | for (i in 1:N) { lm(...outcomes[i]...) } |
| D: Grid/complex | 24 | 7% | Nested loops, expand.grid |
By loop type: 308 for-loops (89%), 38 lapply/sapply (11%). Of the 38, only 7 contain direct estimation calls — all are Category 1 (anonymous function, single iterable).
By estimation function: lm 208 (60%), feols 50 (14%), glm 38 (11%), lm_robust 25 (7%), felm 17 (5%), ivreg 8 (2%).
All four patterns (A+B+C+D) are in scope — 100% coverage of estimation loops.
Expansion happens inside the recognizer, reusing the inliner’s
substitution infrastructure. A new loop-expander.ts module
is called where the silent skip currently lives. This mirrors how
inliner.ts is a separate module called from the
recognizer.
ForNode
→ resolve iterable to concrete values
→ for each value:
→ clone body (deepCloneAndSubstitute)
→ substitute loop variable with literal
→ collapse subscripts (vec[3] → literal)
→ evaluate paste0/paste calls
→ resolve formula()/as.formula() calls
→ re-recognize via recognizeR()
→ annotate provenance
→ return AnalysisCall[]
src/core/parsers/r/loop-expander.tsPublic API:
interface LoopExpansionResult {
calls: AnalysisCall[];
diagnostics: Diagnostic[];
}
function tryExpandLoop(
forNode: ForNode,
globalVectors: Map<string, VectorConstantInfo>,
globalConstants: Map<string, ConstantInfo>,
globalFunctions: Map<string, FunctionInfo>,
expansionCounter?: { count: number }, // cumulative across nesting levels, cap at 200
): LoopExpansionResult | null;
function tryExpandApply(
callNode: FunctionCallNode,
globalVectors: Map<string, VectorConstantInfo>,
globalConstants: Map<string, ConstantInfo>,
globalFunctions: Map<string, FunctionInfo>,
expansionCounter?: { count: number },
): LoopExpansionResult | null;tryExpandLoop returns null when the
iterable is not statically resolvable (recognizer emits diagnostic and
silently skips, same as today). tryExpandApply desugars to
a synthetic ForNode and delegates to
tryExpandLoop.
ConstantInfo (file-registry.ts) currently stores single
string values. Add parallel tracking for vector assignments.
New type:
interface VectorConstantInfo {
values: string[];
sourceFile: string;
}Collection: In FileRegistry.addFile()
top-level assignment scan, when the value is a VectorNode
whose elements are all string or numeric literals, store in
globalVectors. Numeric literals are stored as strings
(e.g., c(1, 2, 3) → ["1", "2", "3"]).
Scope threading: RecognizerScope gains
globalVectors: Map<string, VectorConstantInfo>. Flows
through FileRegistry → recognizer → loop expander, same
path as globalConstants and
globalFunctions.
resolveIterable(node, globalVectors, globalConstants) →
string[] | null
| AST shape | Resolution |
|---|---|
VectorNode with all literal elements |
Extract values directly |
IdentifierNode matching a globalVectors
entry |
Look up stored values |
BinaryOpNode with : operator, both sides
numeric literals |
Generate range [start..end] |
FunctionCallNode seq_len(N) /
seq_along(vec) |
Resolve N or vec length, generate [1..N] |
FunctionCallNode length(vec) /
nrow(grid) |
Resolve from known vectors / expand.grid decomposition |
| Anything else | Return null (not statically resolvable) |
For 1:length(outcomes) — the : operator has
a FunctionCallNode("length", [IdentifierNode("outcomes")])
on the RHS. Resolve outcomes from
globalVectors, return length, generate
range.
Loop variable maps directly to a string value. Substitution map:
{ y → LiteralNode("earnings") }. Applied via
deepCloneAndSubstitute() from inliner.ts (already handles
all node types including ForNode).
After substituting i → LiteralNode(3), the
AST contains
SubsetNode(IdentifierNode("outcomes"), [LiteralNode(3)]). A
subscript collapse pass rewrites this:
collapseSubscripts(node, globalVectors) — post-order AST
walk:
vec[i]: SubsetNode where
object is an identifier in globalVectors and
args[0] is a numeric literal →
LiteralNode(values[i - 1]) (R is 1-indexed)vec[[i]]: Same handling
(double-bracket is single-element extraction). Implementation
note: verify that the R parser produces SubsetNode
for [[ — if it produces a distinct node type, add a case
for itThis pass runs after substitution, before
evaluatePasteCalls, so collapsed strings feed into paste
evaluation:
substitute loop var → collapse subscripts → evaluate paste → resolve formula → re-recognize
expand.grid
DecompositionRather than tracking grid objects and computing cross-product
indices, decompose expand.grid loops into nested
ForNodes.
Detection: When resolving an iterable like
1:nrow(specs), check if specs was assigned
from an expand.grid(...) call where all arguments are named
with string-vector values.
Decomposition: Synthesize nested ForNodes from the grid columns:
# Original:
specs <- expand.grid(outcome = c("a","b"), estimator = c("ols","iv"))
for (i in 1:nrow(specs)) { lm(specs$outcome[i], specs$estimator[i]) }
# Synthesized:
for (outcome in c("a","b")) {
for (estimator in c("ols","iv")) {
lm(outcome, estimator) # specs$outcome[i] → outcome, specs$estimator[i] → estimator
}
}The nested loops expand naturally via recursive re-recognition — outer loop expands, each expanded body contains the inner loop, re-recognition expands the inner loop. Existing group detection discovers the cross-product axes.
Body rewriting: When synthesizing nested loops,
rewrite specs$col[i] references in the body to plain
identifiers matching the grid column names. This is a targeted AST
rewrite: find
SubsetNode(DollarAccessNode(IdentifierNode("specs"), "col"), [IdentifierNode("i")])
and replace with IdentifierNode("col").
Detection point: In the recognizer’s top-level
statement walk, when we see specs <- expand.grid(...),
parse the named arguments and store in a local
expandGrids: Map<string, Map<string, string[]>>
map (variable name → column name → values). When we later encounter
for (i in 1:nrow(specs)), look up specs in
this map to trigger decomposition. This map is local to the recognizer
invocation, not persisted in RecognizerScope.
Match function calls in the recognizer where: - Name is
lapply, sapply, vapply,
map, purrr::map, map_df,
map_dfr - First argument is the iterable - An argument is a
FunctionDefNode (anonymous function)
Desugaring to synthetic ForNode:
// lapply(outcomes, function(y) { lm(... y ...) })
// → ForNode { variable: "y", iterable: outcomes, body: funcDef.body }
const syntheticFor: ForNode = {
type: 'for',
variable: funcDef.params[0].name,
iterable: callNode.args[0].value, // the iterable expression
body: funcDef.body,
span: callNode.span,
};
return tryExpandLoop(syntheticFor, globalVectors, globalConstants, globalFunctions);Then the standard expansion path handles everything.
Not in scope: map2 (zero estimation
calls in census), ... args forwarding (zero occurrences),
named function as FUN argument (zero occurrences).
Each expanded AnalysisCall gets:
sourceLoop?: {
variable: string; // "y", "i"
iteration: number; // 0-based
value: string; // "earnings", "2"
type: 'for' | 'lapply' | 'sapply' | 'map';
}Plus sourceSpan set to the ForNode’s span (or the lapply
call’s span for desugared calls).
Return null (unexpandable, silent skip + diagnostic)
when:
| Condition | Diagnostic message |
|---|---|
| Iterable not resolvable | "For-loop iterable 'x' is not statically resolvable" |
| >200 iterations in single loop | "For-loop exceeds 200 iterations, skipping" |
| Cumulative expanded calls exceed 200 | "Loop expansion cap reached (200 total calls), skipping remaining iterations" |
| Body has >200 statements | "For-loop body too large, skipping" |
| Body contains NSE markers | "For-loop body uses eval/do.call, skipping" |
No diagnostic for successful expansion (normal path, no noise).
Info-level diagnostic when expand.grid is decomposed into
nested loops.
Replace the silent skip at line 177–180:
} else if (node.type === 'for') {
const result = tryExpandLoop(node, /* scope fields */);
if (result) {
for (const call of result.calls) {
if (assignedTo) call.assignedTo = assignedTo;
calls.push(call);
if (call.assignedTo) scope.set(call.assignedTo, call);
}
diagnostics.push(...result.diagnostics);
}
return;
}For lapply/sapply/map, add a check in
recognizeFunctionCall before the existing dispatch.
Note: recognizeFunctionCall will need the
externalScope threaded through (it already receives it)
plus the local expandGrids map if we want lapply over grid
iterables. For the initial implementation, only the
externalScope (which contains globalVectors)
is needed.
if (['lapply', 'sapply', 'vapply', 'map', 'map_df', 'map_dfr'].includes(name)) {
const result = tryExpandApply(node, /* scope fields from externalScope */);
if (result) return result.calls;
}Add globalVectors collection in addFile()
alongside stringConstants:
} else if (value.type === 'vector' && value.elements.every(isStringOrNumericLiteral)) {
entry.vectorConstants.set(name, extractLiterals(value));
}Thread through to RecognizerScope.
export interface RecognizerScope {
globalFunctions: Map<string, FunctionInfo>;
globalConstants: Map<string, ConstantInfo>;
globalVectors: Map<string, VectorConstantInfo>; // NEW
}No changes to the inliner itself.
deepCloneAndSubstitute, evaluatePasteCalls,
resolveFormulaCalls, and reconstructSource are
imported and reused by the loop expander.
No changes. Expanded nodes are independent AnalysisCall
objects with different outcome/RHS/data params. Existing
detectOutcomeVariants,
detectSpecificationVariants, etc. auto-group them.
totalExpansions counter caps total produced
calls at 200 across all nesting levels to prevent combinatorial
explosionmap2/parallel iteration — zero
estimation usage in census... args forwarding — zero
occurrencesunique(df$group), function return values, etc. are
skippedresults <- lapply(...) doesn’t connect expanded calls to
downstream stargazer(results[[1]])while loops — not used for
specification iterationexpand.grid
is decomposed to nested loops, not tracked as a persistent object1:5 → range, seq_len(N) →
range, length(vec) → number, unresolvable → nullfor (y in c("a","b")) { lm(y ~ x) } → 2 AnalysisCall with
different outcomesfor (i in 1:2) { lm(outcomes[i] ~ x) } with known
outcomes → subscript collapsed → 2 callsexpand.grid → nested loop
decomposition → cross-product of callslapply(c("a","b"), function(y) { lm(y ~ x) }) → 2
callssourceLoop metadatavec[2] → literal,
vec[[3]] → literal, unknown vec → unchangedsource()’d file with vectors defined in
parent file → scope resolution worksExpanded loops from at least 3 reference papers covering patterns A–D. The 346-loop census provides ground truth for coverage measurement.