M5a: Multi-File Infrastructure — Design Spec

M5a: Multi-File Infrastructure

Date: 2026-04-11 Status: Design Depends on: M4 complete (multi-dataset registry, auto-match, expression evaluator, pipe threading)

Motivation

Real replication packages aren’t single files. They’re ZIPs with 2–27 R files, helper scripts loaded via source(), custom function definitions wrapping lm()/felm() calls, and data files in multiple formats. M4 gave us the data pipeline; M5a gives us the code infrastructure to parse and execute multi-file packages.

Unlocks: Coverage jumps from 12% → ~29% of Top 5 packages. Makes Tier A replication viable for papers that use source() chains and custom wrapper functions.

Scope: ZIP upload + file extraction, source() dependency resolution, function definition tracking with file-scope string constants, single-level function inlining with paste0()/paste() evaluation. Parser extensions for function definitions and for loops (for loop bodies deferred to M5b but parsed now to avoid error recovery corruption).

Out of scope: Loop expansion, lapply()/map() expansion, dynamic formula construction beyond paste0() with string literals, multi-level function inlining, expression evaluator additions (Wave 1/2), Excel reading, code file dependency graph panel.


1. Parser Extensions

New AST Nodes

Two new node types in src/core/parsers/r/ast.ts, added to the RNode union:

export interface FunctionDefNode {
  type: 'function-def';
  params: FunctionParam[];
  body: RNode[];          // statements inside { }
  span: Span;
}

export interface FunctionParam {
  name: string;
  default?: RNode;        // default value expression, if any
}

export interface ForNode {
  type: 'for';
  variable: string;       // iteration variable name
  iterable: RNode;        // the expression after `in`
  body: RNode[];          // statements inside { }
  span: Span;
}

New Lexer Tokens

Parser Grammar Rules

functionDef: Triggered when atom encounters the Function keyword token.

functionDef → Function LParen paramList? RParen body
paramList   → param (Comma param)*
param       → Identifier (Assign expression)?     // default value
body        → LBrace NL* (statement (statementSep statement)*)? NL* RBrace
            | expression                           // single-expression body (no braces)

Single-expression function bodies (function(x) x + 1) wrap the expression in a one-element body array.

forStatement: New alternative in the statement rule.

forStatement → For LParen Identifier In expression RParen body
body         → LBrace NL* (statement (statementSep statement)*)? NL* RBrace
             | statement                           // single-statement body (no braces)

Recognizer Behavior

Statement Rule Update

The statement rule needs updated GATE logic to detect for and function contexts. for starts with the For keyword token (already exists in lexer). Function definitions appear as Identifier Arrow Function or Identifier Assign Function — these are already parsed as assignments where the value expression reaches atom, which needs a new Function alternative.


2. FileRegistry

File: src/core/parsers/file-registry.ts (new)

The FileRegistry maps filenames to parse results, resolves source() dependencies, and accumulates function definitions + string constants across files in topological order.

Data Shape

interface FunctionInfo {
  node: FunctionDefNode;
  sourceFile: string;     // which file defined this function
}

interface ConstantInfo {
  value: string;
  sourceFile: string;
}

interface FileEntry {
  path: string;                              // relative to ZIP root
  source: string;                            // raw R source text
  ast: ProgramNode;                          // parsed AST
  sourceDeps: string[];                      // resolved paths from source() calls
  functionDefs: Map<string, FunctionDefNode>;
  stringConstants: Map<string, string>;
}

interface FileRegistry {
  files: Map<string, FileEntry>;
  allFiles: string[];                        // all file paths (R + data)
  processingOrder: string[];                 // topological sort of R files

  // Accumulated scope (built during ordered processing)
  globalFunctions: Map<string, FunctionInfo>;
  globalConstants: Map<string, ConstantInfo>;
}

Processing Pipeline

  1. Parse all R files — tokenize + parse each .R file independently → FileEntry with AST.

  2. Extract source() calls — walk each AST for FunctionCallNode where name is source. Extract the first positional string argument. This is a pre-pass before the recognizer runs.

  3. Resolve paths — for each source("path") call, resolve using 5-step fallback:

    1. Exact path relative to ZIP root
    2. Exact path relative to calling file’s directory
    3. here::here("a", "b")path.join("a", "b"), then resolve as step 1
    4. Basename match across all R files in the package
    5. Best path-segment match from the right (e.g., source("Code/analysis/helpers.R") → prefer analysis/helpers.R over utils/helpers.R based on matching trailing segments)
    6. No match → warning diagnostic (“could not resolve source(‘…’)”), skip this edge
  4. Topological sort — build DAG from resolved sourceDeps, topological sort. Cycles → warning diagnostic + break cycle at the edge that closes it.

  5. Process files in order — walk files in topological order. For each file:

  6. Merge results — all AnalysisCall[] from all files are concatenated into one flat list → passed to the mapper → one unified pipeline DAG.

source() Handling

source stays in IGNORABLE_FUNCTIONS. The FileRegistry intercepts source() calls during step 2 (pre-pass on the AST), before the recognizer runs. The recognizer continues to silently skip source() calls as ignorable — the resolution has already happened at the FileRegistry layer. No recognizer changes needed for source().

In single-file mode (paste in editor), there’s no FileRegistry pre-pass — source() calls are silently skipped as today.

Scope Rules

Single-File Unification

Pasting code in the editor creates a FileRegistry with one entry (synthetic path untitled.R). Uploading a single .R file creates a FileRegistry with one entry. The code path is unified: everything goes through FileRegistry, whether it’s a ZIP with 27 files or a single paste.


3. Function Inlining

File: src/core/parsers/r/inliner.ts (new)

When the recognizer encounters a FunctionCallNode whose name matches a user-defined function in globalFunctions, it delegates to the inliner.

Guard Checks

Before attempting inlining, verify: - Function body contains at least one FunctionCallNode (otherwise nothing to recognize) - No nested function definitions in the body - No NSE markers in the body: eval(), do.call(), get(), assign(), environment() - Body length ≤ 200 statements (sanity bound)

If any guard fails → emit UnsupportedNode with a diagnostic explaining why.

Inlining Algorithm

  1. Look up function name in globalFunctions → get FunctionDefNode with params and body.

  2. Match arguments — pair call-site args to function params:

  3. Build substitution mapMap<string, RNode> mapping param name → call-site argument AST node.

  4. Add file-scope constants — for free variables in the body (identifiers not in the param list and not local assignments), look up in globalConstants. If found, add name → LiteralNode(value) to the substitution map.

  5. Substitute — deep-clone the function body AST. Walk the clone, replacing every IdentifierNode whose name is in the substitution map with a deep clone of the mapped node.

  6. Evaluate paste0()/paste() — post-substitution pass on the cloned body. For any FunctionCallNode named paste0 or paste where all arguments are now LiteralNode strings:

  7. Resolve formula()/as.formula() — for any FunctionCallNode named formula or as.formula whose first argument is now a LiteralNode string: extract the string, parse it as a formula using the existing formula parser. Replace the FunctionCallNode with the resulting FormulaExprNode. If parsing fails, leave as-is.

  8. Re-recognize — run the recognizer on the substituted body statements. This is a single pass — if the body contains calls to other user-defined functions, those emit UnsupportedNode (no recursion). The recognizer treats the inlined body exactly like top-level statements, with the same accumulated scope (minus the function being inlined, to prevent self-recursion).

  9. Annotate provenance — each AnalysisCall produced from inlining gets:

Failure Handling

Inlining is best-effort. If any step fails (unresolved free variable, non-literal paste argument, guard check), the original call site emits UnsupportedNode with a descriptive diagnostic. The pipeline continues — partial inlining of a file is fine. Example diagnostics:

Example Walkthrough (soil-heterogeneity)

# File scope constants:
my.GeoCtrls <- "Elevation + Slope + Flow_acc + ..."

# Function definition:
table1 <- function(dep.var.name, indep.var.name, FEs, reg.data, clusterVar) {
  m1 <- felm(formula(paste0(dep.var.name, " ~ ", indep.var.name, " | 1 | 0 | ", clusterVar)),
             data = reg.data)
  m2 <- felm(formula(paste0(dep.var.name, " ~ ", indep.var.name, " | ", FEs, " | 0 | ", clusterVar)),
             data = reg.data)
  m3 <- felm(formula(paste0(dep.var.name, " ~ ", indep.var.name, " + ", my.GeoCtrls,
                            " | ", FEs, " | 0 | ", clusterVar)),
             data = reg.data)
  # ... m4, m5, m6 similarly
}

# Call site:
table1(dep.var.name = "LNI_county_Native", indep.var.name = "SHI25",
       FEs = "state", reg.data = dt1, clusterVar = "state")

Step 2-3: Substitution map = { dep.var.name → "LNI_county_Native", indep.var.name → "SHI25", FEs → "state", reg.data → dt1, clusterVar → "state" }

Step 4: Add my.GeoCtrls → "Elevation + Slope + ..." from globalConstants

Step 5: Substitute into cloned body → paste0("LNI_county_Native", " ~ ", "SHI25", " | 1 | 0 | ", "state")

Step 6: Evaluate paste0 → "LNI_county_Native ~ SHI25 | 1 | 0 | state"

Step 7: formula("LNI_county_Native ~ SHI25 | 1 | 0 | state") → parsed FormulaExprNode

Step 8: Recognizer sees felm(formula, data = dt1) → emits linear-model AnalysisCall

Step 9: AnalysisCall.sourceFunction = "table1"

Result: 6 felm() AnalysisCalls from one table1() call. Each carries provenance back to the call site + function name.


4. ZIP Extraction

File: src/core/zip/extractor.ts (new)

Library

fflate (MIT, ~13KB gzipped). Synchronous decompression via unzipSync().

Extraction Pipeline

  1. Read ZIP bufferFile.arrayBuffer() from the upload input → Uint8Array

  2. Extractfflate.unzipSync(buffer)Record<string, Uint8Array> (path → content)

  3. Strip common prefix — if all paths share a single top-level directory (e.g., replication-package/...), strip it so paths become relative to the package root

  4. Classify files by extension:

    Extension Classification Action
    .R, .r R source Decode as UTF-8, send to FileRegistry
    .csv CSV data Send to dataset registry (addDataset)
    .dta Stata data Send to dataset registry (existing .dta parser)
    .xlsx, .xls Excel data Show in file browser with “not yet supported” badge
    .RData, .rds R binary data Show in file browser with “not yet supported” badge
    .pdf Paper PDF Store reference for M6 verification
    Other (.do, .py, .m, .txt, images) Non-R files Show greyed out in file browser
  5. Size gate for data files — check uncompressed size from ZIP metadata before extracting:

  6. Feed to systems:

Single-File Unification

All entry points produce a FileRegistry: - Paste code in editor → FileRegistry with one entry, path untitled.R - Upload single .R file → FileRegistry with one entry, path = filename - Upload ZIP → FileRegistry with N entries

The downstream pipeline (recognizer → mapper → executor) is identical in all cases.


5. UI Changes

5.1 Unified Upload Zone

The existing upload button in the toolbar becomes a unified drop zone accepting any file type.

5.2 File Browser Panel

New collapsible left panel (~200px wide) showing the extracted folder structure as a tree.

Visibility: - Hidden in single-file mode (paste in editor, single .R upload) - Shown after ZIP upload - Collapsible via toggle button

Tree contents: - Folder hierarchy mirrors the ZIP structure (after common prefix stripping) - R files: code icon, clickable → opens as editor tab - Data files (CSV/dta): data icon, clickable → highlights in dataset toolbar dropdown - Unsupported data files (xlsx, RData): data icon + warning badge, tooltip explains - Oversized data files (>500MB): data icon + size warning badge - Non-R files (.do, .py, images): greyed out, not clickable - Folders are collapsible

File status indicators: - Parsed successfully: green dot - Parse errors: amber dot with error count - Not applicable (non-R): no dot

5.3 Tabbed Code Editor

The existing CodeMirror editor gains a tab bar above it.

Tab behavior: - Single-file paste: one tab labeled “untitled.R” - ZIP upload: one tab per R file, labeled with filename (tooltip shows full path) - Clicking a tab switches the editor content - Tabs are editable (CodeMirror remains read-write) - Active tab visually highlighted

Tab ordering: - Topological order (files that are source()’d by others come first) - Files without source() edges sorted alphabetically after dependency-ordered files - Entry point files (e.g., Run_All.R, main.R) naturally sort last since they source everything

Overflow handling: - If >10 tabs, the tab bar scrolls horizontally - Overflow arrows on left/right edges

Re-parse on edit: - When editor content changes (debounced), re-parse the active file - Re-run FileRegistry processing for that file (re-extract source() calls, re-scan function defs + constants) - Re-run recognizer + mapper to update the DAG - Other files’ parse results are cached — only the edited file re-processes

5.4 Layout

Current layout: [Code Editor | DAG Canvas] with property sheet overlay on the right and data table panel at the bottom.

After M5a:

┌──────────┬──────────────────────┬─────────────────────────┐
│ File     │ tab1 | tab2 | tab3  │                         │
│ Browser  ├──────────────────────┤      DAG Canvas         │
│ (tree)   │                      │                         │
│          │   Code Editor        │                         │
│          │                      │                         │
│          │                      │      [Property Sheet]   │
├──────────┴──────────────────────┴─────────────────────────┤
│                    Data Table Viewer                       │
└───────────────────────────────────────────────────────────┘

6. Testing Strategy

Unit Tests

Area Tests
Parser: FunctionDefNode Parse f <- function(x, y = 1) { x + y }, verify params with defaults. Parse function(x) x + 1 (no braces). Nested function bodies.
Parser: ForNode Parse for (i in 1:10) { x <- i }, verify variable/iterable/body. Parse for (x in vec) print(x) (no braces).
Parser: LBrace/RBrace Verify existing code still parses correctly with new tokens. Braces inside strings don’t tokenize.
FileRegistry: source resolution Exact path from root. Relative to caller directory. Basename fallback. Path-segment scoring. No match → diagnostic.
FileRegistry: topological sort Linear chain (A sources B sources C). Star topology (main sources A, B, C). Cycle detection + warning.
FileRegistry: scope accumulation Function defs accumulate across files in order. String constants accumulate. Later files shadow earlier defs.
Inliner: argument matching Positional args. Named args. Mixed positional + named. Default values. Missing required arg → failure.
Inliner: substitution Simple identifier replacement. Nested expression replacement. Free variable from constants.
Inliner: paste0/paste evaluation All-literal args → concatenated string. Mixed literal + non-literal → left as-is. paste() with custom sep.
Inliner: formula resolution formula("y ~ x") → FormulaExprNode. as.formula(paste0(...)) after paste eval.
Inliner: guard checks NSE markers → failure. Nested function defs → failure. Empty body → failure.
Inliner: provenance sourceFunction set on resulting AnalysisCalls. sourceSpan points to call site.
ZIP extraction Common prefix stripping. File classification by extension. Size gate thresholds.

Integration Tests

Test What it verifies
soil-heterogeneity table1() Define function with 6 felm() + paste0() + file-scope constants → call site produces 6 AnalysisCalls with correct formulas
dissecting-crises helper chain source("loading_package_and_functions.R") → define helpers → call helpers in main script → lm() calls recognized
investor-memory flat source Run_All.R sources 27 independent scripts → all produce independent AnalysisCalls → flat DAG
Multi-file → single pipeline 3 files with source() chain → one unified pipeline with correct edge wiring
Single-file backward compat Existing single-file R code paste → same behavior as before (one-entry FileRegistry)

E2E Tests

Test What it verifies
ZIP upload Upload ZIP with R files + CSV → file browser shows tree → tabs appear → DAG renders
Single R file upload Upload one .R file → no file browser → one tab → DAG renders
Tab switching Click tabs → editor content changes → DAG stays (same pipeline)
File browser click Click R file in tree → corresponding tab activates
Edit tab → re-parse Modify code in a tab → DAG updates after debounce

7. Validates Against

Primary: jpe-soil-heterogeneity (2 R files, 206 models)

No source() needed — just 2 files with function definitions + call sites. Tests function inlining + paste0() evaluation + file-scope constants. Our highest-value validation: 206 models, 72% inlinable custom functions.

Upload the 2 R files + data → FileRegistry scans function defs + constants → table1() call sites inline to felm() calls → 206 linear-model nodes in the DAG.

Secondary: qje-investor-memory (27 R files, 137 models)

Flat source() topology. Tests source() resolution + multi-file merge. Run_All.R sources 27 scripts → each parsed independently → 137 felm() calls recognized → flat DAG.

Tertiary: jpe-dissecting-crises (9 R files, 25 models)

Dependency chain via source(). Tests source() resolution + function tracking across files + inlining of lm() wrappers.


8. Dependencies and New Packages

Package Version License Size Purpose
fflate latest MIT ~13KB gzip ZIP extraction in browser

No other new dependencies. All other functionality builds on existing infrastructure (Chevrotain parser, M4 dataset registry, CodeMirror editor, React Flow DAG).


9. Files Modified / Created

File Change
src/core/parsers/r/ast.ts Add FunctionDefNode, FunctionParam, ForNode to RNode union
src/core/parsers/r/tokens.ts Add LBrace, RBrace, In tokens
src/core/parsers/r/parser.ts Add functionDef, forStatement, body grammar rules. Update statement and atom.
src/core/parsers/r/visitor.ts Add visitors for new CST nodes → AST nodes
src/core/parsers/shared/analysis-call.ts Add optional sourceFunction?: string field to AnalysisCall
src/core/parsers/r/recognizer.ts Accept globalFunctions + globalConstants params. Delegate to inliner for user-defined function calls.
src/core/parsers/r/inliner.ts New. Function inlining: arg matching, substitution, paste0 evaluation, formula resolution, re-recognition.
src/core/parsers/file-registry.ts New. FileRegistry: file tracking, source() resolution, topological sort, scope accumulation.
src/core/zip/extractor.ts New. ZIP extraction, file classification, size gating.
src/ui/store/files.ts New. Zustand store for file browser state, tab state, active file.
src/ui/components/file-browser/ New. File browser tree panel component.
src/ui/components/editor/tab-bar.tsx New. Tab bar component for code editor.
src/ui/components/editor/code-editor.tsx Modify existing editor to support multiple files via tabs.
src/ui/components/toolbar/upload-zone.tsx New. Unified upload zone replacing current upload button.
src/ui/App.tsx Layout changes: add file browser panel, wire tab bar to editor, connect upload zone.
src/ui/store/pipeline.ts Update runPipeline to use FileRegistry instead of single-source parse.
package.json Add fflate dependency.