Date: 2026-04-11 Status: Design Depends on: M4 complete (multi-dataset registry, auto-match, expression evaluator, pipe threading)
Real replication packages aren’t single files. They’re ZIPs with 2–27
R files, helper scripts loaded via source(), custom
function definitions wrapping lm()/felm()
calls, and data files in multiple formats. M4 gave us the data pipeline;
M5a gives us the code infrastructure to parse and execute multi-file
packages.
Unlocks: Coverage jumps from 12% → ~29% of Top 5
packages. Makes Tier A replication viable for papers that use
source() chains and custom wrapper functions.
Scope: ZIP upload + file extraction,
source() dependency resolution, function definition
tracking with file-scope string constants, single-level function
inlining with paste0()/paste() evaluation.
Parser extensions for function definitions and
for loops (for loop bodies deferred to M5b but parsed now
to avoid error recovery corruption).
Out of scope: Loop expansion,
lapply()/map() expansion, dynamic formula
construction beyond paste0() with string literals,
multi-level function inlining, expression evaluator additions (Wave
1/2), Excel reading, code file dependency graph panel.
Two new node types in src/core/parsers/r/ast.ts, added
to the RNode union:
export interface FunctionDefNode {
type: 'function-def';
params: FunctionParam[];
body: RNode[]; // statements inside { }
span: Span;
}
export interface FunctionParam {
name: string;
default?: RNode; // default value expression, if any
}
export interface ForNode {
type: 'for';
variable: string; // iteration variable name
iterable: RNode; // the expression after `in`
body: RNode[]; // statements inside { }
span: Span;
}LBrace / RBrace — { and
} (not currently tokenized)In — keyword token for in inside
for loopsfunctionDef: Triggered when
atom encounters the Function keyword
token.
functionDef → Function LParen paramList? RParen body
paramList → param (Comma param)*
param → Identifier (Assign expression)? // default value
body → LBrace NL* (statement (statementSep statement)*)? NL* RBrace
| expression // single-expression body (no braces)
Single-expression function bodies (function(x) x + 1)
wrap the expression in a one-element body array.
forStatement: New alternative in the
statement rule.
forStatement → For LParen Identifier In expression RParen body
body → LBrace NL* (statement (statementSep statement)*)? NL* RBrace
| statement // single-statement body (no braces)
FunctionDefNode: Not recognized
directly. Consumed by the FileRegistry (Section 2) during scope
building. When encountered during a normal recognizer walk, silently
skip (no diagnostic, no UnsupportedNode).ForNode: Silently skip in M5a (no
diagnostic). M5b adds loop expansion. Having it in the AST prevents
error recovery from corrupting subsequent statements.The statement rule needs updated GATE logic to detect
for and function contexts. for
starts with the For keyword token (already exists in
lexer). Function definitions appear as
Identifier Arrow Function or
Identifier Assign Function — these are already parsed as
assignments where the value expression reaches atom, which
needs a new Function alternative.
File: src/core/parsers/file-registry.ts
(new)
The FileRegistry maps filenames to parse results, resolves
source() dependencies, and accumulates function definitions
+ string constants across files in topological order.
interface FunctionInfo {
node: FunctionDefNode;
sourceFile: string; // which file defined this function
}
interface ConstantInfo {
value: string;
sourceFile: string;
}
interface FileEntry {
path: string; // relative to ZIP root
source: string; // raw R source text
ast: ProgramNode; // parsed AST
sourceDeps: string[]; // resolved paths from source() calls
functionDefs: Map<string, FunctionDefNode>;
stringConstants: Map<string, string>;
}
interface FileRegistry {
files: Map<string, FileEntry>;
allFiles: string[]; // all file paths (R + data)
processingOrder: string[]; // topological sort of R files
// Accumulated scope (built during ordered processing)
globalFunctions: Map<string, FunctionInfo>;
globalConstants: Map<string, ConstantInfo>;
}Parse all R files — tokenize + parse each
.R file independently → FileEntry with
AST.
Extract source() calls — walk each
AST for FunctionCallNode where name is source.
Extract the first positional string argument. This is a pre-pass before
the recognizer runs.
Resolve paths — for each
source("path") call, resolve using 5-step fallback:
here::here("a", "b") →
path.join("a", "b"), then resolve as step 1source("Code/analysis/helpers.R") → prefer
analysis/helpers.R over utils/helpers.R based
on matching trailing segments)Topological sort — build DAG from resolved
sourceDeps, topological sort. Cycles → warning diagnostic +
break cycle at the edge that closes it.
Process files in order — walk files in topological order. For each file:
AssignmentNode where value is
FunctionDefNode → add to globalFunctionsAssignmentNode where value is
LiteralNode with string type → add to
globalConstantsMerge results — all AnalysisCall[]
from all files are concatenated into one flat list → passed to the
mapper → one unified pipeline DAG.
source() Handlingsource stays in IGNORABLE_FUNCTIONS. The
FileRegistry intercepts source() calls during step 2
(pre-pass on the AST), before the recognizer runs. The recognizer
continues to silently skip source() calls as ignorable —
the resolution has already happened at the FileRegistry layer. No
recognizer changes needed for source().
In single-file mode (paste in editor), there’s no FileRegistry
pre-pass — source() calls are silently skipped as
today.
globalFunctions.name <- "literal". Not inside functions,
not inside if/for blocks.globalFunctions and
globalConstants as read-only context.Pasting code in the editor creates a FileRegistry with one entry
(synthetic path untitled.R). Uploading a single
.R file creates a FileRegistry with one entry. The code
path is unified: everything goes through FileRegistry, whether it’s a
ZIP with 27 files or a single paste.
File: src/core/parsers/r/inliner.ts
(new)
When the recognizer encounters a FunctionCallNode whose
name matches a user-defined function in globalFunctions, it
delegates to the inliner.
Before attempting inlining, verify: - Function body contains at least
one FunctionCallNode (otherwise nothing to recognize) - No
nested function definitions in the body - No NSE markers in the body:
eval(), do.call(), get(),
assign(), environment() - Body length ≤ 200
statements (sanity bound)
If any guard fails → emit UnsupportedNode with a
diagnostic explaining why.
Look up function name in
globalFunctions → get FunctionDefNode with
params and body.
Match arguments — pair call-site args to function params:
data = my_df) match by name regardless of
positionRNodeUnsupportedNode: “Missing required argument ‘X’ for
function ‘Y’”Build substitution map —
Map<string, RNode> mapping param name → call-site
argument AST node.
Add file-scope constants — for free variables in
the body (identifiers not in the param list and not local assignments),
look up in globalConstants. If found, add
name → LiteralNode(value) to the substitution map.
Substitute — deep-clone the function body AST.
Walk the clone, replacing every IdentifierNode whose name
is in the substitution map with a deep clone of the mapped
node.
Evaluate
paste0()/paste() — post-substitution
pass on the cloned body. For any FunctionCallNode named
paste0 or paste where all
arguments are now LiteralNode strings:
paste0(a, b, c) → LiteralNode with
a + b + cpaste(a, b, c) → LiteralNode with
a + " " + b + " " + c (default sep = " ")sep named argument exists and is a string literal,
use that separatorUnsupportedNode
downstream)Resolve
formula()/as.formula() — for any
FunctionCallNode named formula or
as.formula whose first argument is now a
LiteralNode string: extract the string, parse it as a
formula using the existing formula parser. Replace the
FunctionCallNode with the resulting
FormulaExprNode. If parsing fails, leave as-is.
Re-recognize — run the recognizer on the
substituted body statements. This is a single pass — if
the body contains calls to other user-defined functions, those emit
UnsupportedNode (no recursion). The recognizer treats the
inlined body exactly like top-level statements, with the same
accumulated scope (minus the function being inlined, to prevent
self-recursion).
Annotate provenance — each
AnalysisCall produced from inlining gets:
sourceFunction: string — the original function name
(e.g., "table1")sourceSpan — points to the call site,
not the function definition bodyInlining is best-effort. If any step fails (unresolved free variable,
non-literal paste argument, guard check), the original call site emits
UnsupportedNode with a descriptive diagnostic. The pipeline
continues — partial inlining of a file is fine. Example diagnostics:
"Could not inline 'table1': free variable 'my.GeoCtrls' not found in scope""Could not inline 'run_sim': body contains nested function definitions""Could not inline 'estimate': body exceeds 200 statements""Could not inline 'my_func': contains eval() (non-standard evaluation)"# File scope constants:
my.GeoCtrls <- "Elevation + Slope + Flow_acc + ..."
# Function definition:
table1 <- function(dep.var.name, indep.var.name, FEs, reg.data, clusterVar) {
m1 <- felm(formula(paste0(dep.var.name, " ~ ", indep.var.name, " | 1 | 0 | ", clusterVar)),
data = reg.data)
m2 <- felm(formula(paste0(dep.var.name, " ~ ", indep.var.name, " | ", FEs, " | 0 | ", clusterVar)),
data = reg.data)
m3 <- felm(formula(paste0(dep.var.name, " ~ ", indep.var.name, " + ", my.GeoCtrls,
" | ", FEs, " | 0 | ", clusterVar)),
data = reg.data)
# ... m4, m5, m6 similarly
}
# Call site:
table1(dep.var.name = "LNI_county_Native", indep.var.name = "SHI25",
FEs = "state", reg.data = dt1, clusterVar = "state")Step 2-3: Substitution map =
{ dep.var.name → "LNI_county_Native", indep.var.name → "SHI25", FEs → "state", reg.data → dt1, clusterVar → "state" }
Step 4: Add
my.GeoCtrls → "Elevation + Slope + ..." from
globalConstants
Step 5: Substitute into cloned body →
paste0("LNI_county_Native", " ~ ", "SHI25", " | 1 | 0 | ", "state")
Step 6: Evaluate paste0 →
"LNI_county_Native ~ SHI25 | 1 | 0 | state"
Step 7:
formula("LNI_county_Native ~ SHI25 | 1 | 0 | state") →
parsed FormulaExprNode
Step 8: Recognizer sees
felm(formula, data = dt1) → emits linear-model
AnalysisCall
Step 9:
AnalysisCall.sourceFunction = "table1"
Result: 6 felm() AnalysisCalls from one
table1() call. Each carries provenance back to the call
site + function name.
File: src/core/zip/extractor.ts
(new)
fflate (MIT, ~13KB gzipped). Synchronous decompression
via unzipSync().
Read ZIP buffer —
File.arrayBuffer() from the upload input →
Uint8Array
Extract — fflate.unzipSync(buffer)
→ Record<string, Uint8Array> (path →
content)
Strip common prefix — if all paths share a
single top-level directory (e.g., replication-package/...),
strip it so paths become relative to the package root
Classify files by extension:
| Extension | Classification | Action |
|---|---|---|
.R, .r |
R source | Decode as UTF-8, send to FileRegistry |
.csv |
CSV data | Send to dataset registry (addDataset) |
.dta |
Stata data | Send to dataset registry (existing .dta parser) |
.xlsx, .xls |
Excel data | Show in file browser with “not yet supported” badge |
.RData, .rds |
R binary data | Show in file browser with “not yet supported” badge |
.pdf |
Paper PDF | Store reference for M6 verification |
Other (.do, .py, .m,
.txt, images) |
Non-R files | Show greyed out in file browser |
Size gate for data files — check uncompressed size from ZIP metadata before extracting:
Feed to systems:
source(),
process in order → unified AnalysisCall[]data-load
nodes by filename (existing M4 logic)AnalysisCall[] → mapper → one flat pipeline
DAGAll entry points produce a FileRegistry: - Paste code in
editor → FileRegistry with one entry, path
untitled.R - Upload single .R
file → FileRegistry with one entry, path = filename -
Upload ZIP → FileRegistry with N entries
The downstream pipeline (recognizer → mapper → executor) is identical in all cases.
The existing upload button in the toolbar becomes a unified drop zone accepting any file type.
.zip, .R, .csv,
.dta.R → opens as new editor tab, creates single-file
FileRegistry.csv/.dta → feeds to dataset registry
(existing M4 flow)New collapsible left panel (~200px wide) showing the extracted folder structure as a tree.
Visibility: - Hidden in single-file mode (paste in
editor, single .R upload) - Shown after ZIP upload -
Collapsible via toggle button
Tree contents: - Folder hierarchy mirrors the ZIP
structure (after common prefix stripping) - R files: code icon,
clickable → opens as editor tab - Data files (CSV/dta): data icon,
clickable → highlights in dataset toolbar dropdown - Unsupported data
files (xlsx, RData): data icon + warning badge, tooltip explains -
Oversized data files (>500MB): data icon + size warning badge - Non-R
files (.do, .py, images): greyed out, not
clickable - Folders are collapsible
File status indicators: - Parsed successfully: green dot - Parse errors: amber dot with error count - Not applicable (non-R): no dot
The existing CodeMirror editor gains a tab bar above it.
Tab behavior: - Single-file paste: one tab labeled “untitled.R” - ZIP upload: one tab per R file, labeled with filename (tooltip shows full path) - Clicking a tab switches the editor content - Tabs are editable (CodeMirror remains read-write) - Active tab visually highlighted
Tab ordering: - Topological order (files that are
source()’d by others come first) - Files without
source() edges sorted alphabetically after
dependency-ordered files - Entry point files (e.g.,
Run_All.R, main.R) naturally sort last since
they source everything
Overflow handling: - If >10 tabs, the tab bar scrolls horizontally - Overflow arrows on left/right edges
Re-parse on edit: - When editor content changes
(debounced), re-parse the active file - Re-run FileRegistry processing
for that file (re-extract source() calls, re-scan function
defs + constants) - Re-run recognizer + mapper to update the DAG - Other
files’ parse results are cached — only the edited file re-processes
Current layout:
[Code Editor | DAG Canvas] with property sheet overlay on
the right and data table panel at the bottom.
After M5a:
┌──────────┬──────────────────────┬─────────────────────────┐
│ File │ tab1 | tab2 | tab3 │ │
│ Browser ├──────────────────────┤ DAG Canvas │
│ (tree) │ │ │
│ │ Code Editor │ │
│ │ │ │
│ │ │ [Property Sheet] │
├──────────┴──────────────────────┴─────────────────────────┤
│ Data Table Viewer │
└───────────────────────────────────────────────────────────┘
| Area | Tests |
|---|---|
| Parser: FunctionDefNode | Parse f <- function(x, y = 1) { x + y }, verify
params with defaults. Parse function(x) x + 1 (no braces).
Nested function bodies. |
| Parser: ForNode | Parse for (i in 1:10) { x <- i }, verify
variable/iterable/body. Parse for (x in vec) print(x) (no
braces). |
| Parser: LBrace/RBrace | Verify existing code still parses correctly with new tokens. Braces inside strings don’t tokenize. |
| FileRegistry: source resolution | Exact path from root. Relative to caller directory. Basename fallback. Path-segment scoring. No match → diagnostic. |
| FileRegistry: topological sort | Linear chain (A sources B sources C). Star topology (main sources A, B, C). Cycle detection + warning. |
| FileRegistry: scope accumulation | Function defs accumulate across files in order. String constants accumulate. Later files shadow earlier defs. |
| Inliner: argument matching | Positional args. Named args. Mixed positional + named. Default values. Missing required arg → failure. |
| Inliner: substitution | Simple identifier replacement. Nested expression replacement. Free variable from constants. |
| Inliner: paste0/paste evaluation | All-literal args → concatenated string. Mixed literal + non-literal
→ left as-is. paste() with custom sep. |
| Inliner: formula resolution | formula("y ~ x") → FormulaExprNode.
as.formula(paste0(...)) after paste eval. |
| Inliner: guard checks | NSE markers → failure. Nested function defs → failure. Empty body → failure. |
| Inliner: provenance | sourceFunction set on resulting AnalysisCalls.
sourceSpan points to call site. |
| ZIP extraction | Common prefix stripping. File classification by extension. Size gate thresholds. |
| Test | What it verifies |
|---|---|
soil-heterogeneity table1() |
Define function with 6 felm() + paste0() +
file-scope constants → call site produces 6 AnalysisCalls with correct
formulas |
| dissecting-crises helper chain | source("loading_package_and_functions.R") → define
helpers → call helpers in main script → lm() calls
recognized |
| investor-memory flat source | Run_All.R sources 27 independent scripts → all produce
independent AnalysisCalls → flat DAG |
| Multi-file → single pipeline | 3 files with source() chain → one unified pipeline with
correct edge wiring |
| Single-file backward compat | Existing single-file R code paste → same behavior as before (one-entry FileRegistry) |
| Test | What it verifies |
|---|---|
| ZIP upload | Upload ZIP with R files + CSV → file browser shows tree → tabs appear → DAG renders |
| Single R file upload | Upload one .R file → no file browser → one tab → DAG
renders |
| Tab switching | Click tabs → editor content changes → DAG stays (same pipeline) |
| File browser click | Click R file in tree → corresponding tab activates |
| Edit tab → re-parse | Modify code in a tab → DAG updates after debounce |
No source() needed — just 2 files with function
definitions + call sites. Tests function inlining +
paste0() evaluation + file-scope constants. Our
highest-value validation: 206 models, 72% inlinable custom
functions.
Upload the 2 R files + data → FileRegistry scans function defs +
constants → table1() call sites inline to
felm() calls → 206 linear-model nodes in the
DAG.
Flat source() topology. Tests source()
resolution + multi-file merge. Run_All.R sources 27 scripts
→ each parsed independently → 137 felm() calls recognized →
flat DAG.
Dependency chain via source(). Tests
source() resolution + function tracking across files +
inlining of lm() wrappers.
| Package | Version | License | Size | Purpose |
|---|---|---|---|---|
fflate |
latest | MIT | ~13KB gzip | ZIP extraction in browser |
No other new dependencies. All other functionality builds on existing infrastructure (Chevrotain parser, M4 dataset registry, CodeMirror editor, React Flow DAG).
| File | Change |
|---|---|
src/core/parsers/r/ast.ts |
Add FunctionDefNode, FunctionParam,
ForNode to RNode union |
src/core/parsers/r/tokens.ts |
Add LBrace, RBrace, In
tokens |
src/core/parsers/r/parser.ts |
Add functionDef, forStatement,
body grammar rules. Update statement and
atom. |
src/core/parsers/r/visitor.ts |
Add visitors for new CST nodes → AST nodes |
src/core/parsers/shared/analysis-call.ts |
Add optional sourceFunction?: string field to
AnalysisCall |
src/core/parsers/r/recognizer.ts |
Accept globalFunctions + globalConstants
params. Delegate to inliner for user-defined function calls. |
src/core/parsers/r/inliner.ts |
New. Function inlining: arg matching, substitution, paste0 evaluation, formula resolution, re-recognition. |
src/core/parsers/file-registry.ts |
New. FileRegistry: file tracking, source() resolution, topological sort, scope accumulation. |
src/core/zip/extractor.ts |
New. ZIP extraction, file classification, size gating. |
src/ui/store/files.ts |
New. Zustand store for file browser state, tab state, active file. |
src/ui/components/file-browser/ |
New. File browser tree panel component. |
src/ui/components/editor/tab-bar.tsx |
New. Tab bar component for code editor. |
src/ui/components/editor/code-editor.tsx |
Modify existing editor to support multiple files via tabs. |
src/ui/components/toolbar/upload-zone.tsx |
New. Unified upload zone replacing current upload button. |
src/ui/App.tsx |
Layout changes: add file browser panel, wire tab bar to editor, connect upload zone. |
src/ui/store/pipeline.ts |
Update runPipeline to use FileRegistry instead of
single-source parse. |
package.json |
Add fflate dependency. |