Replication Audit System — Design Spec

Replication Audit System — Design Spec

Date: 2026-04-14 Goal: A developer-facing metric system that tracks how much of each paper’s estimation pipeline Interlyse can execute, with dependency-aware blocker drill-down to prioritize feature development.

Core Concepts

Unit of Measurement: Model Execution

The atomic unit is a model node — any recognized estimation call (lm(), feols(), felm(), glm(), ivreg(), lm_robust(), rq(), etc.) that produces regression/test results. Table output (stargazer, etable) is a downstream presentation concern tracked separately later.

Each model node has an upstream dependency chain in the pipeline DAG: data-load -> transforms -> … -> model. A model is only executable if every node in its chain is supported.

Two Layers

  1. Static analysis — parses R code, builds the DAG, classifies every node as supported or blocked. Runs on the full corpus (all downloaded packages, no data files needed). Answers: “what could work?”

  2. Execution verification — actually runs the pipeline with real data, compares TS output against R output. Runs on a curated subset where data is staged. Answers: “what actually works?”

Three Model Statuses

A model can be executable but not verified (no data staged), or blocked but partially informative (some upstream nodes work, the blocker is identified).

Data Model

Per-Model Record

type ModelStatus =
  | { status: 'blocked'; blockers: BlockerInfo[] }
  | { status: 'executable' }
  | { status: 'verified'; match: boolean; divergence?: DivergenceInfo }

type BlockerInfo = {
  feature: string        // e.g., "pivot_wider", "read_xlsx", "setwd"
  nodeType: string       // pipeline node type that's unsupported
  nodeLabel: string      // human-readable label from the R source
  upstreamOf: string[]   // which model nodes this blocks
}

type DivergenceInfo = {
  coefficient: string    // which coef diverged
  tsValue: number
  rValue: number
  delta: number
}

Paper Scorecard

type PaperScorecard = {
  paperId: string              // e.g., "qje-investor-memory"
  journal: string
  rFiles: number
  modelsDetected: number
  modelsExecutable: number     // static analysis
  modelsVerified: number       // execution pass (0 if not staged)
  modelsVerifiedPassing: number
  dataStaged: boolean
  blockers: Record<string, {   // keyed by feature name
    modelsBlocked: number
    modelLabels: string[]
  }>
}

Corpus Rollup

type CorpusRollup = {
  timestamp: string
  totalPapers: number
  totalModels: number
  totalExecutable: number
  totalVerified: number
  totalVerifiedPassing: number
  papersFullyExecutable: number
  featurePriority: {
    feature: string
    modelsUnblocked: number
    papersAffected: number
  }[]
  paperScorecards: PaperScorecard[]
}

Component 1: Static Analysis Engine

Architecture

A Node CLI script (scripts/audit-static.ts) that reuses the actual Interlyse parser pipeline:

R files  ->  parseR()  ->  recognizeR()  ->  mapToPipeline()  ->  DAG
                                                                    |
                                                          classify each node
                                                                    |
                                                          walk back from models
                                                                    |
                                                          scorecard JSON

This reuses the real parser/recognizer/mapper so “is this supported?” stays in one place. No duplicated classification logic in Python.

Supported Features Manifest

The script doesn’t need a separate manifest — it uses the actual pipeline mapper and executor registry. If mapToPipeline() produces a node and executorRegistry has a handler for it, it’s supported. If the recognizer doesn’t recognize a function call, it’s unsupported. The source of truth is the code itself.

For function calls that the recognizer skips (not recognized), the script captures them as potential blockers by diffing “all function calls in the AST” against “function calls that produced pipeline nodes.”

Blocker Identification

For each model node in the DAG: 1. Walk the dependency chain backward (follow input edges) 2. For each upstream node, check: does an executor exist for this node type? 3. For unrecognized function calls that appear in the R source between the data-load statement and the model call (by source position / Span), flag them as potential blockers — these likely represent transform steps the recognizer missed 4. The primary blocker is the first unsupported node in topological order (closest to the data source) — fixing it might unblock a cascade 5. Secondary blockers are additional unsupported nodes further downstream — even after fixing the primary, these still need work

Models Without Data Sources

Some models have no recognizable upstream data-load — the paper uses attach(), pre-loaded objects, or data constructed entirely in custom functions. These models are classified as blocked with a synthetic blocker "data-source-unknown". They can transition to executable once the data path is resolved (e.g., by staging data and mapping it manually to the model’s expected variables).

Multi-File Handling

Papers with source() calls use the existing FileRegistry + topological sort (M5a). The audit script processes all R files in a package in dependency order, accumulating scope, just like the real pipeline would.

Running

# Audit all downloaded packages
node scripts/audit-static.js

# Audit a single paper
node scripts/audit-static.js --paper qje-investor-memory

# Output: reference-papers/audit/static-results.json

Component 2: Execution Verification Engine

Architecture

A test harness (scripts/audit-verify.ts) for papers with staged data:

Paper R code  ->  local R process  ->  ground-truth JSON (coefficients per model)
Paper R code  ->  Interlyse TS pipeline  ->  TS results (coefficients per model)
                                                    |
                                              diff TS vs R
                                                    |
                                           verification JSON

R Ground-Truth Extraction

For each staged paper, a small R wrapper script: 1. source()s the paper’s R code 2. Intercepts estimation function calls (wraps lm, feols, etc.) 3. For each model, extracts: coefficients, standard errors, p-values, R-squared, N, residual df 4. Writes structured JSON: reference-papers/audit/r-ground-truth/<paper-id>.json

This R wrapper is semi-automated — the static analysis tells us which functions appear and how many models to expect. The wrapper template handles the common cases (lm, feols, felm, glm, ivreg). Papers with unusual patterns may need manual adjustment.

TS Execution

The verify script: 1. Loads the paper’s data files into Interlyse Dataset objects 2. Parses the R code through the full pipeline (parser -> recognizer -> mapper -> executor) 3. Extracts the same coefficient/SE/p-value values from each model node’s result 4. Writes reference-papers/audit/ts-results/<paper-id>.json

Diffing

Tolerances (matching existing test conventions): - Coefficients/statistics: |ts - r| < 0.00005 - P-values: |ts - r| < 0.00001 - N, df: exact match

A model is verified passing when all extracted values match. A model is verified failing when it executes but values diverge — the divergence info captures which coefficient, by how much.

Data Staging

Staged papers live in reference-papers/audit/data/<paper-id>/ with the CSV/DTA files needed for execution. A reference-papers/audit/staged-papers.json manifest tracks which papers are staged and ready for verification.

Running

# Verify all staged papers
node scripts/audit-verify.js

# Verify a single paper
node scripts/audit-verify.js --paper qje-investor-memory

# Output: reference-papers/audit/verification-results.json

Component 3: Reporting

Markdown Report

A script (scripts/audit-report.ts) that consumes the static and verification JSON outputs and produces reference-papers/REPLICATION-AUDIT.md.

Format:

# Replication Audit — YYYY-MM-DD

## Headline
Models executable: X / Y (Z%) across N R papers
Models verified:   A / B (C%) across M staged papers
Papers 100% executable: P / N

## Feature Priority
| Feature           | Models unblocked | Papers affected |
|-------------------|:----------------:|:---------------:|
| ...               | ...              | ...             |

## Per-Paper Scorecards
### paper-id (K models)
- Executable: X/K (Z%)
- Verified: A/B (C%) [or "not staged"]
- Blockers:
  - feature_name -> blocks N models

Diff Mode

npm run audit -- --diff compares against the previous REPLICATION-AUDIT.md (or a saved static-results.json snapshot) and outputs only what changed:

## Changes since last audit (YYYY-MM-DD)
- Added: distinct() support
- Models executable: 3,412 -> 3,455 (+43)
- Papers at 100%: 14 -> 16 (+2)
- Newly unblocked papers: jpe-dissecting-financial-crises, restud-inference-single-treated-cluster

Previous snapshots are saved in reference-papers/audit/history/ with timestamps.

Top-Level Command

# Full audit: static + verify + report
npm run audit

# Static only (no data needed)
npm run audit:static

# Verify only (staged papers)
npm run audit:verify

# Report with diff
npm run audit -- --diff

Wired up via package.json scripts.

File Structure

scripts/
  audit-static.ts          # Static DAG analysis CLI
  audit-verify.ts          # Execution verification CLI
  audit-report.ts          # Markdown report generator

reference-papers/
  audit/
    static-results.json    # Latest static analysis output
    verification-results.json  # Latest verification output
    r-ground-truth/        # R output per staged paper
      qje-investor-memory.json
      ...
    ts-results/            # TS output per staged paper
      qje-investor-memory.json
      ...
    data/                  # Staged data files per paper
      qje-investor-memory/
        data.csv
        ...
    staged-papers.json     # Manifest of staged papers
    history/               # Snapshots for diff mode
      2026-04-14.json
      ...
  REPLICATION-AUDIT.md     # Human-readable report (committed)

Scope & Non-Goals

Growth Path

  1. Start: Static analysis on all 36 R papers. Execution verification on 3-5 papers with easily accessible data (MIT-licensed packages, CSV-only data).
  2. Near term: Grow staged papers to 10-15 as you download more data. Add Stata papers when the Stata parser ships.
  3. Later: The execution pass could use WebR instead of a local R process, making it fully self-contained (no R installation required). The static analysis could run in CI as a regression check.