M4.5: DiD Estimators — Design Spec

M4.5: DiD Estimators — Design Spec

Goal

Add a did pipeline node that runs difference-in-differences estimators as a single coherent analysis. Internally, each estimator composes over existing regression primitives (FE demeaning, OLS, clustered SEs). Externally, it’s one node with typed params and a unified result containing event-study coefficients, ATT summaries, and pre-trends tests.

User story: Researcher pastes R code containing att_gt(), did2s(), or feols() with event-study syntax → Interlyse recognizes it as a DiD analysis → executes the estimator(s) → displays an event-study plot overlaying all estimators with confidence intervals.

Why now (between M4 and M5): DiD is the dominant method in applied econ. 3 of 6 reference papers use it. M5’s replication workflow needs DiD to validate against the JEL-DiD paper. The engine (M3 regression + M4 data pipelines) is ready; this milestone adds the meta-estimator layer on top.

Estimators (Tier 2)

Estimator R Package Internal approach
TWFE event-study fixest (i() syntax) Expand i(event_time, ref=...) into binary indicators, run existing feols() path
Gardner two-stage did2s Stage 1: OLS on untreated obs → residuals. Stage 2: OLS on residualized outcome with event indicators. Two calls to existing regression.ts
Callaway-Sant’Anna did (att_gt) Loop over (cohort, time) pairs, run 2×2 DiD on subsets via OLS, aggregate with proper weights. Bootstrap for inference (default 1000 iterations)
Sun-Abraham fixest (sunab()) Construct cohort × relative-time interaction indicators, run feols, reweight coefficients for aggregation
Borusyak imputation didimputation OLS on untreated observations → impute Y(0) for treated → ATT = Y - Y_hat(0). Influence-function SEs

Deferred (Tier 3): Roth-Sant’Anna (staggered) — specialized variance formula, lower usage. Slots in later as another run* function.

Architecture: Hybrid Monolithic Executor (Approach C)

One did executor with per-estimator error isolation via try/catch. Failed estimators produce error entries in the result; succeeded estimators display normally.

did-executor
│
├─ validatePanel(dataset, params)     → PanelInfo { units, times, cohorts, neverTreated }
├─ preparePanelData(dataset, params)  → adds time_to_treat, ever_treated, treatment columns
│
├─ try { runTWFE() }       → success | error
├─ try { runGardner() }    → success | error
├─ try { runCS() }         → success | error
├─ try { runSunAbraham() } → success | error
├─ try { runBorusyak() }   → success | error
│
└─ DiDResult
    ├─ succeeded: EstimatorResult[]
    ├─ failed: { estimator, error }[]
    ├─ eventStudyCoefs: Map<estimator, { time, coef, se, ciLower, ciUpper }[]>
    ├─ attSummary: Map<estimator, { att, se, pValue }>
    └─ preTrendsTest: Map<estimator, { fStat, pValue, df }>

Why this approach

Reuses existing infrastructure

New primitives needed

Pipeline Types

Node type

interface DiDNode extends BaseNode {
  type: 'did';
  params: DiDParams;
  result?: DiDResult;
}

interface DiDParams {
  yname: string;                          // outcome variable
  tname: string;                          // calendar time variable
  idname: string;                         // unit/panel ID variable
  gname: string;                          // cohort variable (first treatment period; 0/Infinity = never-treated)
  estimators: DiDEstimator[];             // which estimators to run
  xformla?: Formula;                      // optional covariates (RHS of ~)
  weights?: string;                       // optional weight column
  clusterVar?: string;                    // cluster variable (defaults to idname)
  controlGroup: 'nevertreated' | 'notyettreated';  // who is the comparison group
  eventHorizon?: [number, number];        // relative time range [min_e, max_e] (default: data-driven)
  bootstrapIterations?: number;           // for C-S bootstrap (default: 1000)
  confidenceLevel?: number;               // for CIs (default: 0.95)
}

type DiDEstimator = 'twfe' | 'gardner' | 'callaway-santanna' | 'sun-abraham' | 'borusyak';

Result type

interface DiDResult {
  type: 'did';
  panelInfo: PanelInfo;
  estimatorResults: EstimatorOutcome[];
  eventStudyCoefs: Record<DiDEstimator, EventStudyCoefficient[]>;
  attSummary: Record<DiDEstimator, ATTSummary>;
  preTrendsTest: Record<DiDEstimator, PreTrendsTest>;
}

type EstimatorOutcome =
  | { estimator: DiDEstimator; status: 'success' }
  | { estimator: DiDEstimator; status: 'error'; error: string };

interface PanelInfo {
  nUnits: number;
  nPeriods: number;
  cohorts: { treatmentTime: number; nUnits: number }[];
  nNeverTreated: number;
  balancedPanel: boolean;
}

interface EventStudyCoefficient {
  relativeTime: number;   // e.g., -3, -2, -1, 0, 1, 2, 3
  estimate: number;
  se: number;
  ciLower: number;
  ciUpper: number;
  pValue: number;
  isReference: boolean;   // true for omitted reference period
}

interface ATTSummary {
  att: number;            // overall average treatment effect on the treated
  se: number;
  pValue: number;
  ciLower: number;
  ciUpper: number;
  nTreatedObs: number;
}

interface PreTrendsTest {
  fStat: number;
  pValue: number;
  df: [number, number];
  preCoefs: EventStudyCoefficient[];  // just the t < 0 coefficients
}

Ports

'did': {
  inputs: [{ name: 'dataset', type: 'dataset', label: 'Panel Data' }],
  outputs: [{ name: 'result', type: 'did-result', label: 'DiD Result' }],
}

Recognizer Patterns

Individual estimator calls → single did node

# Callaway-Sant'Anna
att_gt(yname = "y", tname = "year", idname = "id", gname = "first_treat",
       data = df, control_group = "notyettreated")
# → did node with estimators: ['callaway-santanna']

# Gardner two-stage
did2s(data, yname = "y", first_stage = ~0 | id + year,
      second_stage = ~i(time_to_treat), treatment = "treat", cluster_var = "id")
# → did node with estimators: ['gardner']

# Borusyak imputation
did_imputation(data, yname = "y", gname = "g", tname = "t", idname = "i")
# → did node with estimators: ['borusyak']

event_study() wrapper → single did node with multiple estimators

event_study(data = df, yname = "y", idname = "id", tname = "year",
            gname = "first_treat", estimator = "all")
# → did node with estimators: ['twfe', 'gardner', 'callaway-santanna', 'sun-abraham', 'borusyak']

feols() with event-study syntax → routed to did instead of linear-model

# i() event-study syntax → did node (estimators: ['twfe'])
feols(y ~ i(time_to_treat, ref = c(-1, -Inf)) | id + year, data = df)

# sunab() syntax → did node (estimators: ['sun-abraham'])
feols(y ~ sunab(first_treat, time_to_treat) | id + year, data = df)

The recognizer inspects the formula: if it contains i(event_var, ref=...) or sunab(...), route to did node. Plain feols(y ~ treat | id + year) without event-study syntax stays as linear-model — that’s standard TWFE, not necessarily a DiD event study.

Grouping: multiple individual calls → one did node

When multiple DiD calls in the same script share the same panel structure (same yname, tname, idname, gname — matched by string equality on the column name arguments), the mapper merges them into a single did node with the union of their estimators. Similar to how multiple feols() calls get grouped in the spec explorer today. If panel structures differ (different outcome variable, different unit ID), they remain separate did nodes.

# These three calls share panel structure → one did node with 3 estimators
cs_result <- att_gt(yname="y", tname="year", idname="id", gname="g", data=df)
gardner_result <- did2s(df, yname="y", ...)
feols(y ~ i(time_to_treat, ref=-1) | id + year, data = df)
# → did node with estimators: ['callaway-santanna', 'gardner', 'twfe']

aggte() → modifies the parent did node’s aggregation params

result <- att_gt(yname="y", tname="year", idname="id", gname="g", data=df)
aggte(result, type = "dynamic", min_e = -5, max_e = 10)
# → sets eventHorizon: [-5, 10] on the did node; aggte is not a separate node

UI Output

Primary: Event-Study Plot

Observable Plot chart showing: - X-axis: relative time (periods before/after treatment) - Y-axis: estimated coefficient (treatment effect) - One series per estimator, color-coded (reuse COLORS from plot-theme.ts) - 95% CI bands (shaded or error bars) - Vertical dashed line at t = 0 (treatment onset) - Horizontal dashed line at y = 0 (null effect) - Reference period marked (typically t = -1)

Secondary: ATT Summary Table

Estimator ATT SE 95% CI p-value
TWFE 0.045 0.012 [0.021, 0.069] 0.000
Gardner 0.042 0.011 [0.020, 0.064] 0.000
C-S 0.038 0.013 [0.013, 0.063] 0.003
Estimator Joint F p-value Pre-trend?
TWFE 1.23 0.294 No
Gardner 0.98 0.421 No

Panel Info Summary

Displayed in the property sheet or results header: - N units, N periods, balanced/unbalanced - Treatment cohorts with counts - Never-treated count - Control group type

Estimator Implementation Notes

TWFE Event-Study

Expand i(time_to_treat, ref=c(-1, -Inf)) into binary indicator columns (one per relative time period, omitting reference periods). Construct design matrix, run existing feols() regression path with unit + time FE. Clustered SEs on idname. Straightforward — mostly indicator generation + existing infrastructure.

Gardner Two-Stage (did2s)

  1. Subset to untreated observations (time_to_treat < 0 or ever_treated == FALSE)
  2. Run OLS: y ~ 0 | unit + time on untreated subset → get unit/time FE estimates
  3. Residualize outcome for full dataset: y_resid = y - unit_FE - time_FE
  4. Run OLS: y_resid ~ event_indicators on full dataset
  5. Adjust SEs for two-stage estimation (cluster on idname)

Two calls to existing regression.ts. The SE adjustment is the main new piece — standard two-stage correction analogous to 2SLS SE adjustment we already have.

Callaway-Sant’Anna (att_gt)

For each (cohort g, time period t) where t >= g: 1. Subset data to: units in cohort g + control units (never-treated or not-yet-treated at t) 2. Take two time periods: t and a base period (typically g - 1 for universal base) 3. Compute ATT(g,t) via outcome regression (difference-in-means or with covariates) 4. Store influence function values for inference

Aggregate ATT(g,t) → overall ATT (simple weighted average) and event-study (average across cohorts at each relative time).

Bootstrap inference: resample unit-level blocks (all periods for a unit stay together), re-compute all ATT(g,t) + aggregation, derive SEs and CIs from bootstrap distribution. Default 1000 iterations.

Performance note: For typical applied econ panels (500-5000 units, 10-30 periods, 3-8 cohorts), each (g,t) regression is small. The bootstrap is the bottleneck — 1000 iterations × ~50 (g,t) pairs × small OLS = ~50K small regressions. At ~0.1ms each on modern hardware, this is ~5 seconds. Acceptable for a one-shot analysis.

Sun-Abraham (sunab)

Construct interaction indicators: for each cohort g and relative time e, create I(cohort == g) * I(time_to_treat == e). Run feols with these interactions + unit/time FE. Then reweight: the TWFE coefficient on I(time_to_treat == e) is a weighted average of cohort-specific effects; Sun-Abraham recovers the properly-weighted average by summing cohort-specific coefficients weighted by cohort shares.

Uses existing feols path for the regression. New logic: indicator construction + coefficient reweighting.

Borusyak Imputation (did_imputation)

  1. Subset to untreated observations (pre-treatment + never-treated)
  2. Run OLS: y ~ unit_FE + time_FE (+ covariates) on untreated subset
  3. Predict Y(0) for all treated observations using estimated FEs
  4. Imputed ATT for each treated (i,t): tau_hat(i,t) = Y(i,t) - Y_hat(0)(i,t)
  5. Aggregate imputed ATTs by relative time for event-study, or overall
  6. Influence-function SEs (analytical, no bootstrap needed)

One regression + prediction + aggregation. The influence-function SE derivation is the complex part.

Testing Strategy

Unit tests (per estimator, validated against R output)

Run each estimator in R on a known dataset, capture coefficients/SEs/p-values, assert Interlyse matches within tolerance. Use the same tolerance standards as existing regression tests: <0.00005 for statistics, <0.00001 for p-values.

Test datasets: - Simulated staggered adoption panel — small (100 units × 10 periods × 3 cohorts) with known DGP so we can verify both point estimates and SEs - Castle doctrine dataset (from did2s package) — real data with published R output to validate against

Integration tests

E2E tests

Validates against

File structure

src/core/stats/
  did/
    panel.ts          — PanelInfo, validatePanel(), preparePanelData()
    indicators.ts     — event-study indicator expansion, i() and sunab() column generation
    twfe.ts           — runTWFE()
    gardner.ts        — runGardner()
    callaway.ts       — runCS(), bootstrap logic
    sun-abraham.ts    — runSunAbraham()
    borusyak.ts       — runBorusyak()
    aggregate.ts      — ATT aggregation, pre-trends test
    bootstrap.ts      — panel bootstrap (unit-block resampling)
    types.ts          — DiDParams, DiDResult, EventStudyCoefficient, etc.
    index.ts          — did executor entry point (orchestrates all of the above)
    *.test.ts         — colocated tests per file
src/core/pipeline/
  types.ts            — add DiDNode to PipelineNode union
  executor.ts         — register 'did' executor
src/core/parsers/r/
  recognizer.ts       — add att_gt, did2s, event_study, did_imputation patterns;
                        route feols+i()/sunab() to did
src/ui/components/
  results/
    event-study-plot.tsx  — Observable Plot event-study chart
    did-results.tsx       — ATT table + pre-trends table + panel info