M3: Generalized Linear Models (GLM)

M3: Generalized Linear Models (GLM)

Date: 2026-04-04 Status: Design Depends on: OLS regression (design matrix, formula handling), robust SEs (sandwich.ts), FE demeaning (demean.ts — for future FE-GLM upgrade path)

Motivation

~20% of applied econ papers use logit or probit models, primarily for propensity scores and binary outcomes. glm() is the fourth most common estimation function after lm(), feols(), and ivreg(). Without GLM, Interlyse silently skips propensity score models that feed into downstream estimators (IPW, doubly robust DiD).

Unlocks: glm(y ~ x, family=binomial), glm(y ~ x, family=binomial(link="probit")), glm(y ~ x, family=poisson), and all standard GLM families. feglm()/fepois() remain recognized-with-warning but the type system is ready for FE-GLM.


1. Type System

core/stats/types.ts — new types

// GLM family identifiers (match R naming)
export type GLMFamily = 'gaussian' | 'binomial' | 'poisson' | 'Gamma' | 'inverse.gaussian';

// Link function identifiers
export type GLMLink =
  | 'identity' | 'logit' | 'probit' | 'cloglog' | 'cauchit'  // binomial links
  | 'log' | 'sqrt'                                             // poisson/gamma links
  | 'inverse' | '1/mu^2';                                      // gamma/inv-gaussian links

export interface GLMResult {
  type: 'glm';
  family: GLMFamily;
  link: GLMLink;
  coefficients: CoefficientRow[];  // reuses existing type — z-stat in tStatistic field
  deviance: number;
  nullDeviance: number;
  aic: number;
  bic: number;
  dispersion: number;             // 1 for binomial/poisson, estimated for gaussian/Gamma/inv.gaussian
  dfResidual: number;
  dfNull: number;
  residuals: number[];            // deviance residuals
  fittedValues: number[];         // on response scale (μ, not η)
  converged: boolean;
  iterations: number;
  vcovType: VcovType;
  clusterInfo?: ClusterInfo[];    // future: clustered GLM SEs
  fixedEffects?: FEInfo[];        // future: FE-GLM
}

Note on CoefficientRow reuse: GLM uses z-statistics (normal distribution), not t-statistics. The existing CoefficientRow.tStatistic field stores the z-value. The field name is inherited — renaming it to statistic would break all existing consumers. The UI can display “z” vs “t” based on result type.

core/parsers/shared/analysis-call.ts

Add 'glm' to AnalysisKind:

export type AnalysisKind =
  | 'data-load' | 'data-filter' | 'descriptive' | 't-test'
  | 'linear-model' | 'glm'
  | 'model-summary' | 'model-comparison' | 'unsupported';

core/pipeline/types.ts — new node

export interface GLMParams {
  formula: Formula;
  data: string;
  family: GLMFamily;
  link: GLMLink;
  // Future: filled by recognizer when feglm/fepois support is added
  fixedEffects?: string[];
  vcovType?: HCType;
  clusterVars?: string[];
  // Future: weighted GLM
  weights?: string;
  offset?: string;
}

export interface GLMNode extends PipelineNodeBase {
  type: 'glm';
  params: GLMParams;
  result?: GLMResult;
}

Add GLMNode to the PipelineNode union. Add GLMResult to re-exports. Add port definition:

'glm': {
  inputs: [{ name: 'data', dataType: 'dataset' }],
  outputs: [{ name: 'model', dataType: 'model' }, { name: 'out', dataType: 'result' }],
},

Same ports as linear-model — GLM nodes can feed into model-summary and comparison-table.


2. Family Function Bundles

New file: core/stats/glm-families.ts

Each family is a pure object implementing a GLMFamilyFunctions interface:

export interface GLMFamilyFunctions {
  /** Link function: μ → η */
  linkFn(mu: number): number;
  /** Inverse link: η → μ */
  linkinv(eta: number): number;
  /** Derivative of linkinv: dμ/dη */
  muEta(eta: number): number;
  /** Variance function: Var(Y) = φ · V(μ) */
  variance(mu: number): number;
  /** Unit deviance: d(y, μ) */
  devResid(y: number, mu: number): number;
  /** Initialize μ from y (before first iteration) */
  initialize(y: number): number;
  /** Whether dispersion is estimated (true) or fixed at 1 (false) */
  estimateDispersion: boolean;
}
Family Default link V(μ) Dispersion
gaussian identity 1 estimated (σ²)
binomial logit μ(1-μ) fixed = 1
poisson log μ fixed = 1
Gamma inverse μ² estimated
inverse.gaussian 1/mu^2 μ³ estimated
Link g(μ) g⁻¹(η) g’(μ)
identity μ η 1
logit log(μ/(1-μ)) 1/(1+e⁻ᶯ) 1/(μ(1-μ))
probit Φ⁻¹(μ) Φ(η) 1/φ(μ)
cloglog log(-log(1-μ)) 1-e^(-eᶯ) 1/((1-μ)(-log(1-μ)))
log log(μ) eᶯ 1/μ
inverse 1/μ 1/η -1/μ²
sqrt √μ η² 1/(2√μ)
1/mu^2 1/μ² 1/√η -2/μ³
cauchit tan(π(μ-0.5)) 0.5+atan(η)/π 1/(π(1+η²))

Implementation: Compose link functions and variance functions independently, then combine into family bundles. A getFamily(family, link) factory returns the correct bundle, using the family’s default link if none specified.

Probit prerequisites

Probit needs the normal CDF (Φ) and PDF (φ). We already have tCDF in distributions.ts. Add: - normalCDF(x) — standard normal CDF (can delegate to tCDF(x, Infinity) or use direct rational approximation) - normalPDF(x) — standard normal PDF: (1/√(2π)) × e^(-x²/2) - normalQuantile(p) — inverse normal CDF (for probit link function). Rational approximation (Abramowitz & Stegun or similar).


3. IRLS Algorithm

New file: core/stats/glm.ts

export function computeGLM(
  formula: Formula,
  dataset: Dataset,
  family: GLMFamily,
  link: GLMLink,
): GLMResult

Algorithm

  1. Build design matrix — reuse expandColumn / design matrix logic from regression.ts. Extract into shared helper if not already factored out (see section 7).

  2. Initialize μ — family-specific:

  3. IRLS loop (max 25 iterations, convergence when deviance change < 1e-8):

    η = link(μ)
    z = η + (y - μ) / muEta(η)        // working response
    w = muEta(η)² / variance(μ)        // IRLS weights (always positive)
    β = solve(X'WX, X'Wz)              // weighted least squares
    η_new = Xβ
    μ_new = linkinv(η_new)
    deviance = Σ devResid(y_i, μ_i)
    check |deviance_old - deviance_new| / (|deviance_new| + 0.1) < 1e-8
  4. Weighted least squares solve — form X'WX and X'Wz explicitly (W is diagonal, so X'WX = Σ w_i · x_i · x_i'). Solve via existing solveAndInverse() from matrix.ts. This returns both β and (X’WX)⁻¹ for SE computation.

  5. Standard errorsse_j = sqrt(dispersion × (X'WX)⁻¹_jj)

  6. z-statistics and p-valuesz_j = β_j / se_j, two-tailed p from normal distribution

  7. Fit statistics:

Null deviance computation

Rather than running a second IRLS, compute the null model deviance directly: - Binomial: nullDeviance = 2 × Σ[y·log(y/ȳ) + (1-y)·log((1-y)/(1-ȳ))] where ȳ = mean(y) - Poisson: nullDeviance = 2 × Σ[y·log(y/ȳ) - (y - ȳ)] - Gaussian: nullDeviance = Σ(y - ȳ)² - Gamma/inverse.gaussian: similar closed forms with ȳ

Convergence edge cases


4. Recognizer

core/parsers/r/recognizer.ts

Add case 'glm' to recognizeFunctionCall(). The recognizer needs to parse the family argument, which in R has several syntactic forms:

# Bare name (most common)
glm(y ~ x, data = df, family = binomial)
glm(y ~ x, data = df, family = poisson)

# Function call with link override
glm(y ~ x, data = df, family = binomial(link = "probit"))
glm(y ~ x, data = df, family = binomial("probit"))

# String (less common)
glm(y ~ x, data = df, family = "binomial")

# Positional (family is 3rd arg)
glm(y ~ x, df, binomial)

recognizeGLM() implementation

function recognizeGLM(
  node: FunctionCallNode,
  assignedTo: string | undefined,
  source: string,
): AnalysisCall
  1. Extract formula (1st arg or formula=) — reuse extractFormula()
  2. Extract data (2nd arg or data=) — reuse existing pattern
  3. Extract family — new logic:
  4. Extract link override from family call args if present
  5. Produce AnalysisCall with kind: 'glm', formula, args: { data, family, link }

Family name mapping

Map R family names to our GLMFamily type: - binomial'binomial' - poisson'poisson' - gaussian'gaussian' - Gamma'Gamma' (capital G, matches R) - inverse.gaussian'inverse.gaussian' - quasibinomial, quasipoisson → recognize but produce unsupported node with warning “quasi-families not yet supported”. These differ from base families by estimating dispersion; adding them later is straightforward once the base families work.


5. Mapper & Executor

Mapper: core/pipeline/mapper.ts

Add case 'glm' in createNode():

case 'glm':
  return {
    ...base,
    type: 'glm',
    params: {
      formula: call.formula!,
      data: (call.args['data'] as string) ?? '',
      family: (call.args['family'] as GLMFamily) ?? 'gaussian',
      link: (call.args['link'] as GLMLink) ?? defaultLinkForFamily(call.args['family'] as GLMFamily),
      ...(call.args['fixedEffects'] ? { fixedEffects: call.args['fixedEffects'] as string[] } : {}),
      ...(call.args['vcovType'] ? { vcovType: call.args['vcovType'] as HCType } : {}),
      ...(call.args['clusterVars'] ? { clusterVars: call.args['clusterVars'] as string[] } : {}),
    },
  };

Executor: core/pipeline/executor.ts

Stub executor (for tests without real computation):

const stubGLM: PrimitiveExecutor = {
  execute: (): GLMResult => ({
    type: 'glm',
    family: 'gaussian',
    link: 'identity',
    coefficients: [],
    deviance: 0, nullDeviance: 0,
    aic: 0, bic: 0, dispersion: 1,
    dfResidual: 0, dfNull: 0,
    residuals: [], fittedValues: [],
    converged: true, iterations: 0,
    vcovType: 'classical',
  }),
};

Real executor:

const realGLM: PrimitiveExecutor = {
  execute: (node: PipelineNode, inputs: Record<string, unknown>, ctx?: ExecutionContext): GLMResult => {
    const glmNode = node as GLMNode;
    const inputDataset = (ctx?.datasets?.get(glmNode.params.data) ?? inputs['data']) as Dataset;
    if (!inputDataset) throw new Error(`Dataset "${glmNode.params.data}" not found`);
    return computeGLM(glmNode.params.formula, inputDataset, glmNode.params.family, glmNode.params.link);
  },
};

Register in registerRealExecutors():

registerExecutor('glm', realGLM);

6. Comparison Table Compatibility

The comparison table executor currently casts model inputs as RegressionResult[] and accesses .rSquared, .adjustedRSquared, .fStatistic. To support mixed GLM + OLS tables:

Option: Duck-typing on coefficients

Both RegressionResult and GLMResult have coefficients: CoefficientRow[]. The coefficient display logic (core/control classification, estimate/SE/p extraction) works identically.

For fit statistics, branch on result type:

// In realComparisonTable executor
const fitStatistics: { name: string; values: (number | null)[] }[] = [];

// R² row: show for regression, null for GLM
if (modelResults.some(r => r.type === 'regression')) {
  fitStatistics.push({
    name: 'R²',
    values: modelResults.map(r => r.type === 'regression' ? r.rSquared : null),
  });
}

// Deviance row: show for GLM, null for regression
if (modelResults.some(r => r.type === 'glm')) {
  fitStatistics.push({
    name: 'Deviance',
    values: modelResults.map(r => r.type === 'glm' ? r.deviance : null),
  });
  fitStatistics.push({
    name: 'AIC',
    values: modelResults.map(r => r.type === 'glm' ? r.aic : null),
  });
}

// N: universal
fitStatistics.push({
  name: 'N',
  values: modelResults.map(r => {
    if (r.type === 'regression') return r.dfModel + r.dfResidual + 1;
    if (r.type === 'glm') return r.dfNull + 1;
    return null;
  }),
});

This requires changing the comparison table’s input type from RegressionResult[] to (RegressionResult | GLMResult)[]. The coefficient extraction path stays the same — only fit statistics branch.


7. Design Matrix Extraction

The design matrix construction in regression.ts (expandColumn, dummy coding, factor handling, interaction expansion, valid-row masking) is ~200 lines of logic that GLM needs identically. Two approaches:

A) Extract to shared module — Move buildDesignMatrix() and helpers to core/stats/design-matrix.ts. Both regression.ts and glm.ts import from it.

B) GLM calls regression internals — Import the unexported helpers directly.

Go with A. The design matrix is logically independent of the estimation method. Extract it cleanly:

// core/stats/design-matrix.ts
export interface DesignMatrixResult {
  X: number[][];          // n × p matrix (row-major)
  y: number[];            // response vector
  validRows: boolean[];   // which rows survived NA removal
  columnNames: string[];  // coefficient names matching X columns
  hasIntercept: boolean;
  n: number;              // rows in X (after NA removal)
  p: number;              // columns in X
}

export function buildDesignMatrix(
  formula: Formula,
  dataset: Dataset,
): DesignMatrixResult

This extraction also benefits future estimators (weighted regression, etc.).


8. FE-GLM Upgrade Path

The GLMParams type already includes fixedEffects?: string[]. When we implement FE-GLM:

  1. Extend demean() to accept weightsdemean(columns, feDims, weights?). Weighted group means: mean_g = Σ(w_i × x_i) / Σ(w_i) for group g. Small, localized change to demean.ts.

  2. Wire weighted demeaning into IRLS — At each iteration, after computing IRLS weights w and working response z, demean both z and X columns using w as the demeaning weight vector. The WLS solve then operates on demeaned data.

  3. Update recognizerfeglm() and fepois() already pre-scanned and parsed. Change their kind from 'linear-model' to 'glm' with appropriate family/link, and the FE vars flow through args['fixedEffects'] as they already do.

This is a clean incremental upgrade — no architectural changes needed, just: - ~20 lines in demean.ts (weighted means) - ~10 lines in glm.ts (call demean inside loop) - ~5 lines in recognizer.ts (remap feglm/fepois kind)


9. UI Changes

Results panel (ui/panels/results-panel.tsx)

For GLM results, display: - Coefficient table (same component as regression — z-stat header instead of t-stat) - Family and link function - Deviance and null deviance - AIC, BIC - Dispersion parameter (if estimated) - Convergence status and iteration count

Spec comparison view

GLM nodes participate in comparison tables via the compatibility changes in section 6. No separate spec-comparison UI needed — GLM models appear as columns alongside OLS models.

Parameter schema (core/pipeline/param-schema.ts)

Add GLMParams schema for the property sheet: - family: dropdown (gaussian, binomial, poisson, Gamma, inverse.gaussian) - link: dropdown (filtered by family — only show valid links) - Formula terms: same as linear-model


10. Testing

Unit tests: core/stats/glm.test.ts

All validated against R glm() output. Tolerance: <0.00005 for coefficients/SEs, <0.00001 for p-values.

Test Family Link What it validates
Logit basic binomial logit Coefficients, SEs, deviance, AIC vs R
Probit basic binomial probit Coefficients, SEs vs R
Poisson basic poisson log Coefficients, deviance, AIC vs R
Gaussian GLM = OLS gaussian identity Coefficients exactly match computeRegression
Gamma basic Gamma inverse Coefficients, dispersion vs R
Inverse Gaussian inverse.gaussian 1/mu^2 Coefficients, dispersion vs R
Non-default link binomial cloglog Coefficients vs R
Multiple predictors binomial logit 3+ covariates, dummy coding
Factor in GLM binomial logit factor() terms expand correctly
Convergence failure binomial logit Perfect separation → converged=false + warning

Unit tests: core/stats/glm-families.test.ts

Pure math tests for each family function bundle — link, linkinv, muEta, variance, devResid. Verify round-trip: linkinv(linkFn(μ)) ≈ μ.

Unit tests: core/stats/distributions.test.ts

Tests for normalCDF, normalPDF, normalQuantile against known values.

Recognizer tests: core/parsers/r/recognizer.test.ts

Input Expected
glm(y ~ x, data=df, family=binomial) kind=glm, family=binomial, link=logit
glm(y ~ x, data=df, family=binomial(link="probit")) kind=glm, family=binomial, link=probit
glm(y ~ x, data=df, family="poisson") kind=glm, family=poisson, link=log
glm(y ~ x + z, df, Gamma) kind=glm, family=Gamma, link=inverse
glm(y ~ x, data=df, family=stats::binomial) kind=glm, family=binomial, link=logit

Integration tests: core/pipeline/integration.test.ts

End-to-end buildPipeline() with realistic multi-line R code:

dat <- read.csv("data.csv")
mod_ols <- lm(y ~ x1 + x2, data = dat)
mod_logit <- glm(employed ~ x1 + x2, data = dat, family = binomial)
stargazer(mod_ols, mod_logit)

Verify: 2 model nodes (linear-model + glm), edges from data-load, comparison table accepts both.

Test dataset

Include a small binary-outcome CSV in examples/ (~200 rows) for integration and E2E testing. Columns: binary outcome, 2-3 numeric predictors, 1 categorical. Generate with known R glm() output for validation.


11. Files Changed

Action File What
Modify src/core/stats/types.ts Add GLMFamily, GLMLink, GLMResult
Modify src/core/parsers/shared/analysis-call.ts Add 'glm' to AnalysisKind
Modify src/core/pipeline/types.ts Add GLMParams, GLMNode, NODE_PORTS entry
Modify src/core/pipeline/mapper.ts Add case 'glm' in createNode()
Modify src/core/pipeline/executor.ts Add stub + real GLM executors, register
Modify src/core/pipeline/executor.ts Update comparison table to handle GLMResult
Modify src/core/parsers/r/recognizer.ts Add case 'glm', implement recognizeGLM()
Modify src/core/pipeline/param-schema.ts Add GLM param schema
Modify src/ui/panels/results-panel.tsx GLM result display
Create src/core/stats/glm.ts computeGLM() — IRLS algorithm
Create src/core/stats/glm-families.ts Family function bundles + registry
Create src/core/stats/design-matrix.ts Extracted shared design matrix builder
Create src/core/stats/glm.test.ts GLM unit tests vs R
Create src/core/stats/glm-families.test.ts Family function pure math tests
Create examples/glm-test-data.csv Binary outcome test dataset
Modify src/core/stats/distributions.ts Add normalCDF, normalPDF, normalQuantile
Modify src/core/stats/distributions.test.ts Normal distribution tests
Modify src/core/parsers/r/recognizer.test.ts GLM recognizer tests
Modify src/core/pipeline/integration.test.ts GLM integration tests
Modify src/core/pipeline/dag.test.ts Add makeGLMNode helper
Modify src/core/pipeline/executor.test.ts Add GLM stub test, update makeNode

Not changed (deferred)