Date: 2026-04-04 Status: Design Depends on: OLS regression (design matrix, formula handling), robust SEs (sandwich.ts), FE demeaning (demean.ts — for future FE-GLM upgrade path)
~20% of applied econ papers use logit or probit models, primarily for
propensity scores and binary outcomes. glm() is the fourth
most common estimation function after lm(),
feols(), and ivreg(). Without GLM, Interlyse
silently skips propensity score models that feed into downstream
estimators (IPW, doubly robust DiD).
Unlocks: glm(y ~ x, family=binomial),
glm(y ~ x, family=binomial(link="probit")),
glm(y ~ x, family=poisson), and all standard GLM families.
feglm()/fepois() remain
recognized-with-warning but the type system is ready for FE-GLM.
core/stats/types.ts —
new types// GLM family identifiers (match R naming)
export type GLMFamily = 'gaussian' | 'binomial' | 'poisson' | 'Gamma' | 'inverse.gaussian';
// Link function identifiers
export type GLMLink =
| 'identity' | 'logit' | 'probit' | 'cloglog' | 'cauchit' // binomial links
| 'log' | 'sqrt' // poisson/gamma links
| 'inverse' | '1/mu^2'; // gamma/inv-gaussian links
export interface GLMResult {
type: 'glm';
family: GLMFamily;
link: GLMLink;
coefficients: CoefficientRow[]; // reuses existing type — z-stat in tStatistic field
deviance: number;
nullDeviance: number;
aic: number;
bic: number;
dispersion: number; // 1 for binomial/poisson, estimated for gaussian/Gamma/inv.gaussian
dfResidual: number;
dfNull: number;
residuals: number[]; // deviance residuals
fittedValues: number[]; // on response scale (μ, not η)
converged: boolean;
iterations: number;
vcovType: VcovType;
clusterInfo?: ClusterInfo[]; // future: clustered GLM SEs
fixedEffects?: FEInfo[]; // future: FE-GLM
}Note on CoefficientRow reuse: GLM uses
z-statistics (normal distribution), not t-statistics. The existing
CoefficientRow.tStatistic field stores the z-value. The
field name is inherited — renaming it to statistic would
break all existing consumers. The UI can display “z” vs “t” based on
result type.
core/parsers/shared/analysis-call.tsAdd 'glm' to AnalysisKind:
export type AnalysisKind =
| 'data-load' | 'data-filter' | 'descriptive' | 't-test'
| 'linear-model' | 'glm'
| 'model-summary' | 'model-comparison' | 'unsupported';core/pipeline/types.ts —
new nodeexport interface GLMParams {
formula: Formula;
data: string;
family: GLMFamily;
link: GLMLink;
// Future: filled by recognizer when feglm/fepois support is added
fixedEffects?: string[];
vcovType?: HCType;
clusterVars?: string[];
// Future: weighted GLM
weights?: string;
offset?: string;
}
export interface GLMNode extends PipelineNodeBase {
type: 'glm';
params: GLMParams;
result?: GLMResult;
}Add GLMNode to the PipelineNode union. Add
GLMResult to re-exports. Add port definition:
'glm': {
inputs: [{ name: 'data', dataType: 'dataset' }],
outputs: [{ name: 'model', dataType: 'model' }, { name: 'out', dataType: 'result' }],
},Same ports as linear-model — GLM nodes can feed into
model-summary and comparison-table.
core/stats/glm-families.tsEach family is a pure object implementing a
GLMFamilyFunctions interface:
export interface GLMFamilyFunctions {
/** Link function: μ → η */
linkFn(mu: number): number;
/** Inverse link: η → μ */
linkinv(eta: number): number;
/** Derivative of linkinv: dμ/dη */
muEta(eta: number): number;
/** Variance function: Var(Y) = φ · V(μ) */
variance(mu: number): number;
/** Unit deviance: d(y, μ) */
devResid(y: number, mu: number): number;
/** Initialize μ from y (before first iteration) */
initialize(y: number): number;
/** Whether dispersion is estimated (true) or fixed at 1 (false) */
estimateDispersion: boolean;
}| Family | Default link | V(μ) | Dispersion |
|---|---|---|---|
gaussian |
identity |
1 | estimated (σ²) |
binomial |
logit |
μ(1-μ) | fixed = 1 |
poisson |
log |
μ | fixed = 1 |
Gamma |
inverse |
μ² | estimated |
inverse.gaussian |
1/mu^2 |
μ³ | estimated |
| Link | g(μ) | g⁻¹(η) | g’(μ) |
|---|---|---|---|
identity |
μ | η | 1 |
logit |
log(μ/(1-μ)) | 1/(1+e⁻ᶯ) | 1/(μ(1-μ)) |
probit |
Φ⁻¹(μ) | Φ(η) | 1/φ(μ) |
cloglog |
log(-log(1-μ)) | 1-e^(-eᶯ) | 1/((1-μ)(-log(1-μ))) |
log |
log(μ) | eᶯ | 1/μ |
inverse |
1/μ | 1/η | -1/μ² |
sqrt |
√μ | η² | 1/(2√μ) |
1/mu^2 |
1/μ² | 1/√η | -2/μ³ |
cauchit |
tan(π(μ-0.5)) | 0.5+atan(η)/π | 1/(π(1+η²)) |
Implementation: Compose link functions and variance
functions independently, then combine into family bundles. A
getFamily(family, link) factory returns the correct bundle,
using the family’s default link if none specified.
Probit needs the normal CDF (Φ) and PDF (φ). We already have
tCDF in distributions.ts. Add: -
normalCDF(x) — standard normal CDF (can delegate to
tCDF(x, Infinity) or use direct rational approximation) -
normalPDF(x) — standard normal PDF:
(1/√(2π)) × e^(-x²/2) - normalQuantile(p) —
inverse normal CDF (for probit link function). Rational approximation
(Abramowitz & Stegun or similar).
core/stats/glm.tsexport function computeGLM(
formula: Formula,
dataset: Dataset,
family: GLMFamily,
link: GLMLink,
): GLMResultBuild design matrix — reuse
expandColumn / design matrix logic from
regression.ts. Extract into shared helper if not already
factored out (see section 7).
Initialize μ — family-specific:
(y + 0.5) / 2y + 0.1 (avoid log(0))yy (must be > 0)y (must be > 0)IRLS loop (max 25 iterations, convergence when deviance change < 1e-8):
η = link(μ)
z = η + (y - μ) / muEta(η) // working response
w = muEta(η)² / variance(μ) // IRLS weights (always positive)
β = solve(X'WX, X'Wz) // weighted least squares
η_new = Xβ
μ_new = linkinv(η_new)
deviance = Σ devResid(y_i, μ_i)
check |deviance_old - deviance_new| / (|deviance_new| + 0.1) < 1e-8Weighted least squares solve — form
X'WX and X'Wz explicitly (W is diagonal, so
X'WX = Σ w_i · x_i · x_i'). Solve via existing
solveAndInverse() from matrix.ts. This returns
both β and (X’WX)⁻¹ for SE computation.
Standard errors —
se_j = sqrt(dispersion × (X'WX)⁻¹_jj)
dispersion = 1 (no estimation)dispersion = deviance / dfResidualdispersion = Pearson χ² / dfResidual where
χ² = Σ (y_i - μ_i)² / V(μ_i)z-statistics and p-values —
z_j = β_j / se_j, two-tailed p from normal
distribution
Fit statistics:
deviance = Σ devResid(y_i, μ_i) — from final
iterationnullDeviance = deviance of intercept-only model (run a
second IRLS or compute analytically)aic = -2·loglik + 2·p — family-specific
log-likelihoodbic = -2·loglik + log(n)·pdfResidual = n - pdfNull = n - 1Rather than running a second IRLS, compute the null model deviance
directly: - Binomial:
nullDeviance = 2 × Σ[y·log(y/ȳ) + (1-y)·log((1-y)/(1-ȳ))]
where ȳ = mean(y) - Poisson:
nullDeviance = 2 × Σ[y·log(y/ȳ) - (y - ȳ)] - Gaussian:
nullDeviance = Σ(y - ȳ)² - Gamma/inverse.gaussian: similar
closed forms with ȳ
glm() issues a “fitted
probabilities numerically 0 or 1 occurred” warning.converged: false and return the last-iteration results. The
UI shows a warning.ε = 1e-10.core/parsers/r/recognizer.tsAdd case 'glm' to recognizeFunctionCall().
The recognizer needs to parse the family argument, which in
R has several syntactic forms:
# Bare name (most common)
glm(y ~ x, data = df, family = binomial)
glm(y ~ x, data = df, family = poisson)
# Function call with link override
glm(y ~ x, data = df, family = binomial(link = "probit"))
glm(y ~ x, data = df, family = binomial("probit"))
# String (less common)
glm(y ~ x, data = df, family = "binomial")
# Positional (family is 3rd arg)
glm(y ~ x, df, binomial)recognizeGLM()
implementationfunction recognizeGLM(
node: FunctionCallNode,
assignedTo: string | undefined,
source: string,
): AnalysisCallformula (1st arg or formula=) —
reuse extractFormula()data (2nd arg or data=) — reuse
existing patternfamily — new logic:
binomial → family=binomial,
link=defaultbinomial(link="probit") →
family=binomial, link=probit"binomial" → family=binomial,
link=defaultstats::binomial → strip
qualifier, family=binomiallink override from family call args if
presentAnalysisCall with kind: 'glm',
formula, args: { data, family, link }Map R family names to our GLMFamily type: -
binomial → 'binomial' - poisson →
'poisson' - gaussian → 'gaussian'
- Gamma → 'Gamma' (capital G, matches R) -
inverse.gaussian → 'inverse.gaussian' -
quasibinomial, quasipoisson → recognize but
produce unsupported node with warning “quasi-families not
yet supported”. These differ from base families by estimating
dispersion; adding them later is straightforward once the base families
work.
core/pipeline/mapper.tsAdd case 'glm' in createNode():
case 'glm':
return {
...base,
type: 'glm',
params: {
formula: call.formula!,
data: (call.args['data'] as string) ?? '',
family: (call.args['family'] as GLMFamily) ?? 'gaussian',
link: (call.args['link'] as GLMLink) ?? defaultLinkForFamily(call.args['family'] as GLMFamily),
...(call.args['fixedEffects'] ? { fixedEffects: call.args['fixedEffects'] as string[] } : {}),
...(call.args['vcovType'] ? { vcovType: call.args['vcovType'] as HCType } : {}),
...(call.args['clusterVars'] ? { clusterVars: call.args['clusterVars'] as string[] } : {}),
},
};core/pipeline/executor.tsStub executor (for tests without real computation):
const stubGLM: PrimitiveExecutor = {
execute: (): GLMResult => ({
type: 'glm',
family: 'gaussian',
link: 'identity',
coefficients: [],
deviance: 0, nullDeviance: 0,
aic: 0, bic: 0, dispersion: 1,
dfResidual: 0, dfNull: 0,
residuals: [], fittedValues: [],
converged: true, iterations: 0,
vcovType: 'classical',
}),
};Real executor:
const realGLM: PrimitiveExecutor = {
execute: (node: PipelineNode, inputs: Record<string, unknown>, ctx?: ExecutionContext): GLMResult => {
const glmNode = node as GLMNode;
const inputDataset = (ctx?.datasets?.get(glmNode.params.data) ?? inputs['data']) as Dataset;
if (!inputDataset) throw new Error(`Dataset "${glmNode.params.data}" not found`);
return computeGLM(glmNode.params.formula, inputDataset, glmNode.params.family, glmNode.params.link);
},
};Register in registerRealExecutors():
registerExecutor('glm', realGLM);The comparison table executor currently casts model inputs as
RegressionResult[] and accesses .rSquared,
.adjustedRSquared, .fStatistic. To support
mixed GLM + OLS tables:
Both RegressionResult and GLMResult have
coefficients: CoefficientRow[]. The coefficient display
logic (core/control classification, estimate/SE/p extraction) works
identically.
For fit statistics, branch on result type:
// In realComparisonTable executor
const fitStatistics: { name: string; values: (number | null)[] }[] = [];
// R² row: show for regression, null for GLM
if (modelResults.some(r => r.type === 'regression')) {
fitStatistics.push({
name: 'R²',
values: modelResults.map(r => r.type === 'regression' ? r.rSquared : null),
});
}
// Deviance row: show for GLM, null for regression
if (modelResults.some(r => r.type === 'glm')) {
fitStatistics.push({
name: 'Deviance',
values: modelResults.map(r => r.type === 'glm' ? r.deviance : null),
});
fitStatistics.push({
name: 'AIC',
values: modelResults.map(r => r.type === 'glm' ? r.aic : null),
});
}
// N: universal
fitStatistics.push({
name: 'N',
values: modelResults.map(r => {
if (r.type === 'regression') return r.dfModel + r.dfResidual + 1;
if (r.type === 'glm') return r.dfNull + 1;
return null;
}),
});This requires changing the comparison table’s input type from
RegressionResult[] to
(RegressionResult | GLMResult)[]. The coefficient
extraction path stays the same — only fit statistics branch.
The design matrix construction in regression.ts
(expandColumn, dummy coding, factor handling, interaction expansion,
valid-row masking) is ~200 lines of logic that GLM needs identically.
Two approaches:
A) Extract to shared module — Move
buildDesignMatrix() and helpers to
core/stats/design-matrix.ts. Both
regression.ts and glm.ts import from it.
B) GLM calls regression internals — Import the unexported helpers directly.
Go with A. The design matrix is logically independent of the estimation method. Extract it cleanly:
// core/stats/design-matrix.ts
export interface DesignMatrixResult {
X: number[][]; // n × p matrix (row-major)
y: number[]; // response vector
validRows: boolean[]; // which rows survived NA removal
columnNames: string[]; // coefficient names matching X columns
hasIntercept: boolean;
n: number; // rows in X (after NA removal)
p: number; // columns in X
}
export function buildDesignMatrix(
formula: Formula,
dataset: Dataset,
): DesignMatrixResultThis extraction also benefits future estimators (weighted regression, etc.).
The GLMParams type already includes
fixedEffects?: string[]. When we implement FE-GLM:
Extend demean() to accept weights —
demean(columns, feDims, weights?). Weighted group means:
mean_g = Σ(w_i × x_i) / Σ(w_i) for group g. Small,
localized change to demean.ts.
Wire weighted demeaning into IRLS — At each iteration, after computing IRLS weights w and working response z, demean both z and X columns using w as the demeaning weight vector. The WLS solve then operates on demeaned data.
Update recognizer — feglm() and
fepois() already pre-scanned and parsed. Change their
kind from 'linear-model' to 'glm'
with appropriate family/link, and the FE vars flow through
args['fixedEffects'] as they already do.
This is a clean incremental upgrade — no architectural changes
needed, just: - ~20 lines in demean.ts (weighted means) -
~10 lines in glm.ts (call demean inside loop) - ~5 lines in
recognizer.ts (remap feglm/fepois kind)
ui/panels/results-panel.tsx)For GLM results, display: - Coefficient table (same component as regression — z-stat header instead of t-stat) - Family and link function - Deviance and null deviance - AIC, BIC - Dispersion parameter (if estimated) - Convergence status and iteration count
GLM nodes participate in comparison tables via the compatibility changes in section 6. No separate spec-comparison UI needed — GLM models appear as columns alongside OLS models.
core/pipeline/param-schema.ts)Add GLMParams schema for the property sheet: -
family: dropdown (gaussian, binomial, poisson, Gamma,
inverse.gaussian) - link: dropdown (filtered by family —
only show valid links) - Formula terms: same as linear-model
core/stats/glm.test.tsAll validated against R glm() output. Tolerance:
<0.00005 for coefficients/SEs, <0.00001 for p-values.
| Test | Family | Link | What it validates |
|---|---|---|---|
| Logit basic | binomial | logit | Coefficients, SEs, deviance, AIC vs R |
| Probit basic | binomial | probit | Coefficients, SEs vs R |
| Poisson basic | poisson | log | Coefficients, deviance, AIC vs R |
| Gaussian GLM = OLS | gaussian | identity | Coefficients exactly match computeRegression |
| Gamma basic | Gamma | inverse | Coefficients, dispersion vs R |
| Inverse Gaussian | inverse.gaussian | 1/mu^2 | Coefficients, dispersion vs R |
| Non-default link | binomial | cloglog | Coefficients vs R |
| Multiple predictors | binomial | logit | 3+ covariates, dummy coding |
| Factor in GLM | binomial | logit | factor() terms expand correctly |
| Convergence failure | binomial | logit | Perfect separation → converged=false + warning |
core/stats/glm-families.test.tsPure math tests for each family function bundle — link, linkinv,
muEta, variance, devResid. Verify round-trip:
linkinv(linkFn(μ)) ≈ μ.
core/stats/distributions.test.tsTests for normalCDF, normalPDF,
normalQuantile against known values.
core/parsers/r/recognizer.test.ts| Input | Expected |
|---|---|
glm(y ~ x, data=df, family=binomial) |
kind=glm, family=binomial, link=logit |
glm(y ~ x, data=df, family=binomial(link="probit")) |
kind=glm, family=binomial, link=probit |
glm(y ~ x, data=df, family="poisson") |
kind=glm, family=poisson, link=log |
glm(y ~ x + z, df, Gamma) |
kind=glm, family=Gamma, link=inverse |
glm(y ~ x, data=df, family=stats::binomial) |
kind=glm, family=binomial, link=logit |
core/pipeline/integration.test.tsEnd-to-end buildPipeline() with realistic multi-line R
code:
dat <- read.csv("data.csv")
mod_ols <- lm(y ~ x1 + x2, data = dat)
mod_logit <- glm(employed ~ x1 + x2, data = dat, family = binomial)
stargazer(mod_ols, mod_logit)Verify: 2 model nodes (linear-model + glm), edges from data-load, comparison table accepts both.
Include a small binary-outcome CSV in examples/ (~200
rows) for integration and E2E testing. Columns: binary outcome, 2-3
numeric predictors, 1 categorical. Generate with known R
glm() output for validation.
| Action | File | What |
|---|---|---|
| Modify | src/core/stats/types.ts |
Add GLMFamily, GLMLink,
GLMResult |
| Modify | src/core/parsers/shared/analysis-call.ts |
Add 'glm' to AnalysisKind |
| Modify | src/core/pipeline/types.ts |
Add GLMParams, GLMNode, NODE_PORTS
entry |
| Modify | src/core/pipeline/mapper.ts |
Add case 'glm' in createNode() |
| Modify | src/core/pipeline/executor.ts |
Add stub + real GLM executors, register |
| Modify | src/core/pipeline/executor.ts |
Update comparison table to handle GLMResult |
| Modify | src/core/parsers/r/recognizer.ts |
Add case 'glm', implement
recognizeGLM() |
| Modify | src/core/pipeline/param-schema.ts |
Add GLM param schema |
| Modify | src/ui/panels/results-panel.tsx |
GLM result display |
| Create | src/core/stats/glm.ts |
computeGLM() — IRLS algorithm |
| Create | src/core/stats/glm-families.ts |
Family function bundles + registry |
| Create | src/core/stats/design-matrix.ts |
Extracted shared design matrix builder |
| Create | src/core/stats/glm.test.ts |
GLM unit tests vs R |
| Create | src/core/stats/glm-families.test.ts |
Family function pure math tests |
| Create | examples/glm-test-data.csv |
Binary outcome test dataset |
| Modify | src/core/stats/distributions.ts |
Add normalCDF, normalPDF,
normalQuantile |
| Modify | src/core/stats/distributions.test.ts |
Normal distribution tests |
| Modify | src/core/parsers/r/recognizer.test.ts |
GLM recognizer tests |
| Modify | src/core/pipeline/integration.test.ts |
GLM integration tests |
| Modify | src/core/pipeline/dag.test.ts |
Add makeGLMNode helper |
| Modify | src/core/pipeline/executor.test.ts |
Add GLM stub test, update makeNode |
feglm()/fepois() remain
linear-model with warning — FE-GLM is a follow-updemean.ts — no weight support yet (FE-GLM upgrade
path)