data.table .SD/.SDcols Column-Set Abstraction — Design

`data.table` `.SD` / `.SDcols` Column-Set Abstraction — Design

Date: 2026-05-30 Author: Mikel Petri Milestone: M6 preparatory — top cross-paper blocker carried over from the 2026-05-08 data.table spec’s deferred set Predecessor: docs/superpowers/specs/2026-05-08-data-table-syntax-design.md (data.table walrus / .() / bracket-form dispatch)

Goal

Land typed pipeline support for the .SD / .SDcols column-set idioms that the prior data.table spec explicitly deferred. Two corpus forms are in scope; both reduce to a single abstraction — a language-agnostic column-selector + per-column template — layered on the existing data-mutate and data-summarise primitives.

Reframing — why this is mostly a mapping problem, not a new primitive

.SD, .SDcols, and .I are data.table-specific syntactic vehicles, not analysis primitives. .SD = “the Subset of Data for this group” (a column-set handle); .SDcols = “restrict that handle to these columns”; .I = “integer row-indices.” Each vehicle expresses an abstract operation that may or may not already have a PipelineNode home. Grounded in the corpus:

Idiom	Paper	Abstract operation	Verdict
`dt[, lapply(.SD, mean, na.rm=TRUE), by=Year]`	#78 dissecting-mechanisms	group-wise reducer applied to all non-key columns	Mapping gap — `data-summarise` already exists; needs a column-set wildcard the recognizer can’t expand without the schema
`dt[, paste0("12mavg_",cols) := frollmean(.SD,n=12), .SDcols=vars]`	#43 monetary-fiscal	apply a window fn over a named column subset, emitting one prefixed output col each	Mapping gap (column-iteration over `data-mutate`) + a separate missing-evaluator bullet (`frollmean`)
`dt[, .SD[1], by=key]`	#69 internationalizing	first row per group (distinct / slice-head)	New primitive — separate spec
`.I` / `.I[which.max(x)]`	(no corpus use today)	row-index selection	Neither — defer (YAGNI)

This spec covers only the first two rows — the column-set abstraction. The slice-per-group primitive and .I are out of scope.

Scope (in)

#	Form	Maps to
1	`dt[, lapply(.SD, FUN, ...args), by=g]` (bare `.SD`)	`data-summarise` + `columnMap { selector: all-but [g], template: "FUN(__COL__, ...args)", outputPattern: "__COL__" }`
2	`dt[, lapply(.SD, FUN, ...args), by=g, .SDcols=cs]`	`data-summarise` + `columnMap { selector: explicit cs, ... }`
3	`dt[, paste0("<prefix>", names) := <fn>(.SD, ...args), .SDcols=cs]`	`data-mutate` + `columnMap { selector: explicit cs, template: "<fn>(__COL__, ...args)", outputPattern: "<prefix>___COL__" }`
4	`dt[, c("a_z","b_z") := <fn>(.SD), .SDcols=c("a","b")]` (positional name list LHS)	`data-mutate` + `columnMap { outputPattern: "__COL__" }` — positional name list bound to selector order
5	`.SDcols=vars` where `vars <- c("a","b","c")` is a literal binding upstream	`selector: explicit` (resolved via inliner’s free-var pass)

Scope (out)

dt[, .SD[1], by=key] — slice-first-per-group is a genuinely new primitive (data-slice / data-distinct); its own spec.
.I / .I[which.max(x)] — no corpus usage; YAGNI.
data.table(WeightedMode(.SD)) — j-expression returning a multi-col data.table; carried over from the prior spec’s deferred set, not picked up here.
dt[, stri_split_fixed(col, "|"), by = .(id, col)] — by-group list-explode (paper #56); carried over from the prior spec’s deferred set.
lapply(...) %>% rbindlist(fill=TRUE) — depends on lapply-batched-function-application; carried over.
Computed .SDcols (e.g. names(dt)[sapply(...)]) that can’t be folded by the inliner to a literal vector — falls through to webr-opaque.
The frollmean evaluator function itself — separate BACKLOG bullet. Paper #43 will produce correct typed structure under this spec but stay execution-gated on that bullet. Paper #78’s mean is already in the evaluator, so it executes end-to-end.

Success bar

Mapping + generic reducer path: the recognizer produces correct typed data-summarise / data-mutate nodes for the in-scope forms, and the lapply(.SD, FUN, ...) reducer path executes end-to-end for any evaluator-known FUN (mean, sum, sd, median, min, max, var, n, length). Paper #78’s yearly-aggregation block produces verified group-mean values pinned by the paper-match test.

Architecture

The abstraction (language-agnostic)

Lives in src/core/stats/types.ts (alongside the other column-shaped types) so it can be reused later by Stata / Python front-ends:

export type ColumnSelector =
  | { kind: 'explicit'; names: string[] }   // .SDcols=c("a","b"), or vars resolved to a literal vector
  | { kind: 'all-but'; exclude: string[] }; // bare .SD → every column except these (the by= keys)

export interface ColumnMap {
  selector: ColumnSelector;
  template: string;        // per-column expression with __COL__ placeholder
  outputPattern: string;   // output column name pattern with __COL__
}

Only two selector kinds — that’s all the corpus needs. The __COL__ token is recognizer-constructed (never user-supplied), so collision risk with real column names is bounded by recognizer discipline; if a paper ever surfaces a real column literally named __COL__, the test catches it and we switch to a more exotic sentinel.

This maps cleanly to equivalents in other languages: Stata collapse (mean) * is all-but + mean; pandas df.groupby(g).agg('mean') is the same; Stata egen over a varlist is explicit + transform. Future Stata/Python recognizers emit the same ColumnMap.

Where it attaches (no new node type)

// src/core/pipeline/types.ts
export interface DataMutateParams {
  expressions: MutateExpr[];
  groupBy?: string[];
  orderBy?: { name: string; desc: boolean }[];
  columnMap?: ColumnMap;        // NEW — executor expands into expressions at run time
}

export interface DataSummariseParams {
  groupBy: string[];
  aggregations: { name: string; expr: string }[];
  columnMap?: ColumnMap;        // NEW — executor expands into aggregations at run time
}

columnMap is optional. Existing nodes without it work unchanged. Existing nodes that happen to set both expressions/aggregations AND columnMap get the union — the corpus has no such case, but the union is free.

Layering

src/core/stats/types.ts                 # +ColumnSelector, +ColumnMap
src/core/pipeline/types.ts              # +columnMap on DataMutateParams, DataSummariseParams
src/core/parsers/r/recognize-data-table.ts
  # +isDataTableBracket trigger extensions for lapply(.SD, ...) and <fn>(.SD, ...) + .SDcols=
  # +Form-A and Form-B recognition emitting AnalysisCall with columnMap args
  # +.SDcols= named-arg extractor (literal vector + inliner free-var resolution)
src/core/parsers/r/recognize-data-table.test.ts   # +unit tests
src/core/pipeline/mapper.ts             # pass columnMap from AnalysisCall to node params
src/core/pipeline/executor.ts           # +column-map expansion pre-pass in realDataMutate, realDataSummarise
src/core/pipeline/param-schema.ts       # +ParamDef kind 'column-map' on mutate/summarise
src/ui/components/nodes/...             # +read-only column-map summary in node body / property sheet

Recognizer dispatch (`recognize-data-table.ts`)

Trigger extensions for `isDataTableBracket`

Two new triggers, additive to the existing predicate:

Form A trigger: j-slot is FunctionCallNode { name: 'lapply', args: [.SD, FUN, ...] }. The first arg is the identifier .SD (literal symbol). Routes to Form A regardless of whether by= is present (with no by=, summarise has empty groupBy).
Form B trigger: j-slot is a WalrusAssignNode whose value is a FunctionCallNode taking .SD as its first arg, AND the bracket has a .SDcols= named arg. Without .SDcols=, Form B falls through (we don’t support bare .SD in mutate context — it’s ambiguous without explicit column declaration).

.SDcols= becomes a recognized named arg alongside by= and with= in the predicate.

Recognizing `.SDcols`

function extractSDcolsArg(arg: ArgumentNode, scope: InlinerScope): ColumnSelector | null {
  const v = arg.value;
  // Literal c("a","b","c") or character vector
  if (v.type === 'function-call' && v.name === 'c') {
    const names = extractStringVector(v);
    if (names) return { kind: 'explicit', names };
  }
  if (v.type === 'vector') {
    const names = extractStringVector(v);
    if (names) return { kind: 'explicit', names };
  }
  // Identifier — try to resolve via inliner free-var pass to a literal string vector
  if (v.type === 'identifier') {
    const resolved = scope.resolveLiteralStringVector?.(v.name);
    if (resolved) return { kind: 'explicit', names: resolved };
  }
  return null; // computed / not resolvable → fall through to opaque
}

The inliner’s resolveLiteralStringVector is the existing free-var pass; we just consume it here. If it returns null, the bracket falls through to webr-opaque.

Form A — reduce (paper #78)

// j is FunctionCallNode { name: 'lapply', args: [.SD, FUN, ...extraArgs] }
// byGroups = extractByGroupNames(byArg) (existing helper)
// sdcols = sdcolsArg ? extractSDcolsArg(sdcolsArg, scope) : { kind: 'all-but', exclude: byGroups }
// If sdcolsArg present but unresolvable → return null (opaque fall-through)

const fnNode = j.args[1].value;                  // the FUN — identifier or function-call
const extraArgs = j.args.slice(2);                // ...args (named or positional)
const fnName = fnNode.type === 'identifier' ? fnNode.name : null;
if (!fnName) return null;                         // anonymous fn → opaque

const argList = extraArgs.map(a => exprToString(a.value, '?')).join(', ');
const template = argList
  ? `${fnName}(__COL__, ${argList})`
  : `${fnName}(__COL__)`;

return [{
  kind: 'data-summarise',
  args: {
    data: dataName,
    groupBy: byGroups,
    aggregations: [],
    columnMap: { selector: sdcols, template, outputPattern: '__COL__' },
  },
  sourceSpan: node.span,
}];

Form B — transform (paper #43)

// j is WalrusAssignNode { target, value: FunctionCallNode { args: [.SD, ...args] } }
// .SDcols= named arg required (else fall through)
const sdcols = extractSDcolsArg(sdcolsArg, scope);
if (!sdcols) return null;

const fnCall = j.value;                            // FunctionCallNode { name, args: [.SD, ...] }
const restArgs = fnCall.args.slice(1);
const argList = restArgs.map(a => exprToString(a.value, '?')).join(', ');
const template = argList
  ? `${fnCall.name}(__COL__, ${argList})`
  : `${fnCall.name}(__COL__)`;

// LHS → outputPattern
const outputPattern = extractOutputPattern(j.target, sdcols);
if (!outputPattern) return null;

return [{
  kind: 'data-mutate',
  args: {
    data: dataName,
    expressions: [],
    columnMap: { selector: sdcols, template, outputPattern },
    ...(byGroups.length > 0 ? { groupBy: byGroups } : {}),
  },
  sourceSpan: node.span,
}];

LHS → `outputPattern`

function extractOutputPattern(target: RNode, sdcols: ColumnSelector): string | null {
  // Form 3: paste0("<lit>", <names-identifier>) — extract literal prefix.
  // paste0("12mavg_", avg_names) → "12mavg___COL__". We don't verify that the
  // names-identifier equals the .SDcols identifier; positional alignment is
  // assumed (true in all observed uses). If they diverge in a future paper,
  // a recognizer-side equality check can be added then.
  if (target.type === 'function-call' && target.name === 'paste0' && target.args.length === 2) {
    const lit = target.args[0].value;
    if (lit.type === 'literal' && typeof lit.value === 'string') {
      return `${lit.value}__COL__`;
    }
  }
  // Form 4: c("a_z","b_z") positional name list — must match selector arity for explicit
  if (target.type === 'function-call' && target.name === 'c') {
    const names = extractStringVector(target);
    if (names && sdcols.kind === 'explicit' && names.length === sdcols.names.length) {
      // positional binding: outputPattern is not a single template; we degrade to
      // a per-column lookup table. Encode as outputPattern = '__COL__' and inject
      // a name-remapping into the executor — see "positional LHS" note below.
      return '__POSITIONAL__';   // sentinel; executor reads sdcols+target as paired lists
    }
  }
  return null;
}

Positional LHS note. Form 4 (c("a_z","b_z") := <fn>(.SD), .SDcols=c("a","b")) needs paired output names. Rather than overload outputPattern with a list, the recognizer emits outputPattern: '__POSITIONAL__' (sentinel) plus a positionalOutputs: string[] field on ColumnMap (optional). Executor: when outputPattern === '__POSITIONAL__', zip selector.names with positionalOutputs and require equal length. Keeps the common pattern-based form clean while making the positional form explicit. Update ColumnMap:

export interface ColumnMap {
  selector: ColumnSelector;
  template: string;
  outputPattern: string;
  positionalOutputs?: string[];   // when outputPattern === '__POSITIONAL__'
}

Executor (`executor.ts`)

A small pre-pass before delegating to executeMutate / executeSummarise:

function expandColumnMap(
  inputDataset: Dataset,
  map: ColumnMap,
  byGroups: string[],
): { name: string; expr: string }[] {
  // 1. Resolve selector
  let cols: string[];
  if (map.selector.kind === 'explicit') {
    cols = map.selector.names;
    for (const c of cols) {
      if (!inputDataset.columns.find(col => col.name === c)) {
        throw new Error(`column-map: column "${c}" not found in input dataset`);
      }
    }
  } else {
    const exclude = new Set([...map.selector.exclude, ...byGroups]);
    cols = inputDataset.columns.map(c => c.name).filter(n => !exclude.has(n));
  }
  if (cols.length === 0) {
    throw new Error('column-map: selector resolved to zero columns');
  }

  // 2. Resolve output names
  let outNames: string[];
  if (map.outputPattern === '__POSITIONAL__') {
    if (!map.positionalOutputs || map.positionalOutputs.length !== cols.length) {
      throw new Error('column-map: positional outputs / selector length mismatch');
    }
    outNames = map.positionalOutputs;
  } else {
    outNames = cols.map(c => map.outputPattern.replaceAll('__COL__', c));
  }

  // 3. Check for duplicates
  const seen = new Set<string>();
  for (const n of outNames) {
    if (seen.has(n)) throw new Error(`column-map: duplicate output name "${n}"`);
    seen.add(n);
  }

  // 4. Expand template
  return cols.map((c, i) => ({
    name: outNames[i],
    expr: map.template.replaceAll('__COL__', c),
  }));
}

// realDataMutate becomes:
const realDataMutate: PrimitiveExecutor = {
  execute: (node, inputs) => {
    const mutateNode = node as DataMutateNode;
    const inputDataset = inputs['data'] as Dataset | undefined;
    if (!inputDataset) throw new Error('No input dataset for mutate');
    const p = mutateNode.params;
    let expressions = p.expressions;
    if (p.columnMap) {
      const expanded = expandColumnMap(inputDataset, p.columnMap, p.groupBy ?? []);
      expressions = [...p.expressions, ...expanded];
    }
    return executeMutate(inputDataset, expressions, p.groupBy, p.orderBy);
  },
};

// realDataSummarise becomes:
const realDataSummarise: PrimitiveExecutor = {
  execute: (node, inputs) => {
    const summariseNode = node as DataSummariseNode;
    const inputDataset = inputs['data'] as Dataset | undefined;
    if (!inputDataset) throw new Error('No input dataset for summarise');
    const p = summariseNode.params;
    let aggregations = p.aggregations;
    if (p.columnMap) {
      const expanded = expandColumnMap(inputDataset, p.columnMap, p.groupBy);
      aggregations = [...p.aggregations, ...expanded];
    }
    return executeSummarise(inputDataset, p.groupBy, aggregations);
  },
};

The underlying executeMutate / executeSummarise are not modified. The bounded evaluator handles the expanded expressions exactly as it does for hand-written ones today.

Mapper / ports / param-schema / UI

mapper.ts: AnalysisCall.args.columnMap is copied into node.params.columnMap for data-summarise and data-mutate. No change for any other kind.
Ports: unchanged — data in, out (Dataset).
param-schema.ts: new ParamDef kind: 'column-map' added to both summarise and mutate. Renders read-only at this milestone (no editor).
UI: node body text becomes e.g. lapply(.SD, mean) over all-but [Year] when columnMap is set, falling back to the existing assignment-list rendering otherwise. Property sheet shows selector summary + template + outputPattern (read-only). No new component needed beyond extending the existing summarise/mutate node body renderers.

Testing

Unit tests (`recognize-data-table.test.ts`)

Each row produces one test:

Input	Expected output
`dt[, lapply(.SD, mean, na.rm=TRUE), by=Year]`	`data-summarise` groupBy=[Year] columnMap=`{all-but [Year], "mean(__COL__, na.rm = TRUE)", "__COL__"}`
`dt[, lapply(.SD, sum), by=g, .SDcols=c("a","b")]`	`data-summarise` groupBy=[g] columnMap=`{explicit [a,b], "sum(__COL__)", "__COL__"}`
`dt[, paste0("12mavg_",v) := frollmean(.SD,n=12), .SDcols=vars]` with upstream `vars <- c("a","b","c")`	`data-mutate` columnMap=`{explicit [a,b,c], "frollmean(__COL__, 12)", "12mavg___COL__"}`
`dt[, c("a_z","b_z") := scale(.SD), .SDcols=c("a","b")]`	`data-mutate` columnMap=`{explicit [a,b], "scale(__COL__)", "__POSITIONAL__", positionalOutputs=[a_z,b_z]}`
`dt[, .SD[1], by=key]`	null (slice form deliberately not matched)
`dt[, .SD, by=g]` (no lapply, no walrus)	null (bare `.SD` alone not supported)
`dt[, lapply(.SD, mean), by=g, .SDcols=names(dt)[1:3]]` (computed)	null (fall through to opaque)
`dt[, X := <fn>(.SD)]` (Form B without `.SDcols=`)	null (fall through)

Executor tests (`executor.test.ts`)

explicit selector resolves correctly; output names match outputPattern substitution.
all-but selector excludes both selector.exclude and groupBy.
Missing column in explicit → throws with column name.
Empty resolved set → throws.
Duplicate output names → throws with the duplicate name.
__POSITIONAL__ with mismatched lengths → throws.
Generic reducer path: expansion produces the right expressions for mean / sum / sd / median / min / max / var — feeds the existing bounded evaluator unchanged.

Integration test (`pipeline/integration.test.ts`)

A paper-#78-shaped snippet end-to-end via buildPipeline:

TB_raw[, Year := floor(T)]
TB_yearly <- TB_raw[, lapply(.SD, mean, na.rm=TRUE), by=Year]

Assert: one data-summarise node (no phantom data-filter); columnMap is all-but [Year]; against a small fixture with columns T, Year, a, b, c and 24 rows, the executed result has the correct row count, correct columns, and verified group-means for a, b, c.

Paper-match test (`replicate-dissecting-mechanisms.test.ts`)

The yearly-aggregation block (lines 47–48 of the paper) currently emits opaque nodes. Update fixture so the block is typed: 1 data-mutate (Year := floor(T)) + 1 data-summarise (the lapply(.SD, mean, na.rm=TRUE), by=Year line). Pin one numerical assertion: mean(credit_spread) for a specific Year group. Per CLAUDE.md, also re-run npm run test:paper-match.

E2E (`e2e/`)

One Playwright test: paste a #78-shaped snippet, run, assert no phantom data-filter, one data-summarise visible in the DAG, edges connect, the node body shows the templated form.

Regression-risk surface

Triggers in isDataTableBracket are additive — existing .() / single-col walrus / paren-walrus / no-comma filter / with=FALSE / setDT chain forms unchanged. The new triggers fire only on lapply(.SD, ...) (j-slot) or walrus RHS with <fn>(.SD,...) and .SDcols= — both shapes the current dispatch returns null for.
columnMap is optional on params — every existing node-construction site keeps compiling unchanged.
The hardest guard: the trigger must NOT fire for .SD[1] (slice) or .SD used in any other position. The match is structural — FunctionCallNode { name: 'lapply', args: [identifier '.SD', ...] } or WalrusAssignNode { value: FunctionCallNode { args: [identifier '.SD', ...] } }. A SubsetNode { object: identifier '.SD' } (the slice form) doesn’t match either trigger. The unit-test table includes this as an explicit negative case.

Migration / Cleanup

BACKLOG bullet at line 125 of BACKLOG.md already notes “.SD/.SDcols advanced idioms deferred”; after landing, append a (2026-MM-DD) close note pointing at this spec.
The line-83 frollmean bullet stays open (separate concern), and a one-line note added there saying paper #43’s frollmean(.SD,...) recognition is unblocked once this spec lands, leaving only the evaluator function to finish execution.
GAP-ANALYSIS.md lines 1356, 1621 reference the .SD/.SDcols deferral; update on close.

Out-of-scope (re-stated)

.SD[1] slice-first-per-group → separate spec (the new primitive path).
.I row-index → no corpus usage; YAGNI.
frollmean evaluator function → separate BACKLOG bullet.
data.table(WeightedMode(.SD)) → carried over from the prior spec’s deferred set.
dt[, stri_split_fixed(col,"|"), by=.(id,col)] → carried over.
lapply(...) %>% rbindlist(fill=TRUE) → carried over.
Computed .SDcols unresolvable by the inliner → falls through to webr-opaque.

data.table .SD / .SDcols Column-Set Abstraction — Design