M4: Real Data Pipelines — Design Spec

M4: Real Data Pipelines

Date: 2026-04-05 Status: Design Depends on: Columnar storage (Float64Array/CategoricalColumn — complete), R parser (PipeChainNode — parsing complete), data-filter node (subset() — complete), pipeline executor + worker protocol

Motivation

Real papers don’t start with lm(). They start with 20–50 lines of data loading and wrangling. M3 gave us research-grade estimators, but they have no data to run on because we can’t execute the pipeline that produces it. Currently we handle ~6% of data manipulation patterns found in applied econ code.

Unlocks: Ability to parse and execute the full code from a real paper — read_csv() through dplyr pipelines to lm(). Validates against JEL-DiD paper (heavy dplyr pipelines).

Scope: Required tier only — pipes, read_csv(), filter(), mutate(), select(), group_by()+summarise(), as.factor()/as.numeric(), data table viewer. Important tier (joins, read_dta(), case_when(), data.table) follows as M4b reusing the same infrastructure.


1. Vectorized Expression Evaluator

File: src/core/data/expression.ts (new)

A standalone recursive-descent expression parser + tree-walker evaluator. Decoupled from the R parser — takes expression source text as a string, operates on typed column arrays, returns column-shaped results.

Why standalone instead of reusing the R parser? Portability (Stata data manipulation later), simpler testing, avoids coupling data execution to Chevrotain. The expression language is tiny and bounded.

Evaluation model: column-wise (vectorized)

Identifiers resolve to full column arrays. Operators work element-wise on typed arrays. This matches R’s vectorized semantics (log(wage) operates on a vector, not a scalar) and aligns with the Float64Array columnar storage already in place.

"log(wage)"
  → resolve "wage" → Float64Array[50000]
  → apply log element-wise → Float64Array[50000]

"wage_after - wage_before"
  → resolve both → two Float64Array[50000]
  → element-wise subtract → Float64Array[50000]

"state == \"NJ\""
  → resolve "state" → CategoricalColumn
  → element-wise string equality → Uint8Array[50000] (boolean mask)

No separate “aggregation mode” — aggregate functions (mean(), sum(), n(), etc.) always operate on whatever column slice they’re given. The executor controls the slice: whole column (ungrouped) or group partition (grouped).

Grammar

expression  → logical_or
logical_or  → logical_and ( ("||" | "|") logical_and )*
logical_and → equality ( ("&&" | "&") equality )*
equality    → comparison ( ("==" | "!=") comparison )*
comparison  → addition ( ("<" | ">" | "<=" | ">=") addition )*
addition    → multiply ( ("+" | "-") multiply )*
multiply    → unary ( ("*" | "/") unary )*
unary       → ("!" | "-") unary | power
power       → call ("^" power)?         // right-associative
call        → IDENT "(" args? ")" | primary
primary     → NUMBER | STRING | "TRUE" | "FALSE" | "NA" | IDENT | "(" expression ")"

Supported constructs

Category Operators / Functions
Literals numeric (3.14), string ("NJ"), boolean (TRUE/FALSE), NA
Column references bare identifiers (wage, state) → resolve to column array
Arithmetic +, -, *, /, ^, unary -
Comparison ==, !=, <, >, <=, >=
Logical &&, \|\|, !, &, \|
Math (7) log, exp, sqrt, abs, round, ceiling, floor
Type (3) as.numeric, as.factor, as.character
Predicates (1) is.na
Conditional (1) ifelse(test, yes, no)
String (1) nchar
Pairwise (2) pmin, pmax
Aggregate (7) mean, sum, sd, median, min, max, n
Parentheses arbitrary nesting

Total: 15 scalar functions + 7 aggregate functions.

NA propagation

Interface

// Parse once, evaluate many times
function parseExpression(source: string): ExprNode;

// Column-wise evaluation against a dataset (filter, mutate)
function evaluateColumns(
  ast: ExprNode,
  columns: ReadonlyMap<string, Column>,
  rowCount: number,
): Column;

// Aggregate evaluation for summarise — same AST, operates on group slice
function evaluateAggregate(
  ast: ExprNode,
  columns: ReadonlyMap<string, Column>,
  rowCount: number,
): number | string | null;

Integration with existing subset()

The current parseCondition() + evaluateCondition() in transform.ts handles only column op value. After M4, data-filter executor calls evaluateColumns() instead, producing a boolean mask. The old subset() function becomes a thin wrapper or is replaced entirely. evaluateColumns() subsumes it.


2. Pipe Threading in the Recognizer

File: src/core/parsers/r/recognizer.ts (extend)

The parser already produces PipeChainNode with steps: RNode[] for both %>% and |>. The recognizer flattens this into multiple AnalysisCall entries with threaded data dependencies.

Algorithm

  1. Walk PipeChainNode.steps left to right
  2. Step 0 (the head): resolve as a variable reference or function call
  3. Steps 1–N: match function name → emit AnalysisCall with appropriate kind. Set data arg to previous step’s synthetic variable name.
  4. Synthetic variables: each intermediate step gets assignedTo: '__pipe_{chainIndex}_{stepIndex}_{verb}' (e.g., __pipe_0_1_filter, __pipe_0_2_mutate). The final step inherits the real assignedTo from the assignment context (e.g., result).
  5. Pipe argument threading: LHS is the implicit first argument. df %>% filter(x > 0) means filter(df, x > 0) — the recognizer treats the piped-in value as the data argument, not as a positional arg to extract.

group_by state threading

The pipe walker maintains pendingGroupBy: string[] state: - When it hits group_by(x, y), it doesn’t emit a call — it stashes ["x", "y"] - The next summarise or mutate step consumes it, populating their groupBy field - A group_by at the end of a pipe (no consumer) is silently dropped (no-op) - ungroup() clears the pending state

Non-data-verb steps

If a pipe step is a recognized model function (e.g., df %>% filter(x > 0) %>% lm(y ~ x, data = .)), emit it as a linear-model call with data pointing to the previous pipe step. The pipe threading is verb-agnostic — each step is dispatched to the appropriate recognizer pattern.

Unrecognized steps emit kind: 'unsupported'. The chain continues — downstream steps wire to the placeholder’s output, preserving DAG topology even if execution can’t proceed through that step.

Edge case — pipes starting with a function call

result <- read_csv("data.csv") %>% filter(x > 0)

Step 0 is read_csv("data.csv") — a function call, not a variable reference. The recognizer emits it as a data-load call, then threads subsequent steps from its synthetic variable.

New AnalysisCall kinds

Add to AnalysisKind union:

export type AnalysisKind =
  | 'data-load' | 'data-filter' | 'data-mutate' | 'data-select'
  | 'data-summarise' | 'data-arrange' | 'data-rename'
  | 'descriptive' | 't-test' | 'linear-model' | 'glm'
  | 'model-summary' | 'model-comparison' | 'vcov-override'
  | 'unsupported';

3. New Pipeline Node Types

Five new node types join the existing data-load and data-filter. All follow the same port pattern: input data: dataset, output out: dataset. Uniform chaining.

Type definitions

Add to PipelineNode union in core/pipeline/types.ts:

// ── data-mutate ──────────────────────────────────────────────
export interface DataMutateParams {
  expressions: { name: string; expr: string }[];
  groupBy?: string[];        // optional: group_by() %>% mutate()
}
export interface DataMutateNode extends PipelineNodeBase {
  type: 'data-mutate';
  params: DataMutateParams;
  result?: Dataset;
}

// ── data-select ──────────────────────────────────────────────
export interface DataSelectParams {
  columns: string[];         // positive selection: select(x, y)
  drop: string[];            // negative selection: select(-z)
}
export interface DataSelectNode extends PipelineNodeBase {
  type: 'data-select';
  params: DataSelectParams;
  result?: Dataset;
}

// ── data-summarise ───────────────────────────────────────────
export interface DataSummariseParams {
  groupBy: string[];
  aggregations: { name: string; expr: string }[];
}
export interface DataSummariseNode extends PipelineNodeBase {
  type: 'data-summarise';
  params: DataSummariseParams;
  result?: Dataset;
}

// ── data-arrange ─────────────────────────────────────────────
export interface DataArrangeParams {
  columns: { name: string; desc: boolean }[];
}
export interface DataArrangeNode extends PipelineNodeBase {
  type: 'data-arrange';
  params: DataArrangeParams;
  result?: Dataset;
}

// ── data-rename ──────────────────────────────────────────────
export interface DataRenameParams {
  mapping: { from: string; to: string }[];
}
export interface DataRenameNode extends PipelineNodeBase {
  type: 'data-rename';
  params: DataRenameParams;
  result?: Dataset;
}

NODE_PORTS additions

All five use the same port shape:

'data-mutate':    { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-select':    { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-summarise': { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-arrange':   { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-rename':    { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },

group_by() handling

Not a separate node. group_by() is consumed by the recognizer’s pipe walker:

Pattern Node type groupBy Output shape
mutate(y = log(x)) data-mutate [] same rows
group_by(s) %>% mutate(y = x - mean(x)) data-mutate ["s"] same rows, group-scoped aggregates
group_by(s) %>% summarise(y = mean(x)) data-summarise ["s"] one row per group

Grouped mutate: executor partitions rows by group, evaluates expressions within each partition (aggregate functions resolve to group-scoped values), then reassembles all rows in original order.

as.factor() / as.numeric() handling

Two paths: - Inside mutate(): expression evaluator functions — mutate(year_f = as.factor(year)) evaluates normally - Standalone df$col <- as.factor(df$col): recognizer detects $<- assignment pattern and emits a data-mutate node with a single expression

read_csv() execution

Already recognized as data-load. The executor already loads CSV via PapaParse. For M4, verify it works end-to-end with piped chains: read_csv("data.csv") %>% filter(...) should produce a data-load node followed by a data-filter node with correct edge wiring.


4. Recognizer Patterns for dplyr Verbs

Each verb gets a recognizer match in the pipe walker. The recognizer extracts arguments from the R AST and emits AnalysisCall entries.

Verb Function match Extracted params AnalysisCall kind
filter(cond1, cond2) filter conditions joined with && → single expression string data-filter (reuse existing)
mutate(name = expr, ...) mutate named assignments → {name, expr}[] data-mutate
select(col1, col2) / select(-col) select positive/negative column refs → columns[] / drop[] data-select
group_by(col1, col2) N/A — consumed by next verb group columns stashed in pipe state
summarise(name = agg_expr, ...) summarise/summarize named assignments → {name, expr}[] + inherited groupBy data-summarise
arrange(col, desc(col2)) arrange column refs, detect desc() wrapper → {name, desc}[] data-arrange
rename(new = old) rename named assignments → {from, to}[] data-rename
as.factor(col) / as.numeric(col) standalone $<- assignment column + function → single-expression data-mutate data-mutate

Argument extraction strategy

The recognizer uses source spans to extract expression text from AST nodes. For mutate(log_wage = log(wage)): 1. Find the AssignmentNode inside function call args (R = in call context) 2. Extract name: "log_wage" from the LHS identifier 3. Extract expr: "log(wage)" from the RHS source span (slice original source text)

This raw text is what the expression evaluator receives. The R parser AST gives us the structure to find where expressions are; the expression evaluator’s own parser handles the expression content.

filter() vs existing subset()

Both produce data-filter nodes. The recognizer already handles subset(df, condition). Adding filter(condition) is another function name match. Difference: filter() takes multiple conditions as separate arguments (implicit &&), while subset() takes one. The recognizer joins multiple filter args with &&.


5. Executor Implementation

Files: src/core/data/expression.ts (new), src/core/pipeline/executor.ts (extend)

data-mutate executor

1. Receive input Dataset from 'data' port
2. Build column map: Map<string, Column> from dataset.columns
3. For each expression in params.expressions:
   a. parseExpression(expr.expr)
   b. If params.groupBy is set:
      - Partition rows by group key columns
      - For each group: evaluateColumns(ast, groupSlice, groupRowCount)
      - Reassemble results in original row order
   c. Else: evaluateColumns(ast, allColumns, rowCount)
   d. Add result column (name = expr.name) to column map
      - Important: later expressions can reference earlier ones in the same mutate()
4. Return new Dataset with original columns + new/replaced columns

data-select executor

1. Receive input Dataset
2. If params.columns is non-empty: keep only those columns
3. If params.drop is non-empty: remove those columns
4. Return new Dataset with filtered column set

data-summarise executor

1. Receive input Dataset
2. Partition rows by params.groupBy columns
3. For each group:
   a. Build group column slice (typed array views or copies)
   b. For each aggregation in params.aggregations:
      evaluateAggregate(ast, groupColumns, groupRowCount)
   c. Produces one value per aggregation per group
4. Assemble output Dataset: groupBy columns (one row per group) + aggregation columns

data-arrange executor

1. Receive input Dataset
2. Build sort key array from params.columns
3. Compute row index permutation via multi-key sort (stable)
4. Reorder all columns by permutation
5. Return new Dataset

data-rename executor

1. Receive input Dataset
2. For each mapping in params.mapping:
   - Find column by from name, change its name to to
3. Return new Dataset with renamed columns (data unchanged)

data-filter executor upgrade

Replace parseCondition() + evaluateCondition() with the expression evaluator:

1. Receive input Dataset
2. parseExpression(params.condition)
3. evaluateColumns(ast, columns, rowCount) → boolean mask (Uint8Array)
4. Build keepRows index array from mask
5. Slice all columns by keepRows
6. Return new Dataset

The existing subset() in transform.ts is preserved as a convenience wrapper but internally delegates to the expression evaluator.

Registration

All five new executors register in registerRealExecutors(). Each receives ExecutionContext with runtime dataset as today.


6. Data Table Viewer

Architecture

Interaction model

Click any data-producing node (data-load, data-filter, data-mutate, etc.) → fixed-height bottom panel (~250px) appears below the DAG canvas showing the dataset. Click a non-data node or click the close button → panel disappears.

Panel layout

┌─────────────────────────────────────────────────────────────┐
│ [filter] state == "NJ"              253 rows × 8 cols  [×]  │  ← header bar
├────┬──────────┬──────────┬──────────┬──────────┬───────────┤
│    │ state ↕  │ wage ↕   │ emp ↕    │ chain ↕  │ ...       │  ← column headers (sortable)
│    │ str      │ num      │ num      │ fct      │           │  ← type pills
├────┼──────────┼──────────┼──────────┼──────────┼───────────┤
│  1 │ NJ       │    5.25  │      40  │ bk       │           │  ← data rows
│  2 │ NJ       │    5.15  │      —   │ kfc      │           │  ← em-dash for NA
│  3 │ NJ       │    5.50  │      35  │ wendys   │           │
│ .. │ ...      │    ...   │    ...   │ ...      │           │
└────┴──────────┴──────────┴──────────┴──────────┴───────────┘

Header bar

Table features

Performance

Zustand integration

New state in the pipeline store:

interface PipelineState {
  // ... existing fields
  inspectedNodeId: string | null;    // node whose data is shown in the viewer
  inspectedDataset: Dataset | null;  // the dataset to display
}

inspectedNodeId is independent of selectedNodeId — selecting a node for the property sheet vs inspecting its data are separate actions. Double-click or a dedicated “inspect” button on data-producing nodes sets inspectedNodeId.

When inspectedNodeId changes, the store sends a REQUEST_DATASET message to the worker (protocol already exists) and populates inspectedDataset from the DATASET_RESPONSE.


7. Mapper Changes

The mapper (src/core/pipeline/mapper.ts) needs cases for the six new AnalysisCall kinds. All follow the existing data-filter pattern:

  1. createNode() — construct the typed node from call.args
  2. addDataEdges() — resolve call.args['data'] against scope map → create edge to data input port
  3. scope.set(call.assignedTo, { nodeId, port: 'out' }) — register output for downstream resolution

The mapper already handles flat AnalysisCall[] with variable-scope edge resolution. Synthetic pipe variable names (__pipe_0_1_filter) slot cleanly into the existing scope map. No new mapper concepts needed.


8. DAG Visualization

New node types need visual treatment in the React Flow canvas.

Node styling

Data pipeline nodes get a distinct visual style from model/stats nodes: - Color family — muted blue-green palette (vs warm palette for model nodes) - Per-verb accent — filter (blue), mutate (green), select (slate), summarise (purple), arrange (amber), rename (teal) - Compact labels — show verb + key info: filter: state == "NJ", mutate: +log_wage, +treated, select: 3 cols, summarise: by state

Pipe chain visual grouping

Pipe chains are common in real code (3–8 nodes in sequence). They should read as a connected pipeline, not scattered nodes. dagre layout already handles this via topological ordering — sequential chains render as vertical sequences. No special grouping needed beyond correct edge wiring.


9. Implementation Sequence

Ordered by dependency — each step builds on the previous.

Step 1: Expression evaluator (src/core/data/expression.ts)

Step 2: Pipeline types + ports

Step 3: Recognizer pipe threading

Step 4: Mapper + executor wiring

Step 5: Data table viewer UI

Step 6: Integration test against JEL-DiD


Validates against

JEL-DiD paper — heavy dplyr pipelines with read_csv, filter, mutate, group_by + summarise, feeding into M3’s estimators. After M4, we should parse and execute its full data pipeline end-to-end.


Appendix: Existing Infrastructure Leveraged

Component Status M4 usage
PipeChainNode in R parser Parsing complete Recognizer walks this
Float64Array / CategoricalColumn Complete (2026-04-05) Expression evaluator operates on typed arrays directly
REQUEST_DATASET / DATASET_RESPONSE protocol Complete (2026-04-05) Data table viewer retrieves dataset from worker
subset() in transform.ts Complete (basic) Upgraded to use expression evaluator
data-filter node + executor Complete Reused for filter() verb, executor upgraded
data-load node + executor Complete Reused for read_csv() in pipe heads
Mapper scope resolution Complete Synthetic pipe variables use same mechanism
registerRealExecutors() Complete 5 new executors added here