Date: 2026-04-05 Status: Design Depends on: Columnar storage (Float64Array/CategoricalColumn — complete), R parser (PipeChainNode — parsing complete), data-filter node (subset() — complete), pipeline executor + worker protocol
Real papers don’t start with lm(). They start with 20–50
lines of data loading and wrangling. M3 gave us research-grade
estimators, but they have no data to run on because we can’t execute the
pipeline that produces it. Currently we handle ~6% of data manipulation
patterns found in applied econ code.
Unlocks: Ability to parse and execute the full code
from a real paper — read_csv() through dplyr pipelines to
lm(). Validates against JEL-DiD paper (heavy dplyr
pipelines).
Scope: Required tier only — pipes,
read_csv(), filter(), mutate(),
select(), group_by()+summarise(),
as.factor()/as.numeric(), data table viewer.
Important tier (joins, read_dta(),
case_when(), data.table) follows as M4b reusing the same
infrastructure.
File: src/core/data/expression.ts
(new)
A standalone recursive-descent expression parser + tree-walker evaluator. Decoupled from the R parser — takes expression source text as a string, operates on typed column arrays, returns column-shaped results.
Why standalone instead of reusing the R parser? Portability (Stata data manipulation later), simpler testing, avoids coupling data execution to Chevrotain. The expression language is tiny and bounded.
Identifiers resolve to full column arrays. Operators work
element-wise on typed arrays. This matches R’s vectorized semantics
(log(wage) operates on a vector, not a scalar) and aligns
with the Float64Array columnar storage already in place.
"log(wage)"
→ resolve "wage" → Float64Array[50000]
→ apply log element-wise → Float64Array[50000]
"wage_after - wage_before"
→ resolve both → two Float64Array[50000]
→ element-wise subtract → Float64Array[50000]
"state == \"NJ\""
→ resolve "state" → CategoricalColumn
→ element-wise string equality → Uint8Array[50000] (boolean mask)
filter() evaluates to a boolean mask → selects matching
rows from all columnsmutate() evaluates to a column array → appended as new
columnsummarise() evaluates per group slice — same
column-wise logic on each group’s partitionNo separate “aggregation mode” — aggregate functions
(mean(), sum(), n(), etc.) always
operate on whatever column slice they’re given. The executor controls
the slice: whole column (ungrouped) or group partition (grouped).
expression → logical_or
logical_or → logical_and ( ("||" | "|") logical_and )*
logical_and → equality ( ("&&" | "&") equality )*
equality → comparison ( ("==" | "!=") comparison )*
comparison → addition ( ("<" | ">" | "<=" | ">=") addition )*
addition → multiply ( ("+" | "-") multiply )*
multiply → unary ( ("*" | "/") unary )*
unary → ("!" | "-") unary | power
power → call ("^" power)? // right-associative
call → IDENT "(" args? ")" | primary
primary → NUMBER | STRING | "TRUE" | "FALSE" | "NA" | IDENT | "(" expression ")"
| Category | Operators / Functions |
|---|---|
| Literals | numeric (3.14), string ("NJ"), boolean
(TRUE/FALSE), NA |
| Column references | bare identifiers (wage, state) → resolve
to column array |
| Arithmetic | +, -, *, /,
^, unary - |
| Comparison | ==, !=, <,
>, <=, >= |
| Logical | &&, \|\|, !,
&, \| |
| Math (7) | log, exp, sqrt,
abs, round, ceiling,
floor |
| Type (3) | as.numeric, as.factor,
as.character |
| Predicates (1) | is.na |
| Conditional (1) | ifelse(test, yes, no) |
| String (1) | nchar |
| Pairwise (2) | pmin, pmax |
| Aggregate (7) | mean, sum, sd,
median, min, max,
n |
| Parentheses | arbitrary nesting |
Total: 15 scalar functions + 7 aggregate functions.
NaN → NaN (Float64Array
native behavior)NaN → falseis.na() → true for NaN
(numeric) or 0xFFFFFFFF (categorical)NaN (R default
na.rm = TRUE convention)// Parse once, evaluate many times
function parseExpression(source: string): ExprNode;
// Column-wise evaluation against a dataset (filter, mutate)
function evaluateColumns(
ast: ExprNode,
columns: ReadonlyMap<string, Column>,
rowCount: number,
): Column;
// Aggregate evaluation for summarise — same AST, operates on group slice
function evaluateAggregate(
ast: ExprNode,
columns: ReadonlyMap<string, Column>,
rowCount: number,
): number | string | null;subset()The current parseCondition() +
evaluateCondition() in transform.ts handles
only column op value. After M4, data-filter
executor calls evaluateColumns() instead, producing a
boolean mask. The old subset() function becomes a thin
wrapper or is replaced entirely. evaluateColumns() subsumes
it.
File: src/core/parsers/r/recognizer.ts
(extend)
The parser already produces PipeChainNode with
steps: RNode[] for both %>% and
|>. The recognizer flattens this into multiple
AnalysisCall entries with threaded data dependencies.
PipeChainNode.steps left to rightdf): becomes the data dependency
for step 1read_csv("data.csv")): emit as
data-load call, thread from its synthetic variableAnalysisCall with appropriate kind. Set
data arg to previous step’s synthetic variable name.assignedTo: '__pipe_{chainIndex}_{stepIndex}_{verb}' (e.g.,
__pipe_0_1_filter, __pipe_0_2_mutate). The
final step inherits the real assignedTo
from the assignment context (e.g., result).df %>% filter(x > 0) means
filter(df, x > 0) — the recognizer treats the piped-in
value as the data argument, not as a positional arg to
extract.The pipe walker maintains pendingGroupBy: string[]
state: - When it hits group_by(x, y), it doesn’t emit a
call — it stashes ["x", "y"] - The next
summarise or mutate step consumes it,
populating their groupBy field - A group_by at
the end of a pipe (no consumer) is silently dropped (no-op) -
ungroup() clears the pending state
If a pipe step is a recognized model function (e.g.,
df %>% filter(x > 0) %>% lm(y ~ x, data = .)),
emit it as a linear-model call with data
pointing to the previous pipe step. The pipe threading is verb-agnostic
— each step is dispatched to the appropriate recognizer pattern.
Unrecognized steps emit kind: 'unsupported'. The chain
continues — downstream steps wire to the placeholder’s output,
preserving DAG topology even if execution can’t proceed through that
step.
result <- read_csv("data.csv") %>% filter(x > 0)Step 0 is read_csv("data.csv") — a function call, not a
variable reference. The recognizer emits it as a data-load
call, then threads subsequent steps from its synthetic variable.
Add to AnalysisKind union:
export type AnalysisKind =
| 'data-load' | 'data-filter' | 'data-mutate' | 'data-select'
| 'data-summarise' | 'data-arrange' | 'data-rename'
| 'descriptive' | 't-test' | 'linear-model' | 'glm'
| 'model-summary' | 'model-comparison' | 'vcov-override'
| 'unsupported';Five new node types join the existing data-load and
data-filter. All follow the same port pattern: input
data: dataset, output out: dataset. Uniform
chaining.
Add to PipelineNode union in
core/pipeline/types.ts:
// ── data-mutate ──────────────────────────────────────────────
export interface DataMutateParams {
expressions: { name: string; expr: string }[];
groupBy?: string[]; // optional: group_by() %>% mutate()
}
export interface DataMutateNode extends PipelineNodeBase {
type: 'data-mutate';
params: DataMutateParams;
result?: Dataset;
}
// ── data-select ──────────────────────────────────────────────
export interface DataSelectParams {
columns: string[]; // positive selection: select(x, y)
drop: string[]; // negative selection: select(-z)
}
export interface DataSelectNode extends PipelineNodeBase {
type: 'data-select';
params: DataSelectParams;
result?: Dataset;
}
// ── data-summarise ───────────────────────────────────────────
export interface DataSummariseParams {
groupBy: string[];
aggregations: { name: string; expr: string }[];
}
export interface DataSummariseNode extends PipelineNodeBase {
type: 'data-summarise';
params: DataSummariseParams;
result?: Dataset;
}
// ── data-arrange ─────────────────────────────────────────────
export interface DataArrangeParams {
columns: { name: string; desc: boolean }[];
}
export interface DataArrangeNode extends PipelineNodeBase {
type: 'data-arrange';
params: DataArrangeParams;
result?: Dataset;
}
// ── data-rename ──────────────────────────────────────────────
export interface DataRenameParams {
mapping: { from: string; to: string }[];
}
export interface DataRenameNode extends PipelineNodeBase {
type: 'data-rename';
params: DataRenameParams;
result?: Dataset;
}All five use the same port shape:
'data-mutate': { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-select': { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-summarise': { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-arrange': { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },
'data-rename': { inputs: [{ name: 'data', dataType: 'dataset' }], outputs: [{ name: 'out', dataType: 'dataset' }] },Not a separate node. group_by() is consumed by the
recognizer’s pipe walker:
| Pattern | Node type | groupBy |
Output shape |
|---|---|---|---|
mutate(y = log(x)) |
data-mutate |
[] |
same rows |
group_by(s) %>% mutate(y = x - mean(x)) |
data-mutate |
["s"] |
same rows, group-scoped aggregates |
group_by(s) %>% summarise(y = mean(x)) |
data-summarise |
["s"] |
one row per group |
Grouped mutate: executor partitions rows by group, evaluates expressions within each partition (aggregate functions resolve to group-scoped values), then reassembles all rows in original order.
Two paths: - Inside mutate():
expression evaluator functions —
mutate(year_f = as.factor(year)) evaluates normally -
Standalone df$col <- as.factor(df$col):
recognizer detects $<- assignment pattern and emits a
data-mutate node with a single expression
Already recognized as data-load. The executor already
loads CSV via PapaParse. For M4, verify it works end-to-end with piped
chains: read_csv("data.csv") %>% filter(...) should
produce a data-load node followed by a
data-filter node with correct edge wiring.
Each verb gets a recognizer match in the pipe walker. The recognizer
extracts arguments from the R AST and emits AnalysisCall
entries.
| Verb | Function match | Extracted params | AnalysisCall kind |
|---|---|---|---|
filter(cond1, cond2) |
filter |
conditions joined with && → single expression
string |
data-filter (reuse existing) |
mutate(name = expr, ...) |
mutate |
named assignments → {name, expr}[] |
data-mutate |
select(col1, col2) / select(-col) |
select |
positive/negative column refs → columns[] /
drop[] |
data-select |
group_by(col1, col2) |
N/A — consumed by next verb | group columns stashed in pipe state | — |
summarise(name = agg_expr, ...) |
summarise/summarize |
named assignments → {name, expr}[] + inherited
groupBy |
data-summarise |
arrange(col, desc(col2)) |
arrange |
column refs, detect desc() wrapper →
{name, desc}[] |
data-arrange |
rename(new = old) |
rename |
named assignments → {from, to}[] |
data-rename |
as.factor(col) / as.numeric(col) |
standalone $<- assignment |
column + function → single-expression data-mutate |
data-mutate |
The recognizer uses source spans to extract expression text from AST
nodes. For mutate(log_wage = log(wage)): 1. Find the
AssignmentNode inside function call args (R =
in call context) 2. Extract name: "log_wage" from the LHS
identifier 3. Extract expr: "log(wage)" from the RHS source
span (slice original source text)
This raw text is what the expression evaluator receives. The R parser AST gives us the structure to find where expressions are; the expression evaluator’s own parser handles the expression content.
Both produce data-filter nodes. The recognizer already
handles subset(df, condition). Adding
filter(condition) is another function name match.
Difference: filter() takes multiple conditions as separate
arguments (implicit &&), while
subset() takes one. The recognizer joins multiple filter
args with &&.
Files: src/core/data/expression.ts
(new), src/core/pipeline/executor.ts (extend)
1. Receive input Dataset from 'data' port
2. Build column map: Map<string, Column> from dataset.columns
3. For each expression in params.expressions:
a. parseExpression(expr.expr)
b. If params.groupBy is set:
- Partition rows by group key columns
- For each group: evaluateColumns(ast, groupSlice, groupRowCount)
- Reassemble results in original row order
c. Else: evaluateColumns(ast, allColumns, rowCount)
d. Add result column (name = expr.name) to column map
- Important: later expressions can reference earlier ones in the same mutate()
4. Return new Dataset with original columns + new/replaced columns
1. Receive input Dataset
2. If params.columns is non-empty: keep only those columns
3. If params.drop is non-empty: remove those columns
4. Return new Dataset with filtered column set
1. Receive input Dataset
2. Partition rows by params.groupBy columns
3. For each group:
a. Build group column slice (typed array views or copies)
b. For each aggregation in params.aggregations:
evaluateAggregate(ast, groupColumns, groupRowCount)
c. Produces one value per aggregation per group
4. Assemble output Dataset: groupBy columns (one row per group) + aggregation columns
1. Receive input Dataset
2. Build sort key array from params.columns
3. Compute row index permutation via multi-key sort (stable)
4. Reorder all columns by permutation
5. Return new Dataset
1. Receive input Dataset
2. For each mapping in params.mapping:
- Find column by from name, change its name to to
3. Return new Dataset with renamed columns (data unchanged)
Replace parseCondition() +
evaluateCondition() with the expression evaluator:
1. Receive input Dataset
2. parseExpression(params.condition)
3. evaluateColumns(ast, columns, rowCount) → boolean mask (Uint8Array)
4. Build keepRows index array from mask
5. Slice all columns by keepRows
6. Return new Dataset
The existing subset() in transform.ts is
preserved as a convenience wrapper but internally delegates to the
expression evaluator.
All five new executors register in
registerRealExecutors(). Each receives
ExecutionContext with runtime dataset as today.
src/core/data/ — pure functions, no
Reactsrc/ui/ — thin React
shell, replaceable in future UI reworkClick any data-producing node (data-load, data-filter, data-mutate, etc.) → fixed-height bottom panel (~250px) appears below the DAG canvas showing the dataset. Click a non-data node or click the close button → panel disappears.
┌─────────────────────────────────────────────────────────────┐
│ [filter] state == "NJ" 253 rows × 8 cols [×] │ ← header bar
├────┬──────────┬──────────┬──────────┬──────────┬───────────┤
│ │ state ↕ │ wage ↕ │ emp ↕ │ chain ↕ │ ... │ ← column headers (sortable)
│ │ str │ num │ num │ fct │ │ ← type pills
├────┼──────────┼──────────┼──────────┼──────────┼───────────┤
│ 1 │ NJ │ 5.25 │ 40 │ bk │ │ ← data rows
│ 2 │ NJ │ 5.15 │ — │ kfc │ │ ← em-dash for NA
│ 3 │ NJ │ 5.50 │ 35 │ wendys │ │
│ .. │ ... │ ... │ ... │ ... │ │
└────┴──────────┴──────────┴──────────┴──────────┴───────────┘
state == "NJ")253 rows × 8 cols(filtered from 410)position: sticky; left: 0 so row index stays visible during
horizontal scrollposition: sticky; top: 0num/str/fct pills under each
column header, styled as muted badges—)
in muted italic, not the text NA. More compact, less
noisy.Column[] (aligns with typed columnar storage), not
row-major. Rendering maps row index → column value via typed array
access.New state in the pipeline store:
interface PipelineState {
// ... existing fields
inspectedNodeId: string | null; // node whose data is shown in the viewer
inspectedDataset: Dataset | null; // the dataset to display
}inspectedNodeId is independent of
selectedNodeId — selecting a node for the property sheet vs
inspecting its data are separate actions. Double-click or a dedicated
“inspect” button on data-producing nodes sets
inspectedNodeId.
When inspectedNodeId changes, the store sends a
REQUEST_DATASET message to the worker (protocol already
exists) and populates inspectedDataset from the
DATASET_RESPONSE.
The mapper (src/core/pipeline/mapper.ts) needs cases for
the six new AnalysisCall kinds. All follow the existing
data-filter pattern:
createNode() — construct the typed node from
call.argsaddDataEdges() — resolve call.args['data']
against scope map → create edge to data input
portscope.set(call.assignedTo, { nodeId, port: 'out' }) —
register output for downstream resolutionThe mapper already handles flat AnalysisCall[] with
variable-scope edge resolution. Synthetic pipe variable names
(__pipe_0_1_filter) slot cleanly into the existing
scope map. No new mapper concepts needed.
New node types need visual treatment in the React Flow canvas.
Data pipeline nodes get a distinct visual style from model/stats
nodes: - Color family — muted blue-green palette (vs
warm palette for model nodes) - Per-verb accent —
filter (blue), mutate (green), select (slate), summarise (purple),
arrange (amber), rename (teal) - Compact labels — show
verb + key info: filter: state == "NJ",
mutate: +log_wage, +treated, select: 3 cols,
summarise: by state
Pipe chains are common in real code (3–8 nodes in sequence). They should read as a connected pipeline, not scattered nodes. dagre layout already handles this via topological ordering — sequential chains render as vertical sequences. No special grouping needed beyond correct edge wiring.
Ordered by dependency — each step builds on the previous.
src/core/data/expression.ts)PipelineNode unionNODE_PORTS entriesdag.test.ts,
executor.test.tsPipeChainNode, emit flat
AnalysisCall[] with synthetic variablesgroup_by state threading
(pendingGroupBy)AnalysisCall kindsdata-filter executor to use expression
evaluatorregisterRealExecutors()inspectedNodeId / inspectedDataset
stateREQUEST_DATASET/DATASET_RESPONSE
worker protocolJEL-DiD paper — heavy dplyr pipelines with read_csv,
filter, mutate, group_by +
summarise, feeding into M3’s estimators. After M4, we
should parse and execute its full data pipeline end-to-end.
| Component | Status | M4 usage |
|---|---|---|
PipeChainNode in R parser |
Parsing complete | Recognizer walks this |
Float64Array / CategoricalColumn |
Complete (2026-04-05) | Expression evaluator operates on typed arrays directly |
REQUEST_DATASET / DATASET_RESPONSE
protocol |
Complete (2026-04-05) | Data table viewer retrieves dataset from worker |
subset() in transform.ts |
Complete (basic) | Upgraded to use expression evaluator |
data-filter node + executor |
Complete | Reused for filter() verb, executor upgraded |
data-load node + executor |
Complete | Reused for read_csv() in pipe heads |
| Mapper scope resolution | Complete | Synthetic pipe variables use same mechanism |
registerRealExecutors() |
Complete | 5 new executors added here |