Rungs Internal

Expert's Guide

This page is restricted. Enter the access code to continue.

Private Reference — Study and Internalize

Rungs: The Expert's Guide

Everything you need to speak as an authority on causal reasoning, the Rungs engine, and its applications — in any room, with any audience.

01
Why Correlation Isn't Causation — And Why It Matters for Sales
The foundation. Get this right and everything else clicks.

The Sprinkler Problem

Picture a lawn. The grass is wet. Did rain cause it, or did the sprinkler? You observe wet grass — but observation alone can't tell you the direction of causation. Both rain and sprinklers cause wet grass. If you see wet grass and reason "it rained," you could be right 70% of the time — but that 30% error can matter enormously depending on what you do next.

This is the core problem with every statistical model that exists today: they observe patterns. They cannot distinguish cause from effect. A model trained on historical data learns "when grass is wet, it probably rained" — and that association is real. But association is not causation, and conflating the two leads to decisions that fail when the world changes.

Why Every ML Model Is Stuck at Correlation

Machine learning models — from linear regression to GPT-4 to transformers — are all trained to minimize prediction error on historical data. They learn associations. They get very good at asking: "What is associated with X?" But they cannot ask: "What will happen to Y if I force X to change?" That requires a fundamentally different operation: intervention.

A model that learned "ice cream sales are correlated with drowning rates" will, if deployed naively, predict that banning ice cream reduces drowning. Both are driven by a third variable (summer heat) — a confounder. The model can't see confounders unless you explicitly encode them. And in real business data, confounders are everywhere.

The Three Questions — Observational, Interventional, Counterfactual

Question 1 — Observational
What IS?

"Customers who bought product A also bought product B." This is what your BI tools do today. It's useful for description. It cannot tell you what to do.

Question 2 — Interventional
What HAPPENS IF I change something?

"If I increase the price of product A by 10%, what happens to sales?" This requires knowing the causal mechanism — not just the correlation. A price-sales correlation can be confounded by seasonality, promotional activity, competitor pricing. The causal question cuts through the confounders.

Question 3 — Counterfactual
What WOULD HAVE happened if I'd decided differently?

"We raised prices last quarter. Sales dropped. Would sales have dropped anyway due to the recession?" This is the hardest question. It requires reasoning about a world that didn't happen. It's the question at the heart of litigation, insurance claims, and every post-mortem analysis.

The Business Translation

Say This to a Business Buyer

"Your current BI tools tell you what happened. Rungs tells you why it happened, what will happen if you change something, and what would have happened if you'd decided differently. That's the difference between a rearview mirror and a flight simulator."

How to Answer: "Isn't this just regression?"

Your answer: "Regression tells you correlation. Rungs tells you causation. They look similar — both estimate relationships between variables — but they give different answers, and in high-stakes decisions, dangerously different answers. A regression on customer data might tell you that customers who call support more are more loyal. The causal model tells you that's backwards — loyal customers call more because they're engaged, but forcing support calls doesn't create loyalty. Acting on the regression destroys value. Acting on the causal model creates it."

If they push back: "The mathematical difference is this: regression conditions on observations. Do-calculus computes the effect of interventions. Pearl's 2009 paper and fifty subsequent experimental confirmations show these give different — and the causal one is the right — answer in systems with confounding."
02
Pearl's Three Rungs — The Ladder of Causation
This is the taxonomy that underlies everything. Know it cold.
Rung 1 — Association (Seeing)
P(Y | X)
P(Y | X) — Conditional probability

"What is the probability of Y, given that I observe X?" This is the domain of statistics and all ML. Every neural network, every regression, every Bayesian classifier lives here. It can only learn from passive observation — from data that already exists.

Limitation: Cannot distinguish correlation from causation. Cannot answer questions about interventions. Will give wrong answers when confounders are present.

Business translation: "Your revenue went up when you ran ads. But did the ads cause it, or were customers already planning to buy?"
Rung 2 — Intervention (Doing)
P(Y | do(X))
P(Y | do(X=x)) — Causal effect

"What is the probability of Y if I force X to take value x?" The do() operator represents actual intervention — not just observing X=x, but setting it. This requires a causal model. It cannot be computed from data alone. You need to know the structure of how variables influence each other.

The key insight: P(Y | do(X=x)) ≠ P(Y | X=x) when there is confounding. The difference between these two quantities is the confounding bias that all observational studies struggle with.

Business translation: "If we actually raise the price — not just observe cases where the price happened to be higher — what will happen to sales?"
Rung 3 — Counterfactual (Imagining)
P(Y_x | X=x', Y=y')
P(Y_x=y | X=x', Y=y') — Potential outcome

"What would Y have been if X had been x, given that we actually observed X=x' and Y=y'?" This is reasoning about a world that didn't happen. It requires knowing the noise terms — the unmeasured individual-level variation. This is the rung of attribution, legal liability, and policy evaluation.

Why it's hardest: You need a fully specified SCM, including the noise distribution, to compute counterfactuals exactly. In linear models, this is solvable in closed form (abduction: infer the noise, then compute). In nonlinear models, it requires sampling.

Business translation: "We changed the product last quarter and revenue dropped. Would it have dropped anyway due to market conditions? How much of the drop was caused by our decision?"

Simpson's Paradox — Why the Rungs Can't Be Conflated

The Berkeley admissions case is the canonical example. Aggregated data showed women were admitted at a lower rate than men — apparent discrimination. But when you looked department by department, women were admitted at a higher rate in every individual department. The paradox: overall discrimination, but per-department advantage?

The answer is confounding. Women applied to more competitive departments at higher rates. Department selectivity is a confounder. The correct causal analysis asks: P(admit | do(gender=F)) — controlling for the confounder. When you do this properly, the apparent discrimination vanishes. The Rung 1 analysis (what's associated with admission?) gave the wrong policy conclusion. The Rung 2 analysis gave the right one.

How to Answer: "What's the difference between Rungs and a Bayesian network?"

Your answer: "A Bayesian network is a Rung 1 tool. It represents a joint probability distribution over variables and lets you do conditional inference: 'given that I observe X, what's the probability of Y?' That's association. Rungs operates at all three rungs. It can answer Rung 1 questions like Bayesian networks, but it can also answer Rung 2 interventional questions — 'what happens if I set X?' — and Rung 3 counterfactual questions — 'what would have happened?' The math is fundamentally different for each rung, and the algorithms are different. Bayesian networks weren't designed for Rungs 2 and 3."

If they push back: "Pearl himself draws this distinction in 'The Book of Why' and 'Causality.' The graphical structure is the same — both use DAGs — but the semantics are different. A Bayesian network edge means statistical dependence. A causal DAG edge means causal influence. These overlap but are not equivalent."
03
Structural Causal Models — The Math Under the Hood (Plain English)
What you're actually computing when Rungs runs a query.

What an SCM Is

A Structural Causal Model (SCM) has three components: a set of variables, a set of structural equations, and a noise distribution. Each variable X gets an equation: X := f(parents(X), noise_X). The function f describes how X is determined by its direct causes plus some irreducible randomness (noise_X). The noise terms are independent across variables — they represent everything not captured by the model.

This is different from regression. Regression says "Y and X are correlated and here's a coefficient." An SCM says "X causes Y through this specific mechanism, and here's the residual noise." The structural equation is a claim about the world, not just about the data.

DAGs — Directed Acyclic Graphs

The visual representation of an SCM is a DAG. Each node is a variable. Each directed edge (arrow) represents direct causal influence. The direction matters: an arrow from A to B means A is a direct cause of B, not the reverse. Acyclic means there are no feedback loops — causation flows forward, not in circles. (Time-series models handle feedback by unrolling the loop across time.)

The analogy that works with non-technical audiences: "Think of it as a plumbing diagram for your data. Each pipe shows which variable feeds into which. When you want to know what happens if you change a valve upstream, you trace the pipes downstream. That's exactly what Rungs does — it traces the causal pipes."

d-Separation — Reading Independence from a Graph

d-separation is the fundamental tool for determining whether two variables are independent given a set of conditions, just by reading the graph structure — without looking at any data. If X and Y are d-separated by a set Z, then conditioning on Z makes X and Y independent in any distribution that's consistent with the graph.

This is powerful because it lets you determine what you need to measure, what you can ignore, and whether an identification strategy is valid — all from the causal graph before you've seen a single data point.

The Three Graph Patterns

Chain
X → Z → Y

Z mediates the effect of X on Y. Conditioning on Z blocks the path from X to Y. This is how you analyze mediation: how much of X's effect on Y goes through Z vs. directly?

Fork (Common Cause)
X ← Z → Y

Z is a confounder — a common cause of X and Y. X and Y are correlated because of Z, not because of causation. Conditioning on Z (blocking the backdoor) removes the confounding and isolates the causal effect.

Collider (Common Effect)
X → Z ← Y

Z is caused by both X and Y. X and Y are independent — until you condition on Z. This is the paradox: conditioning on a collider opens a path and induces spurious correlation. This is how selection bias works.

V-Structures and Causal Discovery

An unshielded collider (X → Z ← Y, where X and Y have no direct connection) is the only graph pattern that can be reliably detected from data alone. It's the signature of causation in observational data. Causal discovery algorithms like PC and FCI use v-structures as anchors to orient edges and distinguish cause from effect — something correlation analysis cannot do.

How to Explain This to a Non-Technical Executive

Your answer: "Think of it like a plumbing diagram for your business. Each pipe shows which variable feeds which. There are three kinds of pipe configurations: a chain (A flows through B to reach C), a fork (one pipe splits to feed two separate things — that's where confounding comes from), and a collider (two pipes flow into the same tank — and looking at the tank level actually tells you something about both inputs). Rungs reads this plumbing diagram and figures out exactly where to look and what to measure to answer your causal question."

If they want more: "The math that makes this work is called d-separation — a graph algorithm that determines independence relationships just from structure. It was Pearl's key insight in the late 1980s and is now the foundation of modern causal inference."
04
Do-Calculus — The Core Algorithm
The mathematical heart of causal inference. The algorithm Rungs is built on.

What "do" Means: Graph Surgery

The do() operator represents an intervention. When you write do(X=x), you are performing a mental surgery on the causal graph: remove all incoming edges to X (X is no longer influenced by its normal causes — you've overridden them), set X=x, then propagate the effect forward through the remaining graph.

This is the mathematical formalization of "what happens if we actually change this?" It's distinct from observation because observation leaves all causal mechanisms intact. Intervention cuts the causal history of X and forces a new value.

Example: Price is normally set by supply, demand, and competitor pricing. To compute P(sales | do(price = $10)), you cut those incoming edges, fix price at $10, and ask what happens to sales. You're not conditioning on historical cases where price happened to be $10 — you're computing the effect of forcing it to be $10, regardless of why it got there.

The Three Do-Calculus Rules (Plain English)

  • Rule 1 — Ignoring observations: If a variable W is irrelevant to Y given the current set of conditions (i.e., W is d-separated from Y after the appropriate graph surgery), you can safely drop it from your conditioning set. This is the rule for simplifying complex expressions.
  • Rule 2 — Action/observation exchange: Under specific graph conditions, a do() intervention on X can be replaced by simple conditioning on X — just treating X as an observed variable rather than an intervened one. The backdoor adjustment is a special case of this rule. This is what makes observational causal inference possible: you don't need to run an experiment if the conditions for Rule 2 are met.
  • Rule 3 — Ignoring actions: If an action (do()) on a variable Z doesn't affect Y given the current conditioning set, you can remove the do(Z) from the expression. This simplifies multi-intervention queries.

Pearl proved in 1995 that these three rules are complete — any causal effect that can be identified from observational data can be identified by applying these three rules. If you can't identify an effect using do-calculus, no amount of data will help; you need a randomized trial.

Identification: When Can You Compute Causal Effects from Observational Data?

A causal effect is identifiable if it can be expressed as a function of observed data alone — no intervention required. Not all causal effects are identifiable. When latent confounders exist between X and Y, and you can't block the backdoor path, the effect may not be identifiable without additional assumptions.

The Backdoor Criterion

A set of variables Z satisfies the backdoor criterion for estimating the effect of X on Y if: (1) Z blocks all backdoor paths from X to Y (paths that go through parents of X), and (2) Z contains no descendants of X.

Example: You want to know if education (X) causes income (Y). Ability (Z) is a confounder — it causes both education and income. Condition on ability and you block the backdoor path. The adjusted estimate P(Y | do(X)) = Σ_z P(Y | X, Z=z) P(Z=z) — this is the backdoor adjustment formula. It's a weighted average of the conditional probabilities, where the weight is the marginal distribution of the confounder.

The Front-Door Criterion

When you can't block all backdoor paths (e.g., the confounder is unmeasured), the front-door criterion offers an alternative. If there's a mediator M that (1) is caused by X, (2) fully mediates the effect of X on Y, and (3) has no unblocked backdoor paths to Y, you can use M to identify the causal effect even without conditioning on the confounder.

The classic example: smoking (X) → tar in lungs (M) → cancer (Y). You can't measure all confounders between smoking and cancer. But if tar fully mediates the smoking-cancer effect, you can identify the effect via the front-door adjustment.

How to Answer: "Do you need a randomized trial to use Rungs?"

Your answer: "No. Rungs uses the causal graph to identify effects from observational data — the same data you already have. An RCT is just one way to answer a causal question; it answers it by design. Rungs answers it from observational data using do-calculus, as long as the causal structure is known or can be discovered. The backdoor and front-door criteria are mathematical conditions that tell you when observational data is sufficient. If those conditions are met, the answer from Rungs is as valid as an RCT. If they're not met, Rungs tells you that too — and tells you what additional data you'd need."

If they push back: "Pearl's work formalized exactly this: the conditions under which observational data answers causal questions. The conditions are testable from the causal graph. We don't assume they're always met — Rungs checks and reports identification status for every query."
05
The Rungs Engine — What's Actually Built
The specific algorithms and components inside the engine. Be precise.
Variable Elimination

Handles large graphs without exponential blowup by eliminating variables in an optimal order, reusing computations. Core inference algorithm for exact probabilistic queries.

PC Algorithm

Causal discovery from data using conditional independence tests. Starts with a fully connected graph, removes edges where independence is detected, then orients v-structures. Recovers the true DAG (up to Markov equivalence) from observational data.

FCI Algorithm

PC's extension for the realistic case where latent confounders may exist. Produces PAGs (partial ancestral graphs) instead of DAGs. Bidirected edges (↔) represent unmeasured common causes. The honest answer when you can't fully identify the structure.

Linear Gaussian SCM

Exact do-calculus for continuous variables. Supports full mediation analysis: natural direct effect (NDE) and natural indirect effect (NIE). Manski bounds for partial identification. Rosenbaum sensitivity analysis for unmeasured confounding.

Granger Causality

Detects causal direction in time-series data using F-tests. "Does knowing X's history improve prediction of Y beyond Y's own history?" Handles lagged causal effects in sequential data.

Sheaf Backbone

Neural architecture for adversarial causal chains up to 100 nodes. Achieved 100% generalization (up from 74.4% with standard GNNs). 5.19M parameters. Trained checkpoint: 8.5MB. Handles long-range causal dependencies that standard message passing can't.

NL→DAG Parser

Plain English → causal graph. Three backends: Claude (high accuracy), Ollama (local/private), regex (deterministic fallback). Bridges natural language descriptions to the engine's tensor representation.

IPW / AIPW

Inverse probability weighting and augmented IPW (doubly-robust estimator) for data-driven causal effect estimation. AIPW is consistent if either the propensity model or the outcome model is correct — robust to model misspecification.

IV / 2SLS

Instrumental variable estimation and two-stage least squares. For when you have an instrument (a variable that affects X but has no direct effect on Y except through X) — allows identification even with unmeasured confounding.

Bootstrap + Sensitivity

95% confidence intervals on every estimate via bootstrap resampling. Rosenbaum bounds (how strong would unmeasured confounding need to be to overturn the result?). E-values (Pearl and VanderWeele). Every output includes uncertainty quantification.

How to Answer: "Is this just a wrapper around an LLM?"

Your answer: "No — and this is a critical distinction. The Rungs engine is pure mathematics — graph algorithms, do-calculus, conditional independence tests, bootstrap resampling. No LLM, no GPU in the inference path, no sampling from a probability distribution over tokens. It runs in 1.3 milliseconds per query on CPU. LLMs are used in Mode 2 as the language layer — they take the user's natural language query, extract the causal structure, pass it to Rungs as a structured API call, and wrap the structured JSON response back into natural language. The computation is entirely deterministic. The LLM handles the language; Rungs handles the math."

If they push back: "Ask GPT-4 to compute P(sales | do(price=10)) given a specific causal graph. It will produce a plausible-sounding answer. It won't run the algorithm. Rungs runs the algorithm. The difference shows up on benchmarks — 98.6% vs. 69% on CLadder."
06
Benchmarks — The Numbers and What They Mean
Know these cold. Be able to cite them without hesitation.
98.6%
Rungs on CLadder
~69%
GPT-4 on CLadder
+29pts
Rungs lead
1.3ms
Per query (CPU)

CLadder Benchmark

CLadder is a benchmark of 10,059 causal reasoning questions across all three rungs, designed by Pearl's research group at UCLA. Questions span associational, interventional, and counterfactual reasoning in diverse domains and graph structures. It is the gold-standard benchmark for causal reasoning in AI systems.

System Score Correct / Total Notes
Rungs 98.6% 9,922 / 10,059 At theoretical ceiling. 7 verified dataset bugs, remainder are rounding artifacts.
GPT-4 ~69% ~6,941 / 10,059 Best published LLM baseline. Fails systematically on Rung 2 and Rung 3.
GPT-3.5 ~56% ~5,633 / 10,059 Near random on counterfactual questions.
Random baseline 50% 5,030 / 10,059 Binary questions. Coin flip.

Why 98.6% and Not 100%

The 137 misses break down as follows: 7 are verified bugs in the CLadder dataset itself (confirmed by independent review). The remainder are rounding artifacts — questions where the expected answer is computed to more decimal places than the benchmark's grading tolerates. The engine is at the theoretical ceiling for this benchmark.

Other Benchmarks

  • LESA Benchmark: 8 batteries testing core causal reasoning capabilities — identification, estimation, discovery, counterfactuals, mediation, sensitivity, time-series, and graph surgery. All 8 passing.
  • BBH Coverage: 27/27 tasks from Big-Bench Hard. 6,508/6,511 instances (100.0%). 55 solver files covering all task types.
  • Rung 3 Training: 99.44% classification accuracy, R²=0.988, MSE=0.056 on combined linear and nonlinear counterfactual data (42.4K examples, 22 domains). 1.58M parameters.

Performance

1.3ms per query on CPU. No GPU required in the inference path. $0 per query — no API calls to any external service during engine computation. This matters for enterprise deployment: no latency spikes from third-party APIs, no data leaving the customer's environment, no per-query cost at scale.

How to Answer: "How do you know it's better than GPT-4?"

Your answer: "We ran the CLadder benchmark — 10,059 causal reasoning questions designed by Pearl's research group at UCLA. It's the standard for evaluating causal reasoning in AI. GPT-4 scores around 69%. Rungs scores 98.6%. The 29-point gap isn't statistical noise — it's the difference between correlation-based reasoning and actual causal computation. GPT-4 fails systematically on interventional and counterfactual questions because it doesn't have an algorithm for those rungs. It produces plausible-sounding answers. Rungs computes the correct one."

If they push back: "The benchmark is published. The methodology is reproducible. We can run it on any system you'd like to compare against. The numbers stand on their own."
07
Mode 2 Architecture — How Rungs Deploys
The deployment model that makes Rungs work at scale. Be clear on what each layer does.

The Four-Step Pipeline

Mode 2 Pipeline

Step 1: User asks in plain English — "Did our pricing change cause the churn increase last quarter, or was it something else?"

Step 2: LLM extracts causal structure — identifies variables (price, churn, market conditions, seasonality), relationships, and maps to a DAG via the NL→DAG parser.

Step 3: Rungs computes — applies do-calculus, runs backdoor adjustment or front-door criterion as appropriate, computes P(churn | do(price_change)), returns structured JSON with point estimate + confidence interval + sensitivity analysis.

Step 4: LLM formats the answer — wraps the structured result in clear business language, explains the conclusion, notes caveats, and generates a follow-up question if needed.

Why This Division of Labor Is Correct

LLMs are extraordinarily good at language tasks: parsing intent, understanding context, generating coherent text, asking clarifying questions. They are demonstrably bad at causal computation: they confuse correlation and causation, they don't run algorithms, they hallucinate plausible-sounding but mathematically incorrect answers to quantitative questions.

Rungs is extraordinarily good at causal computation and produces no language output at all — just structured JSON. It doesn't understand intent. It doesn't know what a "pricing change" means in a business context.

Mode 2 gives each layer the job it's suited for. The LLM handles language in both directions. Rungs handles the computation in the middle. Neither layer is asked to do what it's bad at.

The Calculator Analogy

Use This With Technical Buyers

"Think of Rungs as a calculator that the LLM uses. When your CFO asks a financial question, she uses Excel — she doesn't try to do compound interest in her head. Mode 2 is the same principle: LLMs are good at language, Rungs is good at causal math. The LLM calls Rungs the way a spreadsheet calls a math library. Use the right tool for each job."

The MCP Server

Rungs exposes 8 tools via the Model Context Protocol (MCP). Any LLM that supports tool calls — Claude, GPT-4, Gemini — can use Rungs as a backend without any custom integration. The tools cover: do-calculus queries, counterfactual estimation, causal discovery, mediation analysis, sensitivity analysis, graph surgery, backdoor/front-door identification, and natural language DAG parsing (parse_causal_text).

The Alternative — Pure LLM

Without Rungs in the loop, an LLM asked "did our price change cause the churn?" does the following: it pattern-matches to training data about pricing and churn, generates a plausible-sounding narrative, and produces a number that feels right. It cannot distinguish P(churn | do(price)) from P(churn | price). It does not run a conditional independence test. It does not check whether the backdoor criterion is satisfied. It will confidently give wrong answers in the presence of confounding.

How to Answer: "Why not just fine-tune an LLM on causal data?"

Your answer: "Because causal reasoning requires computation, not pattern matching. A fine-tuned LLM gets better at producing causal-sounding text. It doesn't get better at actually computing P(Y | do(X)). That computation requires running the do-calculus algorithm — which is a specific mathematical procedure, not a statistical pattern. Fine-tuning gives you better language about causal concepts. Rungs gives you correct causal answers. The CLadder benchmark makes this concrete: GPT-4 fine-tuned on causal reasoning texts still scores in the 70s. Rungs scores 98.6% because it runs the algorithm."

If they push back: "The Sparks of AGI paper from Microsoft, Pearl's CLadder paper, and dozens of follow-up benchmarks all confirm: LLMs fail at causal reasoning at the algorithmic level. This isn't a fine-tuning problem. It's an architecture problem. Rungs solves it at the architecture level."
08
Industry Applications — Scripts for Each Vertical
Know your vertical cold before you walk in. Adapt the causal question to their language.
Security / Alvaka
"Did this process cause the breach, or is it a coincidence?"
Rungs traces the causal kill chain through the process tree. It computes P(breach | process_A, do(block_process_A)) using the security event graph. Every alert is scored: is this process causally upstream of the breach, or just correlated because it runs at the same time? No more alert fatigue from correlational detections.
Business value: Reduce false positive alerts by 60–80%. Every alert has a causal confidence score and an audit trail. Compliance teams can show regulators the deterministic causal chain, not a black-box ML score.
"Every EDR today correlates events to find threats. Correlation means false positives. Rungs traces causation through the process tree. If process A didn't cause the lateral movement, it doesn't appear in the kill chain. Zero false positives from correlation."
Finance / Revenue Analytics
"Did our pricing change cause the churn, or did the market move?"
Rungs computes P(churn | do(price_change)) vs. the observed P(churn | price_change). The difference is the confounding bias — how much of the observed correlation was driven by the market conditions that moved simultaneously with the price change. The causal effect is isolated using the backdoor adjustment over market variables.
Business value: Know whether to reverse a pricing decision or hold it. Know whether the churn was caused by your action or by an external force you couldn't control. Counterfactual analysis of pricing experiments post-hoc.
"Your BI dashboard shows churn went up when prices went up. Did the price cause the churn? Or did the recession hit at the same time? Rungs separates those. You get the causal number, not the correlation."
Healthcare / Clinical Decision Support
"Did this drug cause the adverse event, or was it the underlying condition?"
Counterfactual analysis: P(adverse_event | drug=1) − P(adverse_event | drug=0), estimated from observational EHR data using backdoor adjustment over comorbidities, age, and baseline severity. The causal attributable risk is separated from the background rate driven by the condition itself.
Business value: Pharmacovigilance with causal attribution instead of disproportionality statistics. Liability reduction: deterministic audit trail for adverse event attribution. Clinical protocol optimization: "which pathway is causally indicated for this patient presentation?"
"FAERS reports show a signal for drug X and outcome Y. But sick people take drug X. The correlation is confounded by severity. Rungs computes the causal attributable risk — the part of the outcome you can actually attribute to the drug."
Legal / Expert Witness
"Did the defendant's action cause the harm?"
But-for causation is a Rung 3 counterfactual: P(harm | action=0) — what would the outcome have been if the defendant hadn't acted? Rungs computes this from the causal model of the case. The output is a structured report with the counterfactual probability, confidence bounds, and sensitivity analysis showing how robust the conclusion is to unmeasured confounding.
Business value: Expert witness reports that survive Daubert scrutiny because the methodology is published, reproducible, and mathematically grounded. Attorneys pay $50K–$500K per complex causation case. The tool writes the analysis; the expert testifies to the methodology.
"The Daubert standard requires scientific reliability. Rungs produces a mathematically verifiable causal attribution with confidence bounds and sensitivity analysis — reproducible by any qualified expert with the same graph and data."
Manufacturing / Operations
"Did the maintenance skip cause the equipment failure?"
Intervention query: P(failure | do(maintenance=0)) using the equipment process DAG. The causal model encodes the maintenance-failure mechanism, separating it from baseline failure rates and other operational variables. Compares counterfactual failure probability against observed failure rate.
Business value: Maintenance optimization — which specific maintenance actions causally reduce failure rates, and which are correlated with good outcomes because they're performed by better-trained teams (the actual causal variable). Root cause analysis with attribution scores instead of anecdotal post-mortems.
"Your maintenance log shows correlations with failures. But were those maintenance tasks causing failures to be prevented, or were good operators doing both maintenance and other things that mattered more? Rungs tells you which one is the actual cause."
Insurance / Underwriting
"Was it the policy change or the weather event that drove the claims increase?"
Mediation analysis: decompose the total effect into the direct effect (policy change → claims) and the indirect effect (policy change → behavior → claims). How much of the increase goes through each pathway? The weather event's contribution is computed via backdoor adjustment over climate variables — separating systemic risk from policy-specific risk.
Business value: Accurate loss attribution for reinsurance pricing. Subrogation: identify whether a third party's action causally contributed to the loss. Fraud detection: distinguish claims where the causal chain is consistent with the reported event from claims where the story doesn't match the data.
"Claims went up 30% last year. Is that the weather, the policy changes you made, or behavioral changes from the economy? Mediation analysis gives you the exact split — with confidence bounds — so you can price next year's reinsurance correctly."
09
Tough Questions — How to Answer Every Hard Question
Fifteen questions you will get. Know the answer before they ask.
How is this different from what we already have with ChatGPT?
ChatGPT generates plausible text about causal topics. It cannot run causal algorithms. Ask GPT-4 to compute a do-calculus expression on a specific graph and it will produce a number with no mathematical grounding — it pattern-matches to training data. Rungs runs the actual algorithm: graph surgery, backdoor adjustment, counterfactual abduction. The difference shows up when the answer matters — when a confident-sounding wrong answer costs you something. On CLadder, GPT-4 scores 69%. Rungs scores 98.6%. That gap is the gap between language generation and causal computation.
Ask any LLM "does conditioning on a collider open or close the path?" — half the time it gets it wrong. Rungs never gets it wrong because it's running the d-separation algorithm, not guessing.
Can't I just use a Bayesian network?
A Bayesian network answers Rung 1 questions — conditional probability given observations. It cannot answer Rung 2 interventional questions (do-calculus requires a causal graph, not just a probabilistic one) or Rung 3 counterfactual questions (those require noise-level inference). Bayesian networks are valuable and Rungs uses them as a substrate. But the causal layer sits on top. If your question is "what will happen if I change X?", a Bayesian network gives you the wrong answer. You need do-calculus.
Pearl himself defines the distinction in Causality (2009). The graphical structures look the same but the semantics are different. A causal graph makes stronger claims than a Bayesian network, which is why it can answer harder questions.
Do I need to know the causal graph in advance?
No — though having domain knowledge accelerates things. Rungs includes causal discovery (PC and FCI algorithms) that learn the causal structure from your data. You can also use the NL→DAG parser to describe the causal relationships in plain English and Rungs constructs the graph automatically. In most enterprise settings, domain experts know the causal relationships — they just haven't formalized them. Rungs gives them a way to encode that knowledge and query it rigorously.
Causal discovery from pure data is possible but limited — it can recover structure up to Markov equivalence, meaning some edge directions remain ambiguous. Domain knowledge resolves the ambiguity. The combination of discovery algorithms plus domain knowledge is more powerful than either alone.
What if my graph is wrong?
Rungs is transparent about this in two ways. First, sensitivity analysis (Rosenbaum bounds and E-values) quantifies how wrong the graph would need to be to change the conclusion. Second, Rungs reports identification status — if the effect can't be identified from the current graph and data, it says so rather than producing a spurious answer. No model is perfectly specified. The question is whether the conclusion is robust to plausible misspecifications, and Rungs answers that directly.
The alternative — using a black-box ML model — is also wrong, but in ways you can't quantify. At least with Rungs, the assumptions are explicit and testable.
Isn't this just regression with extra steps?
No. Regression estimates the association P(Y | X). Rungs computes the causal effect P(Y | do(X)). These are mathematically different objects and give different numerical answers when confounding is present. The extra steps aren't bureaucracy — they're the steps required to get the right answer. Regression in the presence of a confounder gives the wrong answer. Rungs, using the backdoor adjustment, gives the right answer. In high-stakes decisions, the difference is the difference between a good decision and a costly mistake.
Try it on the Berkeley admissions data. Regression on gender and admission gives the wrong sign. Causal analysis with department as a confounder gives the right answer. The "extra steps" reversed the conclusion.
How does it handle missing data?
Missing data in causal inference is its own sub-problem — the mechanism of missingness can itself be causal. Rungs handles three missingness patterns: MCAR (missing completely at random — standard imputation works), MAR (missing at random — inverse probability weighting on the missingness model), and MNAR (missing not at random — requires modeling the missingness mechanism explicitly). The engine reports which assumption is being made for each variable with missing data.
Most ML tools treat missing data as a nuisance to be imputed without modeling the mechanism. In causal inference, imputing without modeling the mechanism can induce bias. Rungs models the mechanism and adjusts accordingly.
What's the accuracy on our specific data?
That depends on the quality of the causal graph and the data. The 98.6% CLadder number measures reasoning accuracy on well-specified causal questions. For your data, we'd run a validation exercise: split historical data, use the causal model to predict the effect of past interventions you know the outcome of, and compare the predictions to actuals. In most enterprise datasets, the causal estimates are more accurate than correlational benchmarks because they're not biased by confounders.
The right framing isn't "accuracy" in the ML sense — it's "are the assumptions warranted?" Rungs exposes those assumptions explicitly, so you can verify them with domain knowledge rather than just hoping the model learned the right thing.
Is this open source? Can a competitor copy it?
The core IP is proprietary and patent-protected (provisional filed). The benchmark results are published and reproducible, which is how science works — but the specific implementation, the architectural innovations (Sheaf backbone, holographic integration modules, the Mode 2 deployment architecture), and the vertical-specific implementations are not open. The open-source causal inference landscape (DoWhy, CausalML, EconML) covers Rung 1 estimation well. The Rung 2/3 computation, the 98.6% CLadder score, and the Mode 2 architecture are Rungs-specific.
A competitor could implement the standard PC and FCI algorithms — they're published. They could not replicate the Sheaf backbone's performance on adversarial chains, the 100% BBH coverage, or the specific MCP toolchain without significant R&D investment.
What happens when the causal structure changes over time?
Rungs includes Granger causality testing for time-series data and supports dynamic causal models where relationships can shift. For structural changes (a regulation changes the causal mechanism, a market shock rewires consumer behavior), Rungs handles this via graph versioning — you update the causal graph and re-run. The Sheaf backbone's 100% generalization to 100-node chains (2x training length) shows that the architecture handles distribution shift in chain length; similar logic applies to structural shifts in graph topology.
All causal models are static approximations of a dynamic world. The right approach is regular calibration — update the graph as domain knowledge updates. Rungs is designed for this: the graph is an explicit, editable object, not buried inside model weights.
Why is this better than A/B testing?
A/B testing is the gold standard for one specific question: does intervention X cause outcome Y, on average, in the population you can randomize? Rungs is better in three cases. First, when you can't randomize (legal, ethical, or practical constraints). Second, when you want to study subgroup effects or mediation (A/B tells you average treatment effect; Rungs tells you heterogeneous effects and pathways). Third, when you want counterfactuals on individuals, not averages — "what would have happened to this specific customer if we had shown them the other variant?"
A/B testing and Rungs are complementary. Use A/B for high-stakes binary decisions where you can randomize and need ironclad average effect estimates. Use Rungs for everything that A/B can't cover — which is most of the causal questions businesses actually have.
Do I need to retrain it on our data?
The causal inference algorithms (do-calculus, backdoor adjustment, PC algorithm) don't require training in the ML sense — they're mathematical procedures applied to your data and causal graph. The neural components (Sheaf backbone, NL→DAG parser) have pre-trained weights that generalize across domains. What you do provide is: your data and either a causal graph (encoded in plain English or as a JSON DAG specification) or consent to run causal discovery. Setup is measured in hours, not months.
The distinction between "training" and "parameterization" matters here. You're not training a new model on your data. You're providing the causal graph that encodes your domain knowledge, and Rungs applies the algorithms to your data given that structure. Faster, more interpretable, and more robust than training from scratch.
What does 1.3ms per query actually mean in practice?
It means Rungs can handle 750+ queries per second on a single CPU core. For an enterprise BI tool with 100 concurrent analysts, that's effectively infinite throughput — Rungs will never be the bottleneck. It also means you're not dependent on GPU availability or API rate limits. The entire stack can run on-premises on commodity hardware. For real-time applications (security event analysis, live trading, IoT sensor fusion), 1.3ms means Rungs can be in the critical path without adding perceptible latency.
Compare to calling the OpenAI API: 500ms–2000ms per query, $0.01–$0.06 per query, rate-limited, and data leaves your environment. Rungs: 1.3ms, $0, unlimited, fully on-prem. At 10M queries per month, that's $100K–$600K per month in API costs vs. $0.
How do I explain this to my board?
Three sentences: "Every AI tool today tells you what's correlated. Rungs tells you what's actually causing outcomes — and what would have happened if you'd decided differently. That's the difference between a rearview mirror and a simulator." Then give one example from their domain. If they ask how: "It uses a mathematical framework called do-calculus, developed by Turing Award winner Judea Pearl at UCLA, that separates causation from correlation in any data set. We've implemented it as a 1.3ms engine that any AI system can call as a tool."
Boards care about two things: competitive moat and liability reduction. Moat: "no other production causal reasoning engine scores 98.6% on the standard benchmark." Liability: "every output is an auditable, deterministic causal chain — not a black-box score."
What's the minimum data requirement?
It depends on the complexity of the causal graph and the size of the effect you're trying to detect. For simple backdoor adjustments with a few confounders, a few hundred observations are often sufficient. For causal discovery (learning the graph from data), you need more — typically thousands of observations to reliably detect edge directions. The key insight: Rungs often needs less data than ML models because it's not estimating a high-dimensional function — it's applying a specific mathematical procedure to a targeted set of variables. Domain knowledge (the graph) replaces a lot of data.
The data requirement for causal inference is smaller than for prediction modeling for the same reason that a navigator using a map needs fewer observations than one learning the terrain from scratch. The graph is the map.
Why hasn't a big company built this?
A few have tried: Microsoft Research has DoWhy, Amazon has CausalML, IBM has some work in this space. None of them hit 98.6% on CLadder. None of them have the Mode 2 deployment architecture. None of them have the Sheaf backbone's adversarial chain generalization. The academic tools (DoWhy, EconML) are research code, not production engines. Building a production causal inference engine at this accuracy level required deep specialization and years of focused research. Big companies have too many competing priorities. This is the founder's dilemma working in our favor — the focused team beats the distracted giant.
"Why didn't Google build Stripe?" Specialization beats scale when the problem requires deep domain expertise and the market is large enough to justify it. The $700B causal inference TAM (across all verticals) justifies it. The technical depth required to do it right protects the moat.
10
The 90-Second Pitch — Memorize This
Three versions. Written out word-for-word. Adapt to the room, not the script.

Version 1 — The Elevator Pitch (30 seconds)

Use this with anyone you meet at a conference, on a flight, at an event. Pure value, no jargon. Goal: make them want to know more.

30-Second Elevator
Every AI tool today — ChatGPT, all of it — tells you what's correlated in your data. But correlation isn't causation, and in high-stakes decisions, the difference costs you. We built Rungs: the first production causal reasoning engine. It tells you not just what happened, but why it happened, what will happen if you change something, and what would have happened if you'd decided differently. Think of it as a flight simulator for your data — not a rearview mirror. We're working with enterprise security and finance teams now, and the results are measurable.

Version 2 — The Demo Setup (60 seconds)

Use this immediately before a live demo with a technical or business buyer. Sets up what they're about to see, primes them to notice the right things.

60-Second Demo Setup
Before I show you the demo, let me frame what you're looking at. Every BI tool in the market answers what we call Rung 1 questions — association, correlation, "what is." Your dashboards are Rung 1. Rungs goes to Rung 2 and Rung 3. Rung 2 is intervention: if you actually change X — force it, set it, make a decision — what happens to Y? That's the causal effect, not the correlation. Rung 3 is counterfactual: given what actually happened, what would have happened if you'd decided differently? That's attribution, liability, and post-mortem analysis. What you're about to see is a demo where we take your data — or data that looks like yours — and answer questions your current tools can't touch. When you see a confidence interval in the output, that's the Rosenbaum sensitivity bound: how wrong would the model need to be before the conclusion reverses? When you see a p-value, it's from a conditional independence test on your actual data. This is math, not language generation. Let's start.

Version 3 — The Investor Version (90 seconds)

For a VC, angel, or strategic investor. Market + moat + traction. Don't get into the technical weeds — keep it at the level of: why does this matter, why now, why us, why it can't be copied.

90-Second Investor Version
Every organization in the world is sitting on data that could answer their most important questions — but their tools only answer the easy ones. "What happened?" Business intelligence has solved that. "Why did it happen, what will happen if we change something, and what would have happened if we'd decided differently?" No one can answer those. That's a $700 billion problem — spanning security, finance, healthcare, legal, insurance, manufacturing. Every one of those industries turns on causation, not correlation. We built Rungs: a causal reasoning engine that scores 98.6% on the academic gold standard — 29 points above GPT-4 — runs in 1.3 milliseconds per query on CPU at zero marginal cost, and deploys via a four-step pipeline where any LLM calls Rungs as a tool. The moat is three things: a patent-pending algorithm architecture, benchmark results that took years of focused research to achieve, and the network effect of being the causal layer that every LLM deployment can call. We're live with our first enterprise customer in cybersecurity, we have a clear path to adjacent verticals, and we're raising to accelerate sales and build the enterprise wrapper. The question isn't whether causation matters — it's why it took this long for someone to productize it correctly. We're that product.