<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://hugonissar.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://hugonissar.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-05-19T14:40:08+02:00</updated><id>https://hugonissar.github.io/feed.xml</id><title type="html">Hugo Nissar — MarTech engineering on Google Cloud</title><subtitle>Open-source MarTech engineering — BigQuery MCP servers for AI agents, GA4 purchase propensity models with BigQuery ML, and Google Ads remarketing pipelines. Written and maintained by Hugo Nissar.</subtitle><author><name>Hugo Nissar</name></author><entry><title type="html">Self-hosted vs Google’s official BigQuery MCP server: a security and cost comparison</title><link href="https://hugonissar.github.io/blog/bigquery-mcp-server-comparison/" rel="alternate" type="text/html" title="Self-hosted vs Google’s official BigQuery MCP server: a security and cost comparison" /><published>2026-05-18T00:00:00+02:00</published><updated>2026-05-18T00:00:00+02:00</updated><id>https://hugonissar.github.io/blog/bigquery-mcp-server-comparison</id><content type="html" xml:base="https://hugonissar.github.io/blog/bigquery-mcp-server-comparison/"><![CDATA[<p>Google released an official <strong>BigQuery MCP server</strong> in 2025. It works, it’s
maintained, it’s the right default for many teams. But the moment you put an
agent in front of an untrusted user — a customer chatbot, a third-party
integration, anywhere prompt injection is a real threat — its defaults stop
being the right defaults.</p>

<p>This post compares the two architectures across the dimensions that actually
matter in production: <strong>table access control</strong>, <strong>per-query scan cost</strong>,
<strong>rate limiting</strong>, and <strong>idle cost</strong>. By the end you’ll know which one to
deploy and why.</p>

<h2 id="tldr">TL;DR</h2>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Google official</th>
      <th>Self-hosted (this guide)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Table access control</td>
      <td>IAM only — every reachable table</td>
      <td>Hard allowlist of <code class="language-plaintext highlighter-rouge">(dataset, table)</code> pairs, parsed in code</td>
    </tr>
    <tr>
      <td>Per-query scan cap</td>
      <td>None built in</td>
      <td><code class="language-plaintext highlighter-rouge">MAX_SCAN_MB</code> enforced via dry-run before the job runs</td>
    </tr>
    <tr>
      <td>Rate limiting</td>
      <td>None (“no limit on the number of calls”, per Google’s docs)</td>
      <td>Token bucket + burst, configurable</td>
    </tr>
    <tr>
      <td>Result row cap</td>
      <td>None</td>
      <td><code class="language-plaintext highlighter-rouge">MAX_RESULT_ROWS</code>, truncates server-side</td>
    </tr>
    <tr>
      <td>Source code</td>
      <td>Closed</td>
      <td>Open, MIT, ~1400 lines</td>
    </tr>
    <tr>
      <td>Idle cost</td>
      <td>Managed</td>
      <td>~$0 (Cloud Run scales to zero)</td>
    </tr>
    <tr>
      <td>Prompt-injection scanning</td>
      <td>Yes (Model Armor add-on)</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Forecasting built in</td>
      <td>Yes</td>
      <td>No</td>
    </tr>
  </tbody>
</table>

<p><strong>Pick Google’s server</strong> if you trust the agent absolutely, want Model Armor
prompt-injection scanning, or need built-in <code class="language-plaintext highlighter-rouge">forecast</code> and ARIMA tools.</p>

<p><strong>Pick the self-hosted alternative</strong> for anything else — especially customer-facing
agents, multi-tenant deployments, regulated environments, and anywhere the
words <em>“scan budget”</em> or <em>“rate limit”</em> matter.</p>

<h2 id="the-three-risks-googles-defaults-dont-cover">The three risks Google’s defaults don’t cover</h2>

<h3 id="1-any-table-the-service-account-can-reach">1. Any table the service account can reach</h3>

<p>Google’s server exposes every BigQuery table the underlying service account
has IAM permission to read. For an internal data team that’s the right default
— the service account has the same scope as the analyst using it.</p>

<p>For an agent talking to a customer, it isn’t. A misconfigured IAM grant, a
fork of the service account’s role, or a future colleague who adds a dataset
to <em>“the analytics service account, it’s already got access to everything”</em>
can quietly widen the surface that an agent can read. The next prompt
injection now has more to work with.</p>

<p>The self-hosted alternative inverts the default: tables are listed explicitly
in <code class="language-plaintext highlighter-rouge">(dataset, table)</code> pairs as environment variables. Anything outside the
allowlist is rejected at the <strong>SQL parser layer</strong> — before a job is ever
submitted to BigQuery. Cross-dataset references like
<code class="language-plaintext highlighter-rouge">wrong_dataset.allowed_table</code> are caught and rejected too.</p>

<h3 id="2-the-ai-just-ran-a-2000-query-problem">2. The “AI just ran a $2,000 query” problem</h3>

<p>There is no per-query scan ceiling in Google’s MCP server. The
<a href="https://docs.cloud.google.com/bigquery/docs/use-bigquery-mcp">official docs</a>
suggest enforcing one via custom IAM roles or BigQuery-level quotas. Both work,
both require Ops effort, and neither is on by default.</p>

<p>The self-hosted server dry-runs every query first. If BigQuery’s estimate
exceeds <code class="language-plaintext highlighter-rouge">MAX_SCAN_MB</code> (default 100), the real job is never submitted — no
bytes billed, no surprise on the invoice.</p>

<h3 id="3-no-built-in-rate-limiting">3. No built-in rate limiting</h3>

<p>From Google’s docs, verbatim: <em>“The BigQuery MCP server doesn’t have its own
quotas. There is no limit on the number of calls that can be made to the MCP
server.”</em></p>

<p>For a server fronted by an agent that retries on failure, that’s a foot-gun.
The self-hosted alternative ships with a configurable
<code class="language-plaintext highlighter-rouge">RATE_LIMIT_QPM</code> and <code class="language-plaintext highlighter-rouge">RATE_LIMIT_BURST</code> token bucket, plus separate
concurrency semaphores for queries and metadata calls.</p>

<h2 id="when-the-managed-version-still-wins">When the managed version still wins</h2>

<p>The self-hosted server isn’t a strict superset. Google’s managed offering
gives you <strong>Model Armor integration</strong> for prompt-injection scanning (a paid
add-on, but a real one) and <strong>built-in forecasting / ARIMA tools</strong> that the
self-hosted server intentionally doesn’t ship. If those matter, use Google’s
version. If they don’t, the cost / security tradeoff swings hard toward
self-hosted.</p>

<h2 id="deploying-the-self-hosted-alternative">Deploying the self-hosted alternative</h2>

<p>Five gcloud commands, ~10 minutes:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 1. Generate API keys</span>
openssl rand <span class="nt">-hex</span> 32 | gcloud secrets create mcp-api-key <span class="nt">--data-file</span><span class="o">=</span>-
openssl rand <span class="nt">-hex</span> 32 | gcloud secrets create mcp-admin-key <span class="nt">--data-file</span><span class="o">=</span>-

<span class="c"># 2. Build the image</span>
gcloud builds submit <span class="se">\</span>
  <span class="nt">--tag</span> europe-north2-docker.pkg.dev/<span class="nv">$PROJECT_ID</span>/bigquery-readonly-mcp/server:latest

<span class="c"># 3. Deploy to Cloud Run</span>
gcloud run deploy bigquery-readonly-mcp <span class="se">\</span>
  <span class="nt">--image</span><span class="o">=</span>europe-north2-docker.pkg.dev/<span class="nv">$PROJECT_ID</span>/bigquery-readonly-mcp/server:latest <span class="se">\</span>
  <span class="nt">--region</span><span class="o">=</span>europe-north2 <span class="se">\</span>
  <span class="nt">--service-account</span><span class="o">=</span><span class="s2">"bigquery-readonly-mcp@</span><span class="nv">$PROJECT_ID</span><span class="s2">.iam.gserviceaccount.com"</span> <span class="se">\</span>
  <span class="nt">--set-secrets</span><span class="o">=</span><span class="s2">"MCP_API_KEY=mcp-api-key:latest,MCP_ADMIN_KEY=mcp-admin-key:latest"</span> <span class="se">\</span>
  <span class="nt">--set-env-vars</span><span class="o">=</span><span class="s2">"GCP_PROJECT_ID=</span><span class="nv">$PROJECT_ID</span><span class="s2">,BQ_DATASET_ID=analytics,BQ_ALLOWED_TABLE=events,MAX_SCAN_MB=100,RATE_LIMIT_QPM=20"</span>
</code></pre></div></div>

<p>The full IAM least-privilege guide, multi-table configuration, and operations
playbook are in the <a href="https://github.com/hugonissar/BigQuery-Read-Only-MCP-Server">repository README</a>.</p>

<h2 id="frequently-asked-questions">Frequently asked questions</h2>

<h3 id="does-this-work-with-claude-desktop">Does this work with Claude Desktop?</h3>

<p>Yes. It also works with Cursor, Windsurf, Claude Code, ChatGPT’s deep
research, the OpenAI Responses API, and anything else that speaks
streamable-HTTP MCP. Configuration is the standard <code class="language-plaintext highlighter-rouge">url</code> + <code class="language-plaintext highlighter-rouge">headers</code> MCP
client config.</p>

<h3 id="can-i-run-it-inside-a-private-vpc">Can I run it inside a private VPC?</h3>

<p>Yes. Add <code class="language-plaintext highlighter-rouge">--vpc-connector</code> and <code class="language-plaintext highlighter-rouge">--ingress=internal</code> to the Cloud Run deploy
command, then put an internal load balancer in front.</p>

<h3 id="how-much-does-it-cost-at-idle">How much does it cost at idle?</h3>

<p>Roughly nothing. Cloud Run scales to zero between requests; the only standing
costs are Artifact Registry storage (~$0.10/GB/month for the image) and
Secret Manager versions (cents/month). Under modest load — say 1000 queries
a day — total monthly cost is in the single-digit dollars.</p>

<h3 id="what-about-column-level-masking">What about column-level masking?</h3>

<p>The allowlist is table-granularity. If you need to hide PII columns, create
a BigQuery authorized view that projects only safe columns and add the <em>view</em>
to the allowlist instead of the underlying table.</p>

<h2 id="source-code">Source code</h2>

<p>The full source is one Python file, MIT-licensed, ~1400 lines:
<a href="https://github.com/hugonissar/BigQuery-Read-Only-MCP-Server">github.com/hugonissar/BigQuery-Read-Only-MCP-Server</a>.
PRs welcome.</p>]]></content><author><name>Hugo Nissar</name></author><category term="mcp" /><category term="bigquery" /><category term="cloud-run" /><category term="security" /><category term="ai-agents" /><summary type="html"><![CDATA[When to pick Google's managed BigQuery MCP server vs a self-hosted one. A side-by-side comparison of allowlists, scan ceilings, rate limiting, and idle cost — with deploy commands for the self-hosted alternative.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://hugonissar.github.io/assets/images/og-default.png" /><media:content medium="image" url="https://hugonissar.github.io/assets/images/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">When +35% is really −60%: a GA4 anomaly detector where the LLM only writes the report</title><link href="https://hugonissar.github.io/blog/ga4-anomaly-detector/" rel="alternate" type="text/html" title="When +35% is really −60%: a GA4 anomaly detector where the LLM only writes the report" /><published>2026-05-18T00:00:00+02:00</published><updated>2026-05-18T00:00:00+02:00</updated><id>https://hugonissar.github.io/blog/ga4-anomaly-detector</id><content type="html" xml:base="https://hugonissar.github.io/blog/ga4-anomaly-detector/"><![CDATA[<p>I’ve been using MCP servers for about a year now, which means I’ve been
doing what most of the tech industry has been doing:
hooking AI agents up to real data sources and watching them make us
extremely productive most of the time and occasionally make a confident
mistake that no one without the original SQL output would catch.</p>

<p>The most surprising failure mode, for me, was always the simple
calculations. Not the hard reasoning. Not the complex multi-step joins. The
<em>simple calculations.</em> “Revenue grew from $5,000 to $8,000” — fine. “That’s
a 60% increase” — sometimes fine, sometimes confidently 35%, sometimes
confidently <em>negative.</em> The model returns a number, the chat UI shows the
number in bold, and unless the user happens to do the long division in
their head before pasting it into a Slack channel, the number sticks.</p>

<p>This isn’t a controversial observation among people who use these tools
heavily — every analytics engineer I know has stories. But what tends to be
<em>less</em> talked about is the architectural lesson: <strong>if the math matters,
don’t make the LLM do it.</strong> Hand the LLM the result of the math, computed
deterministically in code, and let it do the part it’s actually good at —
writing the prose.</p>

<p>That’s the lesson behind a weekend project I shipped this month:
<a href="https://github.com/hugonissar/GA4-Anomaly-Detector"><strong>GA4-Anomaly-Detector</strong></a>.
It started as a benchmark — <em>can an LLM correctly find anomalies in a
GA4 BigQuery export if I just hand it the rows, or do I need to compute the
anomalies myself first?</em> — and became a tool that does anomaly detection in
code and uses the LLM purely for narrative. This post walks through what
it does, why the architecture matters, the full CLI reference, and an
honest word of caution about what it deliberately <em>doesn’t</em> have yet.</p>

<h2 id="the-benchmark-that-made-the-design-obvious">The benchmark that made the design obvious</h2>

<p>The original question wasn’t even my own. It was already answered in a
<a href="https://blog.coupler.io/how-to-analyse-ga4-with-ai/">piece by Coupler.io</a>
benchmarking Google’s official GA4 MCP server against the GA4 sample
dataset. The setup was simple: ask the agent to describe the recent traffic
trend; check the answer against the actual numbers. <strong>Google’s MCP server
reported a 35% increase in traffic when the actual trend was a 60%
decrease.</strong> Same direction missed entirely.</p>

<p>This isn’t because Google’s engineers built it badly. It’s because the
architecture handed the LLM the raw query results and asked it to do the
analysis. That’s the structural mistake. An LLM scanning a column of
numbers will sometimes pattern-match to a description that fits the most
salient sub-window rather than the actual trend. Sometimes it will
confidently average the wrong subset of rows. Sometimes it will simply get
the percentage backwards. The chart in your head and the chart in the
LLM’s head are not always the same chart.</p>

<p>The fix is structural too: <strong>run the statistics in code, hand the LLM only
the structured findings.</strong> That’s what <code class="language-plaintext highlighter-rouge">ga4-anomaly-detector</code> does. The
LLM never sees the daily metric values. It never sees the raw event rows.
It sees a list of findings (“revenue dropped 38% on 2021-01-19 and stayed
down”) and a system prompt forbidding speculation about causes that aren’t
corroborated by other findings. If the math is right, the math is right
in code, before any LLM call.</p>

<h2 id="what-it-produces">What it produces</h2>

<p>The output is a markdown report you can pipe into Slack, email, a doc, or
<code class="language-plaintext highlighter-rouge">cat</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># GA4 Anomaly Detector
*2021-01-01 → 2021-01-31 · `bigquery-public-data.ga4_obfuscated_sample_ecommerce`*

## Headline
Revenue stepped down by 38% starting 2021-01-19 and has held.

## Key findings
- **revenue** level shift on 2021-01-19: ~$8,200 → ~$5,100
  (↓ -38% sustained over 14 days)
- **conversions** on 2021-01-22: 47 vs ~89 expected
  (↓ -47%, high severity)
- **sessions** on 2021-01-08: 4,820 vs ~3,100 expected
  (↑ +56%, medium severity)

## What changed in the mix
**revenue by source medium** (2021-01-18–2021-01-24 → 2021-01-25–2021-01-31)
- Gainer: `(direct) / (none)` 31% → 44% share
- Loser: `google / cpc` 28% → 17% share
</code></pre></div></div>

<p>The headline at the top is the only sentence the LLM writes that isn’t
directly grounded in a numerical finding. Everything below it has a
specific anomaly object with <code class="language-plaintext highlighter-rouge">metric</code>, <code class="language-plaintext highlighter-rouge">date</code>, <code class="language-plaintext highlighter-rouge">expected</code>, <code class="language-plaintext highlighter-rouge">observed</code>,
<code class="language-plaintext highlighter-rouge">severity</code> behind it.</p>

<h2 id="how-the-math-gets-done">How the math gets done</h2>

<p>Three detectors, picked because the alternatives produced worse signal on
GA4-shaped data:</p>

<ul>
  <li><strong>Point anomalies use STL residual z-scores.</strong> GA4 metrics have strong
weekday/weekend cycles. A naive rolling-mean z-score flags every
Saturday as anomalous. STL decomposes the series into trend + weekly
seasonal + residual; we z-score the residual. A day is flagged when its
residual exceeds the sigma threshold.</li>
  <li><strong>Change points use PELT with an RBF cost.</strong> A one-day spike is not a
change point. A site migration that drops sessions and they stay dropped
<em>is.</em> PELT separates the two. The RBF cost is roughly scale-invariant,
so the same penalty works for <code class="language-plaintext highlighter-rouge">sessions</code> (thousands) and
<code class="language-plaintext highlighter-rouge">conversion_rate</code> (single digits).</li>
  <li><strong>Mix shifts use Jensen-Shannon divergence.</strong> This catches the case
GA4 dashboards hide: total sessions look flat, but direct doubled while
organic collapsed. We compute share-of-voice distributions across
adjacent windows and measure the divergence. JS is bounded <code class="language-plaintext highlighter-rouge">[0, ln 2]</code>
so the threshold is interpretable.</li>
</ul>

<p>The LLM — Gemini 3 by default, but the architecture takes any client
implementing the <code class="language-plaintext highlighter-rouge">LLMClient</code> Protocol — receives the resulting
<code class="language-plaintext highlighter-rouge">AnomalyReport</code> and turns it into prose. It cannot invent a finding
because there’s nothing in its input it could invent one from.</p>

<hr />

<h2 id="full-cli-reference">Full CLI reference</h2>

<p>The CLI has two subcommands and a shared set of flags. This section
documents every option in <code class="language-plaintext highlighter-rouge">cli.py</code>.</p>

<h3 id="subcommands">Subcommands</h3>

<table>
  <thead>
    <tr>
      <th>Command</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">sample</code></td>
      <td>Run against Google’s public obfuscated GA4 sample dataset. No GA4 export of your own needed — only a GCP project for query billing. Sensible defaults for the date range that match the data available in the public sample.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">run</code></td>
      <td>Run against your own GA4 BigQuery export. Requires <code class="language-plaintext highlighter-rouge">--project-id</code> and <code class="language-plaintext highlighter-rouge">--dataset</code>.</td>
    </tr>
  </tbody>
</table>

<h3 id="top-level-help-flag">Top-level help flag</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>What it does</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">-h</code>, <code class="language-plaintext highlighter-rouge">--help</code></td>
      <td>Standard argparse help (top level + subcommand names only).</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--h</code>, <code class="language-plaintext highlighter-rouge">--help-all</code></td>
      <td>Custom action that prints the top-level help <em>plus every subcommand’s full help in one pass</em>. Use this when you want to see every flag everywhere without typing <code class="language-plaintext highlighter-rouge">-h</code> per subcommand. The single-dash form <code class="language-plaintext highlighter-rouge">--h</code> is intentional — <code class="language-plaintext highlighter-rouge">allow_abbrev=False</code> keeps argparse from auto-expanding it to <code class="language-plaintext highlighter-rouge">--help</code>.</td>
    </tr>
  </tbody>
</table>

<h3 id="run-specific-required-flags"><code class="language-plaintext highlighter-rouge">run</code>-specific required flags</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>Required</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--project-id PROJECT</code></td>
      <td>Yes</td>
      <td>The GCP project hosting your GA4 BigQuery export.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--dataset DATASET</code></td>
      <td>Yes</td>
      <td>The BigQuery dataset, typically <code class="language-plaintext highlighter-rouge">analytics_&lt;property_id&gt;</code> (e.g. <code class="language-plaintext highlighter-rouge">analytics_123456789</code>).</td>
    </tr>
  </tbody>
</table>

<h3 id="date-range-both-subcommands">Date range (both subcommands)</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>Type</th>
      <th>Default</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--start YYYY-MM-DD</code></td>
      <td>date</td>
      <td><code class="language-plaintext highlighter-rouge">sample</code>: <code class="language-plaintext highlighter-rouge">2020-12-01</code> (start of the sample’s window). <code class="language-plaintext highlighter-rouge">run</code>: today minus 30 days.</td>
      <td>Inclusive start of the analysis window.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--end YYYY-MM-DD</code></td>
      <td>date</td>
      <td><code class="language-plaintext highlighter-rouge">sample</code>: <code class="language-plaintext highlighter-rouge">2021-01-31</code> (end of the sample’s window). <code class="language-plaintext highlighter-rouge">run</code>: yesterday.</td>
      <td>Inclusive end of the analysis window.</td>
    </tr>
  </tbody>
</table>

<p>Validation: if <code class="language-plaintext highlighter-rouge">--start &gt; --end</code>, the CLI exits with code <code class="language-plaintext highlighter-rouge">2</code> and an error.</p>

<h3 id="billing-and-auth-both-subcommands">Billing and auth (both subcommands)</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>Default</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--billing-project PROJECT</code></td>
      <td><code class="language-plaintext highlighter-rouge">GOOGLE_CLOUD_PROJECT</code> env var</td>
      <td>GCP project to bill queries against. <strong>Required</strong>, either via this flag or the env var — the CLI exits with code <code class="language-plaintext highlighter-rouge">2</code> if neither is set.</td>
    </tr>
  </tbody>
</table>

<p>Auth itself is provided by Google’s Application Default Credentials —
run <code class="language-plaintext highlighter-rouge">gcloud auth application-default login</code> once, or set
<code class="language-plaintext highlighter-rouge">GOOGLE_APPLICATION_CREDENTIALS</code> to a service-account JSON. The CLI
surfaces this hint automatically if BigQuery client initialization fails.</p>

<h3 id="data-shape-both-subcommands">Data shape (both subcommands)</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>Type</th>
      <th>Default</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--dimensions CSV</code></td>
      <td>comma-sep list</td>
      <td><code class="language-plaintext highlighter-rouge">source_medium</code></td>
      <td>Which dimensions to slice mix-shift detection on. Valid values: <code class="language-plaintext highlighter-rouge">source_medium</code>, <code class="language-plaintext highlighter-rouge">device_category</code>, <code class="language-plaintext highlighter-rouge">country</code>, <code class="language-plaintext highlighter-rouge">browser</code>. Pass <code class="language-plaintext highlighter-rouge">--dimensions ""</code> (empty string) to skip mix-shift detection entirely — useful on large exports where it’s the slowest step.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--metrics CSV</code></td>
      <td>comma-sep list</td>
      <td>All standard metrics (<code class="language-plaintext highlighter-rouge">DEFAULT_METRICS</code>)</td>
      <td>Which metrics to run anomaly detection over. Sticks to the GA4-export-derivable set.</td>
    </tr>
  </tbody>
</table>

<h3 id="output-both-subcommands">Output (both subcommands)</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>Default</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">-o</code>, <code class="language-plaintext highlighter-rouge">--output FILE</code></td>
      <td>stdout</td>
      <td>If supplied, the markdown report is written to this path. Otherwise it goes to stdout, ready to pipe into <code class="language-plaintext highlighter-rouge">pbcopy</code>, <code class="language-plaintext highlighter-rouge">tee</code>, or anything else.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">-v</code>, <code class="language-plaintext highlighter-rouge">--verbose</code></td>
      <td>off</td>
      <td>Verbose logging on stderr. Logs end up in stderr, output ends up in stdout, so <code class="language-plaintext highlighter-rouge">python cli.py sample &gt; report.md</code> still works cleanly.</td>
    </tr>
  </tbody>
</table>

<h3 id="detection-tuning-both-subcommands-grouped-as-detection-tuning">Detection tuning (both subcommands, grouped as “Detection tuning”)</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>Type</th>
      <th>Default</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--sigma-threshold FLOAT</code></td>
      <td>float</td>
      <td><code class="language-plaintext highlighter-rouge">3.0</code></td>
      <td>Z-score threshold for point anomalies. <strong>Lower means more sensitive.</strong> Drop to <code class="language-plaintext highlighter-rouge">2.5</code> for small-traffic sites with high day-to-day noise; raise to <code class="language-plaintext highlighter-rouge">4.0</code> for high-volume sites where you only want extreme deviations.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--pelt-penalty FLOAT</code></td>
      <td>float</td>
      <td><code class="language-plaintext highlighter-rouge">10.0</code></td>
      <td>Penalty parameter for the PELT change-point detector. <strong>Lower means more change points.</strong> Raise if you’re seeing spurious breaks; lower if real shifts are being missed.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--mix-window-days INT</code></td>
      <td>int</td>
      <td><code class="language-plaintext highlighter-rouge">7</code></td>
      <td>Width of each comparison window for mix-shift detection. The default compares the last 7 days against the 7 days before, anchored to the most recent date in the export.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--known-events DATES</code></td>
      <td>comma-sep dates</td>
      <td>empty</td>
      <td>Dates (YYYY-MM-DD) to exclude from point-anomaly detection. Without this, a December run will dutifully flag Christmas as a -60% anomaly. Excluded dates are also dropped from the noise-floor estimate, so they don’t bias the threshold. Example: <code class="language-plaintext highlighter-rouge">--known-events 2026-12-24,2026-12-25,2026-12-31,2027-01-01</code>. The <code class="language-plaintext highlighter-rouge">holidays</code> Python package will generate per-country lists if you don’t want to hardcode.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--conversion-events CSV</code></td>
      <td>comma-sep list</td>
      <td><code class="language-plaintext highlighter-rouge">purchase</code> (the default for ecommerce)</td>
      <td>GA4 event names that count as conversions. Matters because the <code class="language-plaintext highlighter-rouge">conversions</code> metric is derived from this list. SaaS or lead-gen sites should override: <code class="language-plaintext highlighter-rouge">--conversion-events sign_up,subscribe</code>. Without this, non-ecommerce sites will report 0 conversions and the narrative will confidently say so.</td>
    </tr>
  </tbody>
</table>

<h3 id="llm-narrative-both-subcommands-grouped-as-llm-narrative">LLM narrative (both subcommands, grouped as “LLM narrative”)</h3>

<table>
  <thead>
    <tr>
      <th>Flag</th>
      <th>Type</th>
      <th>Default</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--no-llm</code></td>
      <td>flag</td>
      <td>off</td>
      <td>Skip the LLM call entirely; use the deterministic template renderer instead. Same output shape, zero tokens, useful for tuning the detector parameters in a tight loop.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--vertex</code></td>
      <td>flag</td>
      <td>off, but reads <code class="language-plaintext highlighter-rouge">GOOGLE_GENAI_USE_VERTEXAI=true</code> from env</td>
      <td>Use Vertex AI (gcloud auth) for the Gemini call instead of an AI Studio API key. Reuses <code class="language-plaintext highlighter-rouge">--billing-project</code> and your Application Default Credentials. Right choice if you want unified auth + Cloud Logging audit trail; AI Studio + an API key is right for personal use.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--model MODEL</code></td>
      <td>string</td>
      <td><code class="language-plaintext highlighter-rouge">GeminiClient.DEFAULT_MODEL</code> (Gemini 3 Preview at the time of writing)</td>
      <td>The Gemini model string. Override if you want a different cost/quality tradeoff.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--api-key KEY</code></td>
      <td>string</td>
      <td><code class="language-plaintext highlighter-rouge">GEMINI_API_KEY</code>, falling back to <code class="language-plaintext highlighter-rouge">GOOGLE_API_KEY</code> env var</td>
      <td>API key for AI Studio mode only. Ignored in <code class="language-plaintext highlighter-rouge">--vertex</code> mode.</td>
    </tr>
  </tbody>
</table>

<h3 id="environment-variables-the-cli-reads">Environment variables the CLI reads</h3>

<p>A consolidated list of the env vars the CLI consults, with which flag each
one substitutes for:</p>

<table>
  <thead>
    <tr>
      <th>Env var</th>
      <th>Substitutes for</th>
      <th>Effect</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">GOOGLE_CLOUD_PROJECT</code></td>
      <td><code class="language-plaintext highlighter-rouge">--billing-project</code></td>
      <td>Default project for billing and (in <code class="language-plaintext highlighter-rouge">--vertex</code> mode) Vertex AI.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">GOOGLE_APPLICATION_CREDENTIALS</code></td>
      <td>—</td>
      <td>Path to a service-account JSON, used by Google client libraries’ ADC chain.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">GOOGLE_GENAI_USE_VERTEXAI</code></td>
      <td><code class="language-plaintext highlighter-rouge">--vertex</code></td>
      <td>If set to <code class="language-plaintext highlighter-rouge">true</code>, defaults the CLI to Vertex AI mode.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">GOOGLE_CLOUD_LOCATION</code></td>
      <td>—</td>
      <td>Vertex AI region. Defaults to <code class="language-plaintext highlighter-rouge">global</code> (required for Gemini 3 Preview models). Set to e.g. <code class="language-plaintext highlighter-rouge">europe-west4</code> for data residency, but make sure your model supports that region.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">GEMINI_API_KEY</code></td>
      <td><code class="language-plaintext highlighter-rouge">--api-key</code></td>
      <td>Primary fallback for the AI Studio API key.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">GOOGLE_API_KEY</code></td>
      <td><code class="language-plaintext highlighter-rouge">--api-key</code></td>
      <td>Secondary fallback if <code class="language-plaintext highlighter-rouge">GEMINI_API_KEY</code> isn’t set.</td>
    </tr>
  </tbody>
</table>

<h3 id="exit-codes">Exit codes</h3>

<table>
  <thead>
    <tr>
      <th>Code</th>
      <th>When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0</code></td>
      <td>Success — report written or printed.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">1</code></td>
      <td>Operational error — BigQuery auth failed, query failed, or no data was returned for the requested window. The CLI logs an actionable hint (e.g. “run <code class="language-plaintext highlighter-rouge">gcloud auth application-default login</code>”) before exiting.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">2</code></td>
      <td>Usage error — <code class="language-plaintext highlighter-rouge">--start &gt; --end</code>, an unknown dimension was requested, or no billing project was supplied.</td>
    </tr>
  </tbody>
</table>

<h3 id="putting-it-together--three-canonical-invocations">Putting it together — three canonical invocations</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 1. Test against the public sample, no LLM, no tokens spent.</span>
python cli.py sample <span class="nt">--billing-project</span> my-gcp <span class="nt">--no-llm</span>

<span class="c"># 2. Run against your own export, last two weeks, three dimensions, save to a file.</span>
python cli.py run <span class="se">\</span>
    <span class="nt">--project-id</span> my-project <span class="se">\</span>
    <span class="nt">--dataset</span> analytics_123456789 <span class="se">\</span>
    <span class="nt">--start</span> 2026-05-01 <span class="nt">--end</span> 2026-05-15 <span class="se">\</span>
    <span class="nt">--dimensions</span> source_medium,device_category,country <span class="se">\</span>
    <span class="nt">--output</span> weekly-report.md

<span class="c"># 3. Same, but using Vertex AI for narration (gcloud auth, no API key).</span>
python cli.py run <span class="se">\</span>
    <span class="nt">--project-id</span> my-project <span class="se">\</span>
    <span class="nt">--dataset</span> analytics_123456789 <span class="se">\</span>
    <span class="nt">--start</span> 2026-05-01 <span class="nt">--end</span> 2026-05-15 <span class="se">\</span>
    <span class="nt">--vertex</span> <span class="se">\</span>
    <span class="nt">--model</span> gemini-3-preview <span class="se">\</span>
    <span class="nt">--output</span> weekly-report.md
</code></pre></div></div>

<hr />

<h2 id="using-it-from-claude-desktop--and-a-word-of-caution">Using it from Claude Desktop — and a word of caution</h2>

<p>The repo also ships an <code class="language-plaintext highlighter-rouge">mcp_server.py</code> that exposes the analyze pipeline as
an MCP tool. Any compatible client (Claude Desktop, Cursor, Claude Code,
Gemini CLI) can call it. The server returns the structured findings as
JSON; the client’s LLM narrates them in context. This is the closest you
can get to “ask Claude about my GA4 traffic” while keeping the math out of
the LLM’s hands.</p>

<p><strong>But this is the part where I have to be honest about what’s missing.</strong></p>

<p>This repo is <strong>pre-release</strong>. The MCP server in it is intentionally
minimal. It does <em>not</em> yet have any of the guardrails my other open-source
MCP server has:</p>

<table>
  <thead>
    <tr>
      <th>Guardrail</th>
      <th><a href="/projects/bigquery-readonly-mcp-server/">BigQuery Read-Only MCP</a></th>
      <th>GA4 Anomaly Detector MCP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Hard table allowlist</td>
      <td>Yes — parsed before query submission</td>
      <td>No — it queries whatever dataset you tell it to</td>
    </tr>
    <tr>
      <td>Per-query scan ceiling</td>
      <td>Yes — dry-run + reject above <code class="language-plaintext highlighter-rouge">MAX_SCAN_MB</code></td>
      <td>No</td>
    </tr>
    <tr>
      <td>Token-bucket rate limit</td>
      <td>Yes — configurable QPM + burst</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Result row cap</td>
      <td>Yes — <code class="language-plaintext highlighter-rouge">MAX_RESULT_ROWS</code> truncation</td>
      <td>No</td>
    </tr>
    <tr>
      <td>API key auth</td>
      <td>Yes — constant-time header validation</td>
      <td>No — runs locally, trusts the calling client</td>
    </tr>
    <tr>
      <td>Suitable for multi-tenant or customer-facing deployment</td>
      <td>Yes</td>
      <td><strong>No</strong></td>
    </tr>
  </tbody>
</table>

<p>The anomaly detector’s MCP server is designed for <strong>single-user, local
use</strong> — it runs on your laptop, against your own GA4 export, called by
your own Claude Desktop. In that context, none of the missing guardrails
matter, because you can’t injection-attack your own laptop. But if you’re
thinking about deploying this server to Cloud Run or putting it behind a
proxy so a team can share it, <strong>don’t</strong>, not without adding the layers
above first.</p>

<p>The longer write-up on why guardrails matter is in
<a href="/blog/bigquery-cost-control-mcp/"><em>Stopping the $2,000 AI query</em></a> — the
short version is that any MCP server fronted by an LLM is one bad prompt
away from a four-figure invoice unless the <em>server</em> refuses to run queries
above a cost cap. The pattern from that post — dry-run, reject above
ceiling — is exactly the patch this repo needs before 1.0.</p>

<p>If you want to use this tool today, the safest mode is the <strong>CLI</strong>, not
the MCP server. The CLI runs once per command, against the dataset you
explicitly point it at, and exits. There’s no persistent process for a
prompt injection to talk to.</p>

<h2 id="faq">FAQ</h2>

<h3 id="why-not-just-use-looker-studios-built-in-anomaly-detection">Why not just use Looker Studio’s built-in anomaly detection?</h3>

<p>Looker Studio’s anomaly bands are useful for eyeballing. They don’t give
you change points, don’t give you mix shifts, don’t give you a narrative
you can paste into Slack, and don’t accept tuning parameters. This tool
produces something closer to what an analyst would write at the end of a
review — <em>“here are the four things you should know about this week”</em> —
not a chart you have to interpret.</p>

<h3 id="can-i-plug-in-a-different-llm-than-gemini">Can I plug in a different LLM than Gemini?</h3>

<p>Yes. The narrative module defines an <code class="language-plaintext highlighter-rouge">LLMClient</code> Protocol. <code class="language-plaintext highlighter-rouge">GeminiClient</code>
implements it; you can drop in an Anthropic or OpenAI client by writing
your own and passing it to <code class="language-plaintext highlighter-rouge">render_with_llm()</code>. The CLI doesn’t have a
flag for that yet — easy patch if anyone wants it.</p>

<h3 id="will-this-work-on-a-small-site-with-low-daily-traffic">Will this work on a small site with low daily traffic?</h3>

<p>Probably with tuning. The defaults are picked for medium-traffic ecommerce
(roughly the shape of the public sample dataset). On a small site, drop
<code class="language-plaintext highlighter-rouge">--sigma-threshold</code> to <code class="language-plaintext highlighter-rouge">2.5</code> and <code class="language-plaintext highlighter-rouge">--pelt-penalty</code> to <code class="language-plaintext highlighter-rouge">5.0</code> so the
detectors actually fire on the smaller signal-to-noise ratio. The mix-shift
detector is more forgiving — it operates on share-of-voice, not raw
volume, so it works fine even at low traffic.</p>

<h3 id="what-about-consent-mode-v2-and-modeled-data">What about Consent Mode v2 and modeled data?</h3>

<p>The detector reads whatever’s in your GA4 export. If your export contains
modeled data due to consent mode, the detector treats it the same as
observed data — the math doesn’t distinguish. That’s usually the right
call for <em>trend</em> detection (you want to see the modeled traffic move), but
it means a sudden change in the modeled/observed ratio (e.g., from a
cookie banner change) will show up as a real anomaly. That’s correct
behavior; just be aware when interpreting the result.</p>

<h3 id="why-a-weekend-project-not-a-full-product">Why a weekend project, not a full product?</h3>

<p>Because I wanted to know if the architecture worked before committing more
time. The benchmark — <em>do the structured findings actually make the
narrative more accurate?</em> — got a clear answer (yes, dramatically) and the
tool was already useful, so I shipped it. The roadmap in the README has
the obvious next steps (synthetic-anomaly tests, more detectors, the
guardrails for MCP mode) — happy to take PRs.</p>

<h3 id="how-does-this-interact-with-the-bigquery-mcp-server">How does this interact with the BigQuery MCP server?</h3>

<p>They’re complementary. The
<a href="https://github.com/hugonissar/BigQuery-Read-Only-MCP-Server">BigQuery Read-Only MCP server</a>
is for general-purpose SQL access from an agent, with guardrails. This
anomaly detector is for a specific computation (anomaly detection on GA4)
where the computation is in code and only the narration is the agent’s
job. If you connect Claude to <em>both</em> servers, you can ask Claude “did
anything weird happen last week” (handled by the anomaly detector) and
then “show me the top 10 affected pages” (handled by the BigQuery server).
Different jobs; different tools.</p>

<h2 id="source-code">Source code</h2>

<p>Full source — STL/PELT/JS-divergence detectors, BigQuery fetcher, MCP
server, CLI:
<a href="https://github.com/hugonissar/GA4-Anomaly-Detector">github.com/hugonissar/GA4-Anomaly-Detector</a>.
MIT-licensed. Pre-release; PRs welcome, especially around the guardrails
work needed before the MCP server is safe to share across a team.</p>]]></content><author><name>Hugo Nissar</name></author><category term="mcp" /><category term="ga4" /><category term="anomaly-detection" /><category term="claude" /><category term="gemini" /><category term="llm-trust" /><summary type="html"><![CDATA[A weekend project that started as a benchmark — can an LLM be trusted to compute anomalies on raw GA4 data, or should the math stay in code? — and became a GA4 anomaly detector with the trust boundary drawn explicitly. Plus the full CLI reference and a word of caution about running it as an unguarded MCP server.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://hugonissar.github.io/assets/images/og-default.png" /><media:content medium="image" url="https://hugonissar.github.io/assets/images/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Forecasting GA4 revenue in pure SQL with BigQuery ML — and the model-quality checks you can’t skip</title><link href="https://hugonissar.github.io/blog/ga4-bigquery-ml-sales-forecast/" rel="alternate" type="text/html" title="Forecasting GA4 revenue in pure SQL with BigQuery ML — and the model-quality checks you can’t skip" /><published>2026-05-18T00:00:00+02:00</published><updated>2026-05-18T00:00:00+02:00</updated><id>https://hugonissar.github.io/blog/ga4-bigquery-ml-sales-forecast</id><content type="html" xml:base="https://hugonissar.github.io/blog/ga4-bigquery-ml-sales-forecast/"><![CDATA[<p>A revenue forecast is a number that looks the same whether the model fit your
data brilliantly or learned almost nothing useful. The chart in Data Studio
draws a line either way. That’s the trap with auto-ARIMA forecasting in
BigQuery ML — you get output before you’ve earned the right to trust it, and
the output is presentable enough to end up in a deck.</p>

<p>This post walks through a new pre-release pipeline I’ve been building,
<a href="https://github.com/hugonissar/GA4-BigQuery-ML-Sales-Forecast"><strong>GA4-BigQuery-ML-Sales-Forecast</strong></a>,
which produces a 30-day daily revenue forecast in pure SQL from your GA4
BigQuery export. The architecture is interesting — pure SQL, no Python, no
Vertex AI, no Cloud Functions — but the more important story is the
<strong>model-quality checks</strong>, because they’re what decides whether the forecast
is useful or decorative.</p>

<h2 id="tldr">TL;DR</h2>

<table>
  <thead>
    <tr>
      <th> </th>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>What it is</strong></td>
      <td>A pure-SQL pipeline that produces a 30-day daily GA4 revenue forecast, with custom holidays, automatic seasonality, ad-spend regressors, and seasonal sub-forecasts for upstream funnel metrics</td>
    </tr>
    <tr>
      <td><strong>How it runs</strong></td>
      <td>One scheduled query refreshes data daily; another retrains the model weekly. Output is a single BigQuery view that plugs straight into Data Studio</td>
    </tr>
    <tr>
      <td><strong>Algorithm</strong></td>
      <td><code class="language-plaintext highlighter-rouge">ARIMA_PLUS_XREG</code> for the main revenue model, four <code class="language-plaintext highlighter-rouge">ARIMA_PLUS</code> univariate models for the funnel sub-forecasts</td>
    </tr>
    <tr>
      <td><strong>Status</strong></td>
      <td><strong>Pre-release.</strong> Validated on synthetic and small production datasets. Real-world testing wanted</td>
    </tr>
    <tr>
      <td><strong>License</strong></td>
      <td>MIT</td>
    </tr>
    <tr>
      <td><strong>The catch</strong></td>
      <td>Auto-ARIMA gives a confident output whether or not the model fits well. Skip §4 of the SQL and you have no idea what you’ve built</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="what-the-pipeline-does">What the pipeline does</h2>

<p>Two phases:</p>

<ol>
  <li><strong>Build</strong> (first run, then weekly retrain). Aggregate GA4 events and ad
metrics into a daily fact table. Train one <code class="language-plaintext highlighter-rouge">ARIMA_PLUS_XREG</code> model on
revenue, plus four small <code class="language-plaintext highlighter-rouge">ARIMA_PLUS</code> sub-models on the funnel metrics
that will become the future regressors.</li>
  <li><strong>Forecast</strong> (daily refresh, live on every query). The sub-forecasts
predict tomorrow’s funnel volume. The trained XREG model uses those plus
your planned ad spend to predict tomorrow’s revenue. A view unions
forecast with history for Data Studio.</li>
</ol>

<p>After running the script you end up with <strong>11 objects</strong> in your dataset —
3 tables, 6 models, 2 views. No Python, no Cloud Functions, no Vertex
endpoints, no orchestrator. Everything is scheduled queries plus views.</p>

<h3 id="what-makes-the-design-interesting">What makes the design interesting</h3>

<p>A few choices worth highlighting:</p>

<ul>
  <li>
    <p><strong>Funnel-as-regressors with their own forecasts.</strong> The main model uses
<code class="language-plaintext highlighter-rouge">view_item</code>, <code class="language-plaintext highlighter-rouge">add_to_cart</code>, <code class="language-plaintext highlighter-rouge">begin_checkout</code>, and ad-spend variables as
exogenous regressors. The challenge with regressors is always <em>“how do I
know tomorrow’s value?”</em> — solved here by training a separate
<code class="language-plaintext highlighter-rouge">ARIMA_PLUS</code> model per regressor, so the system forecasts its own inputs.
Recursive forecasting like this works because the funnel metrics are
themselves seasonal and stable enough to predict on their own.</p>
  </li>
  <li>
    <p><strong>Custom occasions table.</strong> Built-in holiday regions catch most
calendar effects, but they don’t know about your Black Friday week, your
May campaign, or your anniversary sale. The <code class="language-plaintext highlighter-rouge">custom_holidays</code> table lets
you name these and tag a window around each. The model learns a lift
per named occasion and applies it to future occurrences.</p>
  </li>
  <li>
    <p><strong>Scenario planning baked in.</strong> Because ad spend is an explicit
regressor, the forecast reacts to changes in your media plan. Override
the trailing-average defaults in <code class="language-plaintext highlighter-rouge">future_regressors</code> with values from a
<code class="language-plaintext highlighter-rouge">media_plan</code> table and the forecast updates the next time the view is
queried.</p>
  </li>
  <li>
    <p><strong>Explainability view as a first-class citizen.</strong> Section §8 of the SQL
builds a <code class="language-plaintext highlighter-rouge">sales_forecast_explained</code> view backed by <code class="language-plaintext highlighter-rouge">ML.EXPLAIN_FORECAST</code>.
Every forecasted point decomposes into trend, weekly seasonality, yearly
seasonality, holiday lift, and per-regressor attribution. This is the
view that makes the eval step actionable rather than abstract.</p>
  </li>
</ul>

<h2 id="why-you-cannot-trust-auto-arima-without-evaluating-it">Why you cannot trust auto-ARIMA without evaluating it</h2>

<p>This is the part that matters most, so it gets its own section.</p>

<p><code class="language-plaintext highlighter-rouge">ARIMA_PLUS_XREG</code> is a great default — Google’s implementation does
automatic order selection, automatic seasonality detection, automatic
outlier handling, and automatic structural-break detection. The “automatic”
is the appeal and the risk in the same breath. The model <strong>always</strong> fits
something. It always produces a prediction. It always plots a smooth
forecast curve with confidence bands. None of that tells you whether the
fit is any good.</p>

<p>Real ways the same pipeline can go wrong while producing identical-looking
output:</p>

<ul>
  <li><strong>The training data has a structural break the model can’t reach over.</strong>
A pricing change six months ago, a new market launched, a viral spike
during one campaign. Auto-ARIMA may or may not flag these as outliers;
even when it does, the rest of the model is fit on data that’s no longer
representative of the present. The forecast looks plausible — and is
systematically biased.</li>
  <li><strong>A regressor is silently doing nothing.</strong> You added ad spend as a
regressor expecting it to drive the forecast. The model gave it
near-zero weight because in your historical data, spend and revenue
weren’t correlated (maybe revenue is dominated by organic, maybe spend
is too steady to be informative). Now your “spend-aware” forecast isn’t
spend-aware at all, and your scenario plans don’t shift the line.</li>
  <li><strong>Yearly seasonality is over-fit on too little history.</strong> With under
~18 months of clean data, the yearly seasonality term picks up noise as
if it were signal. Next January’s forecast reflects last January’s
freak week.</li>
  <li><strong>A custom occasion was added with one historical example.</strong> The model
learns whatever happened that one Black Friday and projects it forward
with high confidence. One data point is not a season.</li>
  <li><strong>Recent data is missing or stale.</strong> GA4 export delays, a paused tag, a
broken scheduled query upstream — and the model trains on a truncated
view of the last 30 days. The trend term picks up the artificial drop
and forecasts continued decline.</li>
</ul>

<p>None of those situations produce an error. All of them produce a chart.</p>

<h2 id="the-validation-workflow--4-of-the-sql">The validation workflow — §4 of the SQL</h2>

<p>The repo’s section §4 is the eval step, and it’s the section that should
never be skipped on a first run or after a major retrain. The pattern:</p>

<ol>
  <li>Train an <strong>eval model</strong> identical to production, but on data ending
<strong>N days ago</strong> (default: 30 days).</li>
  <li>Run <code class="language-plaintext highlighter-rouge">ML.EVALUATE</code> against the held-out N days that the eval model
never saw.</li>
  <li>Read the metrics.</li>
  <li><strong>Drop the eval model.</strong> It exists only for this measurement; the
production model is the one that trained on everything.</li>
</ol>

<p>The key SQL for the evaluation step is small enough to read in one breath:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">ML</span><span class="p">.</span><span class="n">EVALUATE</span><span class="p">(</span>
  <span class="n">MODEL</span> <span class="nv">`YOUR_PROJECT.YOUR_DATASET.revenue_forecast_eval`</span><span class="p">,</span>
  <span class="p">(</span>
    <span class="k">SELECT</span> <span class="k">day</span><span class="p">,</span> <span class="n">revenue</span><span class="p">,</span> <span class="n">view_item</span><span class="p">,</span> <span class="n">add_to_cart</span><span class="p">,</span>
           <span class="n">begin_checkout</span><span class="p">,</span> <span class="n">spend</span><span class="p">,</span> <span class="n">impressions</span><span class="p">,</span> <span class="n">clicks</span>
    <span class="k">FROM</span> <span class="nv">`YOUR_PROJECT.YOUR_DATASET.daily_features`</span>
    <span class="k">WHERE</span> <span class="k">day</span> <span class="k">BETWEEN</span> <span class="n">DATE_SUB</span><span class="p">(</span><span class="k">CURRENT_DATE</span><span class="p">(),</span> <span class="n">INTERVAL</span> <span class="mi">30</span> <span class="k">DAY</span><span class="p">)</span>
                  <span class="k">AND</span> <span class="n">DATE_SUB</span><span class="p">(</span><span class="k">CURRENT_DATE</span><span class="p">(),</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">DAY</span><span class="p">)</span>
  <span class="p">),</span>
  <span class="n">STRUCT</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">95</span> <span class="k">AS</span> <span class="n">confidence_level</span><span class="p">,</span> <span class="k">TRUE</span> <span class="k">AS</span> <span class="n">perform_aggregation</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>You get back five numbers: <code class="language-plaintext highlighter-rouge">mean_absolute_error</code>,
<code class="language-plaintext highlighter-rouge">mean_squared_error</code>, <code class="language-plaintext highlighter-rouge">root_mean_squared_error</code>,
<code class="language-plaintext highlighter-rouge">mean_absolute_percentage_error</code>, and <code class="language-plaintext highlighter-rouge">symmetric_mean_absolute_percentage_error</code>.</p>

<p>The one you care about most is <strong>MAPE</strong> (mean absolute percentage error).</p>

<h2 id="what-mape-numbers-actually-mean">What MAPE numbers actually mean</h2>

<p>There’s a reasonable amount of confusion about what counts as a “good”
MAPE for revenue forecasting because the right answer depends on your
business volatility. Practical operating ranges for daily ecommerce
revenue:</p>

<table>
  <thead>
    <tr>
      <th>MAPE</th>
      <th>Interpretation</th>
      <th>What to do</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>&lt; 8%</strong></td>
      <td>Excellent. Usually means very stable revenue (subscription, B2B) and clean data</td>
      <td>Trust it. Use it for planning</td>
    </tr>
    <tr>
      <td><strong>8–15%</strong></td>
      <td>Healthy. The pipeline is working as expected for typical ecommerce with normal noise</td>
      <td>Trust it. Most forecasts in production live here</td>
    </tr>
    <tr>
      <td><strong>15–25%</strong></td>
      <td>Usable but verify. Likely either high day-to-day volatility, modest data history, or a recently-changed business</td>
      <td>Use for <em>directional</em> planning. Don’t pin budget decisions to specific forecast values</td>
    </tr>
    <tr>
      <td><strong>25–40%</strong></td>
      <td>Don’t trust this in isolation. The model is fitting something but it’s noisy enough that a +30% surprise tomorrow wouldn’t be unexpected</td>
      <td>Investigate before publishing the forecast</td>
    </tr>
    <tr>
      <td><strong>&gt; 40%</strong></td>
      <td>Broken. Either data quality, training history, or a structural mismatch with the model</td>
      <td>Diagnose; do not deploy</td>
    </tr>
  </tbody>
</table>

<p>A subtlety: MAPE penalizes percent errors equally regardless of the day’s
volume. A 50% miss on a $100 day and a 50% miss on a $100,000 day both
count the same. If your business has many low-volume days, your MAPE will
read worse than the model deserves — look at <strong>MAE</strong> in actual currency
units to sanity-check.</p>

<p><code class="language-plaintext highlighter-rouge">SMAPE</code> (symmetric MAPE) is more forgiving on small denominator days; if
MAPE is much worse than SMAPE on the same period, low-volume days are
distorting the average.</p>

<h2 id="reading-mlexplain_forecast-where-the-prediction-comes-from">Reading ML.EXPLAIN_FORECAST: where the prediction comes from</h2>

<p>Knowing the model has a decent MAPE is necessary but not sufficient. The
next question is: <em>where is each forecasted dollar coming from?</em> If you
deploy a “spend-aware” forecast and 95% of every prediction’s value
attributes to the trend term, your forecast isn’t actually responding to
spend — it’s a stationary line dressed up as a media plan.</p>

<p>The <code class="language-plaintext highlighter-rouge">sales_forecast_explained</code> view (§8) makes this visible. Each
forecasted day has columns like:</p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>What it tells you</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">attribution_trend</code></td>
      <td>The slow-moving level component</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">attribution_weekly_seasonality</code></td>
      <td>The day-of-week effect (your Tuesday is +12%, your Sunday is −20%, etc.)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">attribution_yearly_seasonality</code></td>
      <td>The calendar position effect (you’re up 8% relative to year average)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">attribution_holiday</code></td>
      <td>Lift from named occasions (Black Friday, your custom sales)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">attribution_spend</code></td>
      <td>How much of tomorrow’s predicted revenue is driven by tomorrow’s planned spend</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">attribution_view_item</code>, <code class="language-plaintext highlighter-rouge">attribution_add_to_cart</code>, etc.</td>
      <td>Funnel metric contributions</td>
    </tr>
  </tbody>
</table>

<p>Two questions to ask every time you look at this view:</p>

<ol>
  <li>
    <p><strong>Does one term dominate everything else?</strong> A healthy model spreads
attribution across components. If <code class="language-plaintext highlighter-rouge">attribution_trend</code> is 90% of every
forecast, your regressors and seasonality are doing almost nothing —
the model has effectively collapsed to a simple time-series.</p>
  </li>
  <li>
    <p><strong>Does the attribution match your intuition?</strong> If you know spend is the
most important driver of your revenue and <code class="language-plaintext highlighter-rouge">attribution_spend</code> is near
zero, something is wrong upstream. Most often: spend and revenue aren’t
actually correlated in your historical data (verify with a quick
<code class="language-plaintext highlighter-rouge">CORR()</code>), the regressor was added recently with too little history, or
the regressor’s variance is too low for the model to learn from.</p>
  </li>
</ol>

<h2 id="what-to-do-when-the-eval-fails">What to do when the eval fails</h2>

<p>Concrete responses to the most common failure modes:</p>

<table>
  <thead>
    <tr>
      <th>Symptom</th>
      <th>Likely cause</th>
      <th>Response</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MAPE &gt; 25% on a long-running stable business</td>
      <td>Recent structural break (pricing, market, product mix)</td>
      <td>Limit training window to post-break period via <code class="language-plaintext highlighter-rouge">data_window</code> in the model options. Accept some short-term volatility cost</td>
    </tr>
    <tr>
      <td>MAPE &gt; 25% on a young business</td>
      <td>Not enough history — yearly seasonality overfits</td>
      <td>Either disable yearly seasonality (<code class="language-plaintext highlighter-rouge">auto_arima=TRUE</code> with <code class="language-plaintext highlighter-rouge">seasonal_periods=['WEEKLY']</code> only) or wait for more history before going to production</td>
    </tr>
    <tr>
      <td>MAPE looks fine but <code class="language-plaintext highlighter-rouge">attribution_spend</code> is near zero</td>
      <td>Spend and revenue weren’t correlated historically</td>
      <td>Either remove spend as a regressor, or check whether the right lag is being used (today’s spend often doesn’t drive today’s revenue — try a 1- or 2-day lag)</td>
    </tr>
    <tr>
      <td>One huge spike day skews everything</td>
      <td>Auto-outlier detection didn’t catch it</td>
      <td>Add the day to <code class="language-plaintext highlighter-rouge">custom_holidays</code> with <code class="language-plaintext highlighter-rouge">holiday_name='one_off'</code> and re-train, or filter it out of <code class="language-plaintext highlighter-rouge">daily_features</code></td>
    </tr>
    <tr>
      <td>Forecasts look flat over the next 30 days</td>
      <td>Funnel sub-forecasts have too little data to learn weekday seasonality</td>
      <td>Confirm <code class="language-plaintext highlighter-rouge">daily_features</code> has at least 13 months of data and the GA4 ecommerce events are actually firing</td>
    </tr>
  </tbody>
</table>

<h2 id="a-note-on-retraining-cadence">A note on retraining cadence</h2>

<p>The pipeline schedules a weekly retrain by default. That’s the right
default for most ecommerce businesses — the underlying behavior changes
slowly enough that weekly is fresh enough, and <code class="language-plaintext highlighter-rouge">ARIMA_PLUS_XREG</code> training
is cheap enough that running it every Monday morning costs essentially
nothing.</p>

<p>Two cases where you’d want to retrain more often:</p>

<ul>
  <li><strong>You’re heading into a high-stakes period</strong> (Black Friday, an
anniversary sale) and the most recent four weeks contain the kind of
signal you want the model to weight. Daily retrains for two weeks
before the event, then revert to weekly.</li>
  <li><strong>A regressor just changed regime</strong> (you doubled your media budget, a
new channel went live). Retrain on demand once you have 2–3 weeks of
data at the new level so the model recalibrates its regressor weights.</li>
</ul>

<p>And one case where you’d want to retrain <em>less</em> often:</p>

<ul>
  <li><strong>You ran §4 and got a great MAPE.</strong> Don’t retrain mid-week if you’ve
already validated the current model. Weekly retrains are about keeping
the model fresh; if the current model is fitting well and the underlying
business is stable, additional retraining only adds variance.</li>
</ul>

<h2 id="this-is-pre-release--testers-wanted">This is pre-release — testers wanted</h2>

<p>The pipeline has been validated on synthetic data and small production
datasets, but real-world ecommerce data has quirks that only show up at
scale. The README has a <a href="https://github.com/hugonissar/GA4-BigQuery-ML-Sales-Forecast#contributing">Contributing</a>
section laying out exactly what would help most — currently I’m looking
for testers with 2+ years of stable GA4 export history and at least 500
daily orders or 50k daily sessions, especially in less-tested setups:
subscription/recurring revenue, heavy promotional calendars, non-US
holiday regions.</p>

<p>The contribution loop is small: run the pipeline on your data, open an
issue with your MAPE, business profile, and anything that broke or
surprised you. The goal is a public benchmark table of realistic MAPE
numbers across different business types so other users have a calibration
point before they deploy.</p>

<h2 id="faq">FAQ</h2>

<h3 id="how-is-this-different-from-googles-ga4-predictive-metrics">How is this different from Google’s GA4 predictive metrics?</h3>

<p>GA4 ships predictive metrics (purchase probability, revenue prediction) at
the <em>user</em> level — given this user’s behavior, how likely are they to
purchase? Useful for audience targeting. This pipeline operates at the
<em>business</em> level — given the whole property, what’s tomorrow’s total
revenue? Different question, different model class, complementary outputs.</p>

<h3 id="why-arima_plus_xreg-and-not-prophet-neuralprophet-or-deepar">Why ARIMA_PLUS_XREG and not Prophet, NeuralProphet, or DeepAR?</h3>

<p><code class="language-plaintext highlighter-rouge">ARIMA_PLUS_XREG</code> is the only BQML-native option that supports exogenous
regressors with built-in seasonality and holiday handling. Prophet would
work but requires moving data out of BigQuery into Python. NeuralProphet
and DeepAR require even more infrastructure. For a marketing analytics
team that lives in BigQuery + Data Studio, the operational cost of
introducing a Python orchestrator is high. The pipeline trades off some
modeling ceiling for radically lower ops complexity.</p>

<h3 id="what-if-i-dont-have-ad-spend-data">What if I don’t have ad spend data?</h3>

<p>You can run the pipeline with only the funnel regressors — drop <code class="language-plaintext highlighter-rouge">spend</code>,
<code class="language-plaintext highlighter-rouge">impressions</code>, and <code class="language-plaintext highlighter-rouge">clicks</code> from §1 and from the model options in §3. The
ad-aware scenario planning goes away, but the seasonal forecasting still
works. Expect MAPE to be 2–5 points worse depending on how much of your
revenue volatility was driven by spend.</p>

<h3 id="how-does-this-handle-consent-mode-v2">How does this handle Consent Mode v2?</h3>

<p>Revenue is computed from <code class="language-plaintext highlighter-rouge">purchase</code> events in the GA4 export, which
respect consent settings at collection time — non-consented users either
don’t appear or appear as modeled estimates depending on your GA4 setup.
The pipeline doesn’t add a separate consent filter because at the
aggregate-revenue level you want all observed revenue, modeled or not.</p>

<h3 id="can-i-forecast-more-than-30-days-out">Can I forecast more than 30 days out?</h3>

<p>You can — extend the forecast horizon in <code class="language-plaintext highlighter-rouge">ML.FORECAST</code> and in
<code class="language-plaintext highlighter-rouge">future_regressors</code>. Expect accuracy to fall off after roughly 30 days
because the sub-forecasts driving the regressors degrade in the same way
the main model does. For longer horizons, switch to weekly granularity
(aggregate <code class="language-plaintext highlighter-rouge">daily_features</code> to weekly first) or expect 25–35% MAPE in
weeks 5–8.</p>

<h3 id="are-there-cost-limits">Are there cost limits?</h3>

<p><code class="language-plaintext highlighter-rouge">ARIMA_PLUS_XREG</code> training scans the full training window each time. For
2 years of daily data with ~10 regressor columns, each train scans on the
order of a few megabytes — well under a cent per train at on-demand
pricing. Daily refresh is similarly trivial. Even on the most generous
weekly retrain + daily refresh schedule, total monthly BigQuery cost from
this pipeline is in the low single dollars.</p>

<h2 id="source-code">Source code</h2>

<p>Full SQL, dataflow diagrams, IAM setup, troubleshooting guide, and the
contributing checklist are at
<a href="https://github.com/hugonissar/GA4-BigQuery-ML-Sales-Forecast">github.com/hugonissar/GA4-BigQuery-ML-Sales-Forecast</a>.
MIT-licensed (placeholder, will finalize for 1.0).</p>

<p>If you run it on production data, please open an issue with your MAPE and
business profile — that’s the single most useful contribution at this
stage of the project.</p>]]></content><author><name>Hugo Nissar</name></author><category term="bqml" /><category term="ga4" /><category term="forecasting" /><category term="arima" /><category term="model-quality" /><summary type="html"><![CDATA[A pre-release pure-SQL pipeline that turns the GA4 BigQuery export into a 30-day revenue forecast — with custom holidays, ad-spend regressors, and seasonal sub-forecasts. The architecture is interesting; the model-quality step is what decides whether to trust the output.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://hugonissar.github.io/assets/images/og-default.png" /><media:content medium="image" url="https://hugonissar.github.io/assets/images/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">How to build a GA4 purchase propensity model with BigQuery ML (without Vertex AI)</title><link href="https://hugonissar.github.io/blog/ga4-purchase-propensity-bqml/" rel="alternate" type="text/html" title="How to build a GA4 purchase propensity model with BigQuery ML (without Vertex AI)" /><published>2026-05-15T00:00:00+02:00</published><updated>2026-05-15T00:00:00+02:00</updated><id>https://hugonissar.github.io/blog/ga4-purchase-propensity-bqml</id><content type="html" xml:base="https://hugonissar.github.io/blog/ga4-purchase-propensity-bqml/"><![CDATA[<p>GA4 ships predictive audiences out of the box — <em>purchase probability</em>,
<em>churn probability</em> — and they’re a fine default if you have the traffic for
them. The threshold is ≥1,000 returning users with purchases <em>and</em> ≥1,000
returning users without, in the past 28 days. Many small and mid-sized
ecommerce sites never cross it.</p>

<p>If that’s you, you can roll your own with <strong>BigQuery ML</strong>. No Vertex AI, no
AutoML, no third-party SaaS — just BigQuery, a Cloud Function, the GA4
Measurement Protocol, and Cloud Scheduler. This post walks through the design
end-to-end, with the full source open on GitHub.</p>

<h2 id="what-the-pipeline-looks-like">What the pipeline looks like</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GA4 → BigQuery export → weekly BQML training → daily Cloud Function scoring
                                                          ↓
                                            GA4 Measurement Protocol push
                                                          ↓
                                            GA4 audiences → Google Ads
</code></pre></div></div>

<p>Two files, one model, one Cloud Function. Weekly training runs as a scheduled
BigQuery query (Sunday 00:01). Daily scoring runs as a Cloud Function (06:00
local time) that scores every user active in the last 24 hours, buckets them
into <code class="language-plaintext highlighter-rouge">low</code> / <code class="language-plaintext highlighter-rouge">mid</code> / <code class="language-plaintext highlighter-rouge">high</code>, and pushes the bucket back to GA4 as a user
property via the Measurement Protocol.</p>

<h2 id="why-bqml-over-vertex-ai-for-this">Why BQML over Vertex AI for this</h2>

<p>For a single binary classification model on GA4 data, BQML is the right tool.
Three reasons:</p>

<ol>
  <li><strong>Data stays in BigQuery.</strong> Training reads directly from the GA4 export.
No data movement, no scheduling, no separate feature store.</li>
  <li><strong>HP tuning is built in.</strong> <code class="language-plaintext highlighter-rouge">OPTIONS(num_trials=20, ...)</code> sweeps
<code class="language-plaintext highlighter-rouge">learn_rate</code>, <code class="language-plaintext highlighter-rouge">max_tree_depth</code>, <code class="language-plaintext highlighter-rouge">subsample</code>, and <code class="language-plaintext highlighter-rouge">l2_reg</code> and reports
<code class="language-plaintext highlighter-rouge">ML.TRIAL_INFO</code> after training. No separate orchestration.</li>
  <li><strong>Inference is a SQL query.</strong> <code class="language-plaintext highlighter-rouge">ML.PREDICT(MODEL ...)</code> returns scored rows.
The whole daily job is one BigQuery query plus an HTTP loop.</li>
</ol>

<p>Vertex AI is the right answer when you outgrow this — multi-model serving,
custom containers, GPU training. For a ~30-feature boosted-tree classifier on
180 days of GA4 events, BQML is simpler, cheaper, and equally accurate.</p>

<h2 id="the-features">The features</h2>

<p>The model uses ~30 features across five categories:</p>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Features</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Engagement</td>
      <td><code class="language-plaintext highlighter-rouge">total_engagement_seconds</code>, <code class="language-plaintext highlighter-rouge">engagement_last_30d_seconds</code>, <code class="language-plaintext highlighter-rouge">engagement_last_7d_seconds</code></td>
    </tr>
    <tr>
      <td>Sessions</td>
      <td><code class="language-plaintext highlighter-rouge">sessions_total</code>, <code class="language-plaintext highlighter-rouge">engaged_sessions</code>, <code class="language-plaintext highlighter-rouge">engaged_session_rate</code>, <code class="language-plaintext highlighter-rouge">avg_session_depth</code>, <code class="language-plaintext highlighter-rouge">avg_session_engagement_seconds</code>, <code class="language-plaintext highlighter-rouge">max_session_engagement_seconds</code></td>
    </tr>
    <tr>
      <td>Funnel events</td>
      <td><code class="language-plaintext highlighter-rouge">view_item_list_count</code>, <code class="language-plaintext highlighter-rouge">select_item_count</code>, <code class="language-plaintext highlighter-rouge">view_item_count</code>, <code class="language-plaintext highlighter-rouge">add_to_wishlist_count</code>, <code class="language-plaintext highlighter-rouge">add_to_cart_count</code>, <code class="language-plaintext highlighter-rouge">view_cart_count</code>, <code class="language-plaintext highlighter-rouge">remove_from_cart_count</code>, <code class="language-plaintext highlighter-rouge">begin_checkout_count</code>, <code class="language-plaintext highlighter-rouge">add_shipping_info_count</code>, <code class="language-plaintext highlighter-rouge">add_payment_info_count</code></td>
    </tr>
    <tr>
      <td>Activity</td>
      <td><code class="language-plaintext highlighter-rouge">page_view_count</code>, <code class="language-plaintext highlighter-rouge">total_events</code>, <code class="language-plaintext highlighter-rouge">days_since_last_visit</code></td>
    </tr>
    <tr>
      <td>Funnel ratios</td>
      <td><code class="language-plaintext highlighter-rouge">list_to_view_ratio</code>, <code class="language-plaintext highlighter-rouge">view_to_cart_ratio</code>, <code class="language-plaintext highlighter-rouge">cart_to_checkout_ratio</code>, <code class="language-plaintext highlighter-rouge">checkout_to_payment_ratio</code>, <code class="language-plaintext highlighter-rouge">cart_abandon_ratio</code></td>
    </tr>
  </tbody>
</table>

<p>The ratios matter most. In every ecommerce model I’ve trained, the top three
feature importances are some combination of <code class="language-plaintext highlighter-rouge">checkout_to_payment_ratio</code>,
<code class="language-plaintext highlighter-rouge">add_payment_info_count</code>, and <code class="language-plaintext highlighter-rouge">cart_to_checkout_ratio</code>. Users who reach
checkout but don’t pay are the highest-value retargeting segment.</p>

<h2 id="arent-begin_checkout-and-add_payment_info-data-leakage">Aren’t <code class="language-plaintext highlighter-rouge">begin_checkout</code> and <code class="language-plaintext highlighter-rouge">add_payment_info</code> data leakage?</h2>

<p>A common worry. The short answer: no, they aren’t.</p>

<p>Data leakage means using post-prediction-time information at training time.
This model trains features from the <strong>180-day feature window</strong> (210 to 31
days ago) and labels from a <strong>separate 30-day window</strong> (30 to 1 day ago). The
windows don’t overlap. An <code class="language-plaintext highlighter-rouge">add_payment_info</code> event 60 days ago is a real,
legitimate predictor of a <code class="language-plaintext highlighter-rouge">purchase</code> event 10 days ago — exactly the signal
you want the model to learn.</p>

<p>The full training SQL with the windowing is in <code class="language-plaintext highlighter-rouge">training.sql</code>
in the <a href="https://github.com/hugonissar/GA4-Ecommerce-BQML-Purchase-Propensity">repo</a>.</p>

<h2 id="consent-filter-at-the-push-step-not-at-training">Consent: filter at the push step, not at training</h2>

<p>This is the design decision that gets the most pushback, so it’s worth being
explicit.</p>

<p><strong>Training data is not filtered by consent.</strong> <strong>Scoring data is.</strong></p>

<p>The pipeline reads all GA4 events at training time to maximize the positive
class. A high-intent user behaves the same regardless of which consent button
they clicked, so filtering training data to consented users only shrinks the
positive set by 30–40% (typical EEA rates) and materially hurts model quality.</p>

<p>At the <strong>scoring step</strong>, the SQL filters to
<code class="language-plaintext highlighter-rouge">ads_personalization_consent = 'GRANTED'</code>. Non-consented users never get a
propensity bucket, never enter the GA4 audience, never reach Google Ads. The
moment data is <em>used for ad targeting</em> is the moment consent legally binds —
that’s the push step, and that’s where the filter lives.</p>

<p>If your organisation requires a stricter “no non-consented data in any
downstream system” policy, the repo has a one-CTE patch that adds a
consent filter to training too. See
<a href="https://github.com/hugonissar/GA4-Ecommerce-BQML-Purchase-Propensity#-consent-model">the README’s Consent model section</a>.</p>

<h2 id="three-buckets-fixed-thresholds">Three buckets, fixed thresholds</h2>

<p>The model outputs a probability. The pipeline buckets it:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">low</code> — predicted probability &lt; 0.40</li>
  <li><code class="language-plaintext highlighter-rouge">mid</code> — 0.40 ≤ p &lt; 0.70</li>
  <li><code class="language-plaintext highlighter-rouge">high</code> — p ≥ 0.70</li>
</ul>

<p>Thresholds are <strong>fixed</strong>, not percentile-based, so bucket meaning stays stable
across retrains. The first model run gives you a distribution; revisit the
thresholds once if it’s heavily skewed. After that, leave them alone.</p>

<h2 id="how-much-traffic-do-you-need">How much traffic do you need?</h2>

<p>Realistic operating points, assuming a 3% conversion rate:</p>

<ul>
  <li><strong>Workable:</strong> 1,000+ weekly visitors / 30+ weekly purchases — usable for
ranking, expect noise in bucket sizes.</li>
  <li><strong>Comfortable:</strong> ~3,300 weekly visitors / ~100 weekly purchases — stable AUC.</li>
  <li><strong>Ideal:</strong> ~8,000 weekly visitors / ~250 weekly purchases — reliable for production.</li>
  <li><strong>Robust:</strong> ~16,000 weekly visitors / ~500 weekly purchases — safe to
fully automate against ad spend.</li>
</ul>

<p>Below 1,000 weekly visitors, both this pipeline and GA4’s built-in audiences
struggle. The issue is statistical, not technical — there just aren’t enough
positives to train against.</p>

<h2 id="deploying-it">Deploying it</h2>

<p>The repo has a six-step quick start: create the BQML dataset, train the model
once manually, schedule the weekly training as a BigQuery scheduled query,
deploy the Cloud Function, schedule the daily run with Cloud Scheduler, and
verify in GA4 DebugView. Total setup time is about an hour if you have GA4
BigQuery export already turned on; longer if you don’t (you need both daily
and streaming exports on).</p>

<p>The full quick-start is in the
<a href="https://github.com/hugonissar/GA4-Ecommerce-BQML-Purchase-Propensity#-quick-start">repository README</a>.</p>

<h2 id="faq">FAQ</h2>

<h3 id="why-not-write-audiences-directly-via-the-google-ads-api">Why not write audiences directly via the Google Ads API?</h3>

<p>The Ads API audience push requires more permissions, more setup, and doesn’t
keep GA4 reporting in sync. Writing back as a GA4 user property keeps
everything in one place: GA4 reports, GA4 audiences, and Google Ads remarketing
all read from the same source.</p>

<h3 id="why-exclude-recent-buyers-from-both-training-and-scoring">Why exclude recent buyers from both training and scoring?</h3>

<p>Two reasons. First, retargeting people who just bought wastes ad spend.
Second, in training, frequent buyers dominate the positive class — the model
learns <em>“frequent buyers buy again”</em> instead of <em>“high-intent prospects buy.”</em>
You don’t need ML to target recent buyers; create a rule-based audience in GA4
(<code class="language-plaintext highlighter-rouge">purchase event in last 30 days</code>) and run retention campaigns against it
separately.</p>

<h3 id="can-i-run-this-without-consent-mode-v2">Can I run this without Consent Mode v2?</h3>

<p>Technically yes (remove the consent filter from the SQL), but then you can’t
legally push the data to Google Ads for ad targeting in jurisdictions that
require explicit consent. The pipeline is designed consent-first because
that’s the constraint that actually matters in production.</p>

<h3 id="does-it-work-with-firebase--app-properties">Does it work with Firebase / app properties?</h3>

<p>With minor changes to the SQL — <code class="language-plaintext highlighter-rouge">user_pseudo_id</code> exists in app exports too,
but the consent param keys differ. Easy to adapt.</p>

<h2 id="source-code">Source code</h2>

<p>Full source, training SQL, IAM least-privilege guide, and Cloud Function
deployment script on GitHub:
<a href="https://github.com/hugonissar/GA4-Ecommerce-BQML-Purchase-Propensity">github.com/hugonissar/GA4-Ecommerce-BQML-Purchase-Propensity</a>.
MIT-licensed.</p>]]></content><author><name>Hugo Nissar</name></author><category term="bqml" /><category term="ga4" /><category term="bigquery" /><category term="propensity" /><category term="google-ads" /><summary type="html"><![CDATA[A complete walkthrough for building a purchase propensity model on GA4 ecommerce data using BigQuery ML — boosted trees, hyperparameter tuning, consent-aware scoring, and pushing buckets back to GA4 via the Measurement Protocol for Google Ads remarketing.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://hugonissar.github.io/assets/images/og-default.png" /><media:content medium="image" url="https://hugonissar.github.io/assets/images/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Stopping the $2,000 AI query: how to cap BigQuery scan cost from an MCP server</title><link href="https://hugonissar.github.io/blog/bigquery-cost-control-mcp/" rel="alternate" type="text/html" title="Stopping the $2,000 AI query: how to cap BigQuery scan cost from an MCP server" /><published>2026-05-11T00:00:00+02:00</published><updated>2026-05-11T00:00:00+02:00</updated><id>https://hugonissar.github.io/blog/bigquery-cost-control-mcp</id><content type="html" xml:base="https://hugonissar.github.io/blog/bigquery-cost-control-mcp/"><![CDATA[<p>Every AI agent connected to BigQuery is one bad query away from a four-figure
invoice. The agent doesn’t have to be malicious — a well-meaning request like
<em>“find me users with similar behavior to the ones who converted last month”</em>
can quietly become a <code class="language-plaintext highlighter-rouge">JOIN</code> of three multi-terabyte tables with no
partition filter. BigQuery happily runs it, scans 6 TB, charges you $30,
and emails the bill.</p>

<p>This post is about preventing that. There are three layers of defense; only
one is bulletproof. The right setup combines all three.</p>

<h2 id="the-three-layers">The three layers</h2>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>Where it lives</th>
      <th>What it stops</th>
      <th>Bulletproof?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Application-layer scan ceiling</strong></td>
      <td>Your MCP server (dry-run + reject)</td>
      <td>Any single query above the ceiling</td>
      <td>Yes — if the ceiling is set</td>
    </tr>
    <tr>
      <td><strong>IAM custom role with quota</strong></td>
      <td>GCP IAM (custom role with project quota)</td>
      <td>A misconfigured service account exceeding its quota</td>
      <td>Mostly — quotas are per-day, not per-query</td>
    </tr>
    <tr>
      <td><strong>Project-wide bytes-billed quota</strong></td>
      <td>BigQuery → Quotas</td>
      <td>A total spend cap across all queries from the project</td>
      <td>Yes — but at project granularity, not user/role</td>
    </tr>
  </tbody>
</table>

<h2 id="layer-1-application-layer-scan-ceiling">Layer 1: application-layer scan ceiling</h2>

<p>This is the cheapest and most precise. Every query the MCP server is about
to run gets a <strong>dry-run first</strong>. BigQuery’s dry-run returns the estimated
bytes scanned without actually executing the query or charging for it. If
the estimate exceeds your ceiling, the query is rejected before the real
job ever runs.</p>

<p>The Python is short — this is what
<a href="https://github.com/hugonissar/BigQuery-Read-Only-MCP-Server">BigQuery-Read-Only-MCP-Server</a>
does on every query:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">google.cloud</span> <span class="kn">import</span> <span class="n">bigquery</span>

<span class="n">bq</span> <span class="o">=</span> <span class="n">bigquery</span><span class="p">.</span><span class="nc">Client</span><span class="p">(</span><span class="n">project</span><span class="o">=</span><span class="n">GCP_PROJECT_ID</span><span class="p">)</span>
<span class="n">MAX_SCAN_MB</span> <span class="o">=</span> <span class="nf">int</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">MAX_SCAN_MB</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">100</span><span class="sh">"</span><span class="p">))</span>
<span class="n">MAX_SCAN_BYTES</span> <span class="o">=</span> <span class="n">MAX_SCAN_MB</span> <span class="o">*</span> <span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span>

<span class="k">def</span> <span class="nf">reject_if_too_big</span><span class="p">(</span><span class="n">sql</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="n">job_config</span> <span class="o">=</span> <span class="n">bigquery</span><span class="p">.</span><span class="nc">QueryJobConfig</span><span class="p">(</span>
        <span class="n">dry_run</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">use_query_cache</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="n">job</span> <span class="o">=</span> <span class="n">bq</span><span class="p">.</span><span class="nf">query</span><span class="p">(</span><span class="n">sql</span><span class="p">,</span> <span class="n">job_config</span><span class="o">=</span><span class="n">job_config</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">job</span><span class="p">.</span><span class="n">total_bytes_processed</span> <span class="o">&gt;</span> <span class="n">MAX_SCAN_BYTES</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">Query would scan </span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">total_bytes_processed</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="s"> bytes, </span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">exceeding cap of </span><span class="si">{</span><span class="n">MAX_SCAN_BYTES</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="sh">"</span>
        <span class="p">)</span>
</code></pre></div></div>

<p><strong>Notes:</strong></p>

<ul>
  <li>Dry-runs are free and fast (typically 50–200ms). Caching them with a small
LRU dramatically reduces overhead when the agent retries the same query.</li>
  <li><code class="language-plaintext highlighter-rouge">total_bytes_processed</code> is an estimate, not a guarantee. For highly
optimized queries against partitioned + clustered tables, the actual scan
can come in slightly under the estimate. The reverse — actual scan
exceeding the estimate — is extremely rare in practice, but the defense
is <em>belt-and-braces</em>, so layer 3 still has a role.</li>
  <li>Setting <code class="language-plaintext highlighter-rouge">MAX_SCAN_MB</code> requires knowing your data. 100 MB is fine for
exploratory queries against a GA4 export (you can answer most reasonable
questions in under 100 MB if your tables are partitioned by date). For
unpartitioned reporting tables, you may need 500 MB. Don’t go above 2 GB
without a very specific reason.</li>
</ul>

<p>This is the <strong>bulletproof, per-query</strong> defense. No query exceeding the cap
ever runs. No bytes are billed. No surprise charges.</p>

<h2 id="layer-2-iam-custom-role-with-quota">Layer 2: IAM custom role with quota</h2>

<p>The native BigQuery roles (<code class="language-plaintext highlighter-rouge">bigquery.dataViewer</code>, <code class="language-plaintext highlighter-rouge">bigquery.jobUser</code>,
<code class="language-plaintext highlighter-rouge">bigquery.user</code>) don’t have per-role byte quotas. You can’t say
<em>“this service account is allowed to scan 50 GB/day”</em> through them.</p>

<p>You can do it with a <strong>custom role plus a project-level quota override</strong>
keyed to the service account, but the configuration is non-obvious. The
shape:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 1. Create the custom role (essentially jobUser + read scoped tighter)</span>
gcloud iam roles create bqAgentJobUser <span class="se">\</span>
  <span class="nt">--project</span><span class="o">=</span><span class="nv">$PROJECT_ID</span> <span class="se">\</span>
  <span class="nt">--title</span><span class="o">=</span><span class="s2">"BQ Agent Job User"</span> <span class="se">\</span>
  <span class="nt">--permissions</span><span class="o">=</span>bigquery.jobs.create,bigquery.jobs.get <span class="se">\</span>
  <span class="nt">--stage</span><span class="o">=</span>GA

<span class="c"># 2. Bind to the agent service account</span>
gcloud projects add-iam-policy-binding <span class="nv">$PROJECT_ID</span> <span class="se">\</span>
  <span class="nt">--member</span><span class="o">=</span><span class="s2">"serviceAccount:bigquery-readonly-mcp@</span><span class="k">${</span><span class="nv">PROJECT_ID</span><span class="k">}</span><span class="s2">.iam.gserviceaccount.com"</span> <span class="se">\</span>
  <span class="nt">--role</span><span class="o">=</span><span class="s2">"projects/</span><span class="k">${</span><span class="nv">PROJECT_ID</span><span class="k">}</span><span class="s2">/roles/bqAgentJobUser"</span>

<span class="c"># 3. Set a per-user-per-day query bytes quota override via the BigQuery</span>
<span class="c"># administration UI: BigQuery → Reservations → Slots/quota → query bytes</span>
<span class="c"># scanned per user per day. Choose a value like 50 GB.</span>
</code></pre></div></div>

<p>The catch: quotas are <strong>per-user-per-day</strong>. They’re a backstop, not a
per-query control. A single 6 TB query happily eats through the daily cap
in one shot. So this layer protects against <strong>sustained misbehavior</strong>, not
<strong>a single bad query</strong>.</p>

<p>When this layer pays off: it catches the case where an agent is running 100
well-formed queries an hour for days, slowly burning down your budget.
Layer 1’s per-query cap doesn’t notice that pattern.</p>

<h2 id="layer-3-project-wide-bytes-billed-quota">Layer 3: project-wide bytes-billed quota</h2>

<p>This is the <strong>whole project’s emergency brake</strong>. In Google Cloud Console:</p>

<ol>
  <li><strong>APIs &amp; Services → Quotas &amp; System Limits</strong>.</li>
  <li>Filter to BigQuery.</li>
  <li>Find <em>“Query usage per day”</em>.</li>
  <li><strong>Edit Quota</strong> and set a daily ceiling. Default is 200 TB/day — drop it
to something defensible (50 GB? 500 GB?).</li>
  <li>Save. The change takes effect within minutes.</li>
</ol>

<p>When this layer matters: a query that somehow slipped past the application
layer ceiling (a bug, a misconfiguration, a deploy without the cap),
combined with an IAM quota that’s too loose. The project-wide quota stops
<em>all</em> BigQuery jobs in the project when hit — which is disruptive but
bounded. Set it high enough that legitimate batch jobs (your GA4 ML
training, scheduled queries, etc.) don’t trip it.</p>

<h2 id="what-this-looks-like-together">What this looks like together</h2>

<p>For a production deployment of the MCP server I maintain, my actual
configuration:</p>

<ul>
  <li><strong>Layer 1</strong>: <code class="language-plaintext highlighter-rouge">MAX_SCAN_MB=100</code> for exploratory agents, <code class="language-plaintext highlighter-rouge">MAX_SCAN_MB=500</code>
for known-good internal use.</li>
  <li><strong>Layer 2</strong>: a custom role for the service account, with a 50 GB
per-user-per-day quota.</li>
  <li><strong>Layer 3</strong>: project-wide cap of 1 TB/day. High enough that nothing
legitimate ever hits it, low enough that a runaway script gets stopped
before it costs four figures.</li>
</ul>

<p>Total cost at idle: zero (Cloud Run scales to zero). Total worst-case
exposure if every layer except the last fails: 1 TB × $6.25/TB = <strong>$6.25</strong>.</p>

<h2 id="why-just-use-iam-isnt-enough">Why “just use IAM” isn’t enough</h2>

<p>If you ask in a typical Cloud forum <em>“how do I prevent expensive BigQuery
queries from an agent?”</em>, the default answer is <em>“use IAM.”</em> IAM is
necessary. It is not sufficient.</p>

<p>The reason: IAM controls <strong>who can run jobs</strong> and <strong>on which datasets</strong>.
It doesn’t control <strong>how big each job is</strong>. A perfectly IAM-correct setup
with <code class="language-plaintext highlighter-rouge">bigquery.dataViewer</code> on three datasets and <code class="language-plaintext highlighter-rouge">bigquery.jobUser</code> on the
project lets an agent run a 6 TB scan against the largest of those
datasets, no questions asked. IAM doesn’t see the query plan; only
BigQuery does.</p>

<p>The pattern in this post — dry-run before submit, reject above the cap —
is what bridges the gap. IAM gates <strong>access</strong>, the application-layer cap
gates <strong>cost</strong>, the quotas gate <strong>sustained spend</strong>. All three are
necessary, and only the first one is sufficient for stopping a single
catastrophic query.</p>

<h2 id="faq">FAQ</h2>

<h3 id="doesnt-dry-run-also-cost-money">Doesn’t dry-run also cost money?</h3>

<p>No. Dry-runs are explicitly free per
<a href="https://cloud.google.com/bigquery/pricing#dry-run">Google’s BigQuery pricing docs</a>.
You can run unlimited dry-runs without billing impact.</p>

<h3 id="what-about-queries-that-scan-less-than-expected">What about queries that scan less than expected?</h3>

<p>Dry-run gives an upper bound, not the actual scan. In practice the estimate
is accurate to within a few percent for partitioned + clustered tables, and
within ~10% for unpartitioned ones. Either way, you’re being conservative —
you never pay for a query bigger than the dry-run estimate.</p>

<h3 id="how-does-this-interact-with-bigquerys-slot-based-pricing">How does this interact with BigQuery’s slot-based pricing?</h3>

<p>It doesn’t directly. The scan ceiling controls <strong>bytes scanned</strong>, which is
the unit of on-demand pricing. If you’re on flat-rate slots, the relevant
metric is <strong>slot-seconds</strong>, not bytes — and dry-run reports both. The same
pattern works: dry-run, check slot-seconds, reject if above the cap.</p>

<h3 id="can-i-do-this-on-snowflake">Can I do this on Snowflake?</h3>

<p>Snowflake exposes query plan estimates via <code class="language-plaintext highlighter-rouge">EXPLAIN</code>. The principle is the
same: estimate before run, reject above a cap. The implementation differs
(Snowflake’s estimates are less precise than BigQuery’s dry-run) but the
defense-in-depth structure is identical.</p>

<h2 id="source-code">Source code</h2>

<p>Full source — including the dry-run cache, the LRU TTL implementation, and
the rest of the security layers (allowlist, rate limiting, result truncation):
<a href="https://github.com/hugonissar/BigQuery-Read-Only-MCP-Server">github.com/hugonissar/BigQuery-Read-Only-MCP-Server</a>.
MIT-licensed.</p>]]></content><author><name>Hugo Nissar</name></author><category term="mcp" /><category term="bigquery" /><category term="cost-control" /><category term="security" /><category term="ai-agents" /><summary type="html"><![CDATA[Three layers of cost control for BigQuery queries originating from an MCP server — per-query scan ceilings via dry-run, IAM custom roles, and project-wide quotas. With code and example IAM bindings.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://hugonissar.github.io/assets/images/og-default.png" /><media:content medium="image" url="https://hugonissar.github.io/assets/images/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>