Hugo Nissar — MarTech engineering on Google Cloud

Self-hosted vs Google’s official BigQuery MCP server: a security and cost comparison

2026-05-18T00:00:00+02:00

Google released an official BigQuery MCP server in 2025. It works, it’s maintained, it’s the right default for many teams. But the moment you put an agent in front of an untrusted user — a customer chatbot, a third-party integration, anywhere prompt injection is a real threat — its defaults stop being the right defaults.

This post compares the two architectures across the dimensions that actually matter in production: table access control, per-query scan cost, rate limiting, and idle cost. By the end you’ll know which one to deploy and why.

TL;DR

	Google official	Self-hosted (this guide)
Table access control	IAM only — every reachable table	Hard allowlist of `(dataset, table)` pairs, parsed in code
Per-query scan cap	None built in	`MAX_SCAN_MB` enforced via dry-run before the job runs
Rate limiting	None (“no limit on the number of calls”, per Google’s docs)	Token bucket + burst, configurable
Result row cap	None	`MAX_RESULT_ROWS`, truncates server-side
Source code	Closed	Open, MIT, ~1400 lines
Idle cost	Managed	~$0 (Cloud Run scales to zero)
Prompt-injection scanning	Yes (Model Armor add-on)	No
Forecasting built in	Yes	No

Pick Google’s server if you trust the agent absolutely, want Model Armor prompt-injection scanning, or need built-in forecast and ARIMA tools.

Pick the self-hosted alternative for anything else — especially customer-facing agents, multi-tenant deployments, regulated environments, and anywhere the words “scan budget” or “rate limit” matter.

The three risks Google’s defaults don’t cover

1. Any table the service account can reach

Google’s server exposes every BigQuery table the underlying service account has IAM permission to read. For an internal data team that’s the right default — the service account has the same scope as the analyst using it.

For an agent talking to a customer, it isn’t. A misconfigured IAM grant, a fork of the service account’s role, or a future colleague who adds a dataset to “the analytics service account, it’s already got access to everything” can quietly widen the surface that an agent can read. The next prompt injection now has more to work with.

The self-hosted alternative inverts the default: tables are listed explicitly in (dataset, table) pairs as environment variables. Anything outside the allowlist is rejected at the SQL parser layer — before a job is ever submitted to BigQuery. Cross-dataset references like wrong_dataset.allowed_table are caught and rejected too.

2. The “AI just ran a $2,000 query” problem

There is no per-query scan ceiling in Google’s MCP server. The official docs suggest enforcing one via custom IAM roles or BigQuery-level quotas. Both work, both require Ops effort, and neither is on by default.

The self-hosted server dry-runs every query first. If BigQuery’s estimate exceeds MAX_SCAN_MB (default 100), the real job is never submitted — no bytes billed, no surprise on the invoice.

3. No built-in rate limiting

From Google’s docs, verbatim: “The BigQuery MCP server doesn’t have its own quotas. There is no limit on the number of calls that can be made to the MCP server.”

For a server fronted by an agent that retries on failure, that’s a foot-gun. The self-hosted alternative ships with a configurable RATE_LIMIT_QPM and RATE_LIMIT_BURST token bucket, plus separate concurrency semaphores for queries and metadata calls.

When the managed version still wins

The self-hosted server isn’t a strict superset. Google’s managed offering gives you Model Armor integration for prompt-injection scanning (a paid add-on, but a real one) and built-in forecasting / ARIMA tools that the self-hosted server intentionally doesn’t ship. If those matter, use Google’s version. If they don’t, the cost / security tradeoff swings hard toward self-hosted.

Deploying the self-hosted alternative

Five gcloud commands, ~10 minutes:

# 1. Generate API keys
openssl rand -hex 32 | gcloud secrets create mcp-api-key --data-file=-
openssl rand -hex 32 | gcloud secrets create mcp-admin-key --data-file=-

# 2. Build the image
gcloud builds submit \
  --tag europe-north2-docker.pkg.dev/$PROJECT_ID/bigquery-readonly-mcp/server:latest

# 3. Deploy to Cloud Run
gcloud run deploy bigquery-readonly-mcp \
  --image=europe-north2-docker.pkg.dev/$PROJECT_ID/bigquery-readonly-mcp/server:latest \
  --region=europe-north2 \
  --service-account="bigquery-readonly-mcp@$PROJECT_ID.iam.gserviceaccount.com" \
  --set-secrets="MCP_API_KEY=mcp-api-key:latest,MCP_ADMIN_KEY=mcp-admin-key:latest" \
  --set-env-vars="GCP_PROJECT_ID=$PROJECT_ID,BQ_DATASET_ID=analytics,BQ_ALLOWED_TABLE=events,MAX_SCAN_MB=100,RATE_LIMIT_QPM=20"

The full IAM least-privilege guide, multi-table configuration, and operations playbook are in the repository README.

Frequently asked questions

Does this work with Claude Desktop?

Yes. It also works with Cursor, Windsurf, Claude Code, ChatGPT’s deep research, the OpenAI Responses API, and anything else that speaks streamable-HTTP MCP. Configuration is the standard url + headers MCP client config.

Can I run it inside a private VPC?

Yes. Add --vpc-connector and --ingress=internal to the Cloud Run deploy command, then put an internal load balancer in front.

How much does it cost at idle?

Roughly nothing. Cloud Run scales to zero between requests; the only standing costs are Artifact Registry storage (~$0.10/GB/month for the image) and Secret Manager versions (cents/month). Under modest load — say 1000 queries a day — total monthly cost is in the single-digit dollars.

What about column-level masking?

The allowlist is table-granularity. If you need to hide PII columns, create a BigQuery authorized view that projects only safe columns and add the view to the allowlist instead of the underlying table.

Source code

The full source is one Python file, MIT-licensed, ~1400 lines: github.com/hugonissar/BigQuery-Read-Only-MCP-Server. PRs welcome.

When +35% is really −60%: a GA4 anomaly detector where the LLM only writes the report

2026-05-18T00:00:00+02:00

I’ve been using MCP servers for about a year now, which means I’ve been doing what most of the tech industry has been doing: hooking AI agents up to real data sources and watching them make us extremely productive most of the time and occasionally make a confident mistake that no one without the original SQL output would catch.

The most surprising failure mode, for me, was always the simple calculations. Not the hard reasoning. Not the complex multi-step joins. The simple calculations. “Revenue grew from $5,000 to $8,000” — fine. “That’s a 60% increase” — sometimes fine, sometimes confidently 35%, sometimes confidently negative. The model returns a number, the chat UI shows the number in bold, and unless the user happens to do the long division in their head before pasting it into a Slack channel, the number sticks.

This isn’t a controversial observation among people who use these tools heavily — every analytics engineer I know has stories. But what tends to be less talked about is the architectural lesson: if the math matters, don’t make the LLM do it. Hand the LLM the result of the math, computed deterministically in code, and let it do the part it’s actually good at — writing the prose.

That’s the lesson behind a weekend project I shipped this month: GA4-Anomaly-Detector. It started as a benchmark — can an LLM correctly find anomalies in a GA4 BigQuery export if I just hand it the rows, or do I need to compute the anomalies myself first? — and became a tool that does anomaly detection in code and uses the LLM purely for narrative. This post walks through what it does, why the architecture matters, the full CLI reference, and an honest word of caution about what it deliberately doesn’t have yet.

The benchmark that made the design obvious

The original question wasn’t even my own. It was already answered in a piece by Coupler.io benchmarking Google’s official GA4 MCP server against the GA4 sample dataset. The setup was simple: ask the agent to describe the recent traffic trend; check the answer against the actual numbers. Google’s MCP server reported a 35% increase in traffic when the actual trend was a 60% decrease. Same direction missed entirely.

This isn’t because Google’s engineers built it badly. It’s because the architecture handed the LLM the raw query results and asked it to do the analysis. That’s the structural mistake. An LLM scanning a column of numbers will sometimes pattern-match to a description that fits the most salient sub-window rather than the actual trend. Sometimes it will confidently average the wrong subset of rows. Sometimes it will simply get the percentage backwards. The chart in your head and the chart in the LLM’s head are not always the same chart.

The fix is structural too: run the statistics in code, hand the LLM only the structured findings. That’s what ga4-anomaly-detector does. The LLM never sees the daily metric values. It never sees the raw event rows. It sees a list of findings (“revenue dropped 38% on 2021-01-19 and stayed down”) and a system prompt forbidding speculation about causes that aren’t corroborated by other findings. If the math is right, the math is right in code, before any LLM call.

What it produces

The output is a markdown report you can pipe into Slack, email, a doc, or cat:

# GA4 Anomaly Detector
*2021-01-01 → 2021-01-31 · `bigquery-public-data.ga4_obfuscated_sample_ecommerce`*

## Headline
Revenue stepped down by 38% starting 2021-01-19 and has held.

## Key findings
- **revenue** level shift on 2021-01-19: ~$8,200 → ~$5,100
  (↓ -38% sustained over 14 days)
- **conversions** on 2021-01-22: 47 vs ~89 expected
  (↓ -47%, high severity)
- **sessions** on 2021-01-08: 4,820 vs ~3,100 expected
  (↑ +56%, medium severity)

## What changed in the mix
**revenue by source medium** (2021-01-18–2021-01-24 → 2021-01-25–2021-01-31)
- Gainer: `(direct) / (none)` 31% → 44% share
- Loser: `google / cpc` 28% → 17% share

The headline at the top is the only sentence the LLM writes that isn’t directly grounded in a numerical finding. Everything below it has a specific anomaly object with metric, date, expected, observed, severity behind it.

How the math gets done

Three detectors, picked because the alternatives produced worse signal on GA4-shaped data:

Point anomalies use STL residual z-scores. GA4 metrics have strong weekday/weekend cycles. A naive rolling-mean z-score flags every Saturday as anomalous. STL decomposes the series into trend + weekly seasonal + residual; we z-score the residual. A day is flagged when its residual exceeds the sigma threshold.
Change points use PELT with an RBF cost. A one-day spike is not a change point. A site migration that drops sessions and they stay dropped is. PELT separates the two. The RBF cost is roughly scale-invariant, so the same penalty works for sessions (thousands) and conversion_rate (single digits).
Mix shifts use Jensen-Shannon divergence. This catches the case GA4 dashboards hide: total sessions look flat, but direct doubled while organic collapsed. We compute share-of-voice distributions across adjacent windows and measure the divergence. JS is bounded [0, ln 2] so the threshold is interpretable.

The LLM — Gemini 3 by default, but the architecture takes any client implementing the LLMClient Protocol — receives the resulting AnomalyReport and turns it into prose. It cannot invent a finding because there’s nothing in its input it could invent one from.

Full CLI reference

The CLI has two subcommands and a shared set of flags. This section documents every option in cli.py.

Subcommands

Command	Purpose
`sample`	Run against Google’s public obfuscated GA4 sample dataset. No GA4 export of your own needed — only a GCP project for query billing. Sensible defaults for the date range that match the data available in the public sample.
`run`	Run against your own GA4 BigQuery export. Requires `--project-id` and `--dataset`.

Top-level help flag

Flag	What it does
`-h`, `--help`	Standard argparse help (top level + subcommand names only).
`--h`, `--help-all`	Custom action that prints the top-level help plus every subcommand’s full help in one pass. Use this when you want to see every flag everywhere without typing `-h` per subcommand. The single-dash form `--h` is intentional — `allow_abbrev=False` keeps argparse from auto-expanding it to `--help`.

`run`-specific required flags

Flag	Required	What it is
`--project-id PROJECT`	Yes	The GCP project hosting your GA4 BigQuery export.
`--dataset DATASET`	Yes	The BigQuery dataset, typically `analytics_` (e.g. `analytics_123456789`).

Date range (both subcommands)

Flag	Type	Default	What it is
`--start YYYY-MM-DD`	date	`sample`: `2020-12-01` (start of the sample’s window). `run`: today minus 30 days.	Inclusive start of the analysis window.
`--end YYYY-MM-DD`	date	`sample`: `2021-01-31` (end of the sample’s window). `run`: yesterday.	Inclusive end of the analysis window.

Validation: if --start > --end, the CLI exits with code 2 and an error.

Billing and auth (both subcommands)

Flag	Default	What it is
`--billing-project PROJECT`	`GOOGLE_CLOUD_PROJECT` env var	GCP project to bill queries against. Required, either via this flag or the env var — the CLI exits with code `2` if neither is set.

Auth itself is provided by Google’s Application Default Credentials — run gcloud auth application-default login once, or set GOOGLE_APPLICATION_CREDENTIALS to a service-account JSON. The CLI surfaces this hint automatically if BigQuery client initialization fails.

Data shape (both subcommands)

Flag	Type	Default	What it is
`--dimensions CSV`	comma-sep list	`source_medium`	Which dimensions to slice mix-shift detection on. Valid values: `source_medium`, `device_category`, `country`, `browser`. Pass `--dimensions ""` (empty string) to skip mix-shift detection entirely — useful on large exports where it’s the slowest step.
`--metrics CSV`	comma-sep list	All standard metrics (`DEFAULT_METRICS`)	Which metrics to run anomaly detection over. Sticks to the GA4-export-derivable set.

Output (both subcommands)

Flag	Default	What it is
`-o`, `--output FILE`	stdout	If supplied, the markdown report is written to this path. Otherwise it goes to stdout, ready to pipe into `pbcopy`, `tee`, or anything else.
`-v`, `--verbose`	off	Verbose logging on stderr. Logs end up in stderr, output ends up in stdout, so `python cli.py sample > report.md` still works cleanly.

Detection tuning (both subcommands, grouped as “Detection tuning”)

Flag	Type	Default	What it is
`--sigma-threshold FLOAT`	float	`3.0`	Z-score threshold for point anomalies. Lower means more sensitive. Drop to `2.5` for small-traffic sites with high day-to-day noise; raise to `4.0` for high-volume sites where you only want extreme deviations.
`--pelt-penalty FLOAT`	float	`10.0`	Penalty parameter for the PELT change-point detector. Lower means more change points. Raise if you’re seeing spurious breaks; lower if real shifts are being missed.
`--mix-window-days INT`	int	`7`	Width of each comparison window for mix-shift detection. The default compares the last 7 days against the 7 days before, anchored to the most recent date in the export.
`--known-events DATES`	comma-sep dates	empty	Dates (YYYY-MM-DD) to exclude from point-anomaly detection. Without this, a December run will dutifully flag Christmas as a -60% anomaly. Excluded dates are also dropped from the noise-floor estimate, so they don’t bias the threshold. Example: `--known-events 2026-12-24,2026-12-25,2026-12-31,2027-01-01`. The `holidays` Python package will generate per-country lists if you don’t want to hardcode.
`--conversion-events CSV`	comma-sep list	`purchase` (the default for ecommerce)	GA4 event names that count as conversions. Matters because the `conversions` metric is derived from this list. SaaS or lead-gen sites should override: `--conversion-events sign_up,subscribe`. Without this, non-ecommerce sites will report 0 conversions and the narrative will confidently say so.

LLM narrative (both subcommands, grouped as “LLM narrative”)

Flag	Type	Default	What it is
`--no-llm`	flag	off	Skip the LLM call entirely; use the deterministic template renderer instead. Same output shape, zero tokens, useful for tuning the detector parameters in a tight loop.
`--vertex`	flag	off, but reads `GOOGLE_GENAI_USE_VERTEXAI=true` from env	Use Vertex AI (gcloud auth) for the Gemini call instead of an AI Studio API key. Reuses `--billing-project` and your Application Default Credentials. Right choice if you want unified auth + Cloud Logging audit trail; AI Studio + an API key is right for personal use.
`--model MODEL`	string	`GeminiClient.DEFAULT_MODEL` (Gemini 3 Preview at the time of writing)	The Gemini model string. Override if you want a different cost/quality tradeoff.
`--api-key KEY`	string	`GEMINI_API_KEY`, falling back to `GOOGLE_API_KEY` env var	API key for AI Studio mode only. Ignored in `--vertex` mode.

Environment variables the CLI reads

A consolidated list of the env vars the CLI consults, with which flag each one substitutes for:

Env var	Substitutes for	Effect
`GOOGLE_CLOUD_PROJECT`	`--billing-project`	Default project for billing and (in `--vertex` mode) Vertex AI.
`GOOGLE_APPLICATION_CREDENTIALS`	—	Path to a service-account JSON, used by Google client libraries’ ADC chain.
`GOOGLE_GENAI_USE_VERTEXAI`	`--vertex`	If set to `true`, defaults the CLI to Vertex AI mode.
`GOOGLE_CLOUD_LOCATION`	—	Vertex AI region. Defaults to `global` (required for Gemini 3 Preview models). Set to e.g. `europe-west4` for data residency, but make sure your model supports that region.
`GEMINI_API_KEY`	`--api-key`	Primary fallback for the AI Studio API key.
`GOOGLE_API_KEY`	`--api-key`	Secondary fallback if `GEMINI_API_KEY` isn’t set.

Exit codes

Code	When
`0`	Success — report written or printed.
`1`	Operational error — BigQuery auth failed, query failed, or no data was returned for the requested window. The CLI logs an actionable hint (e.g. “run `gcloud auth application-default login`”) before exiting.
`2`	Usage error — `--start > --end`, an unknown dimension was requested, or no billing project was supplied.

Putting it together — three canonical invocations

# 1. Test against the public sample, no LLM, no tokens spent.
python cli.py sample --billing-project my-gcp --no-llm

# 2. Run against your own export, last two weeks, three dimensions, save to a file.
python cli.py run \
    --project-id my-project \
    --dataset analytics_123456789 \
    --start 2026-05-01 --end 2026-05-15 \
    --dimensions source_medium,device_category,country \
    --output weekly-report.md

# 3. Same, but using Vertex AI for narration (gcloud auth, no API key).
python cli.py run \
    --project-id my-project \
    --dataset analytics_123456789 \
    --start 2026-05-01 --end 2026-05-15 \
    --vertex \
    --model gemini-3-preview \
    --output weekly-report.md

Using it from Claude Desktop — and a word of caution

The repo also ships an mcp_server.py that exposes the analyze pipeline as an MCP tool. Any compatible client (Claude Desktop, Cursor, Claude Code, Gemini CLI) can call it. The server returns the structured findings as JSON; the client’s LLM narrates them in context. This is the closest you can get to “ask Claude about my GA4 traffic” while keeping the math out of the LLM’s hands.

But this is the part where I have to be honest about what’s missing.

This repo is pre-release. The MCP server in it is intentionally minimal. It does not yet have any of the guardrails my other open-source MCP server has:

Guardrail	BigQuery Read-Only MCP	GA4 Anomaly Detector MCP
Hard table allowlist	Yes — parsed before query submission	No — it queries whatever dataset you tell it to
Per-query scan ceiling	Yes — dry-run + reject above `MAX_SCAN_MB`	No
Token-bucket rate limit	Yes — configurable QPM + burst	No
Result row cap	Yes — `MAX_RESULT_ROWS` truncation	No
API key auth	Yes — constant-time header validation	No — runs locally, trusts the calling client
Suitable for multi-tenant or customer-facing deployment	Yes	No

The anomaly detector’s MCP server is designed for single-user, local use — it runs on your laptop, against your own GA4 export, called by your own Claude Desktop. In that context, none of the missing guardrails matter, because you can’t injection-attack your own laptop. But if you’re thinking about deploying this server to Cloud Run or putting it behind a proxy so a team can share it, don’t, not without adding the layers above first.

The longer write-up on why guardrails matter is in Stopping the $2,000 AI query — the short version is that any MCP server fronted by an LLM is one bad prompt away from a four-figure invoice unless the server refuses to run queries above a cost cap. The pattern from that post — dry-run, reject above ceiling — is exactly the patch this repo needs before 1.0.

If you want to use this tool today, the safest mode is the CLI, not the MCP server. The CLI runs once per command, against the dataset you explicitly point it at, and exits. There’s no persistent process for a prompt injection to talk to.

FAQ

Why not just use Looker Studio’s built-in anomaly detection?

Looker Studio’s anomaly bands are useful for eyeballing. They don’t give you change points, don’t give you mix shifts, don’t give you a narrative you can paste into Slack, and don’t accept tuning parameters. This tool produces something closer to what an analyst would write at the end of a review — “here are the four things you should know about this week” — not a chart you have to interpret.

Can I plug in a different LLM than Gemini?

Yes. The narrative module defines an LLMClient Protocol. GeminiClient implements it; you can drop in an Anthropic or OpenAI client by writing your own and passing it to render_with_llm(). The CLI doesn’t have a flag for that yet — easy patch if anyone wants it.

Will this work on a small site with low daily traffic?

Probably with tuning. The defaults are picked for medium-traffic ecommerce (roughly the shape of the public sample dataset). On a small site, drop --sigma-threshold to 2.5 and --pelt-penalty to 5.0 so the detectors actually fire on the smaller signal-to-noise ratio. The mix-shift detector is more forgiving — it operates on share-of-voice, not raw volume, so it works fine even at low traffic.

The detector reads whatever’s in your GA4 export. If your export contains modeled data due to consent mode, the detector treats it the same as observed data — the math doesn’t distinguish. That’s usually the right call for trend detection (you want to see the modeled traffic move), but it means a sudden change in the modeled/observed ratio (e.g., from a cookie banner change) will show up as a real anomaly. That’s correct behavior; just be aware when interpreting the result.

Why a weekend project, not a full product?

Because I wanted to know if the architecture worked before committing more time. The benchmark — do the structured findings actually make the narrative more accurate? — got a clear answer (yes, dramatically) and the tool was already useful, so I shipped it. The roadmap in the README has the obvious next steps (synthetic-anomaly tests, more detectors, the guardrails for MCP mode) — happy to take PRs.

How does this interact with the BigQuery MCP server?

They’re complementary. The BigQuery Read-Only MCP server is for general-purpose SQL access from an agent, with guardrails. This anomaly detector is for a specific computation (anomaly detection on GA4) where the computation is in code and only the narration is the agent’s job. If you connect Claude to both servers, you can ask Claude “did anything weird happen last week” (handled by the anomaly detector) and then “show me the top 10 affected pages” (handled by the BigQuery server). Different jobs; different tools.

Source code

Full source — STL/PELT/JS-divergence detectors, BigQuery fetcher, MCP server, CLI: github.com/hugonissar/GA4-Anomaly-Detector. MIT-licensed. Pre-release; PRs welcome, especially around the guardrails work needed before the MCP server is safe to share across a team.

Forecasting GA4 revenue in pure SQL with BigQuery ML — and the model-quality checks you can’t skip

2026-05-18T00:00:00+02:00

A revenue forecast is a number that looks the same whether the model fit your data brilliantly or learned almost nothing useful. The chart in Data Studio draws a line either way. That’s the trap with auto-ARIMA forecasting in BigQuery ML — you get output before you’ve earned the right to trust it, and the output is presentable enough to end up in a deck.

This post walks through a new pre-release pipeline I’ve been building, GA4-BigQuery-ML-Sales-Forecast, which produces a 30-day daily revenue forecast in pure SQL from your GA4 BigQuery export. The architecture is interesting — pure SQL, no Python, no Vertex AI, no Cloud Functions — but the more important story is the model-quality checks, because they’re what decides whether the forecast is useful or decorative.

TL;DR


What it is	A pure-SQL pipeline that produces a 30-day daily GA4 revenue forecast, with custom holidays, automatic seasonality, ad-spend regressors, and seasonal sub-forecasts for upstream funnel metrics
How it runs	One scheduled query refreshes data daily; another retrains the model weekly. Output is a single BigQuery view that plugs straight into Data Studio
Algorithm	`ARIMA_PLUS_XREG` for the main revenue model, four `ARIMA_PLUS` univariate models for the funnel sub-forecasts
Status	Pre-release. Validated on synthetic and small production datasets. Real-world testing wanted
License	MIT
The catch	Auto-ARIMA gives a confident output whether or not the model fits well. Skip §4 of the SQL and you have no idea what you’ve built

What the pipeline does

Two phases:

Build (first run, then weekly retrain). Aggregate GA4 events and ad metrics into a daily fact table. Train one ARIMA_PLUS_XREG model on revenue, plus four small ARIMA_PLUS sub-models on the funnel metrics that will become the future regressors.
Forecast (daily refresh, live on every query). The sub-forecasts predict tomorrow’s funnel volume. The trained XREG model uses those plus your planned ad spend to predict tomorrow’s revenue. A view unions forecast with history for Data Studio.

After running the script you end up with 11 objects in your dataset — 3 tables, 6 models, 2 views. No Python, no Cloud Functions, no Vertex endpoints, no orchestrator. Everything is scheduled queries plus views.

What makes the design interesting

A few choices worth highlighting:

Funnel-as-regressors with their own forecasts. The main model uses view_item, add_to_cart, begin_checkout, and ad-spend variables as exogenous regressors. The challenge with regressors is always “how do I know tomorrow’s value?” — solved here by training a separate ARIMA_PLUS model per regressor, so the system forecasts its own inputs. Recursive forecasting like this works because the funnel metrics are themselves seasonal and stable enough to predict on their own.
Custom occasions table. Built-in holiday regions catch most calendar effects, but they don’t know about your Black Friday week, your May campaign, or your anniversary sale. The custom_holidays table lets you name these and tag a window around each. The model learns a lift per named occasion and applies it to future occurrences.
Scenario planning baked in. Because ad spend is an explicit regressor, the forecast reacts to changes in your media plan. Override the trailing-average defaults in future_regressors with values from a media_plan table and the forecast updates the next time the view is queried.
Explainability view as a first-class citizen. Section §8 of the SQL builds a sales_forecast_explained view backed by ML.EXPLAIN_FORECAST. Every forecasted point decomposes into trend, weekly seasonality, yearly seasonality, holiday lift, and per-regressor attribution. This is the view that makes the eval step actionable rather than abstract.

Why you cannot trust auto-ARIMA without evaluating it

This is the part that matters most, so it gets its own section.

ARIMA_PLUS_XREG is a great default — Google’s implementation does automatic order selection, automatic seasonality detection, automatic outlier handling, and automatic structural-break detection. The “automatic” is the appeal and the risk in the same breath. The model always fits something. It always produces a prediction. It always plots a smooth forecast curve with confidence bands. None of that tells you whether the fit is any good.

Real ways the same pipeline can go wrong while producing identical-looking output:

The training data has a structural break the model can’t reach over. A pricing change six months ago, a new market launched, a viral spike during one campaign. Auto-ARIMA may or may not flag these as outliers; even when it does, the rest of the model is fit on data that’s no longer representative of the present. The forecast looks plausible — and is systematically biased.
A regressor is silently doing nothing. You added ad spend as a regressor expecting it to drive the forecast. The model gave it near-zero weight because in your historical data, spend and revenue weren’t correlated (maybe revenue is dominated by organic, maybe spend is too steady to be informative). Now your “spend-aware” forecast isn’t spend-aware at all, and your scenario plans don’t shift the line.
Yearly seasonality is over-fit on too little history. With under ~18 months of clean data, the yearly seasonality term picks up noise as if it were signal. Next January’s forecast reflects last January’s freak week.
A custom occasion was added with one historical example. The model learns whatever happened that one Black Friday and projects it forward with high confidence. One data point is not a season.
Recent data is missing or stale. GA4 export delays, a paused tag, a broken scheduled query upstream — and the model trains on a truncated view of the last 30 days. The trend term picks up the artificial drop and forecasts continued decline.

None of those situations produce an error. All of them produce a chart.

The validation workflow — §4 of the SQL

The repo’s section §4 is the eval step, and it’s the section that should never be skipped on a first run or after a major retrain. The pattern:

Train an eval model identical to production, but on data ending N days ago (default: 30 days).
Run ML.EVALUATE against the held-out N days that the eval model never saw.
Read the metrics.
Drop the eval model. It exists only for this measurement; the production model is the one that trained on everything.

The key SQL for the evaluation step is small enough to read in one breath:

SELECT *
FROM ML.EVALUATE(
  MODEL `YOUR_PROJECT.YOUR_DATASET.revenue_forecast_eval`,
  (
    SELECT day, revenue, view_item, add_to_cart,
           begin_checkout, spend, impressions, clicks
    FROM `YOUR_PROJECT.YOUR_DATASET.daily_features`
    WHERE day BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
                  AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  ),
  STRUCT(0.95 AS confidence_level, TRUE AS perform_aggregation)
);

You get back five numbers: mean_absolute_error, mean_squared_error, root_mean_squared_error, mean_absolute_percentage_error, and symmetric_mean_absolute_percentage_error.

The one you care about most is MAPE (mean absolute percentage error).

What MAPE numbers actually mean

There’s a reasonable amount of confusion about what counts as a “good” MAPE for revenue forecasting because the right answer depends on your business volatility. Practical operating ranges for daily ecommerce revenue:

MAPE	Interpretation	What to do
< 8%	Excellent. Usually means very stable revenue (subscription, B2B) and clean data	Trust it. Use it for planning
8–15%	Healthy. The pipeline is working as expected for typical ecommerce with normal noise	Trust it. Most forecasts in production live here
15–25%	Usable but verify. Likely either high day-to-day volatility, modest data history, or a recently-changed business	Use for directional planning. Don’t pin budget decisions to specific forecast values
25–40%	Don’t trust this in isolation. The model is fitting something but it’s noisy enough that a +30% surprise tomorrow wouldn’t be unexpected	Investigate before publishing the forecast
> 40%	Broken. Either data quality, training history, or a structural mismatch with the model	Diagnose; do not deploy

A subtlety: MAPE penalizes percent errors equally regardless of the day’s volume. A 50% miss on a $100 day and a 50% miss on a $100,000 day both count the same. If your business has many low-volume days, your MAPE will read worse than the model deserves — look at MAE in actual currency units to sanity-check.

SMAPE (symmetric MAPE) is more forgiving on small denominator days; if MAPE is much worse than SMAPE on the same period, low-volume days are distorting the average.

Reading ML.EXPLAIN_FORECAST: where the prediction comes from

Knowing the model has a decent MAPE is necessary but not sufficient. The next question is: where is each forecasted dollar coming from? If you deploy a “spend-aware” forecast and 95% of every prediction’s value attributes to the trend term, your forecast isn’t actually responding to spend — it’s a stationary line dressed up as a media plan.

The sales_forecast_explained view (§8) makes this visible. Each forecasted day has columns like:

Column	What it tells you
`attribution_trend`	The slow-moving level component
`attribution_weekly_seasonality`	The day-of-week effect (your Tuesday is +12%, your Sunday is −20%, etc.)
`attribution_yearly_seasonality`	The calendar position effect (you’re up 8% relative to year average)
`attribution_holiday`	Lift from named occasions (Black Friday, your custom sales)
`attribution_spend`	How much of tomorrow’s predicted revenue is driven by tomorrow’s planned spend
`attribution_view_item`, `attribution_add_to_cart`, etc.	Funnel metric contributions

Two questions to ask every time you look at this view:

Does one term dominate everything else? A healthy model spreads attribution across components. If attribution_trend is 90% of every forecast, your regressors and seasonality are doing almost nothing — the model has effectively collapsed to a simple time-series.
Does the attribution match your intuition? If you know spend is the most important driver of your revenue and attribution_spend is near zero, something is wrong upstream. Most often: spend and revenue aren’t actually correlated in your historical data (verify with a quick CORR()), the regressor was added recently with too little history, or the regressor’s variance is too low for the model to learn from.

What to do when the eval fails

Concrete responses to the most common failure modes:

Symptom	Likely cause	Response
MAPE > 25% on a long-running stable business	Recent structural break (pricing, market, product mix)	Limit training window to post-break period via `data_window` in the model options. Accept some short-term volatility cost
MAPE > 25% on a young business	Not enough history — yearly seasonality overfits	Either disable yearly seasonality (`auto_arima=TRUE` with `seasonal_periods=['WEEKLY']` only) or wait for more history before going to production
MAPE looks fine but `attribution_spend` is near zero	Spend and revenue weren’t correlated historically	Either remove spend as a regressor, or check whether the right lag is being used (today’s spend often doesn’t drive today’s revenue — try a 1- or 2-day lag)
One huge spike day skews everything	Auto-outlier detection didn’t catch it	Add the day to `custom_holidays` with `holiday_name='one_off'` and re-train, or filter it out of `daily_features`
Forecasts look flat over the next 30 days	Funnel sub-forecasts have too little data to learn weekday seasonality	Confirm `daily_features` has at least 13 months of data and the GA4 ecommerce events are actually firing

A note on retraining cadence

The pipeline schedules a weekly retrain by default. That’s the right default for most ecommerce businesses — the underlying behavior changes slowly enough that weekly is fresh enough, and ARIMA_PLUS_XREG training is cheap enough that running it every Monday morning costs essentially nothing.

Two cases where you’d want to retrain more often:

You’re heading into a high-stakes period (Black Friday, an anniversary sale) and the most recent four weeks contain the kind of signal you want the model to weight. Daily retrains for two weeks before the event, then revert to weekly.
A regressor just changed regime (you doubled your media budget, a new channel went live). Retrain on demand once you have 2–3 weeks of data at the new level so the model recalibrates its regressor weights.

And one case where you’d want to retrain less often:

You ran §4 and got a great MAPE. Don’t retrain mid-week if you’ve already validated the current model. Weekly retrains are about keeping the model fresh; if the current model is fitting well and the underlying business is stable, additional retraining only adds variance.

This is pre-release — testers wanted

The pipeline has been validated on synthetic data and small production datasets, but real-world ecommerce data has quirks that only show up at scale. The README has a Contributing section laying out exactly what would help most — currently I’m looking for testers with 2+ years of stable GA4 export history and at least 500 daily orders or 50k daily sessions, especially in less-tested setups: subscription/recurring revenue, heavy promotional calendars, non-US holiday regions.

The contribution loop is small: run the pipeline on your data, open an issue with your MAPE, business profile, and anything that broke or surprised you. The goal is a public benchmark table of realistic MAPE numbers across different business types so other users have a calibration point before they deploy.

FAQ

How is this different from Google’s GA4 predictive metrics?

GA4 ships predictive metrics (purchase probability, revenue prediction) at the user level — given this user’s behavior, how likely are they to purchase? Useful for audience targeting. This pipeline operates at the business level — given the whole property, what’s tomorrow’s total revenue? Different question, different model class, complementary outputs.

Why ARIMA_PLUS_XREG and not Prophet, NeuralProphet, or DeepAR?

ARIMA_PLUS_XREG is the only BQML-native option that supports exogenous regressors with built-in seasonality and holiday handling. Prophet would work but requires moving data out of BigQuery into Python. NeuralProphet and DeepAR require even more infrastructure. For a marketing analytics team that lives in BigQuery + Data Studio, the operational cost of introducing a Python orchestrator is high. The pipeline trades off some modeling ceiling for radically lower ops complexity.

What if I don’t have ad spend data?

You can run the pipeline with only the funnel regressors — drop spend, impressions, and clicks from §1 and from the model options in §3. The ad-aware scenario planning goes away, but the seasonal forecasting still works. Expect MAPE to be 2–5 points worse depending on how much of your revenue volatility was driven by spend.

Revenue is computed from purchase events in the GA4 export, which respect consent settings at collection time — non-consented users either don’t appear or appear as modeled estimates depending on your GA4 setup. The pipeline doesn’t add a separate consent filter because at the aggregate-revenue level you want all observed revenue, modeled or not.

Can I forecast more than 30 days out?

You can — extend the forecast horizon in ML.FORECAST and in future_regressors. Expect accuracy to fall off after roughly 30 days because the sub-forecasts driving the regressors degrade in the same way the main model does. For longer horizons, switch to weekly granularity (aggregate daily_features to weekly first) or expect 25–35% MAPE in weeks 5–8.

Are there cost limits?

ARIMA_PLUS_XREG training scans the full training window each time. For 2 years of daily data with ~10 regressor columns, each train scans on the order of a few megabytes — well under a cent per train at on-demand pricing. Daily refresh is similarly trivial. Even on the most generous weekly retrain + daily refresh schedule, total monthly BigQuery cost from this pipeline is in the low single dollars.

Source code

Full SQL, dataflow diagrams, IAM setup, troubleshooting guide, and the contributing checklist are at github.com/hugonissar/GA4-BigQuery-ML-Sales-Forecast. MIT-licensed (placeholder, will finalize for 1.0).

If you run it on production data, please open an issue with your MAPE and business profile — that’s the single most useful contribution at this stage of the project.

How to build a GA4 purchase propensity model with BigQuery ML (without Vertex AI)

2026-05-15T00:00:00+02:00

GA4 ships predictive audiences out of the box — purchase probability, churn probability — and they’re a fine default if you have the traffic for them. The threshold is ≥1,000 returning users with purchases and ≥1,000 returning users without, in the past 28 days. Many small and mid-sized ecommerce sites never cross it.

If that’s you, you can roll your own with BigQuery ML. No Vertex AI, no AutoML, no third-party SaaS — just BigQuery, a Cloud Function, the GA4 Measurement Protocol, and Cloud Scheduler. This post walks through the design end-to-end, with the full source open on GitHub.

What the pipeline looks like

GA4 → BigQuery export → weekly BQML training → daily Cloud Function scoring
                                                          ↓
                                            GA4 Measurement Protocol push
                                                          ↓
                                            GA4 audiences → Google Ads

Two files, one model, one Cloud Function. Weekly training runs as a scheduled BigQuery query (Sunday 00:01). Daily scoring runs as a Cloud Function (06:00 local time) that scores every user active in the last 24 hours, buckets them into low / mid / high, and pushes the bucket back to GA4 as a user property via the Measurement Protocol.

Why BQML over Vertex AI for this

For a single binary classification model on GA4 data, BQML is the right tool. Three reasons:

Data stays in BigQuery. Training reads directly from the GA4 export. No data movement, no scheduling, no separate feature store.
HP tuning is built in. OPTIONS(num_trials=20, ...) sweeps learn_rate, max_tree_depth, subsample, and l2_reg and reports ML.TRIAL_INFO after training. No separate orchestration.
Inference is a SQL query. ML.PREDICT(MODEL ...) returns scored rows. The whole daily job is one BigQuery query plus an HTTP loop.

Vertex AI is the right answer when you outgrow this — multi-model serving, custom containers, GPU training. For a ~30-feature boosted-tree classifier on 180 days of GA4 events, BQML is simpler, cheaper, and equally accurate.

The features

The model uses ~30 features across five categories:

Category	Features
Engagement	`total_engagement_seconds`, `engagement_last_30d_seconds`, `engagement_last_7d_seconds`
Sessions	`sessions_total`, `engaged_sessions`, `engaged_session_rate`, `avg_session_depth`, `avg_session_engagement_seconds`, `max_session_engagement_seconds`
Funnel events	`view_item_list_count`, `select_item_count`, `view_item_count`, `add_to_wishlist_count`, `add_to_cart_count`, `view_cart_count`, `remove_from_cart_count`, `begin_checkout_count`, `add_shipping_info_count`, `add_payment_info_count`
Activity	`page_view_count`, `total_events`, `days_since_last_visit`
Funnel ratios	`list_to_view_ratio`, `view_to_cart_ratio`, `cart_to_checkout_ratio`, `checkout_to_payment_ratio`, `cart_abandon_ratio`

The ratios matter most. In every ecommerce model I’ve trained, the top three feature importances are some combination of checkout_to_payment_ratio, add_payment_info_count, and cart_to_checkout_ratio. Users who reach checkout but don’t pay are the highest-value retargeting segment.

Aren’t `begin_checkout` and `add_payment_info` data leakage?

A common worry. The short answer: no, they aren’t.

Data leakage means using post-prediction-time information at training time. This model trains features from the 180-day feature window (210 to 31 days ago) and labels from a separate 30-day window (30 to 1 day ago). The windows don’t overlap. An add_payment_info event 60 days ago is a real, legitimate predictor of a purchase event 10 days ago — exactly the signal you want the model to learn.

The full training SQL with the windowing is in training.sql in the repo.

This is the design decision that gets the most pushback, so it’s worth being explicit.

Training data is not filtered by consent. Scoring data is.

The pipeline reads all GA4 events at training time to maximize the positive class. A high-intent user behaves the same regardless of which consent button they clicked, so filtering training data to consented users only shrinks the positive set by 30–40% (typical EEA rates) and materially hurts model quality.

At the scoring step, the SQL filters to ads_personalization_consent = 'GRANTED'. Non-consented users never get a propensity bucket, never enter the GA4 audience, never reach Google Ads. The moment data is used for ad targeting is the moment consent legally binds — that’s the push step, and that’s where the filter lives.

If your organisation requires a stricter “no non-consented data in any downstream system” policy, the repo has a one-CTE patch that adds a consent filter to training too. See the README’s Consent model section.

Three buckets, fixed thresholds

The model outputs a probability. The pipeline buckets it:

low — predicted probability < 0.40
mid — 0.40 ≤ p < 0.70
high — p ≥ 0.70

Thresholds are fixed, not percentile-based, so bucket meaning stays stable across retrains. The first model run gives you a distribution; revisit the thresholds once if it’s heavily skewed. After that, leave them alone.

How much traffic do you need?

Realistic operating points, assuming a 3% conversion rate:

Workable: 1,000+ weekly visitors / 30+ weekly purchases — usable for ranking, expect noise in bucket sizes.
Comfortable: ~3,300 weekly visitors / ~100 weekly purchases — stable AUC.
Ideal: ~8,000 weekly visitors / ~250 weekly purchases — reliable for production.
Robust: ~16,000 weekly visitors / ~500 weekly purchases — safe to fully automate against ad spend.

Below 1,000 weekly visitors, both this pipeline and GA4’s built-in audiences struggle. The issue is statistical, not technical — there just aren’t enough positives to train against.

Deploying it

The repo has a six-step quick start: create the BQML dataset, train the model once manually, schedule the weekly training as a BigQuery scheduled query, deploy the Cloud Function, schedule the daily run with Cloud Scheduler, and verify in GA4 DebugView. Total setup time is about an hour if you have GA4 BigQuery export already turned on; longer if you don’t (you need both daily and streaming exports on).

The full quick-start is in the repository README.

FAQ

Why not write audiences directly via the Google Ads API?

The Ads API audience push requires more permissions, more setup, and doesn’t keep GA4 reporting in sync. Writing back as a GA4 user property keeps everything in one place: GA4 reports, GA4 audiences, and Google Ads remarketing all read from the same source.

Why exclude recent buyers from both training and scoring?

Two reasons. First, retargeting people who just bought wastes ad spend. Second, in training, frequent buyers dominate the positive class — the model learns “frequent buyers buy again” instead of “high-intent prospects buy.” You don’t need ML to target recent buyers; create a rule-based audience in GA4 (purchase event in last 30 days) and run retention campaigns against it separately.

Technically yes (remove the consent filter from the SQL), but then you can’t legally push the data to Google Ads for ad targeting in jurisdictions that require explicit consent. The pipeline is designed consent-first because that’s the constraint that actually matters in production.

Does it work with Firebase / app properties?

With minor changes to the SQL — user_pseudo_id exists in app exports too, but the consent param keys differ. Easy to adapt.

Source code

Full source, training SQL, IAM least-privilege guide, and Cloud Function deployment script on GitHub: github.com/hugonissar/GA4-Ecommerce-BQML-Purchase-Propensity. MIT-licensed.

Stopping the $2,000 AI query: how to cap BigQuery scan cost from an MCP server

2026-05-11T00:00:00+02:00

Every AI agent connected to BigQuery is one bad query away from a four-figure invoice. The agent doesn’t have to be malicious — a well-meaning request like “find me users with similar behavior to the ones who converted last month” can quietly become a JOIN of three multi-terabyte tables with no partition filter. BigQuery happily runs it, scans 6 TB, charges you $30, and emails the bill.

This post is about preventing that. There are three layers of defense; only one is bulletproof. The right setup combines all three.

The three layers

Layer	Where it lives	What it stops	Bulletproof?
Application-layer scan ceiling	Your MCP server (dry-run + reject)	Any single query above the ceiling	Yes — if the ceiling is set
IAM custom role with quota	GCP IAM (custom role with project quota)	A misconfigured service account exceeding its quota	Mostly — quotas are per-day, not per-query
Project-wide bytes-billed quota	BigQuery → Quotas	A total spend cap across all queries from the project	Yes — but at project granularity, not user/role

Layer 1: application-layer scan ceiling

This is the cheapest and most precise. Every query the MCP server is about to run gets a dry-run first. BigQuery’s dry-run returns the estimated bytes scanned without actually executing the query or charging for it. If the estimate exceeds your ceiling, the query is rejected before the real job ever runs.

The Python is short — this is what BigQuery-Read-Only-MCP-Server does on every query:

from google.cloud import bigquery

bq = bigquery.Client(project=GCP_PROJECT_ID)
MAX_SCAN_MB = int(os.environ.get("MAX_SCAN_MB", "100"))
MAX_SCAN_BYTES = MAX_SCAN_MB * 1024 * 1024

def reject_if_too_big(sql: str) -> None:
    job_config = bigquery.QueryJobConfig(
        dry_run=True,
        use_query_cache=False,
    )
    job = bq.query(sql, job_config=job_config)
    if job.total_bytes_processed > MAX_SCAN_BYTES:
        raise ValueError(
            f"Query would scan {job.total_bytes_processed:,} bytes, "
            f"exceeding cap of {MAX_SCAN_BYTES:,}"
        )

Notes:

Dry-runs are free and fast (typically 50–200ms). Caching them with a small LRU dramatically reduces overhead when the agent retries the same query.
total_bytes_processed is an estimate, not a guarantee. For highly optimized queries against partitioned + clustered tables, the actual scan can come in slightly under the estimate. The reverse — actual scan exceeding the estimate — is extremely rare in practice, but the defense is belt-and-braces, so layer 3 still has a role.
Setting MAX_SCAN_MB requires knowing your data. 100 MB is fine for exploratory queries against a GA4 export (you can answer most reasonable questions in under 100 MB if your tables are partitioned by date). For unpartitioned reporting tables, you may need 500 MB. Don’t go above 2 GB without a very specific reason.

This is the bulletproof, per-query defense. No query exceeding the cap ever runs. No bytes are billed. No surprise charges.

Layer 2: IAM custom role with quota

The native BigQuery roles (bigquery.dataViewer, bigquery.jobUser, bigquery.user) don’t have per-role byte quotas. You can’t say “this service account is allowed to scan 50 GB/day” through them.

You can do it with a custom role plus a project-level quota override keyed to the service account, but the configuration is non-obvious. The shape:

# 1. Create the custom role (essentially jobUser + read scoped tighter)
gcloud iam roles create bqAgentJobUser \
  --project=$PROJECT_ID \
  --title="BQ Agent Job User" \
  --permissions=bigquery.jobs.create,bigquery.jobs.get \
  --stage=GA

# 2. Bind to the agent service account
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:bigquery-readonly-mcp@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="projects/${PROJECT_ID}/roles/bqAgentJobUser"

# 3. Set a per-user-per-day query bytes quota override via the BigQuery
# administration UI: BigQuery → Reservations → Slots/quota → query bytes
# scanned per user per day. Choose a value like 50 GB.

The catch: quotas are per-user-per-day. They’re a backstop, not a per-query control. A single 6 TB query happily eats through the daily cap in one shot. So this layer protects against sustained misbehavior, not a single bad query.

When this layer pays off: it catches the case where an agent is running 100 well-formed queries an hour for days, slowly burning down your budget. Layer 1’s per-query cap doesn’t notice that pattern.

Layer 3: project-wide bytes-billed quota

This is the whole project’s emergency brake. In Google Cloud Console:

APIs & Services → Quotas & System Limits.
Filter to BigQuery.
Find “Query usage per day”.
Edit Quota and set a daily ceiling. Default is 200 TB/day — drop it to something defensible (50 GB? 500 GB?).
Save. The change takes effect within minutes.

When this layer matters: a query that somehow slipped past the application layer ceiling (a bug, a misconfiguration, a deploy without the cap), combined with an IAM quota that’s too loose. The project-wide quota stops all BigQuery jobs in the project when hit — which is disruptive but bounded. Set it high enough that legitimate batch jobs (your GA4 ML training, scheduled queries, etc.) don’t trip it.

What this looks like together

For a production deployment of the MCP server I maintain, my actual configuration:

Layer 1: MAX_SCAN_MB=100 for exploratory agents, MAX_SCAN_MB=500 for known-good internal use.
Layer 2: a custom role for the service account, with a 50 GB per-user-per-day quota.
Layer 3: project-wide cap of 1 TB/day. High enough that nothing legitimate ever hits it, low enough that a runaway script gets stopped before it costs four figures.

Total cost at idle: zero (Cloud Run scales to zero). Total worst-case exposure if every layer except the last fails: 1 TB × $6.25/TB = $6.25.

Why “just use IAM” isn’t enough

If you ask in a typical Cloud forum “how do I prevent expensive BigQuery queries from an agent?”, the default answer is “use IAM.” IAM is necessary. It is not sufficient.

The reason: IAM controls who can run jobs and on which datasets. It doesn’t control how big each job is. A perfectly IAM-correct setup with bigquery.dataViewer on three datasets and bigquery.jobUser on the project lets an agent run a 6 TB scan against the largest of those datasets, no questions asked. IAM doesn’t see the query plan; only BigQuery does.

The pattern in this post — dry-run before submit, reject above the cap — is what bridges the gap. IAM gates access, the application-layer cap gates cost, the quotas gate sustained spend. All three are necessary, and only the first one is sufficient for stopping a single catastrophic query.

FAQ

Doesn’t dry-run also cost money?

No. Dry-runs are explicitly free per Google’s BigQuery pricing docs. You can run unlimited dry-runs without billing impact.

What about queries that scan less than expected?

Dry-run gives an upper bound, not the actual scan. In practice the estimate is accurate to within a few percent for partitioned + clustered tables, and within ~10% for unpartitioned ones. Either way, you’re being conservative — you never pay for a query bigger than the dry-run estimate.

How does this interact with BigQuery’s slot-based pricing?

It doesn’t directly. The scan ceiling controls bytes scanned, which is the unit of on-demand pricing. If you’re on flat-rate slots, the relevant metric is slot-seconds, not bytes — and dry-run reports both. The same pattern works: dry-run, check slot-seconds, reject if above the cap.

Can I do this on Snowflake?

Snowflake exposes query plan estimates via EXPLAIN. The principle is the same: estimate before run, reject above a cap. The implementation differs (Snowflake’s estimates are less precise than BigQuery’s dry-run) but the defense-in-depth structure is identical.

Source code

Full source — including the dry-run cache, the LRU TTL implementation, and the rest of the security layers (allowlist, rate limiting, result truncation): github.com/hugonissar/BigQuery-Read-Only-MCP-Server. MIT-licensed.

Hugo Nissar — MarTech engineering on Google Cloud

Self-hosted vs Google’s official BigQuery MCP server: a security and cost comparison

TL;DR

The three risks Google’s defaults don’t cover

1. Any table the service account can reach

2. The “AI just ran a $2,000 query” problem

3. No built-in rate limiting

When the managed version still wins

Deploying the self-hosted alternative

Frequently asked questions

Does this work with Claude Desktop?

Can I run it inside a private VPC?

How much does it cost at idle?

What about column-level masking?

Source code

When +35% is really −60%: a GA4 anomaly detector where the LLM only writes the report

The benchmark that made the design obvious

What it produces

How the math gets done

Full CLI reference

Subcommands

Top-level help flag

run-specific required flags

Date range (both subcommands)

Billing and auth (both subcommands)

Data shape (both subcommands)

Output (both subcommands)

Detection tuning (both subcommands, grouped as “Detection tuning”)

LLM narrative (both subcommands, grouped as “LLM narrative”)

Environment variables the CLI reads

Exit codes

Putting it together — three canonical invocations

Using it from Claude Desktop — and a word of caution

FAQ

Why not just use Looker Studio’s built-in anomaly detection?

Can I plug in a different LLM than Gemini?

Will this work on a small site with low daily traffic?

What about Consent Mode v2 and modeled data?

Why a weekend project, not a full product?

How does this interact with the BigQuery MCP server?

Source code

Forecasting GA4 revenue in pure SQL with BigQuery ML — and the model-quality checks you can’t skip

TL;DR

What the pipeline does

What makes the design interesting

Why you cannot trust auto-ARIMA without evaluating it

The validation workflow — §4 of the SQL

What MAPE numbers actually mean

Reading ML.EXPLAIN_FORECAST: where the prediction comes from

What to do when the eval fails

A note on retraining cadence

This is pre-release — testers wanted

FAQ

How is this different from Google’s GA4 predictive metrics?

Why ARIMA_PLUS_XREG and not Prophet, NeuralProphet, or DeepAR?

What if I don’t have ad spend data?

How does this handle Consent Mode v2?

Can I forecast more than 30 days out?

Are there cost limits?

Source code

How to build a GA4 purchase propensity model with BigQuery ML (without Vertex AI)

What the pipeline looks like

Why BQML over Vertex AI for this

The features

Aren’t begin_checkout and add_payment_info data leakage?

Consent: filter at the push step, not at training

Three buckets, fixed thresholds

How much traffic do you need?

Deploying it

FAQ

Why not write audiences directly via the Google Ads API?

Why exclude recent buyers from both training and scoring?

Can I run this without Consent Mode v2?

Does it work with Firebase / app properties?

Source code

Stopping the $2,000 AI query: how to cap BigQuery scan cost from an MCP server

The three layers

Layer 1: application-layer scan ceiling

Layer 2: IAM custom role with quota

Layer 3: project-wide bytes-billed quota

`run`-specific required flags

Aren’t `begin_checkout` and `add_payment_info` data leakage?