healthcare-ai-data-engineer · L1 control room

healthcare-ai-data-engineer / ▲ Cloud Run connecting… </> source ▶ Storyboard view repo ↗

🏥 HEALTHCARE DATA PLATFORM — L1 CONTROL ROOM

Can humans + AI trust hospital data today?

PROD | Updated 08:02 AM

🟢 SYSTEM STATUS — SHOULD WE PANIC TODAY?

No — all systems operational. 🎉

No fake patients

No broken feeds

No critical quality failures

Charts are safe to read. Data is safe to feed agents.

❤️ Can we trust the chart?

QC passed?99.2%

Missing keys?0.04%

Duplicate visits?0.00%

Fake patients?none

[Open B2]

⏰ Is the data fresh?

Latest ingest08:01

Data delay2m

Stale alertON

SLA met99.1%

[Open B4]

🔧 Is it alive?

Uptime99.94%

Jobs OK99.1%

MTTR38m

Silent fails0

[Open B4]

📋 Can people use this?

Star schemaPASS

ContractsPASS

Query martsYES

[Open B3][Open B5]

🛡 Will compliance yell?

PII scanPASS

Audit lineageON

HIPAA taggingPASS

[Open B2][Open B6]

🤖 Is it agent-ready?

Marts agent-allowed4 / 5

PII-safe viewsON

Contract coverage100%

Agent freshness SLAmet

[Open B2][Open B5]

🚨 IF SOMETHING TURNS RED

Trust issue→ Open B2

Modeling issue→ Open B3

Pipeline issue→ Open B4

Warehouse issue→ Open B5

Agent-readiness issue→ Open B5

Architecture→ Open B6

traces to: /api/control-room 200 dashboard_spec.yml data/quality/l1_checkpoint_report.json

❤️ TRUST INVESTIGATION ROOM

Can we trust the patient and visit numbers?

PROD | Updated 08:02 AM

😌 CURRENT STATUS

Mostly healthy — 1 issue needs attention (NOT patient-facing yet)

No fake patients detected

❤️ TRUST VITALS (value + inline benchmark)

1.🧪 QC passed?99.2% (good ≥95 | strong ≥99)

Open dbt tests Open run_results.json Open Job Log

2.👁 Missing key fields?0.04% (good <1 | strong <0.1)

Open null offenders query Show sample rows Open Lineage

3.👯 Duplicate visits?0.00% (good <1 | strong =0)

Open duplicate keys query Show validation

4.🧑 Fake patients?100.0% (good ≥99 | strong =100)

Open orphan records query Open relationship test Open upstream job

5.🕵 Can we trace every number?94% (good ≥90 | strong ≥95)

Open manifest.json Open lineage gaps Open model owners

6.🧮 Do systems agree?99.94% (good ≥99 | strong ≥99.9)

Open recon query Open KPI definitions Open decision log

📑 EVIDENCE (PROOF) — show me receipts (direct clicks to bad files)

visit → patient relationship97.8% (expected =100)

direct proof clicks:

Open Query Open failing rows Open dbt Test Open Job Log Open Owner

blast radar

impacts: Patient Count KPI, ER Census, RAG Patient Lookup
scope: clinical marts only
patient-facing dashboards: NOT impacted (yet)

MRN null spike0.12% (strong <0.10)

direct proof clicks:

Show Rows Open Query Open Lineage Open Model Open Owner

blast radar

impacts: reporting only (for now)
risk: if it grows → becomes patient-identity risk

duplicate visit_id0.00% (required =0)

proof: Show Validation

blast radar

none

🚑 TRIAGE — what do we do + who owns it

Broken visit → patient join

owner: Data Platform (on-call) | ETA: < 1h | action: Page on-call

runbook: Open runbook Open rollback plan Open incident thread

MRN null spike

owner: Analytics Engineering | ETA: next sprint | action: Create ticket

paper trail: Open ticket Open data contract Open upstream owner

🛟 AUTO-MITIGATIONS — what we already auto healed so humans don't panic

✅Retried failed dbt job (success)Open retry logs

✅Refreshed affected martsOpen refresh job

✅Switched dashboards to last-known-good snapshotOpen snapshot ID Open diff

✅Added warning banner (degraded mode)Open dashboard link

✅Notified ownerOpen incident thread

🤖 Auto-remediation coverage50%

✅ Machines stabilized the symptoms — dashboards stayed up, no bad data shipped.
⏳ The other 50% needs a human: the KPI definition call below. 👇

🤝 GOOD LUCK HUMAN — HITL your turn now

💰 Finance visits: 1,024 vs 🧾 Billing visits: 1,011 ⚠️

🤖 Machine verdict: both valid 😵 (definition fight, not a data bug)

⏳ What happens if you ignore this:

📤 Finance ships "1,024" to execs
📤 Billing ships "1,011" to Ops
📊 BI publishes both (by accident)
📸 Someone screenshots the mismatch in Slack
🎉 Congratulations: you just scheduled a 90-minute "who's lying" meeting

🥊 Who fights who:

💰 Finance Lead: "Visits = posted revenue events"
🧾 Billing Lead: "Visits = billable encounters"
📊 Data Lead: "Please stop redefining reality in Google Sheets"
⚖️ Compliance (walks in late): "Which one is in the audit report?"

📡 blast radar:

🎯 Patient Count KPI (exec dashboard)
🏥 ER Census (ops)
🔍 Downstream RAG "patient lookup" confidence (counts stop matching)

🛠 Fix it in 3 moves — each step has a button

🏆 Pick the winning definition (or publish both, clearly labelled, like civilized people)

⚖️ Compare both definitions 📊 Open recon query

📜 Write it down as a KPI contract (one paragraph, not a novel)

✍️ Open contract draft 📖 Open KPI definitions

🔒 Enforce it in dbt (tests + semantic layer) so this argument can't respawn next week

🧬 Open dbt model 📐 Open semantic metric

👉 Decision owner: Data Lead · 👤 Assign owner 📝 Open decision log 💬 Start Slack thread

traces to: /api/trust-room 200 trust_metrics_spec.yml evidence_links.md

🛒 B3 DATA MARKETPLACE — Mart Catalog + dbt Lineage

Can humans + AI pick the right dataset without join hell?

dbt + SQL · Gold Layer · BI / AI ready

🛒 MART CATALOG — ready-to-query data products

📦 mart_er_triage

ER ops / census / triage

Grain1 row = 1 ER visit

ConsumersBI + AI + Ops

✅ precomputed✅ prejoined✅ precleaned✅ preaggregated

SELECT * FROM mart_er_triage;

📦 mart_patient_summary

patient lookup / repeat visits

Grain1 row = 1 patient

ConsumersBI + AI

✅ precomputed✅ prejoined✅ precleaned✅ preaggregated

SELECT * FROM mart_patient_summary;

📦 mart_claims_summary

billing / recon

Grain1 row = 1 claim

ConsumersFin + Ops

✅ precomputed✅ prejoined✅ precleaned✅ preaggregated

SELECT * FROM mart_claims_summary;

🧬 LINEAGE PREVIEW — where mart_er_triage comes from

raw_ehr_visit

▼

stg_visit

▼

fct_patient_encounters

▼

mart_er_triage

consumed by▼

📊 Executive Dashboard

🤖 AI Retrieval

Open dbt lineage Open model_map.md Open sample_queries.sql

📜 CONTRACT SNAPSHOT — do we agree what the mart means?

visit = completed care encounterPASS

ER census = active ER encounters in reporting windowPASS

patient = unique human receiving carePASS

claim = billable / reimbursable eventPASS

💡 Main message: precomputed + prejoined + precleaned + preaggregated so humans can SELECT * FROM mart_ instead of writing join hell.

traces to: mart_catalog_ascii.md lineage_ascii.md sample_queries.sql

🔄 B4 PIPELINE OPERATIONS DAG

What runs first? What depends on what? What breaks downstream?

tech: Airflow + Python + dbt + GitHub Actions

1data/raw/

source healthcare data

▼

2ingest_raw.py

validate / load raw

▼

3identity_resolver.py

patient_identity_map.json

4provider_cleaning.py

provider reference data

both must finish▼

5dbt build

bronze → silver → gold

▼

6dbt tests

not_null / unique

7schema checks

contracts valid

8recon checks

finance vs billing

all must pass▼

9quality_gate.py

PASS → publish
FAIL → block + alert

▼

10mart_patient

precomputed mart

11mart_visit

prejoined mart

12mart_claims

finance mart

marts published together▼

13api_refresh

FastAPI / OpenAPI

portfolio/ = consumers only, not pipeline▼

14B1 dashboard

executive cockpit

15B2 trust view

quality cockpit

16AI consumers

RAG / agents

🔗 DEPENDENCY RULES

1 → 22 → 3, 43, 4 → 55 → 6, 7, 8 6, 7, 8 → 99 → 10, 11, 1210, 11, 12 → 1313 → 14, 15, 16

💥 BLAST RADIUS EXAMPLES

If (3) identity_resolver.py fails:

patient_identity_map.json fails
↓ dbt build may still run, but trust quality drops
↓ B2 Trust Dashboard turns yellow / red

If (6) dbt tests fail:

quality_gate.py blocks publish
↓ marts do not refresh
↓ API / dashboard serve last-known-good snapshot

If (9) quality_gate.py fails:

mart_patient / mart_visit / mart_claims blocked
↓ B1 Executive Dashboard shows degraded mode
↓ AI consumers do not receive bad data

✅ WHAT B4 PROVES

🕒 Freshness

data arrives + refreshes on schedule

🔁 Reliability

tasks run in dependency order

🧯 Recovery

failed jobs retry, or block publish safely

💥 Blast radius

you know exactly what breaks downstream

traces to: dag_ascii.md runbook.md

🏭 B5 WAREHOUSE EXPLORER

Do the tables exist — and are they modelled so every join is safe?

dataset: healthcare_dw · BigQuery (dbt core models)

Real tables → modelled as a star schema → integrity enforced by dbt tests → every number traces back to SQL.

🩺 WAREHOUSE AT A GLANCE

1 dataset

11 tables

8 gold models

0 views

last refresh 12:51

health Healthy 🟢

497 encounters

⭐ THE STAR SCHEMA — 7 conformed dimensions → 1 fact

dim_patient

dim_doctor

dim_hospital

dim_diagnosis

dim_insurance

dim_medication

dim_date

7 FK relationships → 1 fact▼

⭐ fact_patient_encounters

1 row = 1 encounter · 7 surrogate FKs + 8 measures

🥉🥈🥇 MEDALLION PATH — where the star comes from

🥉 raw/healthcare_dataset.csv

▼

🥈 stg_healthcare

▼

int_encounters_enriched

int_readmissions

▼

🥇 gold star schema

8 models

🔒 INTEGRITY ENFORCED — dbt tests that gate every build

✅encounter_id — not_null + unique → no duplicate or ghost encounters

✅7 FK relationships (patient/doctor/hospital/diagnosis/insurance/medication/date keys) → joins can't silently drop rows

✅accepted_values on is_emergency / is_readmission [0,1] → clinical flags can't go dirty

🧪 These run on dbt test and gate dbt build. Honest scope: row-shape integrity (the cheap tests that stop silent FK drops) — not semantic clinical validation.

🔎 PROOF QUERY — verified against the real dbt model

SELECT medical_condition, COUNT(*) AS encounters FROM fact_patient_encounters GROUP BY 1 ORDER BY encounters DESC;

🟢 Open SQL 📈 Open Lineage 👁 Preview Data

📏 497 encounters (synthetic dataset). The star schema + enforced FKs are the skill — they hold the same at 497 rows or 497M. Row count isn't the flex; the modelling is.

traces to: /api/warehouse-room 200 dbt-project/models/marts/core/ warehouse_room_payload.json

🏗 B6 SYSTEM ARCHITECTURE

How does the whole machine connect? (the 10-second version)

📥 Sources

EHR · claims · providers

▼

⚙️ dbt → 🏭 BigQuery

transform + test → trusted marts

▼

🔌 API on Cloud Run

serves the room payloads

▼

👁 Humans

B1 / B2 cockpit

🤖 Agents

RAG · agent-allowed only

🔌 API surface — what Cloud Run serves

/api/control-room/api/trust-room/api/warehouse-room/api/retrieve/api/ask

🤖 L2 grounded agent — answers grounded on trusted marts, every claim cites [doc N]

BM25 retrieves top-K from the redacted enriched corpus → Gemini answers only from that evidence. No raw PII indexed.

[Ask grounded agent →]

▲ Why Cloud Run: stateless API, scales to zero when idle, one container, deploys from the same repo CI already guards.

traces to: architecture.mmd dependency_map.mmd