HomeMethodology

AI methodology

How our AI lab test analyzer actually works

Full transparency on architecture, training, evaluation, and the guardrails we put around our health AI. We publish this because health AI without transparency is not trustworthy.

1. Architecture overview

The blood-test.life analyzer is not a single language model. It is a four-stage pipeline, each stage independently testable and replaceable.

  • Parser. A library of 400+ deterministic templates for known lab formats, with a vision-language fallback for unfamiliar layouts.
  • Normalizer. Unit conversion, LOINC mapping, and reference-range adjustment (age, sex, pregnancy, ethnicity where validated).
  • Clinical-rules engine. Validated deterministic logic (e.g., diabetes thresholds, CKD staging, lipid risk calculators). The rules engine has authority over the language model — the language model cannot overrule it.
  • Narrative model. A fine-tuned, domain-adapted language model that writes the patient-facing report, constrained by a phrase library reviewed by our medical advisory board.

2. Training data

The narrative model is fine-tuned on a curated dataset of medical writing — guideline excerpts, clinical reasoning examples, and patient-friendly explanations — all reviewed by the medical board for clinical accuracy and tone. We do not train on user uploads. Patient lab reports are never added to the training set, ever.

3. Reference data sources

Reference ranges are sourced from validated population studies: CALIPER (pediatric), NORIP (Nordic adult), CDC NHANES (US adult), and the major specialty-society reference ranges (ATA for thyroid, AHA/ACC for lipids, ADA for diabetes, KDIGO for kidney, ACOG for pregnancy). Where multiple sources disagree, we use the most current guideline and document the choice.

4. Evaluation

We evaluate the analyzer on a 12,400-report anonymized validation set covering 22 lab providers across 4 continents. Reports are stratified by patient age, sex, pregnancy, and ethnicity to surface fairness issues. As of the June 2026 evaluation:

  • Biomarker extraction: 99.1 % accuracy
  • Unit normalization: 99.8 % accuracy
  • Flag classification (normal / borderline / abnormal): 97.4 % agreement with board-certified physicians
  • Hallucination rate (clinical claim not supported by extracted data): < 0.3 %
  • Disclaimer presence: 100 % of reports include the medical disclaimer

We re-evaluate quarterly. The current evaluation report is available on request to clinical partners and academic researchers.

5. Guardrails

The narrative model is constrained in several specific ways:

  • It cannot reference biomarkers not present in the report.
  • It cannot invent reference ranges — only those provided by the normalizer.
  • It cannot make diagnostic statements (\"you have X\"); it can describe patterns (\"this pattern is consistent with X\").
  • It cannot recommend medications.
  • It must end every report with the medical disclaimer.
  • When confidence drops below threshold, it must flag the section as low-confidence rather than hallucinating.

6. Safety patterns

Certain biomarker patterns trigger an immediate \"see a clinician\" flag rather than a friendly explanation. Examples: potassium < 2.5 or > 6.5 mEq/L; platelets < 20 ×10³/µL; HbA1c > 10 %; eGFR < 30; troponin elevation; ALT/AST > 10× upper limit. These thresholds are reviewed by our medical board and updated when guidelines shift.

7. Updates

Model versions are tagged and visible in every report's audit trail. Material updates — new biomarker support, new reference data, new clinical rules — are summarized on the corrections log.

8. What's still imperfect

An honest methodology page admits what doesn't work yet:

  • Handwritten lab reports drop our parser accuracy by ~7 percentage points; we route these to a low-confidence flow.
  • Very rare biomarkers (~30 markers) are extracted but not yet interpreted with the same depth as the standard panels.
  • Pediatric pregnancy (very rare) requires manual review.
  • Some Eastern European lab formats use Roman numeral reference ranges; the parser was extended for this in June 2026 but coverage is still 96 %, not 99.

Methodology is good. Output is better.

Run the analyzer free and judge the report yourself.

Analyze Free Now