ESHYFT — Nurse Onboarding Doc Verify

AI Token Details Pipeline summary What extraction handles Flag reference

Validation flags reference — all flags the pipeline can return×

validationFlags[] = hard issues · validationWarnings[] = soft flags. FLAG surface to nurse WARN soft / advisory INFO informational only

Flag / Warning	Applies to	Type	What it means
`EXPIRED`	All	FLAG	Document is past its expiration date.
`POSSIBLE_ALTERATION`	All	WARN	AI detected signs of tampering: font inconsistencies, pixel artifacts, white-out, copy-paste edges, or photo manipulation. Check `alterationDetails` for specifics.
`UNDER_18`	Gov ID	FLAG	Extracted DOB places person under 18.
`IMPOSSIBLE_DOB`	Gov ID	FLAG	DOB is in the future or implies age >120. Likely an OCR misread.
`AGE_OVER_100` / `AGE_UNDER_16`	Gov ID	WARN	DOB produces an unusual age — probable OCR digit error. Manual review.
`EXPIRING_SOON`	License	WARN	License expires within 30 days.
`UNRECOGNIZED_DOCUMENT`	License	FLAG	AI couldn't classify the document type. May be a wrong upload or blurry image.
`POSITIVE_RESULT`	TB	FLAG	TB test result is positive. A chest X-ray or clinical follow-up is required.
`READ_WINDOW_VIOLATION`	TB skin	FLAG	PPD was read outside the required 48–72 hour window. Test is invalid.
`2_STEP_INTERVAL_TOO_SHORT`	TB skin	FLAG	Step 2 placed fewer than 7 days after step 1. CDC protocol requires ≥7 days between steps.
`2_STEP_INCOMPLETE_MISSING_STEP_2`	TB skin	FLAG	2-step form with only step 1 data present. Nurse needs to upload the complete form.
`NO_PROVIDER_NAME_OR_SIGNATURE`	TB	WARN	No nurse, reader, or physician on the document. Lab-issued QuantiFERON/T-SPOT reports with a facility name do NOT trigger this.
`NO_ACTIVE_TB_STATEMENT_NOT_FOUND`	TB X-ray	WARN	No TB-negative phrase found in the radiology report. Standard phrases ("no acute infiltrate", "lungs are clear", etc.) are auto-detected.
`XRAY_INDICATION_NOT_TB_RELATED`	TB X-ray	WARN	X-ray indication is unrelated to TB screening (e.g., "SOB, wheezing"). X-ray may be clinically valid but wasn't ordered for TB.
`PHYSICAL_MISSING_DATE`	Physical	FLAG	No examination date extracted. Required to verify the physical is current.
`DATE_LOW_CONFIDENCE`	Physical	WARN	Exam date extracted with low confidence — often a handwritten year where "5" and "3" look similar. Manual review recommended.
`typeMatch: false`	All	WARN	AI detected a different document type than what was selected. Check `typeMatchDetails`. Selection is a hint only, never a hard block.

Pipeline summary — what the extraction stamps on every document×

Each capability runs automatically. Severity: BLOCKS = backend must reject WARNS = soft flag for backend to act on INFO = returned, no action required

Capability	Applies to	Severity
Doc type classification — pipeline re-identifies doc type independently; nurse's selection is a hint only, never a hard gate	All	INFO
Expiry / validity check — hard block on expired docs; warn on expiring within 30 days (licenses); type-specific windows: TB PPD/blood 1yr, X-ray 5yr, CPR 2yr	Gov ID · License · TB · CPR · Vaccines	BLOCKS WARNS
Alteration & fraud detection — font anomalies, pixel artifacts, white-out; Google ID Proofing signals run separately on gov IDs	All	WARNS
Multiple docs in one upload — detected when nurse photographs two cards side by side	All	WARNS
Identity / name matching — exact, fuzzy (1–2 char diff / OCR noise), partial (first only or last only), no match	All	WARNS INFO
Dual-model confidence scoring — Claude + Gemini 3.5 Flash both read the doc; agreement per field raises score, disagreement lowers it	All	INFO
OCR cross-check — raw Cloud Vision scan verified against model output per field; match boosts confidence, mismatch reduces it	All (non-PDF)	INFO
Per-field confidence scores (0–100) — returned on every field; backend decides auto-approve vs. flag threshold	All	INFO
Age validation — impossible DOB (future or age >120) hard blocked; under-18 and over-100 flagged as probable OCR errors	Gov ID	BLOCKS WARNS
2-step PPD protocol compliance — step 2 < 7 days after step 1 blocked (CDC violation); step 1 only blocked (incomplete); 1-step form auto-upgraded if step 2 date present; read window 48–72h validated	TB skin	BLOCKS WARNS
Lab report enforcement — QuantiFERON / T-SPOT must come from an actual lab report; physical exam or immunization records rejected; expected lab values (antigen, mitogen, nil) must be present	TB blood	BLOCKS WARNS
Chest X-ray clearance statement — radiology report must explicitly state "no active TB"; absence flagged	TB X-ray	WARNS
State-specific normalization & exceptions — license type codes normalized (e.g., OH "State Tested Nurse Aide" → CNA); IL/NE CNA no exp date or number expected; AL CNA 2yr expiry auto-computed from issue date	License	INFO
PA CNA enrollment phrase check — required phrases confirming PA Nurse Aide Registry enrollment must appear in the document	License (CNA-PA)	WARNS
Vaccine immunity determination — returns immune / not immune / exempt / unknown per vaccine; MMR requires all 3 components (Measles, Mumps, Rubella); MMRV satisfies both MMR + Varicella; declinations always rejected	Vaccines	BLOCKS INFO
Missing date auto-computation — CPR: 2yr expiry from issue date when not printed; 1-step PPD auto-upgraded to 2-step when step 2 date is present but form says 1-step	CPR · TB skin	INFO

What the extraction pipeline handles today×

Every item below is logic that runs automatically on every upload — no human review needed unless flagged. BLOCKS = hard stop, document cannot be accepted. WARNS = soft flag passed to backend. INFO = computed/detected, no action needed.

All document types

BLOCKS	Wrong document type uploaded (e.g. nurse selected "RN License" but uploaded a passport) — detected and flagged
WARNS	Signs of tampering or image manipulation detected on the document
WARNS	Multiple documents detected in a single upload (e.g. photo of two cards side by side)
INFO	Confidence score returned per extracted field (0–100) — backend can use to set review thresholds
INFO	Two independent models (Claude + Gemini) extract every document — fields where they agree get a confidence boost; conflicts are flagged for review
INFO	OCR text scan independently cross-checks every extracted field — mismatches reduce confidence score

Government ID

BLOCKS	Impossible date of birth detected (age over 120, or future date)
WARNS	ID is expired
WARNS	Nurse appears to be under 18
WARNS	Nurse appears to be over 100 (possible OCR error on DOB)
WARNS	Google Document AI fraud signals detected (image manipulation, suspicious marks)
INFO	Nurse's pre-selected ID type (e.g. "Driver's License") treated as hint only — the actual document type detected by the pipeline is the source of truth, never blocks
INFO	Nurse's age at time of extraction calculated from DOB and returned

Nursing License / Certification

BLOCKS	License is expired
WARNS	License expires within 30 days
WARNS	PA CNA: uploaded a Notice of Enrollment but one or more of the 3 required phrases is missing from the document text
INFO	State-specific license type names normalized to standard codes (e.g. Ohio "State Tested Nurse Aide" → CNA; original label also saved)
INFO	Alabama CNA: 2-year expiration auto-calculated from issue date (AL does not print expiration on license)
INFO	IL, NE, NB CNA: no expiration date required — absence of expiration is not flagged as missing
INFO	AL, IL, NE, NB CNA: no license number required — absence of license number is not flagged as missing

Name verification (cross-document)

INFO	Exact match: name on document matches account name exactly
INFO	Fuzzy match: small spelling differences (accent marks, OCR slips, 1-2 character errors) still count as a match
WARNS	Partial match: first name matches but last name doesn't (or vice versa)
WARNS	No match: neither first nor last name matches account — possible name change or wrong document

TB Test

BLOCKS	2-step PPD: Step 2 was administered fewer than 7 days after Step 1 (protocol violation — results invalid)
BLOCKS	2-step PPD uploaded with Step 1 present but Step 2 missing — incomplete submission
BLOCKS	QuantiFERON or T-SPOT uploaded but document is a physical form or immunization record, not a lab report
BLOCKS	TB test is expired (PPD/blood test: 1 year; Chest X-Ray: 5 years)
WARNS	PPD was not read within the required 48–72 hour window after placement
WARNS	Positive TB result detected
WARNS	No provider name or signature found on the document
WARNS	Chest X-Ray: document does not contain a statement confirming no active TB
WARNS	QuantiFERON/T-SPOT: lab values not found on the document
WARNS	2-step PPD: more than 30 days between Step 1 and Step 2 (possible date transcription error)
INFO	1-step PPD with a Step 2 date present is automatically re-classified as 2-step

CPR / BLS

BLOCKS	Document is a First Aid card only — does not include CPR or BLS
BLOCKS	CPR is expired (2-year validity from issue date if no explicit expiration printed)
BLOCKS	Neither an expiration date nor an issue date can be read — manual review required
WARNS	Document does not contain "CPR" or "BLS" keywords (unexpected format)
INFO	2-year expiration auto-calculated from issue date when no expiration date is printed on the card

Vaccines — MMR, Varicella, Tdap

BLOCKS	Document contains no usable vaccine data — no doses and no titer results
BLOCKS	MMR titer: one or more of Measles, Mumps, Rubella components is missing from the lab report — cannot confirm full immunity
BLOCKS	MMR titer: one or more components is explicitly non-immune (negative result)
BLOCKS	Expected vaccine (e.g. MMR) not found anywhere in the document
BLOCKS	Declination form uploaded — these are never accepted as proof of immunity
WARNS	MMR titer: one or more components is equivocal or indeterminate — manual lab review required
WARNS	Medical exemption: signed but not by a licensed clinician (MD, NP, PA, etc.)
WARNS	Exemption form does not list MMR (or "all vaccines") — may be for a different vaccine
WARNS	Physical exam form uploaded but MMR status is ambiguous or marked "Due" — cannot confirm immunity
INFO	Document classified as: vaccination record / titer report / exemption / declination / physical form
INFO	MMRV combo vaccine satisfies both MMR and Varicella requirements from a single document
INFO	When multiple documents submitted for MMR (e.g. separate measles, mumps, rubella titers), results are aggregated — most recent and best result per component is used
INFO	Final immunity result returned: immune / not immune / exempt / unknown — ready for backend to act on

Live test results — SHFT-4882 Government ID (2026-05-26)×

Document: GovID_Ohio-Drivers-License_Crystal-Mae_Stover_Exp-2029-07-01.jpg sent through POST /api/extract-compare

What was being verified	Result
The nurse's pre-selected ID type is treated as a hint only — the pipeline identifies the document type independently	Correctly identified as Driver's License ✓
Date of birth validated — impossible ages flagged	Age = 37, no impossible-age flag ✓
Tampering/alteration detection runs on every upload	No alterations detected (no false positive) ✓
Single document uploaded — multi-document flag should not fire	Multiple documents = false ✓
Clean document — no validation error or warning flags returned	No flags returned ✓

Live test results — SHFT-4950 License/Certification (2026-05-26)×

4 real nurse license documents sent through POST /api/extract-compare

Document	What was being verified	Result
Illinois RN License NursingLicense_Illinois-Registered-Professional-Nurse-RN-License_Alyissia_Sims_Exp-2026-05-31.png	All core fields extracted: license type, number, expiration, name, state	RN, number extracted, exp 05/31/2026, state IL ✓
	"REGISTERED PROFESSIONAL NURSE" normalized to RN while keeping original label	"REGISTERED PROFESSIONAL NURSE" → RN ✓
	Expiring within 30 days — warning flag should fire	EXPIRING_SOON warning returned ✓
Illinois CNA Registry Printout NursingLicense_Illinois-Health-Care-Worker-Registry-CNA-Verification_Tamika-Nicole_Wilson_Active-2026-03-12.pdf	CNA fields extracted from state registry printout	CNA, number, exp 03/12/2026, issue date 05/11/2009, state IL ✓
	PDF input handled correctly	PDF processed without error ✓
Ohio STNA Card NursingLicense_Ohio-State-Tested-Nurse-Aide-CNA-Card_Misty_McKee_Issued-2025-08-20.jpg	Ohio "State Tested Nurse Aide" normalized to CNA	"State Tested Nurse Aide" → CNA ✓
	OH CNA cards have no expiration — only issue date returned	Issue date 08/20/2025, expiration empty (correct) ✓
Pennsylvania CNA Wallet Card NursingLicense_Pennsylvania-CNA-Nurse-Aide-Registry-Card_Keniesha_Porter_Exp-2027-05-20.jpg	PA CNA card fields extracted	CNA, number, exp 05/20/2027, state PA ✓
	PA notice validation should only fire on Notice of Enrollment, not a wallet card	PA notice validation = not triggered (correct) ✓

All 4 docs: correct type detected, no tampering flagged, no duplicate documents ✓

Live test results — SHFT-4952 TB Test (2026-05-26)×

Document: TB_Chest-Xray-Radiology-Report_REDACTED_2025-04-28.png sent through POST /api/extract-compare

What was being verified	Result
Document correctly classified as a Chest X-Ray	Type = Chest X-Ray ✓
Chest X-Ray expiration auto-calculated as X-Ray date + 5 years	Expiration = 04/28/2030 (from X-Ray date 04/28/2025) ✓
Document contains "no active TB" clearance statement	"No acute pulmonary findings." detected ✓
No provider signature found — missing-signature warning should fire	NO_PROVIDER_NAME_OR_SIGNATURE warning returned ✓
Single document uploaded — multi-document flag should not fire	Multiple documents = false ✓

× SHFT-4882 — AI Extraction Pipeline: Government ID (P1_upload_id)
Open in Jira → | Field Registry (P2–P11) →

What this demo covers:
Each criterion links to a clickable demo sample above.

Live API verification — 2026-05-26
Doc: GovID_Ohio-Drivers-License_REDACTED_Exp-2029-07-01.jpg → POST /api/extract-compare

Criterion	Field returned	Value
P0B is hint only — AI classification source of truth	`claude.typeMatch`	true (DRIVERS_LICENSE)
P5 DOB validated for impossible dates	`claude.ageAtExtraction` / `impossibleAgeFlag`	37 / null
Alteration detection runs on every extraction	`claude.hasVisibleAlterations`	false (no false positive)
Multi-document detection	`claude.multipleDocumentsDetected`	false
Backend flag arrays (clean doc → null)	`documentai.validationFlags` / `validationWarnings`	null / null

Acceptance Criterion	Status	Test Scenario
Accepts US passport, driver's license, learner's permit, temporary license, resident card	✓	Passport (REDACTED), Resident Card (REDACTED), State ID (REDACTED), DL (REDACTED, REDACTED)
Rejects non-accepted document types with error message	◐	Dan: Pipeline detects wrong doc type and flags TYPE MISMATCH. Dmitry: Backend reads the flag and returns the rejection message to nurse: "We couldn't identify this document. Please upload a valid U.S. passport, driver's license, learner's permit, temporary license, or resident card."
Detects expired documents and returns error	◐	Dan: Pipeline extracts expiration date and flags EXPIRED. Dmitry: Backend reads the flag and returns rejection: "This document is expired. Please upload a current, unexpired document."
Extracts all fields P2–P11 (first name, last name, suffix, DOB, expiration, issue date, document #, address, sex, middle name)	✓	REDACTED IL State ID → all 13 fields at 100% agreement
P0B_id_type (nurse's pre-selected doc type) is a hint only — AI classification is source of truth, mismatch never blocks	✓	Passport uploaded as "Driver's License" → AI detects passport, no block
P5 (date of birth) validated for impossible dates; under-18 triggers blocking flag	✓	DOB validated on every extraction. ageAtExtraction computed (REDACTED IL DL → 29). Flags: AGE_OVER_100, AGE_UNDER_16 for manual review. Under-18 → blocking flag.
P11 (middle name) extracted where present	✓	REDACTED MO DL → "REDACTED" extracted as middle name
Passport date formats (DD MMM YYYY) normalized to MM/DD/YYYY	✓	REDACTED passport → dates normalized in output
Handles both portrait and landscape orientations	✓	REDACTED GA DL (portrait), REDACTED Resident Card (landscape)
Confidence scores logged per field for threshold tuning (A6 = confidence config)	✓	Every extraction shows per-field scores (visible in results table)
Alteration detection: font inconsistencies, pixel artifacts, copy-paste edges, white-out, photo tampering	✓	Check runs on every extraction (hasVisibleAlterations field). Current test docs are real submissions and don't trigger it — need purpose-built tampered samples to validate the detection path
Readability check fails gracefully with message prompting clearer upload	✓	Low-quality images return graceful error with re-upload prompt
OCR cross-validation as independent third check	✓	REDACTED physical → OCR confirms/denies each field value independently
P2/P3 (first name / last name) cross-checked against A1/A2 (account first name / last name) for name mismatch detection	◐	Dan: Extraction pipeline already returns P2_first_name & P3_last_name from the document. Dmitry: Backend needs to compare extracted P2/P3 against the nurse's account fields A1_first_name/A2_last_name, handle nickname/suffix/hyphenation fuzzy matching, and trigger the mismatch flag + resolution options (update account name or upload proof of name change via FL-09). Also needs to check ALT-1/ALT-2 (alternate names on file) before flagging.
Name mismatch resolution options (nurse can update account name to match ID, or upload proof of name change)	—	Backend/UI flow — Dmitry scope
Document status flow (In Review → Verified → Needs Attention)	—	Backend state machine — Dmitry scope
Re-upload replaces previous, re-extracts all fields	—	Backend persistence — Dmitry scope
Duplicate detection (same doc uploaded by different accounts)	—	Backend dedup logic — Dmitry scope

✓ Done in this demo ◐ Extraction ready, backend integration pending — Backend/Dmitry scope

Beyond ticket scope (bonus):
• Dual-model comparison (Claude + Gemini) for higher confidence
• Google Cloud Vision OCR as independent cross-check
• Per-field OCR corroboration with confidence adjustment
• Image compression & optimization (sharp: auto-resize, EXIF rotation)
• SHA-256 caching with 24h TTL
• Rate limiting (separate AI + OCR tracking)

× SHFT-4950 — AI Extraction Pipeline: License/Certification (L1_license_upload)
Open in Jira → | Field Registry (L2–L8) →

What this demo covers:
Each criterion links to a clickable demo sample above.

Live API verification — 2026-05-26 (4 docs × POST /api/extract-compare)

Doc	Criterion proven	Evidence
IL RN (REDACTED) `license-rn`	L2/L3/L4/L5/L6 all extracted	licenseType=RN, L3=REDACTED, L4=05/31/2026, L5=REDACTED T REDACTED, L6=IL
	License type normalization	"REGISTERED PROFESSIONAL NURSE" → RN (rawLicenseTypeLabel preserved)
	Expiring-soon warning fires (<30 days)	`documentai.validationWarnings`: ["EXPIRING_SOON"]
IL CNA (REDACTED) `license-cna`	CNA registry verification extraction	licenseType=CNA, L3=REDACTED, L4=03/12/2026, L8=05/11/2009, L6=IL
IL CNA (REDACTED) `license-cna`	PDF input handled	.pdf processed without error
OH STNA (REDACTED) `license-cna`	State-specific abbreviation normalized (STNA → CNA)	rawLicenseTypeLabel="State Tested Nurse Aide" → licenseType=CNA
OH STNA (REDACTED) `license-cna`	L8 issue date extraction (OH CNA has no expiration)	issueDate=08/20/2025, expirationDate=null (correct — OH CNA card)
PA CNA (REDACTED) `license-cna`	PA CNA registry card extraction	licenseType=CNA, L3=REDACTED, L4=05/20/2027, L6=PA
PA CNA (REDACTED) `license-cna`	paCnaNoticeValidation only fires on Notice of Enrollment	paCnaNoticeValidation=null (correct — this is a card, not Notice)
All 4: typeMatch=true, hasVisibleAlterations=false, multipleDocumentsDetected=false ✓

Acceptance Criterion	Status	Test Scenario
Accepts: state nursing board certificates, wallet cards, online verification printouts, screenshots of state board portals	✓	IL RN License (REDACTED), PA CNA Wallet Card (REDACTED), IL Registry Search Printout (REDACTED), OH CNA Card (REDACTED)
Rejects non-accepted doc types with error message	◐	Dan: Pipeline detects UNKNOWN documentType and flags TYPE MISMATCH. Also detects license subtype mismatch (e.g., selected LPN but uploaded CNA). Dmitry: Backend sends rejection message to nurse.
Extracts L2 (license type), L3 (license number), L4 (expiration date), L5 (full name), L6 (state), L8 (issue date)	✓	All license demo samples → fields extracted with per-field confidence
State-specific license type abbreviations (GNA, STNA, TMA, CMA, CMT, QMA, LNA) normalized to internal codes	✓	OH "State Tested Nurse Aide" (REDACTED) → normalized to CNA. rawLicenseTypeLabel preserves original.
License name formats vary by state — AI handles all common formats	✓	Prompt handles "Last, First M.", "First Middle Last", "LAST, FIRST MIDDLE" etc.
Passport-style and non-standard date formats normalized to MM/DD/YYYY	✓	All dates normalized in extraction prompt
Issue date (L8) identified regardless of labeling ("date issued", "effective date", "date of certification")	✓	Prompt lists all common label variants; MO CNA (REDACTED) uses "date of completion"
Detects expired licenses (L4 in past) + warning message	◐	Dan: Pipeline flags EXPIRED with warning text. Dmitry: Backend returns warning to nurse.
License expiring within 30 days returns warning	◐	Dan: Pipeline flags EXPIRING_SOON with date. Dmitry: Backend returns warning to nurse.
Handles portrait and landscape orientations	✓	Wallet cards (landscape) vs certificates (portrait) both handled
Confidence scores logged per field for threshold tuning	✓	Every extraction shows confidencePerField in results
Alteration detection (font inconsistencies, pixel artifacts, white-out, tampering)	✓	hasVisibleAlterations checked on every extraction
Readability check fails gracefully with re-upload prompt	✓	Low-quality images return graceful error
BACKEND SCOPE (Dmitry)
License type matching (L2 vs A7) — mismatch prompt + two options	—	Backend compares extracted L2 against A7_license from account creation
State matching (L6 vs A5) — CNA must match, LPN/RN flexible	—	Backend compares L6 against A5_license_state
Name matching (L5 vs A1/A2) + ALT-N lookup + FL-09 flag	—	Backend cross-checks names, only proof-of-name-change path (no A1/A2 update from license)
CNA exceptions: AL/IL/NB no license number; IL/NB no expiration; AL calculated expiration (issue+24mo)	✓	IL CNA (REDACTED) → stateException: IL_NO_EXPIRATION_REQUIRED. AL: calculatedExpiration = issueDate+2yr. Pipeline skips extraction for these fields per state rules.
PA CNA: validate Notice of Enrollment (3 text checks)	✓	Pipeline extracts documentText and checks 3 required phrases (Commonwealth/Dept of Health, nurse aide training completion, Nurse Aide Registry enrollment). Returns paCnaNoticeValidation with per-phrase results.
Document status flow (In Review → Verified → Needs Attention)	—	Backend state machine
Re-upload replaces previous, re-extracts all fields	—	Backend persistence
Duplicate detection (CNA numbers unique across accounts)	—	Backend dedup logic
Nursys mapping (LPN→PN, RN→RN) + downstream verification	—	Separate ticket — consumes L2, L3, L6 from this pipeline

✓ Done in this demo ◐ Extraction ready, backend pending — Backend/Dmitry scope

× SHFT-4952 — AI Extraction Pipeline: TB Test (TB4_tb_upload)
Open in Jira → | Field Registry →

What this demo covers:
Validated with automated test suite against prod API (2026-05-20).

Live API verification — 2026-05-26
Doc: TB_Chest-Xray-Radiology-Report_REDACTED_REDACTED_2025-04-28.png → POST /api/extract-compare

Criterion	Field returned	Value
Chest X-ray classification	`claude.typeMatch` / `testType`	true / CHEST_XRAY
Chest X-ray expiration = performed date + 5 years	`claude.calculatedExpiration`	04/28/2030 (from xrayDate 04/28/2025)
Detect "no active TB" / equivalent clearance phrase	`claude.noActiveTbStatement`	"No acute pulmonary findings."
Validate doctor's name/signature presence	`documentai.validationWarnings`	["NO_PROVIDER_NAME_OR_SIGNATURE"] ✓ fires correctly
Multi-document detection	`claude.multipleDocumentsDetected`	false

Acceptance Criterion	Status	Test Scenario / Evidence
Accepts 4 TB doc types: 1-step skin, 2-step skin, blood test (QuantiFERON/T-Spot/IGRA), chest X-ray	✓	REDACTED (1-step), REDACTED (2-step), REDACTED/REDACTED/Scott (QuantiFERON), REDACTED/Turner (X-ray) — all classified correctly
Type mismatch: rejects docs that don't match nurse's TB3 selection	✓	Test: skin_test selected + X-ray uploaded → typeMatch:false, "Selected PPD_SKIN_TEST but appears to be CHEST_XRAY"
1-step selected but 2-step detected → auto-upgrade silently, return TEST_TYPE=2-STEP	✓	Auto-upgrade logic fires when step2DatePlaced detected on a 1-step classification. Warning: AUTO_UPGRADED_TO_2_STEP
2-step selected but only one set of dates → incomplete upload flag	✓	Flag: 2_STEP_INCOMPLETE_MISSING_STEP_2 when step2DatePlaced is absent
1-step skin: extract placed+read dates, validate 48-72hr read window	✓	REDACTED: placed 05/19, read 05/21 → readWindowHours:48, readWindowFlag:WITHIN_RANGE. Violation → READ_WINDOW_VIOLATION flag
2-step skin: Step 1 and Step 2 placed dates must be >1 week apart	✓	REDACTED: step1=02/16, step2=03/02 → stepsIntervalDays:14, stepsIntervalFlag:WITHIN_RANGE. <7 days → 2_STEP_INTERVAL_TOO_SHORT
Skin/blood test: calculate expiration = placed/result date + 1 year	✓	REDACTED: read 05/21/2025 → calculatedExpiration:05/21/2026. REDACTED: result 11/06/2025 → calculatedExpiration:11/06/2026
Chest X-ray: calculate expiration = performed date + 5 years	✓	REDACTED: xrayDate 04/28/2025 → calculatedExpiration:04/28/2030
Expired documents flagged (calculated expiration in past)	◐	Dan: Pipeline flags EXPIRED. Dmitry: Backend returns error message to nurse.
Positive result returns manual review flag	✓	overallResult:POSITIVE → flag:POSITIVE_RESULT. Routing to FL-07/Paused is Dmitry's scope.
Validate presence of doctor's name/signature/initials in "given by" field	✓	hasPhysicianSignature + physicianName extracted. Missing both → warning: NO_PHYSICIAN_NAME_OR_SIGNATURE
Blood test: detect if document has actual laboratory values	✓	REDACTED QuantiFERON: hasLabValues:true (IU/mL values present). Missing → warning: NO_LAB_VALUES_DETECTED
Blood test: reject Physical form or Immunization report (no lab values)	✓	isPhysicalOrImmunizationForm:true → flag: PHYSICAL_OR_IMMUNIZATION_FORM_NOT_LAB_REPORT. Rejection message is Dmitry's scope.
Chest X-ray: detect "no active TB" or equivalent clearance phrase	✓	REDACTED: noActiveTbStatement:"No acute pulmonary findings." Missing → warning: NO_ACTIVE_TB_STATEMENT_NOT_FOUND
Alteration detection (font inconsistencies, pixel artifacts, tampering)	✓	hasVisibleAlterations checked on every extraction
Returns tags: test_type, steps_interval_days, expiration_date	✓	All returned: testType, stepsIntervalDays (2-step only), calculatedExpiration
Confidence scores logged per field	✓	overallConfidence + per-field scores in results
Handles multi-page uploads (2-step across two pages)	✓	PDF uploads supported, all pages processed
BACKEND SCOPE (Dmitry)
Error messages to nurse (expired, read window violation, wrong doc type)	—	Backend reads flags and returns user-facing strings
Name matching (applicant name vs A1/A2)	—	Backend cross-checks extracted patientName against account
TB1 symptom screening interaction (positive result + symptoms → manual review)	—	Backend reads POSITIVE_RESULT flag + TB1 answers
Document status flow (In Review → Verified → Needs Attention)	—	Backend state machine
Facility enforcement (expiration_date + steps_interval_days for shift booking)	—	Backend/portal consumes tags from pipeline
TB3 update on auto-upgrade (1-step → 2-step)	—	Backend reads AUTO_UPGRADED_TO_2_STEP warning and updates TB3

✓ Done in this demo ◐ Extraction ready, backend pending — Backend/Dmitry scope

Test docs used:
• TB_PPD-Skin-Test-Results_REDACTED_REDACTED_2025-05-21.jpeg (1-step)
• TB_Employee-Screening-Form_REDACTED_REDACTED_2026-03-02.jpg (2-step)
• TB_QuantiFERON-Gold-Plus-Blood-Test_REDACTED_2025-11-06.jpeg (blood)
• TB_Chest-Xray-Radiology-Report_REDACTED_REDACTED_2025-04-28.png (X-ray)

× SHFT-5080 — AI Extraction Pipeline: MMR Immunity Proof
Open in Jira →

What this demo covers:
Six MMR proof types extracted to a shared contract (mmrDocType + mmrImmune) so Dmitry's backend can roll up immunity across multiple documents per nurse.

Acceptance Criterion	Status	Test Scenario / Evidence
Vaccine record: extract per-vaccine doses, lot, manufacturer, dates	✓	VaccineRecordSchema returns vaccines[] with category (MMR/MMRV/MEASLES/MUMPS/RUBELLA/...), doses[], titerResult, immunityStatus
Lab titer report: extract POSITIVE/NEGATIVE/EQUIVOCAL per component	✓	titerResult + titerDate per vaccine entry; combined MMR titer covers all three components when positive
Physical form with MMR section as proof	✓	uploadType `mmr-physical_form` → PhysicalFormMmrSchema, mmrStatus (ADMINISTERED/IMMUNE_BY_TITER/UP_TO_DATE/DUE/DECLINED/EXEMPT) maps to mmrImmune
Medical exemption: clinician signature required	✓	Flags: MEDICAL_EXEMPTION_MISSING_PHYSICIAN_SIGNATURE, MEDICAL_EXEMPTION_NON_CLINICIAN_SIGNER (token-based MD/DO/NP/PA/APRN/DNP/CNM/PhD check — Pastor ≠ PA)
Religious exemption: nurse signature required	✓	Flag: RELIGIOUS_EXEMPTION_MISSING_NURSE_SIGNATURE when hasPatientSignature is false
Declination form: rejected as exemption	✓	DeclinationSchema → mmrDocType:"declination", mmrImmune:"unknown", flag:MMR_DECLINATION_REJECTED. Declinations never satisfy MMR.
Cross-document aggregation across multiple uploads	✓	`POST /api/mmr/aggregate` consumes prior extractions, returns mmrImmune + per-component evidence (measles/mumps/rubella) + missingComponents[]
Incomplete titer: missing one or more components flagged	✓	Flag: MMR_TITER_INCOMPLETE. Warning lists missing components by name.
Equivocal/indeterminate titer: manual review	✓	Flag: EQUIVOCAL_TITER_MANUAL_REVIEW, mmrImmune:"unknown". Distinguished from outright NEGATIVE.
Newer titer overrides older one per component	✓	Aggregator prefers later titerDate; falls back to status rank when dates missing
Applicant name cross-check (A1 first / A2 last)	✓	`POST /api/name/verify` returns EXACT/FUZZY/PARTIAL/NO_MATCH/INSUFFICIENT_DATA with similarity 0–100. Handles accents, hyphens, OCR slips (LeAnn/LeeAnn).
Alteration detection on every MMR doc	✓	hasVisibleAlterations + alterationDetails on all 6 schemas
Confidence scores per field	✓	confidencePerField + overallConfidence on every schema
BACKEND SCOPE (Dmitry)
Persist mmrDocType + mmrImmune per upload, call aggregator	—	Backend stores extraction, replaces prior of same type, calls /api/mmr/aggregate when status needs recomputing
User-facing error messages from flags	—	Backend reads warnings[] + flags[] and surfaces to nurse
Manual-review routing for exemptions and equivocal titers	—	Backend reads MMR_EXEMPTION_ON_FILE, EQUIVOCAL_TITER_MANUAL_REVIEW

✓ Done in this demo — Backend/Dmitry scope

Upload types: mmr-vaccine_record, mmr-titer, mmr-physical_form, mmr-medical_exemption, mmr-religious_exemption, mmr-declination
Endpoints: POST /api/extract, POST /api/mmr/aggregate, POST /api/name/verify

× Claude Sonnet 4.6 — built by Anthropic
Currently one of the most capable multimodal models for vision-based structured data extraction from documents.

Why we chose it for this pipeline:
Claude is the primary extraction engine. It reads the uploaded document image, identifies all relevant fields (name, DOB, license number, expiration, etc.), and returns structured JSON with per-field confidence scores. It is the most accurate model we tested for this use case.

Strengths:
• Highest accuracy for field extraction across all document types (gov ID, TB tests, physicals, nursing licenses)
• Best at detecting document alterations — catches font mismatches, pixel-level edits, inconsistent backgrounds, and photoshopped text
• Produces nuanced, realistic confidence scores (typically 88–98 range) rather than defaulting to 100
• Superior handwriting recognition — reads handwritten dates, signatures, and doctor notes more reliably
• Strong structured output — consistently returns valid JSON matching our Zod schemas
• Built-in safety guardrails — won't fabricate data it can't read; returns null with low confidence instead

Trade-offs:
• Slower than Gemini (~5–7 seconds per document vs ~3–5s)
• ~10x more expensive per document (~$0.01–0.03 vs ~$0.001–0.005)
• Occasionally over-cautious — may return lower confidence on legible fields

How it works in the pipeline:
1. Image is compressed & optimized (auto-resize, EXIF rotation via sharp)
2. Base64-encoded image + extraction prompt sent to Claude's vision API
3. Claude returns structured JSON with fields + confidence scores
4. Results validated against Zod schema + post-extraction rules (expiry, age, type match)
5. OCR cross-check adjusts confidence: +5 if OCR confirms, -15 if OCR disagrees

Pricing breakdown:
• Input: $3.00 per 1M tokens (the image + prompt you send)
• Output: $15.00 per 1M tokens (the JSON response it generates)

Typical single extraction:
~1,500 input tokens × $3/1M = $0.0045
~400 output tokens × $15/1M = $0.006
Total: ~$0.01 per document

At scale: 10,000 docs/month ≈ $100–300/month

× Gemini 3.5 Flash — built by Google
A fast, cost-efficient multimodal model optimized for high-throughput tasks where speed and cost matter more than peak accuracy.

Why we chose it for this pipeline:
Gemini serves as the second opinion. When two independent models agree on a field value, our confidence in that extraction is very high. When they disagree, the field gets flagged for human review. This dual-model approach catches errors that any single model would miss.

Strengths:
• Very fast — typically 3–5 seconds per document
• ~10x cheaper than Claude per extraction
• Good accuracy on clearly printed text and standard document layouts
• Generous free tier (makes testing and development essentially free)
• High throughput — can process many documents quickly in batch scenarios

Trade-offs:
• Tends to give overconfident scores (95–100 for nearly everything, even ambiguous fields)
• Less reliable on handwritten forms, cursive, and poor-quality scans
• Misses some alteration cues that Claude catches (subtle font changes, compression artifacts)
• Occasionally misreads handwritten dates (e.g., "2025" as "2005")

How it works in the pipeline:
1. Same compressed image sent to Gemini's vision API in parallel with Claude
2. Gemini returns structured JSON matching the same Zod schema
3. Results compared field-by-field against Claude's output
4. Agreement/disagreement highlighted in the comparison view
5. OCR cross-check applied independently to Gemini's results too

Pricing breakdown:
• Input: $0.30 per 1M tokens
• Output: $2.50 per 1M tokens

Typical single extraction:
~1,500 input tokens × $0.30/1M = $0.00045
~400 output tokens × $2.50/1M = $0.001
Total: ~$0.001–0.005 per document

At scale: 10,000 docs/month ≈ $10–50/month

× Google Cloud Vision API — TEXT_DETECTION
Traditional OCR (Optical Character Recognition) — not an AI model. This is the same engine that powers Google Lens, Google Photos text search, and Google Drive's automatic PDF text extraction.

What is OCR?
OCR stands for Optical Character Recognition. It scans an image pixel-by-pixel to detect and extract raw text using pattern matching and character recognition. Unlike AI models, OCR doesn't "understand" the document — it simply finds every piece of text in the image and returns it as a plain string. It doesn't know what a "first name" or "expiration date" is; it just reads characters.

Why we use it in this pipeline:
OCR serves as an independent third cross-check alongside both AI models. If Claude extracts firstName = "SONYA" and OCR also found "SONYA" in the raw text, we have strong evidence that value is correct. If the AI extracted something OCR can't find anywhere in the document, that's a red flag — the AI may have hallucinated or misread.

How confidence adjustment works:
• AI extracts a field value → we search the OCR raw text for that value
• OCR ✓ Found in OCR text → confidence +5 points (confirmed by independent source)
• OCR ✗ Not found in OCR text → confidence -15 points (flagged for human review)
• This adjustment is applied per-field, independently for each AI model's results

Why run OCR if AI already reads the image?
AI models can "hallucinate" — confidently output text that isn't actually in the document. OCR is deterministic (same image always produces same text), so it acts as a ground-truth check. The combination of AI understanding + OCR verification is more reliable than either alone.

Performance:
• Latency: ~0.3–0.5 seconds (runs in parallel with AI, adds zero wait time)
• OCR fires simultaneously with Claude and Gemini — the total request time is determined by the slowest AI model, not the sum

Pricing:
• $1.50 per 1,000 images processed
• First 1,000 images/month are FREE (Google's free tier)
• Per document: ~$0.0015

At scale: 10,000 docs/month ≈ $13.50/month (after free tier)

× Google Document AI — Google Cloud's specialized document processing platform.
Pre-trained processors built for specific document types (IDs, forms, invoices) — not a general-purpose LLM. Returns structured key-value pairs with bounding boxes and confidence scores.

Why we chose it for this pipeline:
Document AI is the third independent extractor alongside Claude and Gemini. Because it's purpose-built for documents (not generative), it tends to be deterministic, fast, and excellent at machine-readable layouts. When all three (Claude + Gemini + Doc AI) agree on a field, our confidence in that value is extremely high.

Strengths:
• Purpose-built per document category — separate model for IDs vs forms vs general text
• Returns spatial layout (bounding boxes), so we can see exactly where each value was read from
• Deterministic — same image always returns same output (unlike LLMs)
• Identity Document Proofing processor detects tampering signals (digital alteration scores, evidence inconclusive, etc.) that LLMs sometimes miss
• Strong on structured/printed text — driver's licenses, official forms, lab reports

Trade-offs:
• Slowest of the three (~10–14s vs Claude ~6s, Gemini ~4s)
• Weaker on free-form handwriting and unusual layouts than Claude
• Each document category needs its own processor (more setup than a single LLM call)
• Field labels come back raw — we map them to our schema in code (e.g., "Family Name" → lastName)

Processors we use:

Category	Processor	Purpose
`government_id`	US Driver License Parser	Pre-trained on US DL/state ID layouts; extracts P2–P11 fields directly
`license`	Form Parser	Generic form K/V extraction for nursing license certificates & cards
`tb_test` / `physical`	OCR Processor	Better at handwriting (TB skin test dates, physician notes)
All gov IDs	Identity Document Proofing	Runs in parallel — returns tampering/alteration scores feeding POSSIBLE_ALTERATION flag

How it works in the pipeline:
1. Same compressed image dispatched to Document AI in parallel with Claude + Gemini
2. Processor picked by upload category (see table above)
3. Raw entities returned with bounding boxes & per-field confidence
4. Field labels mapped to our schema (e.g., DocAI "Date Of Birth" → our dateOfBirth)
5. Results compared against Claude/Gemini for agreement scoring; populates documentai.validationFlags / validationWarnings

Pricing breakdown:
• Form Parser / OCR / DL Parser: $0.030 per page (first 1,000 pages/month FREE)
• Identity Document Proofing: $0.10 per request

Typical single extraction (gov ID):
1 DL Parser call + 1 ID Proofing call = $0.03 + $0.10 = ~$0.13 per gov ID
Non-ID docs (license/TB/physical): ~$0.03 per doc

At scale: 10,000 docs/month — mix-dependent, ~$300–$1,300/month
Note: Doc AI is the most expensive of the three providers — used only because the independent third-opinion catches errors the LLMs miss.

× What are tokens?
Tokens are the unit AI models use to measure text. Think of them as "word pieces." One token is roughly 4 characters or about ¾ of an English word. The word "extraction" is 2 tokens. A full sentence is typically 15–25 tokens.

Why do tokens matter?
AI model pricing is based entirely on tokens — both the tokens you send (input) and the tokens the model generates back (output). Understanding tokens helps you estimate costs and optimize usage.

Input tokens — what you send TO the model:
• The document image (~1,000–2,000 tokens depending on resolution)
• The extraction prompt/instructions (~200 tokens)
• The schema definition telling the model what fields to extract (~100 tokens)
• Total per request: ~1,300–2,300 input tokens

Output tokens — what the model sends BACK:
• The structured JSON with all extracted fields and confidence scores
• Typically ~200–500 tokens depending on document type
• Output tokens cost 3–5x more than input because the model is doing the computational work of "reading" and reasoning

Why does output cost more?
Input is just receiving data. Output requires the model to analyze the image, identify fields, read text (including handwriting), assess confidence, check for alterations, and generate structured JSON — this computation is what you're paying the premium for.

Worked example — 1 driver's license:

Claude Sonnet 4.6:
Input: ~1,500 tokens × $3.00/1M = $0.0045
Output: ~400 tokens × $15.00/1M = $0.006
Total: ~$0.01

Gemini 3.5 Flash:
Input: ~1,500 tokens × $0.30/1M = $0.00045
Output: ~400 tokens × $2.50/1M = $0.001
Total: ~$0.0015

Compare mode (both + OCR): ~$0.015
Claude is ~10x pricier but more accurate. The compare mode runs both for maximum confidence.

Compare All

Final Result

Claude Only

Gemini Only

Doc AI Only

Runs Claude + Gemini + OCR in parallel — shows side-by-side comparison

Upload Type

Document

Drop file here or browse

JPG, PNG, or PDF — max 20 MB

Session Expired

Signing In