eShyft
Nurse Onboarding Doc Verify
Upload a document to verify and extract fields
AI Token Details Pipeline summary What extraction handles Flag reference
Validation flags reference — all flags the pipeline can return×
validationFlags[] = hard issues  ·  validationWarnings[] = soft flags. FLAG surface to nurse   WARN soft / advisory   INFO informational only
Flag / WarningApplies toTypeWhat it means
EXPIREDAllFLAGDocument is past its expiration date.
POSSIBLE_ALTERATIONAllWARNAI detected signs of tampering: font inconsistencies, pixel artifacts, white-out, copy-paste edges, or photo manipulation. Check alterationDetails for specifics.
UNDER_18Gov IDFLAGExtracted DOB places person under 18.
IMPOSSIBLE_DOBGov IDFLAGDOB is in the future or implies age >120. Likely an OCR misread.
AGE_OVER_100 / AGE_UNDER_16Gov IDWARNDOB produces an unusual age — probable OCR digit error. Manual review.
EXPIRING_SOONLicenseWARNLicense expires within 30 days.
UNRECOGNIZED_DOCUMENTLicenseFLAGAI couldn't classify the document type. May be a wrong upload or blurry image.
POSITIVE_RESULTTBFLAGTB test result is positive. A chest X-ray or clinical follow-up is required.
READ_WINDOW_VIOLATIONTB skinFLAGPPD was read outside the required 48–72 hour window. Test is invalid.
2_STEP_INTERVAL_TOO_SHORTTB skinFLAGStep 2 placed fewer than 7 days after step 1. CDC protocol requires ≥7 days between steps.
2_STEP_INCOMPLETE_MISSING_STEP_2TB skinFLAG2-step form with only step 1 data present. Nurse needs to upload the complete form.
NO_PROVIDER_NAME_OR_SIGNATURETBWARNNo nurse, reader, or physician on the document. Lab-issued QuantiFERON/T-SPOT reports with a facility name do NOT trigger this.
NO_ACTIVE_TB_STATEMENT_NOT_FOUNDTB X-rayWARNNo TB-negative phrase found in the radiology report. Standard phrases ("no acute infiltrate", "lungs are clear", etc.) are auto-detected.
XRAY_INDICATION_NOT_TB_RELATEDTB X-rayWARNX-ray indication is unrelated to TB screening (e.g., "SOB, wheezing"). X-ray may be clinically valid but wasn't ordered for TB.
PHYSICAL_MISSING_DATEPhysicalFLAGNo examination date extracted. Required to verify the physical is current.
DATE_LOW_CONFIDENCEPhysicalWARNExam date extracted with low confidence — often a handwritten year where "5" and "3" look similar. Manual review recommended.
typeMatch: falseAllWARNAI detected a different document type than what was selected. Check typeMatchDetails. Selection is a hint only, never a hard block.
Pipeline summary — what the extraction stamps on every document×
Each capability runs automatically. Severity: BLOCKS = backend must reject  WARNS = soft flag for backend to act on  INFO = returned, no action required
CapabilityApplies toSeverity
Doc type classification — pipeline re-identifies doc type independently; nurse's selection is a hint only, never a hard gateAllINFO
Expiry / validity check — hard block on expired docs; warn on expiring within 30 days (licenses); type-specific windows: TB PPD/blood 1yr, X-ray 5yr, CPR 2yrGov ID · License · TB · CPR · VaccinesBLOCKS WARNS
Alteration & fraud detection — font anomalies, pixel artifacts, white-out; Google ID Proofing signals run separately on gov IDsAllWARNS
Multiple docs in one upload — detected when nurse photographs two cards side by sideAllWARNS
Identity / name matching — exact, fuzzy (1–2 char diff / OCR noise), partial (first only or last only), no matchAllWARNS INFO
Dual-model confidence scoring — Claude + Gemini 3.5 Flash both read the doc; agreement per field raises score, disagreement lowers itAllINFO
OCR cross-check — raw Cloud Vision scan verified against model output per field; match boosts confidence, mismatch reduces itAll (non-PDF)INFO
Per-field confidence scores (0–100) — returned on every field; backend decides auto-approve vs. flag thresholdAllINFO
Age validation — impossible DOB (future or age >120) hard blocked; under-18 and over-100 flagged as probable OCR errorsGov IDBLOCKS WARNS
2-step PPD protocol compliance — step 2 < 7 days after step 1 blocked (CDC violation); step 1 only blocked (incomplete); 1-step form auto-upgraded if step 2 date present; read window 48–72h validatedTB skinBLOCKS WARNS
Lab report enforcement — QuantiFERON / T-SPOT must come from an actual lab report; physical exam or immunization records rejected; expected lab values (antigen, mitogen, nil) must be presentTB bloodBLOCKS WARNS
Chest X-ray clearance statement — radiology report must explicitly state "no active TB"; absence flaggedTB X-rayWARNS
State-specific normalization & exceptions — license type codes normalized (e.g., OH "State Tested Nurse Aide" → CNA); IL/NE CNA no exp date or number expected; AL CNA 2yr expiry auto-computed from issue dateLicenseINFO
PA CNA enrollment phrase check — required phrases confirming PA Nurse Aide Registry enrollment must appear in the documentLicense (CNA-PA)WARNS
Vaccine immunity determination — returns immune / not immune / exempt / unknown per vaccine; MMR requires all 3 components (Measles, Mumps, Rubella); MMRV satisfies both MMR + Varicella; declinations always rejectedVaccinesBLOCKS INFO
Missing date auto-computation — CPR: 2yr expiry from issue date when not printed; 1-step PPD auto-upgraded to 2-step when step 2 date is present but form says 1-stepCPR · TB skinINFO
What the extraction pipeline handles today×
Every item below is logic that runs automatically on every upload — no human review needed unless flagged. BLOCKS = hard stop, document cannot be accepted. WARNS = soft flag passed to backend. INFO = computed/detected, no action needed.
All document types
BLOCKSWrong document type uploaded (e.g. nurse selected "RN License" but uploaded a passport) — detected and flagged
WARNSSigns of tampering or image manipulation detected on the document
WARNSMultiple documents detected in a single upload (e.g. photo of two cards side by side)
INFOConfidence score returned per extracted field (0–100) — backend can use to set review thresholds
INFOTwo independent models (Claude + Gemini) extract every document — fields where they agree get a confidence boost; conflicts are flagged for review
INFOOCR text scan independently cross-checks every extracted field — mismatches reduce confidence score
Government ID
BLOCKSImpossible date of birth detected (age over 120, or future date)
WARNSID is expired
WARNSNurse appears to be under 18
WARNSNurse appears to be over 100 (possible OCR error on DOB)
WARNSGoogle Document AI fraud signals detected (image manipulation, suspicious marks)
INFONurse's pre-selected ID type (e.g. "Driver's License") treated as hint only — the actual document type detected by the pipeline is the source of truth, never blocks
INFONurse's age at time of extraction calculated from DOB and returned
Nursing License / Certification
BLOCKSLicense is expired
WARNSLicense expires within 30 days
WARNSPA CNA: uploaded a Notice of Enrollment but one or more of the 3 required phrases is missing from the document text
INFOState-specific license type names normalized to standard codes (e.g. Ohio "State Tested Nurse Aide" → CNA; original label also saved)
INFOAlabama CNA: 2-year expiration auto-calculated from issue date (AL does not print expiration on license)
INFOIL, NE, NB CNA: no expiration date required — absence of expiration is not flagged as missing
INFOAL, IL, NE, NB CNA: no license number required — absence of license number is not flagged as missing
Name verification (cross-document)
INFOExact match: name on document matches account name exactly
INFOFuzzy match: small spelling differences (accent marks, OCR slips, 1-2 character errors) still count as a match
WARNSPartial match: first name matches but last name doesn't (or vice versa)
WARNSNo match: neither first nor last name matches account — possible name change or wrong document
TB Test
BLOCKS2-step PPD: Step 2 was administered fewer than 7 days after Step 1 (protocol violation — results invalid)
BLOCKS2-step PPD uploaded with Step 1 present but Step 2 missing — incomplete submission
BLOCKSQuantiFERON or T-SPOT uploaded but document is a physical form or immunization record, not a lab report
BLOCKSTB test is expired (PPD/blood test: 1 year; Chest X-Ray: 5 years)
WARNSPPD was not read within the required 48–72 hour window after placement
WARNSPositive TB result detected
WARNSNo provider name or signature found on the document
WARNSChest X-Ray: document does not contain a statement confirming no active TB
WARNSQuantiFERON/T-SPOT: lab values not found on the document
WARNS2-step PPD: more than 30 days between Step 1 and Step 2 (possible date transcription error)
INFO1-step PPD with a Step 2 date present is automatically re-classified as 2-step
CPR / BLS
BLOCKSDocument is a First Aid card only — does not include CPR or BLS
BLOCKSCPR is expired (2-year validity from issue date if no explicit expiration printed)
BLOCKSNeither an expiration date nor an issue date can be read — manual review required
WARNSDocument does not contain "CPR" or "BLS" keywords (unexpected format)
INFO2-year expiration auto-calculated from issue date when no expiration date is printed on the card
Vaccines — MMR, Varicella, Tdap
BLOCKSDocument contains no usable vaccine data — no doses and no titer results
BLOCKSMMR titer: one or more of Measles, Mumps, Rubella components is missing from the lab report — cannot confirm full immunity
BLOCKSMMR titer: one or more components is explicitly non-immune (negative result)
BLOCKSExpected vaccine (e.g. MMR) not found anywhere in the document
BLOCKSDeclination form uploaded — these are never accepted as proof of immunity
WARNSMMR titer: one or more components is equivocal or indeterminate — manual lab review required
WARNSMedical exemption: signed but not by a licensed clinician (MD, NP, PA, etc.)
WARNSExemption form does not list MMR (or "all vaccines") — may be for a different vaccine
WARNSPhysical exam form uploaded but MMR status is ambiguous or marked "Due" — cannot confirm immunity
INFODocument classified as: vaccination record / titer report / exemption / declination / physical form
INFOMMRV combo vaccine satisfies both MMR and Varicella requirements from a single document
INFOWhen multiple documents submitted for MMR (e.g. separate measles, mumps, rubella titers), results are aggregated — most recent and best result per component is used
INFOFinal immunity result returned: immune / not immune / exempt / unknown — ready for backend to act on
Live test results — SHFT-4882 Government ID (2026-05-26)×
Document: GovID_Ohio-Drivers-License_Crystal-Mae_Stover_Exp-2029-07-01.jpg sent through POST /api/extract-compare
What was being verifiedResult
The nurse's pre-selected ID type is treated as a hint only — the pipeline identifies the document type independentlyCorrectly identified as Driver's License ✓
Date of birth validated — impossible ages flaggedAge = 37, no impossible-age flag ✓
Tampering/alteration detection runs on every uploadNo alterations detected (no false positive) ✓
Single document uploaded — multi-document flag should not fireMultiple documents = false ✓
Clean document — no validation error or warning flags returnedNo flags returned ✓
Live test results — SHFT-4950 License/Certification (2026-05-26)×
4 real nurse license documents sent through POST /api/extract-compare
DocumentWhat was being verifiedResult
Illinois RN License
NursingLicense_Illinois-Registered-Professional-Nurse-RN-License_Alyissia_Sims_Exp-2026-05-31.png
All core fields extracted: license type, number, expiration, name, stateRN, number extracted, exp 05/31/2026, state IL ✓
"REGISTERED PROFESSIONAL NURSE" normalized to RN while keeping original label"REGISTERED PROFESSIONAL NURSE" → RN ✓
Expiring within 30 days — warning flag should fireEXPIRING_SOON warning returned ✓
Illinois CNA Registry Printout
NursingLicense_Illinois-Health-Care-Worker-Registry-CNA-Verification_Tamika-Nicole_Wilson_Active-2026-03-12.pdf
CNA fields extracted from state registry printoutCNA, number, exp 03/12/2026, issue date 05/11/2009, state IL ✓
PDF input handled correctlyPDF processed without error ✓
Ohio STNA Card
NursingLicense_Ohio-State-Tested-Nurse-Aide-CNA-Card_Misty_McKee_Issued-2025-08-20.jpg
Ohio "State Tested Nurse Aide" normalized to CNA"State Tested Nurse Aide" → CNA ✓
OH CNA cards have no expiration — only issue date returnedIssue date 08/20/2025, expiration empty (correct) ✓
Pennsylvania CNA Wallet Card
NursingLicense_Pennsylvania-CNA-Nurse-Aide-Registry-Card_Keniesha_Porter_Exp-2027-05-20.jpg
PA CNA card fields extractedCNA, number, exp 05/20/2027, state PA ✓
PA notice validation should only fire on Notice of Enrollment, not a wallet cardPA notice validation = not triggered (correct) ✓
All 4 docs: correct type detected, no tampering flagged, no duplicate documents ✓
Live test results — SHFT-4952 TB Test (2026-05-26)×
Document: TB_Chest-Xray-Radiology-Report_REDACTED_2025-04-28.png sent through POST /api/extract-compare
What was being verifiedResult
Document correctly classified as a Chest X-RayType = Chest X-Ray ✓
Chest X-Ray expiration auto-calculated as X-Ray date + 5 yearsExpiration = 04/28/2030 (from X-Ray date 04/28/2025) ✓
Document contains "no active TB" clearance statement"No acute pulmonary findings." detected ✓
No provider signature found — missing-signature warning should fireNO_PROVIDER_NAME_OR_SIGNATURE warning returned ✓
Single document uploaded — multi-document flag should not fireMultiple documents = false ✓
× SHFT-4882 — AI Extraction Pipeline: Government ID (P1_upload_id)
Open in Jira → | Field Registry (P2–P11) →

What this demo covers:
Each criterion links to a clickable demo sample above.

Live API verification — 2026-05-26
Doc: GovID_Ohio-Drivers-License_REDACTED_Exp-2029-07-01.jpgPOST /api/extract-compare
CriterionField returnedValue
P0B is hint only — AI classification source of truthclaude.typeMatchtrue (DRIVERS_LICENSE)
P5 DOB validated for impossible datesclaude.ageAtExtraction / impossibleAgeFlag37 / null
Alteration detection runs on every extractionclaude.hasVisibleAlterationsfalse (no false positive)
Multi-document detectionclaude.multipleDocumentsDetectedfalse
Backend flag arrays (clean doc → null)documentai.validationFlags / validationWarningsnull / null
Acceptance Criterion Status Test Scenario
Accepts US passport, driver's license, learner's permit, temporary license, resident cardPassport (REDACTED), Resident Card (REDACTED), State ID (REDACTED), DL (REDACTED, REDACTED)
Rejects non-accepted document types with error messageDan: Pipeline detects wrong doc type and flags TYPE MISMATCH. Dmitry: Backend reads the flag and returns the rejection message to nurse: "We couldn't identify this document. Please upload a valid U.S. passport, driver's license, learner's permit, temporary license, or resident card."
Detects expired documents and returns errorDan: Pipeline extracts expiration date and flags EXPIRED. Dmitry: Backend reads the flag and returns rejection: "This document is expired. Please upload a current, unexpired document."
Extracts all fields P2–P11 (first name, last name, suffix, DOB, expiration, issue date, document #, address, sex, middle name)REDACTED IL State ID → all 13 fields at 100% agreement
P0B_id_type (nurse's pre-selected doc type) is a hint only — AI classification is source of truth, mismatch never blocksPassport uploaded as "Driver's License" → AI detects passport, no block
P5 (date of birth) validated for impossible dates; under-18 triggers blocking flagDOB validated on every extraction. ageAtExtraction computed (REDACTED IL DL → 29). Flags: AGE_OVER_100, AGE_UNDER_16 for manual review. Under-18 → blocking flag.
P11 (middle name) extracted where presentREDACTED MO DL → "REDACTED" extracted as middle name
Passport date formats (DD MMM YYYY) normalized to MM/DD/YYYYREDACTED passport → dates normalized in output
Handles both portrait and landscape orientationsREDACTED GA DL (portrait), REDACTED Resident Card (landscape)
Confidence scores logged per field for threshold tuning (A6 = confidence config)Every extraction shows per-field scores (visible in results table)
Alteration detection: font inconsistencies, pixel artifacts, copy-paste edges, white-out, photo tamperingCheck runs on every extraction (hasVisibleAlterations field). Current test docs are real submissions and don't trigger it — need purpose-built tampered samples to validate the detection path
Readability check fails gracefully with message prompting clearer uploadLow-quality images return graceful error with re-upload prompt
OCR cross-validation as independent third checkREDACTED physical → OCR confirms/denies each field value independently
P2/P3 (first name / last name) cross-checked against A1/A2 (account first name / last name) for name mismatch detectionDan: Extraction pipeline already returns P2_first_name & P3_last_name from the document. Dmitry: Backend needs to compare extracted P2/P3 against the nurse's account fields A1_first_name/A2_last_name, handle nickname/suffix/hyphenation fuzzy matching, and trigger the mismatch flag + resolution options (update account name or upload proof of name change via FL-09). Also needs to check ALT-1/ALT-2 (alternate names on file) before flagging.
Name mismatch resolution options (nurse can update account name to match ID, or upload proof of name change)Backend/UI flow — Dmitry scope
Document status flow (In Review → Verified → Needs Attention)Backend state machine — Dmitry scope
Re-upload replaces previous, re-extracts all fieldsBackend persistence — Dmitry scope
Duplicate detection (same doc uploaded by different accounts)Backend dedup logic — Dmitry scope
Done in this demo Extraction ready, backend integration pending Backend/Dmitry scope
Beyond ticket scope (bonus):
• Dual-model comparison (Claude + Gemini) for higher confidence
• Google Cloud Vision OCR as independent cross-check
• Per-field OCR corroboration with confidence adjustment
• Image compression & optimization (sharp: auto-resize, EXIF rotation)
• SHA-256 caching with 24h TTL
• Rate limiting (separate AI + OCR tracking)
× SHFT-4950 — AI Extraction Pipeline: License/Certification (L1_license_upload)
Open in Jira → | Field Registry (L2–L8) →

What this demo covers:
Each criterion links to a clickable demo sample above.

Live API verification — 2026-05-26 (4 docs × POST /api/extract-compare)
DocCriterion provenEvidence
IL RN (REDACTED)
license-rn
L2/L3/L4/L5/L6 all extractedlicenseType=RN, L3=REDACTED, L4=05/31/2026, L5=REDACTED T REDACTED, L6=IL
License type normalization"REGISTERED PROFESSIONAL NURSE" → RN (rawLicenseTypeLabel preserved)
Expiring-soon warning fires (<30 days)documentai.validationWarnings: ["EXPIRING_SOON"]
IL CNA (REDACTED)
license-cna
CNA registry verification extractionlicenseType=CNA, L3=REDACTED, L4=03/12/2026, L8=05/11/2009, L6=IL
PDF input handled.pdf processed without error
OH STNA (REDACTED)
license-cna
State-specific abbreviation normalized (STNA → CNA)rawLicenseTypeLabel="State Tested Nurse Aide" → licenseType=CNA
L8 issue date extraction (OH CNA has no expiration)issueDate=08/20/2025, expirationDate=null (correct — OH CNA card)
PA CNA (REDACTED)
license-cna
PA CNA registry card extractionlicenseType=CNA, L3=REDACTED, L4=05/20/2027, L6=PA
paCnaNoticeValidation only fires on Notice of EnrollmentpaCnaNoticeValidation=null (correct — this is a card, not Notice)
All 4: typeMatch=true, hasVisibleAlterations=false, multipleDocumentsDetected=false ✓
Acceptance Criterion Status Test Scenario
Accepts: state nursing board certificates, wallet cards, online verification printouts, screenshots of state board portalsIL RN License (REDACTED), PA CNA Wallet Card (REDACTED), IL Registry Search Printout (REDACTED), OH CNA Card (REDACTED)
Rejects non-accepted doc types with error messageDan: Pipeline detects UNKNOWN documentType and flags TYPE MISMATCH. Also detects license subtype mismatch (e.g., selected LPN but uploaded CNA). Dmitry: Backend sends rejection message to nurse.
Extracts L2 (license type), L3 (license number), L4 (expiration date), L5 (full name), L6 (state), L8 (issue date)All license demo samples → fields extracted with per-field confidence
State-specific license type abbreviations (GNA, STNA, TMA, CMA, CMT, QMA, LNA) normalized to internal codesOH "State Tested Nurse Aide" (REDACTED) → normalized to CNA. rawLicenseTypeLabel preserves original.
License name formats vary by state — AI handles all common formatsPrompt handles "Last, First M.", "First Middle Last", "LAST, FIRST MIDDLE" etc.
Passport-style and non-standard date formats normalized to MM/DD/YYYYAll dates normalized in extraction prompt
Issue date (L8) identified regardless of labeling ("date issued", "effective date", "date of certification")Prompt lists all common label variants; MO CNA (REDACTED) uses "date of completion"
Detects expired licenses (L4 in past) + warning messageDan: Pipeline flags EXPIRED with warning text. Dmitry: Backend returns warning to nurse.
License expiring within 30 days returns warningDan: Pipeline flags EXPIRING_SOON with date. Dmitry: Backend returns warning to nurse.
Handles portrait and landscape orientationsWallet cards (landscape) vs certificates (portrait) both handled
Confidence scores logged per field for threshold tuningEvery extraction shows confidencePerField in results
Alteration detection (font inconsistencies, pixel artifacts, white-out, tampering)hasVisibleAlterations checked on every extraction
Readability check fails gracefully with re-upload promptLow-quality images return graceful error
BACKEND SCOPE (Dmitry)
License type matching (L2 vs A7) — mismatch prompt + two optionsBackend compares extracted L2 against A7_license from account creation
State matching (L6 vs A5) — CNA must match, LPN/RN flexibleBackend compares L6 against A5_license_state
Name matching (L5 vs A1/A2) + ALT-N lookup + FL-09 flagBackend cross-checks names, only proof-of-name-change path (no A1/A2 update from license)
CNA exceptions: AL/IL/NB no license number; IL/NB no expiration; AL calculated expiration (issue+24mo)IL CNA (REDACTED) → stateException: IL_NO_EXPIRATION_REQUIRED. AL: calculatedExpiration = issueDate+2yr. Pipeline skips extraction for these fields per state rules.
PA CNA: validate Notice of Enrollment (3 text checks)Pipeline extracts documentText and checks 3 required phrases (Commonwealth/Dept of Health, nurse aide training completion, Nurse Aide Registry enrollment). Returns paCnaNoticeValidation with per-phrase results.
Document status flow (In Review → Verified → Needs Attention)Backend state machine
Re-upload replaces previous, re-extracts all fieldsBackend persistence
Duplicate detection (CNA numbers unique across accounts)Backend dedup logic
Nursys mapping (LPN→PN, RN→RN) + downstream verificationSeparate ticket — consumes L2, L3, L6 from this pipeline
Done in this demo Extraction ready, backend pending Backend/Dmitry scope
× SHFT-4952 — AI Extraction Pipeline: TB Test (TB4_tb_upload)
Open in Jira → | Field Registry →

What this demo covers:
Validated with automated test suite against prod API (2026-05-20).

Live API verification — 2026-05-26
Doc: TB_Chest-Xray-Radiology-Report_REDACTED_REDACTED_2025-04-28.pngPOST /api/extract-compare
CriterionField returnedValue
Chest X-ray classificationclaude.typeMatch / testTypetrue / CHEST_XRAY
Chest X-ray expiration = performed date + 5 yearsclaude.calculatedExpiration04/28/2030 (from xrayDate 04/28/2025)
Detect "no active TB" / equivalent clearance phraseclaude.noActiveTbStatement"No acute pulmonary findings."
Validate doctor's name/signature presencedocumentai.validationWarnings["NO_PROVIDER_NAME_OR_SIGNATURE"] ✓ fires correctly
Multi-document detectionclaude.multipleDocumentsDetectedfalse
Acceptance Criterion Status Test Scenario / Evidence
Accepts 4 TB doc types: 1-step skin, 2-step skin, blood test (QuantiFERON/T-Spot/IGRA), chest X-rayREDACTED (1-step), REDACTED (2-step), REDACTED/REDACTED/Scott (QuantiFERON), REDACTED/Turner (X-ray) — all classified correctly
Type mismatch: rejects docs that don't match nurse's TB3 selectionTest: skin_test selected + X-ray uploaded → typeMatch:false, "Selected PPD_SKIN_TEST but appears to be CHEST_XRAY"
1-step selected but 2-step detected → auto-upgrade silently, return TEST_TYPE=2-STEPAuto-upgrade logic fires when step2DatePlaced detected on a 1-step classification. Warning: AUTO_UPGRADED_TO_2_STEP
2-step selected but only one set of dates → incomplete upload flagFlag: 2_STEP_INCOMPLETE_MISSING_STEP_2 when step2DatePlaced is absent
1-step skin: extract placed+read dates, validate 48-72hr read windowREDACTED: placed 05/19, read 05/21 → readWindowHours:48, readWindowFlag:WITHIN_RANGE. Violation → READ_WINDOW_VIOLATION flag
2-step skin: Step 1 and Step 2 placed dates must be >1 week apartREDACTED: step1=02/16, step2=03/02 → stepsIntervalDays:14, stepsIntervalFlag:WITHIN_RANGE. <7 days → 2_STEP_INTERVAL_TOO_SHORT
Skin/blood test: calculate expiration = placed/result date + 1 yearREDACTED: read 05/21/2025 → calculatedExpiration:05/21/2026. REDACTED: result 11/06/2025 → calculatedExpiration:11/06/2026
Chest X-ray: calculate expiration = performed date + 5 yearsREDACTED: xrayDate 04/28/2025 → calculatedExpiration:04/28/2030
Expired documents flagged (calculated expiration in past)Dan: Pipeline flags EXPIRED. Dmitry: Backend returns error message to nurse.
Positive result returns manual review flagoverallResult:POSITIVE → flag:POSITIVE_RESULT. Routing to FL-07/Paused is Dmitry's scope.
Validate presence of doctor's name/signature/initials in "given by" fieldhasPhysicianSignature + physicianName extracted. Missing both → warning: NO_PHYSICIAN_NAME_OR_SIGNATURE
Blood test: detect if document has actual laboratory valuesREDACTED QuantiFERON: hasLabValues:true (IU/mL values present). Missing → warning: NO_LAB_VALUES_DETECTED
Blood test: reject Physical form or Immunization report (no lab values)isPhysicalOrImmunizationForm:true → flag: PHYSICAL_OR_IMMUNIZATION_FORM_NOT_LAB_REPORT. Rejection message is Dmitry's scope.
Chest X-ray: detect "no active TB" or equivalent clearance phraseREDACTED: noActiveTbStatement:"No acute pulmonary findings." Missing → warning: NO_ACTIVE_TB_STATEMENT_NOT_FOUND
Alteration detection (font inconsistencies, pixel artifacts, tampering)hasVisibleAlterations checked on every extraction
Returns tags: test_type, steps_interval_days, expiration_dateAll returned: testType, stepsIntervalDays (2-step only), calculatedExpiration
Confidence scores logged per fieldoverallConfidence + per-field scores in results
Handles multi-page uploads (2-step across two pages)PDF uploads supported, all pages processed
BACKEND SCOPE (Dmitry)
Error messages to nurse (expired, read window violation, wrong doc type)Backend reads flags and returns user-facing strings
Name matching (applicant name vs A1/A2)Backend cross-checks extracted patientName against account
TB1 symptom screening interaction (positive result + symptoms → manual review)Backend reads POSITIVE_RESULT flag + TB1 answers
Document status flow (In Review → Verified → Needs Attention)Backend state machine
Facility enforcement (expiration_date + steps_interval_days for shift booking)Backend/portal consumes tags from pipeline
TB3 update on auto-upgrade (1-step → 2-step)Backend reads AUTO_UPGRADED_TO_2_STEP warning and updates TB3
Done in this demo Extraction ready, backend pending Backend/Dmitry scope
Test docs used:
• TB_PPD-Skin-Test-Results_REDACTED_REDACTED_2025-05-21.jpeg (1-step)
• TB_Employee-Screening-Form_REDACTED_REDACTED_2026-03-02.jpg (2-step)
• TB_QuantiFERON-Gold-Plus-Blood-Test_REDACTED_2025-11-06.jpeg (blood)
• TB_Chest-Xray-Radiology-Report_REDACTED_REDACTED_2025-04-28.png (X-ray)
× SHFT-5080 — AI Extraction Pipeline: MMR Immunity Proof
Open in Jira →

What this demo covers:
Six MMR proof types extracted to a shared contract (mmrDocType + mmrImmune) so Dmitry's backend can roll up immunity across multiple documents per nurse.

Acceptance Criterion Status Test Scenario / Evidence
Vaccine record: extract per-vaccine doses, lot, manufacturer, datesVaccineRecordSchema returns vaccines[] with category (MMR/MMRV/MEASLES/MUMPS/RUBELLA/...), doses[], titerResult, immunityStatus
Lab titer report: extract POSITIVE/NEGATIVE/EQUIVOCAL per componenttiterResult + titerDate per vaccine entry; combined MMR titer covers all three components when positive
Physical form with MMR section as proofuploadType mmr-physical_form → PhysicalFormMmrSchema, mmrStatus (ADMINISTERED/IMMUNE_BY_TITER/UP_TO_DATE/DUE/DECLINED/EXEMPT) maps to mmrImmune
Medical exemption: clinician signature requiredFlags: MEDICAL_EXEMPTION_MISSING_PHYSICIAN_SIGNATURE, MEDICAL_EXEMPTION_NON_CLINICIAN_SIGNER (token-based MD/DO/NP/PA/APRN/DNP/CNM/PhD check — Pastor ≠ PA)
Religious exemption: nurse signature requiredFlag: RELIGIOUS_EXEMPTION_MISSING_NURSE_SIGNATURE when hasPatientSignature is false
Declination form: rejected as exemptionDeclinationSchema → mmrDocType:"declination", mmrImmune:"unknown", flag:MMR_DECLINATION_REJECTED. Declinations never satisfy MMR.
Cross-document aggregation across multiple uploadsPOST /api/mmr/aggregate consumes prior extractions, returns mmrImmune + per-component evidence (measles/mumps/rubella) + missingComponents[]
Incomplete titer: missing one or more components flaggedFlag: MMR_TITER_INCOMPLETE. Warning lists missing components by name.
Equivocal/indeterminate titer: manual reviewFlag: EQUIVOCAL_TITER_MANUAL_REVIEW, mmrImmune:"unknown". Distinguished from outright NEGATIVE.
Newer titer overrides older one per componentAggregator prefers later titerDate; falls back to status rank when dates missing
Applicant name cross-check (A1 first / A2 last)POST /api/name/verify returns EXACT/FUZZY/PARTIAL/NO_MATCH/INSUFFICIENT_DATA with similarity 0–100. Handles accents, hyphens, OCR slips (LeAnn/LeeAnn).
Alteration detection on every MMR dochasVisibleAlterations + alterationDetails on all 6 schemas
Confidence scores per fieldconfidencePerField + overallConfidence on every schema
BACKEND SCOPE (Dmitry)
Persist mmrDocType + mmrImmune per upload, call aggregatorBackend stores extraction, replaces prior of same type, calls /api/mmr/aggregate when status needs recomputing
User-facing error messages from flagsBackend reads warnings[] + flags[] and surfaces to nurse
Manual-review routing for exemptions and equivocal titersBackend reads MMR_EXEMPTION_ON_FILE, EQUIVOCAL_TITER_MANUAL_REVIEW
Done in this demo Backend/Dmitry scope
Upload types: mmr-vaccine_record, mmr-titer, mmr-physical_form, mmr-medical_exemption, mmr-religious_exemption, mmr-declination
Endpoints: POST /api/extract, POST /api/mmr/aggregate, POST /api/name/verify
× Claude Sonnet 4.6 — built by Anthropic
Currently one of the most capable multimodal models for vision-based structured data extraction from documents.

Why we chose it for this pipeline:
Claude is the primary extraction engine. It reads the uploaded document image, identifies all relevant fields (name, DOB, license number, expiration, etc.), and returns structured JSON with per-field confidence scores. It is the most accurate model we tested for this use case.

Strengths:
• Highest accuracy for field extraction across all document types (gov ID, TB tests, physicals, nursing licenses)
• Best at detecting document alterations — catches font mismatches, pixel-level edits, inconsistent backgrounds, and photoshopped text
• Produces nuanced, realistic confidence scores (typically 88–98 range) rather than defaulting to 100
• Superior handwriting recognition — reads handwritten dates, signatures, and doctor notes more reliably
• Strong structured output — consistently returns valid JSON matching our Zod schemas
• Built-in safety guardrails — won't fabricate data it can't read; returns null with low confidence instead

Trade-offs:
• Slower than Gemini (~5–7 seconds per document vs ~3–5s)
• ~10x more expensive per document (~$0.01–0.03 vs ~$0.001–0.005)
• Occasionally over-cautious — may return lower confidence on legible fields

How it works in the pipeline:
1. Image is compressed & optimized (auto-resize, EXIF rotation via sharp)
2. Base64-encoded image + extraction prompt sent to Claude's vision API
3. Claude returns structured JSON with fields + confidence scores
4. Results validated against Zod schema + post-extraction rules (expiry, age, type match)
5. OCR cross-check adjusts confidence: +5 if OCR confirms, -15 if OCR disagrees

Pricing breakdown:
• Input: $3.00 per 1M tokens (the image + prompt you send)
• Output: $15.00 per 1M tokens (the JSON response it generates)

Typical single extraction:
~1,500 input tokens × $3/1M = $0.0045
~400 output tokens × $15/1M = $0.006
Total: ~$0.01 per document

At scale: 10,000 docs/month ≈ $100–300/month
× Gemini 3.5 Flash — built by Google
A fast, cost-efficient multimodal model optimized for high-throughput tasks where speed and cost matter more than peak accuracy.

Why we chose it for this pipeline:
Gemini serves as the second opinion. When two independent models agree on a field value, our confidence in that extraction is very high. When they disagree, the field gets flagged for human review. This dual-model approach catches errors that any single model would miss.

Strengths:
• Very fast — typically 3–5 seconds per document
• ~10x cheaper than Claude per extraction
• Good accuracy on clearly printed text and standard document layouts
• Generous free tier (makes testing and development essentially free)
• High throughput — can process many documents quickly in batch scenarios

Trade-offs:
• Tends to give overconfident scores (95–100 for nearly everything, even ambiguous fields)
• Less reliable on handwritten forms, cursive, and poor-quality scans
• Misses some alteration cues that Claude catches (subtle font changes, compression artifacts)
• Occasionally misreads handwritten dates (e.g., "2025" as "2005")

How it works in the pipeline:
1. Same compressed image sent to Gemini's vision API in parallel with Claude
2. Gemini returns structured JSON matching the same Zod schema
3. Results compared field-by-field against Claude's output
4. Agreement/disagreement highlighted in the comparison view
5. OCR cross-check applied independently to Gemini's results too

Pricing breakdown:
• Input: $0.30 per 1M tokens
• Output: $2.50 per 1M tokens

Typical single extraction:
~1,500 input tokens × $0.30/1M = $0.00045
~400 output tokens × $2.50/1M = $0.001
Total: ~$0.001–0.005 per document

At scale: 10,000 docs/month ≈ $10–50/month
× Google Cloud Vision API — TEXT_DETECTION
Traditional OCR (Optical Character Recognition) — not an AI model. This is the same engine that powers Google Lens, Google Photos text search, and Google Drive's automatic PDF text extraction.

What is OCR?
OCR stands for Optical Character Recognition. It scans an image pixel-by-pixel to detect and extract raw text using pattern matching and character recognition. Unlike AI models, OCR doesn't "understand" the document — it simply finds every piece of text in the image and returns it as a plain string. It doesn't know what a "first name" or "expiration date" is; it just reads characters.

Why we use it in this pipeline:
OCR serves as an independent third cross-check alongside both AI models. If Claude extracts firstName = "SONYA" and OCR also found "SONYA" in the raw text, we have strong evidence that value is correct. If the AI extracted something OCR can't find anywhere in the document, that's a red flag — the AI may have hallucinated or misread.

How confidence adjustment works:
• AI extracts a field value → we search the OCR raw text for that value
OCR ✓ Found in OCR text → confidence +5 points (confirmed by independent source)
OCR ✗ Not found in OCR text → confidence -15 points (flagged for human review)
• This adjustment is applied per-field, independently for each AI model's results

Why run OCR if AI already reads the image?
AI models can "hallucinate" — confidently output text that isn't actually in the document. OCR is deterministic (same image always produces same text), so it acts as a ground-truth check. The combination of AI understanding + OCR verification is more reliable than either alone.

Performance:
• Latency: ~0.3–0.5 seconds (runs in parallel with AI, adds zero wait time)
• OCR fires simultaneously with Claude and Gemini — the total request time is determined by the slowest AI model, not the sum

Pricing:
• $1.50 per 1,000 images processed
• First 1,000 images/month are FREE (Google's free tier)
• Per document: ~$0.0015

At scale: 10,000 docs/month ≈ $13.50/month (after free tier)
× Google Document AI — Google Cloud's specialized document processing platform.
Pre-trained processors built for specific document types (IDs, forms, invoices) — not a general-purpose LLM. Returns structured key-value pairs with bounding boxes and confidence scores.

Why we chose it for this pipeline:
Document AI is the third independent extractor alongside Claude and Gemini. Because it's purpose-built for documents (not generative), it tends to be deterministic, fast, and excellent at machine-readable layouts. When all three (Claude + Gemini + Doc AI) agree on a field, our confidence in that value is extremely high.

Strengths:
• Purpose-built per document category — separate model for IDs vs forms vs general text
• Returns spatial layout (bounding boxes), so we can see exactly where each value was read from
• Deterministic — same image always returns same output (unlike LLMs)
• Identity Document Proofing processor detects tampering signals (digital alteration scores, evidence inconclusive, etc.) that LLMs sometimes miss
• Strong on structured/printed text — driver's licenses, official forms, lab reports

Trade-offs:
• Slowest of the three (~10–14s vs Claude ~6s, Gemini ~4s)
• Weaker on free-form handwriting and unusual layouts than Claude
• Each document category needs its own processor (more setup than a single LLM call)
• Field labels come back raw — we map them to our schema in code (e.g., "Family Name" → lastName)

Processors we use:
CategoryProcessorPurpose
government_idUS Driver License ParserPre-trained on US DL/state ID layouts; extracts P2–P11 fields directly
licenseForm ParserGeneric form K/V extraction for nursing license certificates & cards
tb_test / physicalOCR ProcessorBetter at handwriting (TB skin test dates, physician notes)
All gov IDsIdentity Document ProofingRuns in parallel — returns tampering/alteration scores feeding POSSIBLE_ALTERATION flag

How it works in the pipeline:
1. Same compressed image dispatched to Document AI in parallel with Claude + Gemini
2. Processor picked by upload category (see table above)
3. Raw entities returned with bounding boxes & per-field confidence
4. Field labels mapped to our schema (e.g., DocAI "Date Of Birth" → our dateOfBirth)
5. Results compared against Claude/Gemini for agreement scoring; populates documentai.validationFlags / validationWarnings

Pricing breakdown:
• Form Parser / OCR / DL Parser: $0.030 per page (first 1,000 pages/month FREE)
• Identity Document Proofing: $0.10 per request

Typical single extraction (gov ID):
1 DL Parser call + 1 ID Proofing call = $0.03 + $0.10 = ~$0.13 per gov ID
Non-ID docs (license/TB/physical): ~$0.03 per doc

At scale: 10,000 docs/month — mix-dependent, ~$300–$1,300/month
Note: Doc AI is the most expensive of the three providers — used only because the independent third-opinion catches errors the LLMs miss.
× What are tokens?
Tokens are the unit AI models use to measure text. Think of them as "word pieces." One token is roughly 4 characters or about ¾ of an English word. The word "extraction" is 2 tokens. A full sentence is typically 15–25 tokens.

Why do tokens matter?
AI model pricing is based entirely on tokens — both the tokens you send (input) and the tokens the model generates back (output). Understanding tokens helps you estimate costs and optimize usage.

Input tokens — what you send TO the model:
• The document image (~1,000–2,000 tokens depending on resolution)
• The extraction prompt/instructions (~200 tokens)
• The schema definition telling the model what fields to extract (~100 tokens)
• Total per request: ~1,300–2,300 input tokens

Output tokens — what the model sends BACK:
• The structured JSON with all extracted fields and confidence scores
• Typically ~200–500 tokens depending on document type
• Output tokens cost 3–5x more than input because the model is doing the computational work of "reading" and reasoning

Why does output cost more?
Input is just receiving data. Output requires the model to analyze the image, identify fields, read text (including handwriting), assess confidence, check for alterations, and generate structured JSON — this computation is what you're paying the premium for.

Worked example — 1 driver's license:

Claude Sonnet 4.6:
Input: ~1,500 tokens × $3.00/1M = $0.0045
Output: ~400 tokens × $15.00/1M = $0.006
Total: ~$0.01

Gemini 3.5 Flash:
Input: ~1,500 tokens × $0.30/1M = $0.00045
Output: ~400 tokens × $2.50/1M = $0.001
Total: ~$0.0015

Compare mode (both + OCR): ~$0.015
Claude is ~10x pricier but more accurate. The compare mode runs both for maximum confidence.
Compare All
Final Result
Claude Only
Gemini Only
Doc AI Only
Runs Claude + Gemini + OCR in parallel — shows side-by-side comparison
Drop file here or browse
JPG, PNG, or PDF — max 20 MB