Data Quality Checks

Overview

After scanned forms are submitted and processed through the OCR pipeline, a second layer of automated data quality checks runs server-side against the full dataset. These checks are defined in mw_tb_clean__tb_reg_checks.sql and execute as part of the dbt pipeline on every data refresh.

Note

How this differs from in-phone validation

In-Phone Validation Pipeline DQA Checks
When At photo capture, before submission After submission, on the full dataset
Who sees it Field worker, immediately Data manager, in review dashboard
Can prevent bad data entering? βœ… Yes β€” worker corrects and retakes photo ❌ No β€” data already submitted
Scope 26 fields, 5 rows per page All fields, all records
Triggered by ScanForm mobile app dbt pipeline run

Each check produces one row per failure. Records that pass all checks produce no rows. The output table (mw_tb_clean__tb_reg_checks) is the single source of truth for data quality review and is consumed by the DQA dashboard.


Severity levels

Check severity levels
Icon Severity Effect on Record Typical Use Case
β›” Error Error Record is flagged AND excluded from the clean dataset Data that cannot be meaningfully interpreted downstream (e.g. unparseable registration date, clinical contradiction)
⚠️ Warning Warning Record is flagged but retained in the clean dataset Data issues that should be reviewed but do not invalidate the entire record (e.g. unparseable optional date, multiple ovals marked)

Check catalogue

Five check blocks are defined, unioned into a single output table. Each is described below.

πŸ”’ Integer conversion checks

CTE: convert_to_int_checks
Macro: batch_check_convert_to_int
Severity: ⚠️ Warning
Tag: invalid-number-check

Checks that OCR-produced text values for digit-box fields can be parsed as integers. A Warning row is generated for each field on each record where the raw OCR text cannot be converted.

Integer conversion checks (Warning)
5 fields checked; one Warning row generated per unparseable value per record
Variable Label Expected Range Notes
participant_age Age Positive integer Age in years or months depending on age_unit
smear_result Smear Result (baseline) 0–9 Single digit count; combined with smear_plus oval downstream
smear_result_2 Smear Result – 2 month 0–9 Follow-up smear
smear_result_5 Smear Result – 5 month 0–9 Follow-up smear; note source variable has typo smea_result_5
smear_result_6 Smear Result – 6 month 0–9 Follow-up smear

πŸ“… Date conversion checks β€” Warnings

CTE: convert_to_date_checks
Macro: batch_check_convert_to_date
Severity: ⚠️ Warning
Tag: invalid-date-check

Checks that OCR text for optional date fields can be parsed as a valid date. Records with unparseable optional dates are flagged but retained in the clean dataset.

Date conversion checks β€” Warning (9 fields)
Optional date fields; record is retained even if date is invalid
Variable Label Expected Format
lam_date LAM Date dd/mm/yy
smear_date Smear Date (baseline) dd/mm/yy
smear_date_2 Smear Date – 2 month dd/mm/yy
smear_date_5 Smear Date – 5 month dd/mm/yy
xpert_date Xpert Date dd/mm/yy
culture_date Culture Date dd/mm/yy
xray_date XRay Date dd/mm/yy
dst_date DST Date dd/mm/yy
cpt_start_date CPT Start Date dd/mm/yy

πŸ“… Date conversion checks β€” Errors

CTE: convert_to_date_checks_errors
Macro: batch_check_convert_to_date
Severity: β›” Error
Tag: invalid-date-check

Checks that OCR text for key date fields can be parsed as a valid date. Records where these fields are unparseable are excluded from the clean dataset entirely, because these dates are essential for record identification or outcome analysis.

Date conversion checks β€” Error (3 fields)
Key date fields; unparseable value causes record exclusion from clean data
Variable Label Expected Format Why Error (not Warning)
tb_reg_date Registration Date dd/mm/yyyy Primary temporal anchor for the patient record; used in all joins, filters and DQA metadata
outcome_date Outcome Date dd/mm/yy Core outcome field; an unparseable outcome date makes the treatment outcome unanalysable
smear_date_6 Smear Date – 6 month dd/mm/yy 6-month smear is the key end-of-treatment bacteriological assessment
Important

Impact of date Error checks
Any record where tb_reg_date, outcome_date, or smear_date_6 cannot be parsed as a valid date will appear in the checks table with severity = 'Error' and will be filtered out of mw_tb_clean__tb_reg and all downstream refinery and intermediary models.


β˜‘οΈ Multiple-answer selection checks

CTE: melting_checks
Macro: batch_check_melt_columns
Severity: ⚠️ Warning
Tag: multiple-answer-selected-check

ScanForm exports each oval option as a separate boolean column (true/false). For single-select questions, exactly one column should be true. This check flags records where more than one option is marked for the same question β€” indicating either a scanning error or that the field worker crossed multiple ovals.

A Warning row is generated per affected question per record.

Multiple-answer selection checks β€” Warning (22 fields)
Fires when >1 oval is marked for a single-select question
Clean Variable Label Options Raw Boolean Columns Checked
sex Sex 2 male, female
risk_group Occupation / Risk Group 8 miners, prisoner, ex_miners, hh_contact, mining_commun, migrant, health_care_w, risk_other
referral Referral 7 community, ncd, private, walk_in, opd, ward, art_referral
treatment_regimen Treatment Regimen 4 regimen_stan, regimen_resi1, regimen_resi2, regimen_othe
disease_site Site of Disease 2 pulmonary, extra_pulmonary
treatment_history Treatment History 6 new, relap, ralf, fail, other, unknown
how_diagnosed Bact/Cl Diagnosed 2 dx_bacteria, dx_clinically
lam_result LAM Result 2 mtb_detected, mtb_not_detected
xpert_result Xpert Result 2 mtb_detected, mtb_not_detected
culture_result Culture Result 2 mtb_detected, mtb_not_detecte
xray_result XRay Result 2 mtb_detected, mtb_not_detected
dst_result DST Result 3 dst_res, dst_non_res, dst_indeterminate
second_line Moved to 2nd Line Tx 2 yes, no
treat_outcome Treatment Outcome 6 cured, completed, dead, failure, lost_followup, not_evaluated
adherence_support DOT and Adherence Support 4 guardian, health_sur_a, volunteer, healthcare_w
hiv_status HIV Status 3 positive, negative, unknown
hiv_test_time HIV Test Time 2 hiv_test_before_, hiv_test_after_t
arv_status ARV Start Time 3 before_tb, while_on_tb, not_on_tb
cpt_status CPT Status 2 yes, no
enrolled_prp Enrolled into PRP 2 yes, no
ptld_how_diagnosed PTLD Diagnosis Method 4 spiro, res_symptom, chest_xray, reduced_exe
prp_outcome PRP Outcome 5 prp_completed, prp_died, prp_lost_followup, prp_active_TB, prp_excluded

🧬 Clinical logic check

CTE: drug_resistant_standard_regimen_check
Type: Custom SQL
Severity: β›” Error
Tag: clinical-logic-check

This check implements a clinical contradiction rule: a patient cannot simultaneously have a DST result of RIF Resistant Detected and be prescribed the standard first-line regimen (2RHZE/4RH). If both conditions are true on the same record, the record is flagged as an Error and excluded from the clean dataset.

Important

Check message:
β€œPatient marked as drug resistant but prescribed standard treatment regimen”

Trigger condition:
dst_result_dst_res = 'true' AND treatment_regimen_regimen_stan = 'true'

Clinical logic check β€” Error
Drug-resistant result combined with standard first-line regimen
Field 1 Label 1 Value 1 Field 2 Label 2 Value 2 Severity Outcome
dst_result_dst_res DST Result RIF Resistant Detected treatment_regimen_regimen_stan Treatment Regimen 2RHZE/4RH (standard) Error Record excluded from clean dataset

All checks at a glance

Complete check inventory β€” mw_tb_clean__tb_reg_checks
5 check blocks; 41 total field-check combinations
CTE Name Check Type Tag Severity Fields Fires When
convert_to_int_checks Integer conversion invalid-number-check Warning 5 OCR text cannot be parsed as integer
convert_to_date_checks Date conversion invalid-date-check Warning 9 OCR text cannot be parsed as date (optional fields)
convert_to_date_checks_errors Date conversion invalid-date-check Error 3 OCR text cannot be parsed as date (key fields)
melting_checks Multiple-answer selection multiple-answer-selected-check Warning 22 More than one oval marked for a single-select question
drug_resistant_standard_regimen_check Clinical logic clinical-logic-check Error 2 DST = RIF Resistant AND regimen = 2RHZE/4RH simultaneously

Checks output table structure

Every check block produces rows in a common schema. These are unioned into the final mw_tb_clean__tb_reg_checks table.

Output table schema β€” mw_tb_clean__tb_reg_checks
One row per check failure per record; passing records produce no rows
Column Type Description
Entry_URL text URL uniquely identifying the ScanForm submission
Page_ID text Identifier for the scanned page
Submission_Date date Date the form was submitted to ScanForm
tb_reg_date date Patient registration date (parsed); used for temporal reporting
book_number text Register book number extracted from Page_ID
page_number text Page number within the book
Record_number integer Patient row number (1–5) within the page
record_identifier text Surrogate key: Page_ID + '-' + Record_number
check_message text Human-readable description of the specific check failure
Submitted_by text User account that submitted the scan
implementation_error boolean True if this row is also flagged as an implementation error
discard_page boolean True if the page was marked as discarded
discard_row boolean True if the patient row was marked as discarded
severity text 'Error' or 'Warning' (see severity levels above)
tags array List of check category tags, e.g. ['invalid-date-check']

Relationship to in-phone validation

In-phone validation vs. pipeline DQA checks
Aspect πŸ“± In-Phone Validation πŸ—„οΈ Pipeline DQA Checks
Timing Instant, at photo capture After submission, on next pipeline run
Audience Field worker Data manager / supervisor
Can block submission? Yes β€” worker must correct and retake No β€” data already in system
Fields covered 26 key fields 41 fields across 5 check blocks
Check types Required fill, exactly one oval, at most one oval Integer parse, date parse, multi-select, clinical logic
Output On-screen alert in ScanForm app Row in mw_tb_clean__tb_reg_checks table
Effect on clean data Prevents bad data entering Error severity: excludes record; Warning: flags only
Overlap Checks presence and single-selection only Checks parseable values and clinical consistency
Tip

The two validation layers are complementary, not redundant.

In-phone validation catches structural problems (missing required fields, multiple ovals) at the moment of capture β€” when correction is still possible. Pipeline DQA checks verify that OCR outputs are mathematically and clinically coherent after the fact, and provide the permanent audit trail that data managers use to assess data quality over time.