| Check severity levels | |||
| Icon | Severity | Effect on Record | Typical Use Case |
|---|---|---|---|
| β Error | Error | Record is flagged AND excluded from the clean dataset | Data that cannot be meaningfully interpreted downstream (e.g. unparseable registration date, clinical contradiction) |
| β οΈ Warning | Warning | Record is flagged but retained in the clean dataset | Data issues that should be reviewed but do not invalidate the entire record (e.g. unparseable optional date, multiple ovals marked) |
Data Quality Checks
Overview
After scanned forms are submitted and processed through the OCR pipeline, a second layer of automated data quality checks runs server-side against the full dataset. These checks are defined in mw_tb_clean__tb_reg_checks.sql and execute as part of the dbt pipeline on every data refresh.
How this differs from in-phone validation
| In-Phone Validation | Pipeline DQA Checks | |
|---|---|---|
| When | At photo capture, before submission | After submission, on the full dataset |
| Who sees it | Field worker, immediately | Data manager, in review dashboard |
| Can prevent bad data entering? | β Yes β worker corrects and retakes photo | β No β data already submitted |
| Scope | 26 fields, 5 rows per page | All fields, all records |
| Triggered by | ScanForm mobile app | dbt pipeline run |
Each check produces one row per failure. Records that pass all checks produce no rows. The output table (mw_tb_clean__tb_reg_checks) is the single source of truth for data quality review and is consumed by the DQA dashboard.
Severity levels
Check catalogue
Five check blocks are defined, unioned into a single output table. Each is described below.
π’ Integer conversion checks
CTE: convert_to_int_checks
Macro: batch_check_convert_to_int
Severity: β οΈ Warning
Tag: invalid-number-check
Checks that OCR-produced text values for digit-box fields can be parsed as integers. A Warning row is generated for each field on each record where the raw OCR text cannot be converted.
| Integer conversion checks (Warning) | |||
| 5 fields checked; one Warning row generated per unparseable value per record | |||
| Variable | Label | Expected Range | Notes |
|---|---|---|---|
| participant_age | Age | Positive integer | Age in years or months depending on age_unit |
| smear_result | Smear Result (baseline) | 0β9 | Single digit count; combined with smear_plus oval downstream |
| smear_result_2 | Smear Result β 2 month | 0β9 | Follow-up smear |
| smear_result_5 | Smear Result β 5 month | 0β9 | Follow-up smear; note source variable has typo smea_result_5 |
| smear_result_6 | Smear Result β 6 month | 0β9 | Follow-up smear |
π Date conversion checks β Warnings
CTE: convert_to_date_checks
Macro: batch_check_convert_to_date
Severity: β οΈ Warning
Tag: invalid-date-check
Checks that OCR text for optional date fields can be parsed as a valid date. Records with unparseable optional dates are flagged but retained in the clean dataset.
| Date conversion checks β Warning (9 fields) | ||
| Optional date fields; record is retained even if date is invalid | ||
| Variable | Label | Expected Format |
|---|---|---|
| lam_date | LAM Date | dd/mm/yy |
| smear_date | Smear Date (baseline) | dd/mm/yy |
| smear_date_2 | Smear Date β 2 month | dd/mm/yy |
| smear_date_5 | Smear Date β 5 month | dd/mm/yy |
| xpert_date | Xpert Date | dd/mm/yy |
| culture_date | Culture Date | dd/mm/yy |
| xray_date | XRay Date | dd/mm/yy |
| dst_date | DST Date | dd/mm/yy |
| cpt_start_date | CPT Start Date | dd/mm/yy |
π Date conversion checks β Errors
CTE: convert_to_date_checks_errors
Macro: batch_check_convert_to_date
Severity: β Error
Tag: invalid-date-check
Checks that OCR text for key date fields can be parsed as a valid date. Records where these fields are unparseable are excluded from the clean dataset entirely, because these dates are essential for record identification or outcome analysis.
| Date conversion checks β Error (3 fields) | |||
| Key date fields; unparseable value causes record exclusion from clean data | |||
| Variable | Label | Expected Format | Why Error (not Warning) |
|---|---|---|---|
| tb_reg_date | Registration Date | dd/mm/yyyy | Primary temporal anchor for the patient record; used in all joins, filters and DQA metadata |
| outcome_date | Outcome Date | dd/mm/yy | Core outcome field; an unparseable outcome date makes the treatment outcome unanalysable |
| smear_date_6 | Smear Date β 6 month | dd/mm/yy | 6-month smear is the key end-of-treatment bacteriological assessment |
Impact of date Error checks
Any record where tb_reg_date, outcome_date, or smear_date_6 cannot be parsed as a valid date will appear in the checks table with severity = 'Error' and will be filtered out of mw_tb_clean__tb_reg and all downstream refinery and intermediary models.
βοΈ Multiple-answer selection checks
CTE: melting_checks
Macro: batch_check_melt_columns
Severity: β οΈ Warning
Tag: multiple-answer-selected-check
ScanForm exports each oval option as a separate boolean column (true/false). For single-select questions, exactly one column should be true. This check flags records where more than one option is marked for the same question β indicating either a scanning error or that the field worker crossed multiple ovals.
A Warning row is generated per affected question per record.
| Multiple-answer selection checks β Warning (22 fields) | |||
| Fires when >1 oval is marked for a single-select question | |||
| Clean Variable | Label | Options | Raw Boolean Columns Checked |
|---|---|---|---|
| sex | Sex | 2 | male, female |
| risk_group | Occupation / Risk Group | 8 | miners, prisoner, ex_miners, hh_contact, mining_commun, migrant, health_care_w, risk_other |
| referral | Referral | 7 | community, ncd, private, walk_in, opd, ward, art_referral |
| treatment_regimen | Treatment Regimen | 4 | regimen_stan, regimen_resi1, regimen_resi2, regimen_othe |
| disease_site | Site of Disease | 2 | pulmonary, extra_pulmonary |
| treatment_history | Treatment History | 6 | new, relap, ralf, fail, other, unknown |
| how_diagnosed | Bact/Cl Diagnosed | 2 | dx_bacteria, dx_clinically |
| lam_result | LAM Result | 2 | mtb_detected, mtb_not_detected |
| xpert_result | Xpert Result | 2 | mtb_detected, mtb_not_detected |
| culture_result | Culture Result | 2 | mtb_detected, mtb_not_detecte |
| xray_result | XRay Result | 2 | mtb_detected, mtb_not_detected |
| dst_result | DST Result | 3 | dst_res, dst_non_res, dst_indeterminate |
| second_line | Moved to 2nd Line Tx | 2 | yes, no |
| treat_outcome | Treatment Outcome | 6 | cured, completed, dead, failure, lost_followup, not_evaluated |
| adherence_support | DOT and Adherence Support | 4 | guardian, health_sur_a, volunteer, healthcare_w |
| hiv_status | HIV Status | 3 | positive, negative, unknown |
| hiv_test_time | HIV Test Time | 2 | hiv_test_before_, hiv_test_after_t |
| arv_status | ARV Start Time | 3 | before_tb, while_on_tb, not_on_tb |
| cpt_status | CPT Status | 2 | yes, no |
| enrolled_prp | Enrolled into PRP | 2 | yes, no |
| ptld_how_diagnosed | PTLD Diagnosis Method | 4 | spiro, res_symptom, chest_xray, reduced_exe |
| prp_outcome | PRP Outcome | 5 | prp_completed, prp_died, prp_lost_followup, prp_active_TB, prp_excluded |
𧬠Clinical logic check
CTE: drug_resistant_standard_regimen_check
Type: Custom SQL
Severity: β Error
Tag: clinical-logic-check
This check implements a clinical contradiction rule: a patient cannot simultaneously have a DST result of RIF Resistant Detected and be prescribed the standard first-line regimen (2RHZE/4RH). If both conditions are true on the same record, the record is flagged as an Error and excluded from the clean dataset.
Check message:
βPatient marked as drug resistant but prescribed standard treatment regimenβ
Trigger condition:
dst_result_dst_res = 'true' AND treatment_regimen_regimen_stan = 'true'
| Clinical logic check β Error | |||||||
| Drug-resistant result combined with standard first-line regimen | |||||||
| Field 1 | Label 1 | Value 1 | Field 2 | Label 2 | Value 2 | Severity | Outcome |
|---|---|---|---|---|---|---|---|
| dst_result_dst_res | DST Result | RIF Resistant Detected | treatment_regimen_regimen_stan | Treatment Regimen | 2RHZE/4RH (standard) | Error | Record excluded from clean dataset |
All checks at a glance
Complete check inventory β mw_tb_clean__tb_reg_checks |
|||||
| 5 check blocks; 41 total field-check combinations | |||||
| CTE Name | Check Type | Tag | Severity | Fields | Fires When |
|---|---|---|---|---|---|
| convert_to_int_checks | Integer conversion | invalid-number-check | Warning | 5 | OCR text cannot be parsed as integer |
| convert_to_date_checks | Date conversion | invalid-date-check | Warning | 9 | OCR text cannot be parsed as date (optional fields) |
| convert_to_date_checks_errors | Date conversion | invalid-date-check | Error | 3 | OCR text cannot be parsed as date (key fields) |
| melting_checks | Multiple-answer selection | multiple-answer-selected-check | Warning | 22 | More than one oval marked for a single-select question |
| drug_resistant_standard_regimen_check | Clinical logic | clinical-logic-check | Error | 2 | DST = RIF Resistant AND regimen = 2RHZE/4RH simultaneously |
Checks output table structure
Every check block produces rows in a common schema. These are unioned into the final mw_tb_clean__tb_reg_checks table.
Output table schema β mw_tb_clean__tb_reg_checks |
||
| One row per check failure per record; passing records produce no rows | ||
| Column | Type | Description |
|---|---|---|
| Entry_URL | text | URL uniquely identifying the ScanForm submission |
| Page_ID | text | Identifier for the scanned page |
| Submission_Date | date | Date the form was submitted to ScanForm |
| tb_reg_date | date | Patient registration date (parsed); used for temporal reporting |
| book_number | text | Register book number extracted from Page_ID |
| page_number | text | Page number within the book |
| Record_number | integer | Patient row number (1β5) within the page |
| record_identifier | text | Surrogate key: Page_ID + '-' + Record_number |
| check_message | text | Human-readable description of the specific check failure |
| Submitted_by | text | User account that submitted the scan |
| implementation_error | boolean | True if this row is also flagged as an implementation error |
| discard_page | boolean | True if the page was marked as discarded |
| discard_row | boolean | True if the patient row was marked as discarded |
| severity | text | 'Error' or 'Warning' (see severity levels above) |
| tags | array | List of check category tags, e.g. ['invalid-date-check'] |
Relationship to in-phone validation
| In-phone validation vs. pipeline DQA checks | ||
| Aspect | π± In-Phone Validation | ποΈ Pipeline DQA Checks |
|---|---|---|
| Timing | Instant, at photo capture | After submission, on next pipeline run |
| Audience | Field worker | Data manager / supervisor |
| Can block submission? | Yes β worker must correct and retake | No β data already in system |
| Fields covered | 26 key fields | 41 fields across 5 check blocks |
| Check types | Required fill, exactly one oval, at most one oval | Integer parse, date parse, multi-select, clinical logic |
| Output | On-screen alert in ScanForm app | Row in mw_tb_clean__tb_reg_checks table |
| Effect on clean data | Prevents bad data entering | Error severity: excludes record; Warning: flags only |
| Overlap | Checks presence and single-selection only | Checks parseable values and clinical consistency |
The two validation layers are complementary, not redundant.
In-phone validation catches structural problems (missing required fields, multiple ovals) at the moment of capture β when correction is still possible. Pipeline DQA checks verify that OCR outputs are mathematically and clinically coherent after the fact, and provide the permanent audit trail that data managers use to assess data quality over time.