flowchart LR
subgraph source["π‘ Source"]
S[(mat_pip_MW_TB_Register_v1)]
end
subgraph base["π₯ Base"]
B[mat_pip_MW_TB_Register_v1]
end
subgraph pre_clean["π§ Pre-Clean"]
PC_RAW[tb_reg_raw]
PC[tb_reg]
end
subgraph clean["β
Clean"]
CL[tb_reg]
CH[tb_reg_checks]
end
subgraph intermediary["π Intermediary"]
IM_AGG[tb_aggregated]
IM_LOC[checks_with_locations]
end
subgraph refinery["π Refinery"]
RF[tb_reg]
RF_CH[tb_reg_checks]
end
S --> base
base --> PC_RAW
PC_RAW --> PC
PC_RAW --> CH
PC --> CL
CH --> CL
CL --> IM_AGG
CH --> IM_LOC
CL --> RF
CH --> RF_CH
IM_AGG --> RF
Data Pipeline
Pipeline Overview
Scanned forms are ingested from ScanForm, exported as a materialised dataset, and then processed through five transformation layers before reaching dashboards and reporting tools. Each layer has a specific responsibility: ingestion, filtering, cleaning, aggregation, and final delivery.
Layer-by-Layer Description
π₯ Base β mw_tb_base__mat_pip_MW_TB_Register_v1
The base model reads directly from the ScanForm materialised export (mat_pip_MW_TB_Register_v1). Its sole job is column renaming and minor structural tidying:
- All column names are normalised to
snake_case. book_numberandpage_numberare extracted fromPage_IDusing substring positions (characters 2β6 and 7β9 respectively) as a legacy fallback.- The
smea_result_5typo present in the raw export is aliased tosmear_result_5at this stage. - No rows are filtered; all scanned records β including discarded, unverified, and duplicate entries β pass through.
- Every select-one field is exported as individual boolean columns per option (e.g.
participant_sex_male,participant_sex_female), reflecting ScanFormβs materialised export format.
Why keep everything at the base layer? Cleaning decisions (filtering discards, handling invalids) must happen downstream so that checks can audit the full raw universe of records.
π§ Pre-Clean β Two models
mw_tb_pre_clean__tb_reg_raw
The first pre-clean model applies row-level exclusion filters to the base table:
- Removes rows where
discard_row = true - Removes pages where
discard_page = true - Excludes records where
tb_reg_dateis the blank template string'__/__/20__' - Excludes records where
tb_reg_numberequals the blank template'MTB-_____-___-__'
A record_identifier surrogate key is constructed as Page_ID || '-' || Record_number to uniquely identify each patient row across the dataset.
All other columns remain as raw text strings β no type casting, no value mapping β preserving the data exactly as OCR produced it so that the checks model can audit raw values.
mw_tb_pre_clean__tb_reg
The second pre-clean model applies field-level transformations to the filtered raw table:
| Pre-clean transformations applied per field | ||
Model: mw_tb_pre_clean__tb_reg |
||
| Output Field | Transformation | Detail |
|---|---|---|
| age_years | Unit-aware numeric conversion | If age_unit = Y: age as float. If age_unit = M: age / 12. Yields decimal years. |
| sex | Melt boolean columns | participant_sex_male / participant_sex_female β single 'Male' / 'Female' value |
| risk_group | Melt boolean columns | 8 individual risk columns β single label (Miner, Ex-Miner, Mining Community, HCW, Prisoner, HH Contact, Migrant, Other) |
| referral | Melt boolean columns | 7 referral columns β Community, NCD, Private, Walk In, OPD, Ward, ART |
| treatment_regimen | Melt boolean columns | 4 regimen columns β 2RHZE/4RH, BPalM, BPal, Other Regimen |
| treatment_regimen_other | pretty_text() | Removes underscores from OCR-produced spacing artefacts |
| disease_site | Melt boolean columns | Pulmonary / Extra-Pulmonary |
| treatment_history | Melt boolean columns | New, Relap, RALF, Fail, Other, Unknown |
| how_diagnosed | Melt boolean columns | Bacteriologically / Clinically |
| smear_result (x4) | Digit + plus-sign combination logic | 0βNegative; 1/2/3 + oval β '1+'/'2+'/'3+'; 1/2/3 without oval β exact count; 4β9 β exact count; else Invalid |
| lam/xpert/culture/xray/dst results | Melt boolean columns | Two- or three-option ovals collapsed to single label string |
| dst_result | Melt boolean columns | RIF Resistant Detected / RIF Resistant Not Detected / RIF Resistant Indeterminate |
| treat_outcome | Melt boolean columns | Cured, Completed, Dead, Treatment Failure, Lost to Follow-up, Not Evaluated |
| adherence_support | Melt boolean columns | Guardian, Health Surveilance Assistant, Volunteer, Healthcare Worker |
| hiv_status | Melt boolean columns | Positive / Negative / Unknown |
| arv_status | Melt boolean columns | Before TB Tx / While on TB Tx / Not While on TB Tx |
| arv_id | Null-blank replacement | '_____________' (13 underscores) β NULL |
| art_reg_id | Null-blank replacement | 'ART-_____-___-__' β NULL |
| participant_reg_id | Null-blank replacement | '_____-___-__' β NULL |
| tb_reg_number | Null-blank replacement | 'MTB-_____-___-__' β NULL |
| All date fields | convert_to_date() | String β DATE type using dd/mm/yy or dd/mm/yyyy format macro |
| enrolled_prp / cpt_status / second_line | Melt boolean columns | Yes / No |
| ptld_how_diagnosed | Melt boolean columns | Spirometry/LFT, Persistent Respiratory Symptoms, Abnormalities X-ray, Reduced Exercise |
| prp_outcome | Melt boolean columns | Completed, Died, Lost to Follow-up, Active TB, Excluded |
Smear microscopy is recorded on paper as two complementary fields: a digit box (0β9 bacilli count) and a plus oval (to indicate a grade). The pipeline combines them as follows:
| Digit | Plus oval marked? | Clean value |
|---|---|---|
| 0 | β | Negative |
| 1 | No | 1 |
| 1 | Yes | 1+ |
| 2 | No | 2 |
| 2 | Yes | 2+ |
| 3 | No | 3 |
| 3 | Yes | 3+ |
| 4β9 | β | Exact count |
| Other | β | Invalid |
This logic is applied identically to all four smear timepoints (baseline, 2-month, 5-month, 6-month).
β Clean β Two models
mw_tb_clean__tb_reg_checks
The checks model runs data quality assertions against mw_tb_pre_clean__tb_reg_raw (raw strings, before type casting) and unions them into a single checks table. Each row in the output represents one triggered check for one record.
| Data quality checks | ||||
Model: mw_tb_clean__tb_reg_checks |
||||
| Check | Input Columns | Severity | Tag | Description |
|---|---|---|---|---|
| convert_to_int | participant_age, smear_result (x4) | Warning | invalid-number-check | Flags records where age or smear digit boxes cannot be cast to integer |
| convert_to_date (Warning) | lam_date, smear_date, smear_date_2, smear_date_5, xpert_date, culture_date, xray_date, dst_date, cpt_start_date | Warning | invalid-date-check | Flags unparseable dates for optional date fields |
| convert_to_date (Error) | tb_reg_date, outcome_date, smear_date_6 | Error | invalid-date-check | Flags unparseable dates for key date fields; these records are excluded from clean data |
| batch_check_melt_columns | All select-one fields (22 groups) | Warning | multiple-answer-selected-check | Flags records where more than one oval was marked for a single-select question |
| drug_resistant_standard_regimen_check | dst_result_dst_res + treatment_regimen_regimen_stan | Error | clinical-logic-check | Flags records where DST is RIF Resistant Detected but the standard 2RHZE/4RH regimen is recorded β a clinical contradiction |
Error vs Warning
- Error β the record is excluded from
mw_tb_clean__tb_regand all downstream models. Errors indicate data that cannot be meaningfully interpreted (e.g. an unparseable registration date). - Warning β the record passes through to the clean table but the issue is flagged for review in the checks dashboard (e.g. a multiple-selection on a single-select oval field).
mw_tb_clean__tb_reg
The clean table selects all transformed fields from mw_tb_pre_clean__tb_reg and filters out any record_identifier that triggered an Error-severity check. The resulting dataset contains only records that are structurally valid and interpretable, with all fields cast to appropriate types and all select-one columns collapsed to clean label strings.
π Intermediary β Two models
mw_tb_intermediary__tb_aggregated
Produces a single aggregate summary row across all clean records. Each patient record is first scored with binary flags, then summed:
| Aggregated summary metrics | |
Model: mw_tb_intermediary__tb_aggregated |
|
| Metric | Aggregation Logic |
|---|---|
| total_treated | Count of records where tb_reg_date is not NULL |
| average_age | Mean of age_years across all records |
| total_female | Count where sex = 'Female' |
| total_male | Count where sex = 'Male' |
| total_standard_regimen | Count where treatment_regimen = '2RHZE/4RH' |
| total_other_regimen | Count where treatment_regimen = 'Other Regimen' |
| total_pulmonary_tb | Count where disease_site = 'Pulmonary' |
| total_previously_treated | Count where treatment_history IN ('Relap', 'RALF', 'Fail', 'Other') |
| total_resistant_tests | Count where dst_date is not NULL |
| total_resistant_tb | Count where dst_result = 'RIF Resistant Detected' |
| total_positive_outcome | Count where treat_outcome IN ('Cured', 'Completed') |
| total_hiv_positive | Count where hiv_status = 'Positive' |
| total_enrolled_prp | Count where enrolled_prp = 'Yes' |
| total_completed_prp | Count where prp_outcome = 'Completed' |
mw_tb_intermediary__checks_with_locations
Joins mw_tb_clean__tb_reg_checks with mw_tb_intermediary__books_with_locations on book_number to enrich check records with facility/location metadata. This powers location-disaggregated DQA dashboards where data managers can see which site generated the most errors or warnings.
π Refinery β Two models
Refinery models are the final delivery layer consumed by dashboards and external sinks (e.g. DHIS2, Metabase). They mirror the clean and checks content with any final adjustments needed for downstream consumers.
mw_tb_refinery__tb_reg
A pass-through of all columns from mw_tb_clean__tb_reg. No further transformation is applied; the refinery layer simply exposes the clean patient-level dataset in the marts schema where it is accessible to dashboard tooling.
mw_tb_refinery__tb_reg_checks
Selects from mw_tb_clean__tb_reg_checks and applies one additional filter:
- Excludes rows where
implementation_error = trueβ implementation errors reflect paper form issues (e.g. duplicate rows printed in error) rather than genuine data quality problems, and should not appear in operational DQA reports shown to facility staff.
Key Design Decisions
| Key pipeline design decisions | |
| Decision | Rationale |
|---|---|
| Raw export kept as text strings throughout pre_clean_raw | Allows checks to inspect the actual OCR output (e.g. '__/__/20__') before any casting masks the raw value |
| record_identifier = Page_ID + Record_number | Provides a stable surrogate key linking checks back to the specific patient row and page scan |
| Errors exclude from clean; Warnings do not | Errors (e.g. unparseable registration date, clinical contradiction) make a record unanalysable. Warnings flag issues for review without losing the record entirely |
| Blank template strings β NULL | ScanForm exports unfilled structured fields as the template pattern (e.g. 'MTB-_____-___-__'). These are replaced with NULL so that downstream aggregations and joins treat them as missing rather than a valid value |
| Age normalised to decimal years | Infants recorded in months are divided by 12 so that age_years is a single comparable numeric field across all patients |
| Smear result combines digit + plus oval | The paper form encodes grade information across two separate OCR fields. The pipeline merges them into a single interpretable clinical value (e.g. '2+') |
| implementation_error filtered from refinery checks | Facility staff reviewing DQA reports should not see errors caused by form printing issues outside their control |
| book_number extracted from Page_ID substring | Legacy handling: in the absence of a separate book metadata export, position 2β6 of Page_ID encodes the book identifier |
Model Inventory
| All pipeline models | ||||
| 9 models across 5 layers | ||||
| Model | Layer | Reads From | Output Granularity | Notes |
|---|---|---|---|---|
| mw_tb_base__mat_pip_MW_TB_Register_v1 | Base | scanform_materialized_exports | All scanned rows | Column rename only; no filtering |
| mw_tb_pre_clean__tb_reg_raw | Pre-Clean | base__mat_pip_MW_TB_Register_v1 | Excludes discards | Row filters + record_identifier; all values remain as raw text |
| mw_tb_pre_clean__tb_reg | Pre-Clean | pre_clean__tb_reg_raw | Same as raw | Type casts, melt, smear combination, blankβNULL |
| mw_tb_clean__tb_reg_checks | Clean | pre_clean__tb_reg_raw | One row per check triggered | Union of 5 check types; Error and Warning severity |
| mw_tb_clean__tb_reg | Clean | pre_clean__tb_reg | Excludes Error records | Final clean patient-level table |
| mw_tb_intermediary__tb_aggregated | Intermediary | clean__tb_reg | 1 summary row | 14 aggregate metrics for dashboard KPI tiles |
| mw_tb_intermediary__checks_with_locations | Intermediary | clean__tb_reg_checks + books_with_locations | One row per check (with location) | Joins checks to facility metadata |
| mw_tb_refinery__tb_reg | Refinery | clean__tb_reg | Same as clean | Pass-through; exposed in marts schema |
| mw_tb_refinery__tb_reg_checks | Refinery | clean__tb_reg_checks | Excludes implementation_error rows | DQA checks surfaced to facility staff |