Data Pipeline

Pipeline Overview

Scanned forms are ingested from ScanForm, exported as a materialised dataset, and then processed through five transformation layers before reaching dashboards and reporting tools. Each layer has a specific responsibility: ingestion, filtering, cleaning, aggregation, and final delivery.

flowchart LR
  subgraph source["📡 Source"]
    S[(mat_pip_MW_TB_Register_v1)]
  end

  subgraph base["📥 Base"]
    B[mat_pip_MW_TB_Register_v1]
  end

  subgraph pre_clean["🔧 Pre-Clean"]
    PC_RAW[tb_reg_raw]
    PC[tb_reg]
  end

  subgraph clean["✅ Clean"]
    CL[tb_reg]
    CH[tb_reg_checks]
  end

  subgraph intermediary["🔀 Intermediary"]
    IM_AGG[tb_aggregated]
    IM_LOC[checks_with_locations]
  end

  subgraph refinery["🚀 Refinery"]
    RF[tb_reg]
    RF_CH[tb_reg_checks]
  end

  S --> base
  base --> PC_RAW
  PC_RAW --> PC
  PC_RAW --> CH
  PC --> CL
  CH --> CL
  CL --> IM_AGG
  CH --> IM_LOC
  CL --> RF
  CH --> RF_CH
  IM_AGG --> RF

Layer-by-Layer Description

📥 Base — `mw_tb_base__mat_pip_MW_TB_Register_v1`

The base model reads directly from the ScanForm materialised export (mat_pip_MW_TB_Register_v1). Its sole job is column renaming and minor structural tidying:

All column names are normalised to snake_case.
book_number and page_number are extracted from Page_ID using substring positions (characters 2–6 and 7–9 respectively) as a legacy fallback.
The smea_result_5 typo present in the raw export is aliased to smear_result_5 at this stage.
No rows are filtered; all scanned records — including discarded, unverified, and duplicate entries — pass through.
Every select-one field is exported as individual boolean columns per option (e.g. participant_sex_male, participant_sex_female), reflecting ScanForm’s materialised export format.

Note

Why keep everything at the base layer? Cleaning decisions (filtering discards, handling invalids) must happen downstream so that checks can audit the full raw universe of records.

🔧 Pre-Clean — Two models

`mw_tb_pre_clean__tb_reg_raw`

The first pre-clean model applies row-level exclusion filters to the base table:

Removes rows where discard_row = true
Removes pages where discard_page = true
Excludes records where tb_reg_date is the blank template string '__/__/20__'
Excludes records where tb_reg_number equals the blank template 'MTB-_____-___-__'

A record_identifier surrogate key is constructed as Page_ID || '-' || Record_number to uniquely identify each patient row across the dataset.

All other columns remain as raw text strings — no type casting, no value mapping — preserving the data exactly as OCR produced it so that the checks model can audit raw values.

`mw_tb_pre_clean__tb_reg`

The second pre-clean model applies field-level transformations to the filtered raw table:

Output Field	Transformation	Detail
Pre-clean transformations applied per field
Model: `mw_tb_pre_clean__tb_reg`
age_years	Unit-aware numeric conversion	If age_unit = Y: age as float. If age_unit = M: age / 12. Yields decimal years.
sex	Melt boolean columns	participant_sex_male / participant_sex_female → single 'Male' / 'Female' value
risk_group	Melt boolean columns	8 individual risk columns → single label (Miner, Ex-Miner, Mining Community, HCW, Prisoner, HH Contact, Migrant, Other)
referral	Melt boolean columns	7 referral columns → Community, NCD, Private, Walk In, OPD, Ward, ART
treatment_regimen	Melt boolean columns	4 regimen columns → 2RHZE/4RH, BPalM, BPal, Other Regimen
treatment_regimen_other	pretty_text()	Removes underscores from OCR-produced spacing artefacts
disease_site	Melt boolean columns	Pulmonary / Extra-Pulmonary
treatment_history	Melt boolean columns	New, Relap, RALF, Fail, Other, Unknown
how_diagnosed	Melt boolean columns	Bacteriologically / Clinically
smear_result (x4)	Digit + plus-sign combination logic	0→Negative; 1/2/3 + oval → '1+'/'2+'/'3+'; 1/2/3 without oval → exact count; 4–9 → exact count; else Invalid
lam/xpert/culture/xray/dst results	Melt boolean columns	Two- or three-option ovals collapsed to single label string
dst_result	Melt boolean columns	RIF Resistant Detected / RIF Resistant Not Detected / RIF Resistant Indeterminate
treat_outcome	Melt boolean columns	Cured, Completed, Dead, Treatment Failure, Lost to Follow-up, Not Evaluated
adherence_support	Melt boolean columns	Guardian, Health Surveilance Assistant, Volunteer, Healthcare Worker
hiv_status	Melt boolean columns	Positive / Negative / Unknown
arv_status	Melt boolean columns	Before TB Tx / While on TB Tx / Not While on TB Tx
arv_id	Null-blank replacement	'_____________' (13 underscores) → NULL
art_reg_id	Null-blank replacement	'ART-_____-___-__' → NULL
participant_reg_id	Null-blank replacement	'_____-___-__' → NULL
tb_reg_number	Null-blank replacement	'MTB-_____-___-__' → NULL
All date fields	convert_to_date()	String → DATE type using dd/mm/yy or dd/mm/yyyy format macro
enrolled_prp / cpt_status / second_line	Melt boolean columns	Yes / No
ptld_how_diagnosed	Melt boolean columns	Spirometry/LFT, Persistent Respiratory Symptoms, Abnormalities X-ray, Reduced Exercise
prp_outcome	Melt boolean columns	Completed, Died, Lost to Follow-up, Active TB, Excluded

Smear result combination logic

Smear microscopy is recorded on paper as two complementary fields: a digit box (0–9 bacilli count) and a plus oval (to indicate a grade). The pipeline combines them as follows:

Digit	Plus oval marked?	Clean value
0	—	Negative
1	No	1
1	Yes	1+
2	No	2
2	Yes	2+
3	No	3
3	Yes	3+
4–9	—	Exact count
Other	—	Invalid

This logic is applied identically to all four smear timepoints (baseline, 2-month, 5-month, 6-month).

✅ Clean — Two models

`mw_tb_clean__tb_reg_checks`

The checks model runs data quality assertions against mw_tb_pre_clean__tb_reg_raw (raw strings, before type casting) and unions them into a single checks table. Each row in the output represents one triggered check for one record.

Check	Input Columns	Severity	Tag	Description
Data quality checks
Model: `mw_tb_clean__tb_reg_checks`
convert_to_int	participant_age, smear_result (x4)	Warning	invalid-number-check	Flags records where age or smear digit boxes cannot be cast to integer
convert_to_date (Warning)	lam_date, smear_date, smear_date_2, smear_date_5, xpert_date, culture_date, xray_date, dst_date, cpt_start_date	Warning	invalid-date-check	Flags unparseable dates for optional date fields
convert_to_date (Error)	tb_reg_date, outcome_date, smear_date_6	Error	invalid-date-check	Flags unparseable dates for key date fields; these records are excluded from clean data
batch_check_melt_columns	All select-one fields (22 groups)	Warning	multiple-answer-selected-check	Flags records where more than one oval was marked for a single-select question
drug_resistant_standard_regimen_check	dst_result_dst_res + treatment_regimen_regimen_stan	Error	clinical-logic-check	Flags records where DST is RIF Resistant Detected but the standard 2RHZE/4RH regimen is recorded — a clinical contradiction

Important

Error vs Warning

Error — the record is excluded from mw_tb_clean__tb_reg and all downstream models. Errors indicate data that cannot be meaningfully interpreted (e.g. an unparseable registration date).
Warning — the record passes through to the clean table but the issue is flagged for review in the checks dashboard (e.g. a multiple-selection on a single-select oval field).

`mw_tb_clean__tb_reg`

The clean table selects all transformed fields from mw_tb_pre_clean__tb_reg and filters out any record_identifier that triggered an Error-severity check. The resulting dataset contains only records that are structurally valid and interpretable, with all fields cast to appropriate types and all select-one columns collapsed to clean label strings.

🔀 Intermediary — Two models

`mw_tb_intermediary__tb_aggregated`

Produces a single aggregate summary row across all clean records. Each patient record is first scored with binary flags, then summed:

Metric	Aggregation Logic
Aggregated summary metrics
Model: `mw_tb_intermediary__tb_aggregated`
total_treated	Count of records where tb_reg_date is not NULL
average_age	Mean of age_years across all records
total_female	Count where sex = 'Female'
total_male	Count where sex = 'Male'
total_standard_regimen	Count where treatment_regimen = '2RHZE/4RH'
total_other_regimen	Count where treatment_regimen = 'Other Regimen'
total_pulmonary_tb	Count where disease_site = 'Pulmonary'
total_previously_treated	Count where treatment_history IN ('Relap', 'RALF', 'Fail', 'Other')
total_resistant_tests	Count where dst_date is not NULL
total_resistant_tb	Count where dst_result = 'RIF Resistant Detected'
total_positive_outcome	Count where treat_outcome IN ('Cured', 'Completed')
total_hiv_positive	Count where hiv_status = 'Positive'
total_enrolled_prp	Count where enrolled_prp = 'Yes'
total_completed_prp	Count where prp_outcome = 'Completed'

`mw_tb_intermediary__checks_with_locations`

Joins mw_tb_clean__tb_reg_checks with mw_tb_intermediary__books_with_locations on book_number to enrich check records with facility/location metadata. This powers location-disaggregated DQA dashboards where data managers can see which site generated the most errors or warnings.

🚀 Refinery — Two models

Refinery models are the final delivery layer consumed by dashboards and external sinks (e.g. DHIS2, Metabase). They mirror the clean and checks content with any final adjustments needed for downstream consumers.

`mw_tb_refinery__tb_reg`

A pass-through of all columns from mw_tb_clean__tb_reg. No further transformation is applied; the refinery layer simply exposes the clean patient-level dataset in the marts schema where it is accessible to dashboard tooling.

`mw_tb_refinery__tb_reg_checks`

Selects from mw_tb_clean__tb_reg_checks and applies one additional filter:

Excludes rows where implementation_error = true — implementation errors reflect paper form issues (e.g. duplicate rows printed in error) rather than genuine data quality problems, and should not appear in operational DQA reports shown to facility staff.

Key Design Decisions

Decision	Rationale
Key pipeline design decisions
Raw export kept as text strings throughout pre_clean_raw	Allows checks to inspect the actual OCR output (e.g. '__/__/20__') before any casting masks the raw value
record_identifier = Page_ID + Record_number	Provides a stable surrogate key linking checks back to the specific patient row and page scan
Errors exclude from clean; Warnings do not	Errors (e.g. unparseable registration date, clinical contradiction) make a record unanalysable. Warnings flag issues for review without losing the record entirely
Blank template strings → NULL	ScanForm exports unfilled structured fields as the template pattern (e.g. 'MTB-_____-___-__'). These are replaced with NULL so that downstream aggregations and joins treat them as missing rather than a valid value
Age normalised to decimal years	Infants recorded in months are divided by 12 so that age_years is a single comparable numeric field across all patients
Smear result combines digit + plus oval	The paper form encodes grade information across two separate OCR fields. The pipeline merges them into a single interpretable clinical value (e.g. '2+')
implementation_error filtered from refinery checks	Facility staff reviewing DQA reports should not see errors caused by form printing issues outside their control
book_number extracted from Page_ID substring	Legacy handling: in the absence of a separate book metadata export, position 2–6 of Page_ID encodes the book identifier

Model Inventory

Model	Layer	Reads From	Output Granularity	Notes
All pipeline models
9 models across 5 layers
mw_tb_base__mat_pip_MW_TB_Register_v1	Base	scanform_materialized_exports	All scanned rows	Column rename only; no filtering
mw_tb_pre_clean__tb_reg_raw	Pre-Clean	base__mat_pip_MW_TB_Register_v1	Excludes discards	Row filters + record_identifier; all values remain as raw text
mw_tb_pre_clean__tb_reg	Pre-Clean	pre_clean__tb_reg_raw	Same as raw	Type casts, melt, smear combination, blank→NULL
mw_tb_clean__tb_reg_checks	Clean	pre_clean__tb_reg_raw	One row per check triggered	Union of 5 check types; Error and Warning severity
mw_tb_clean__tb_reg	Clean	pre_clean__tb_reg	Excludes Error records	Final clean patient-level table
mw_tb_intermediary__tb_aggregated	Intermediary	clean__tb_reg	1 summary row	14 aggregate metrics for dashboard KPI tiles
mw_tb_intermediary__checks_with_locations	Intermediary	clean__tb_reg_checks + books_with_locations	One row per check (with location)	Joins checks to facility metadata
mw_tb_refinery__tb_reg	Refinery	clean__tb_reg	Same as clean	Pass-through; exposed in marts schema
mw_tb_refinery__tb_reg_checks	Refinery	clean__tb_reg_checks	Excludes implementation_error rows	DQA checks surfaced to facility staff

Pipeline Overview

Layer-by-Layer Description

📥 Base — mw_tb_base__mat_pip_MW_TB_Register_v1

🔧 Pre-Clean — Two models

mw_tb_pre_clean__tb_reg_raw

mw_tb_pre_clean__tb_reg

✅ Clean — Two models

mw_tb_clean__tb_reg_checks

mw_tb_clean__tb_reg

🔀 Intermediary — Two models

mw_tb_intermediary__tb_aggregated

mw_tb_intermediary__checks_with_locations

🚀 Refinery — Two models

mw_tb_refinery__tb_reg

mw_tb_refinery__tb_reg_checks

Key Design Decisions

Model Inventory

📥 Base — `mw_tb_base__mat_pip_MW_TB_Register_v1`

`mw_tb_pre_clean__tb_reg_raw`

`mw_tb_pre_clean__tb_reg`

`mw_tb_clean__tb_reg_checks`

`mw_tb_clean__tb_reg`

`mw_tb_intermediary__tb_aggregated`

`mw_tb_intermediary__checks_with_locations`

`mw_tb_refinery__tb_reg`

`mw_tb_refinery__tb_reg_checks`