Data Pipeline

Pipeline Overview

Scanned forms are ingested from ScanForm, exported as a materialised dataset, and then processed through five transformation layers before reaching dashboards and reporting tools. Each layer has a specific responsibility: ingestion, filtering, cleaning, aggregation, and final delivery.

flowchart LR
  subgraph source["πŸ“‘ Source"]
    S[(mat_pip_MW_TB_Register_v1)]
  end

  subgraph base["πŸ“₯ Base"]
    B[mat_pip_MW_TB_Register_v1]
  end

  subgraph pre_clean["πŸ”§ Pre-Clean"]
    PC_RAW[tb_reg_raw]
    PC[tb_reg]
  end

  subgraph clean["βœ… Clean"]
    CL[tb_reg]
    CH[tb_reg_checks]
  end

  subgraph intermediary["πŸ”€ Intermediary"]
    IM_AGG[tb_aggregated]
    IM_LOC[checks_with_locations]
  end

  subgraph refinery["πŸš€ Refinery"]
    RF[tb_reg]
    RF_CH[tb_reg_checks]
  end

  S --> base
  base --> PC_RAW
  PC_RAW --> PC
  PC_RAW --> CH
  PC --> CL
  CH --> CL
  CL --> IM_AGG
  CH --> IM_LOC
  CL --> RF
  CH --> RF_CH
  IM_AGG --> RF


Layer-by-Layer Description

πŸ“₯ Base β€” mw_tb_base__mat_pip_MW_TB_Register_v1

The base model reads directly from the ScanForm materialised export (mat_pip_MW_TB_Register_v1). Its sole job is column renaming and minor structural tidying:

  • All column names are normalised to snake_case.
  • book_number and page_number are extracted from Page_ID using substring positions (characters 2–6 and 7–9 respectively) as a legacy fallback.
  • The smea_result_5 typo present in the raw export is aliased to smear_result_5 at this stage.
  • No rows are filtered; all scanned records β€” including discarded, unverified, and duplicate entries β€” pass through.
  • Every select-one field is exported as individual boolean columns per option (e.g. participant_sex_male, participant_sex_female), reflecting ScanForm’s materialised export format.
Note

Why keep everything at the base layer? Cleaning decisions (filtering discards, handling invalids) must happen downstream so that checks can audit the full raw universe of records.


πŸ”§ Pre-Clean β€” Two models

mw_tb_pre_clean__tb_reg_raw

The first pre-clean model applies row-level exclusion filters to the base table:

  • Removes rows where discard_row = true
  • Removes pages where discard_page = true
  • Excludes records where tb_reg_date is the blank template string '__/__/20__'
  • Excludes records where tb_reg_number equals the blank template 'MTB-_____-___-__'

A record_identifier surrogate key is constructed as Page_ID || '-' || Record_number to uniquely identify each patient row across the dataset.

All other columns remain as raw text strings β€” no type casting, no value mapping β€” preserving the data exactly as OCR produced it so that the checks model can audit raw values.

mw_tb_pre_clean__tb_reg

The second pre-clean model applies field-level transformations to the filtered raw table:

Pre-clean transformations applied per field
Model: mw_tb_pre_clean__tb_reg
Output Field Transformation Detail
age_years Unit-aware numeric conversion If age_unit = Y: age as float. If age_unit = M: age / 12. Yields decimal years.
sex Melt boolean columns participant_sex_male / participant_sex_female β†’ single 'Male' / 'Female' value
risk_group Melt boolean columns 8 individual risk columns β†’ single label (Miner, Ex-Miner, Mining Community, HCW, Prisoner, HH Contact, Migrant, Other)
referral Melt boolean columns 7 referral columns β†’ Community, NCD, Private, Walk In, OPD, Ward, ART
treatment_regimen Melt boolean columns 4 regimen columns β†’ 2RHZE/4RH, BPalM, BPal, Other Regimen
treatment_regimen_other pretty_text() Removes underscores from OCR-produced spacing artefacts
disease_site Melt boolean columns Pulmonary / Extra-Pulmonary
treatment_history Melt boolean columns New, Relap, RALF, Fail, Other, Unknown
how_diagnosed Melt boolean columns Bacteriologically / Clinically
smear_result (x4) Digit + plus-sign combination logic 0β†’Negative; 1/2/3 + oval β†’ '1+'/'2+'/'3+'; 1/2/3 without oval β†’ exact count; 4–9 β†’ exact count; else Invalid
lam/xpert/culture/xray/dst results Melt boolean columns Two- or three-option ovals collapsed to single label string
dst_result Melt boolean columns RIF Resistant Detected / RIF Resistant Not Detected / RIF Resistant Indeterminate
treat_outcome Melt boolean columns Cured, Completed, Dead, Treatment Failure, Lost to Follow-up, Not Evaluated
adherence_support Melt boolean columns Guardian, Health Surveilance Assistant, Volunteer, Healthcare Worker
hiv_status Melt boolean columns Positive / Negative / Unknown
arv_status Melt boolean columns Before TB Tx / While on TB Tx / Not While on TB Tx
arv_id Null-blank replacement '_____________' (13 underscores) β†’ NULL
art_reg_id Null-blank replacement 'ART-_____-___-__' β†’ NULL
participant_reg_id Null-blank replacement '_____-___-__' β†’ NULL
tb_reg_number Null-blank replacement 'MTB-_____-___-__' β†’ NULL
All date fields convert_to_date() String β†’ DATE type using dd/mm/yy or dd/mm/yyyy format macro
enrolled_prp / cpt_status / second_line Melt boolean columns Yes / No
ptld_how_diagnosed Melt boolean columns Spirometry/LFT, Persistent Respiratory Symptoms, Abnormalities X-ray, Reduced Exercise
prp_outcome Melt boolean columns Completed, Died, Lost to Follow-up, Active TB, Excluded

Smear microscopy is recorded on paper as two complementary fields: a digit box (0–9 bacilli count) and a plus oval (to indicate a grade). The pipeline combines them as follows:

Digit Plus oval marked? Clean value
0 β€” Negative
1 No 1
1 Yes 1+
2 No 2
2 Yes 2+
3 No 3
3 Yes 3+
4–9 β€” Exact count
Other β€” Invalid

This logic is applied identically to all four smear timepoints (baseline, 2-month, 5-month, 6-month).


βœ… Clean β€” Two models

mw_tb_clean__tb_reg_checks

The checks model runs data quality assertions against mw_tb_pre_clean__tb_reg_raw (raw strings, before type casting) and unions them into a single checks table. Each row in the output represents one triggered check for one record.

Data quality checks
Model: mw_tb_clean__tb_reg_checks
Check Input Columns Severity Tag Description
convert_to_int participant_age, smear_result (x4) Warning invalid-number-check Flags records where age or smear digit boxes cannot be cast to integer
convert_to_date (Warning) lam_date, smear_date, smear_date_2, smear_date_5, xpert_date, culture_date, xray_date, dst_date, cpt_start_date Warning invalid-date-check Flags unparseable dates for optional date fields
convert_to_date (Error) tb_reg_date, outcome_date, smear_date_6 Error invalid-date-check Flags unparseable dates for key date fields; these records are excluded from clean data
batch_check_melt_columns All select-one fields (22 groups) Warning multiple-answer-selected-check Flags records where more than one oval was marked for a single-select question
drug_resistant_standard_regimen_check dst_result_dst_res + treatment_regimen_regimen_stan Error clinical-logic-check Flags records where DST is RIF Resistant Detected but the standard 2RHZE/4RH regimen is recorded β€” a clinical contradiction
Important

Error vs Warning

  • Error β€” the record is excluded from mw_tb_clean__tb_reg and all downstream models. Errors indicate data that cannot be meaningfully interpreted (e.g. an unparseable registration date).
  • Warning β€” the record passes through to the clean table but the issue is flagged for review in the checks dashboard (e.g. a multiple-selection on a single-select oval field).

mw_tb_clean__tb_reg

The clean table selects all transformed fields from mw_tb_pre_clean__tb_reg and filters out any record_identifier that triggered an Error-severity check. The resulting dataset contains only records that are structurally valid and interpretable, with all fields cast to appropriate types and all select-one columns collapsed to clean label strings.


πŸ”€ Intermediary β€” Two models

mw_tb_intermediary__tb_aggregated

Produces a single aggregate summary row across all clean records. Each patient record is first scored with binary flags, then summed:

Aggregated summary metrics
Model: mw_tb_intermediary__tb_aggregated
Metric Aggregation Logic
total_treated Count of records where tb_reg_date is not NULL
average_age Mean of age_years across all records
total_female Count where sex = 'Female'
total_male Count where sex = 'Male'
total_standard_regimen Count where treatment_regimen = '2RHZE/4RH'
total_other_regimen Count where treatment_regimen = 'Other Regimen'
total_pulmonary_tb Count where disease_site = 'Pulmonary'
total_previously_treated Count where treatment_history IN ('Relap', 'RALF', 'Fail', 'Other')
total_resistant_tests Count where dst_date is not NULL
total_resistant_tb Count where dst_result = 'RIF Resistant Detected'
total_positive_outcome Count where treat_outcome IN ('Cured', 'Completed')
total_hiv_positive Count where hiv_status = 'Positive'
total_enrolled_prp Count where enrolled_prp = 'Yes'
total_completed_prp Count where prp_outcome = 'Completed'

mw_tb_intermediary__checks_with_locations

Joins mw_tb_clean__tb_reg_checks with mw_tb_intermediary__books_with_locations on book_number to enrich check records with facility/location metadata. This powers location-disaggregated DQA dashboards where data managers can see which site generated the most errors or warnings.


πŸš€ Refinery β€” Two models

Refinery models are the final delivery layer consumed by dashboards and external sinks (e.g. DHIS2, Metabase). They mirror the clean and checks content with any final adjustments needed for downstream consumers.

mw_tb_refinery__tb_reg

A pass-through of all columns from mw_tb_clean__tb_reg. No further transformation is applied; the refinery layer simply exposes the clean patient-level dataset in the marts schema where it is accessible to dashboard tooling.

mw_tb_refinery__tb_reg_checks

Selects from mw_tb_clean__tb_reg_checks and applies one additional filter:

  • Excludes rows where implementation_error = true β€” implementation errors reflect paper form issues (e.g. duplicate rows printed in error) rather than genuine data quality problems, and should not appear in operational DQA reports shown to facility staff.

Key Design Decisions

Key pipeline design decisions
Decision Rationale
Raw export kept as text strings throughout pre_clean_raw Allows checks to inspect the actual OCR output (e.g. '__/__/20__') before any casting masks the raw value
record_identifier = Page_ID + Record_number Provides a stable surrogate key linking checks back to the specific patient row and page scan
Errors exclude from clean; Warnings do not Errors (e.g. unparseable registration date, clinical contradiction) make a record unanalysable. Warnings flag issues for review without losing the record entirely
Blank template strings β†’ NULL ScanForm exports unfilled structured fields as the template pattern (e.g. 'MTB-_____-___-__'). These are replaced with NULL so that downstream aggregations and joins treat them as missing rather than a valid value
Age normalised to decimal years Infants recorded in months are divided by 12 so that age_years is a single comparable numeric field across all patients
Smear result combines digit + plus oval The paper form encodes grade information across two separate OCR fields. The pipeline merges them into a single interpretable clinical value (e.g. '2+')
implementation_error filtered from refinery checks Facility staff reviewing DQA reports should not see errors caused by form printing issues outside their control
book_number extracted from Page_ID substring Legacy handling: in the absence of a separate book metadata export, position 2–6 of Page_ID encodes the book identifier

Model Inventory

All pipeline models
9 models across 5 layers
Model Layer Reads From Output Granularity Notes
mw_tb_base__mat_pip_MW_TB_Register_v1 Base scanform_materialized_exports All scanned rows Column rename only; no filtering
mw_tb_pre_clean__tb_reg_raw Pre-Clean base__mat_pip_MW_TB_Register_v1 Excludes discards Row filters + record_identifier; all values remain as raw text
mw_tb_pre_clean__tb_reg Pre-Clean pre_clean__tb_reg_raw Same as raw Type casts, melt, smear combination, blank→NULL
mw_tb_clean__tb_reg_checks Clean pre_clean__tb_reg_raw One row per check triggered Union of 5 check types; Error and Warning severity
mw_tb_clean__tb_reg Clean pre_clean__tb_reg Excludes Error records Final clean patient-level table
mw_tb_intermediary__tb_aggregated Intermediary clean__tb_reg 1 summary row 14 aggregate metrics for dashboard KPI tiles
mw_tb_intermediary__checks_with_locations Intermediary clean__tb_reg_checks + books_with_locations One row per check (with location) Joins checks to facility metadata
mw_tb_refinery__tb_reg Refinery clean__tb_reg Same as clean Pass-through; exposed in marts schema
mw_tb_refinery__tb_reg_checks Refinery clean__tb_reg_checks Excludes implementation_error rows DQA checks surfaced to facility staff