← Back to projects
// live · production

Multi-Source Lead-Generation Pipeline

Role  Senior Data Engineer (sole engineer) Stack  PostgreSQL · Python · AWS · Streamlit

A regional direct-mail marketing firm ran its entire mortgage-protection lead-generation operation out of spreadsheets: daily closing data from four regional mail partners, mail-piece production and delivery tracking, inbound call-center responses, and lead handoff to an insurance agency — all tracked by hand. I replaced that with a PostgreSQL-backed pipeline deployed on AWS, with a login-gated web app for daily monitoring and reporting.

~11.4M
mortgage records ingested per year
4 → 1
partner feeds unified into one code path
100%
barcode decoder pass rate on 11,396 real pairs
2
AWS tenants from one template

The Problem

The core engineering challenge was heterogeneity. Four partners delivered the same logical data — mortgage records, mail-piece barcodes, delivery confirmations — but in three different barcode encodings and four different file formats across three transport channels. Each weekday, data for all four partners arrives as six files that are not uniform:

A naive design would build four parallel pipelines. Instead I unified the data path wherever the inputs were logically the same and isolated only the genuinely partner-specific deltas.

PARTNERS TRANSPORT STAGING CANONICAL CORE DOWNSTREAM Partner A · SEXLSX · IMb embedded Partner B · WCSV · no IMb Partner C · NECSV · no IMb Partner D · MWCSV · no IMb FTPS endpoint Email · Graph API Partner SFTP stg_mortgageunified · 4 partnerspartner ← filename stg_imb_mapbarcode-map backfill mortgage_recordsimb_digits — UNIQUE 20-digitON CONFLICT DO NOTHING imb.py · USPS Pub 97FADT↔digits · 100% on 11,396 Delivery confirm4 feeds · first-wins IVR responsesmtg_id + tracking# 12 mat. viewsA/B · delivery · callback
Data flow. Four partners arrive over three transport channels and converge into one unified staging table; all barcode encodings are normalized to a single canonical 20-digit key (cyan path) before promotion to the core table, which feeds the three downstream tracks.

Architecture

Canonical-form ingestion

The pivot of the whole design is choosing one canonical key — the 20-digit barcode tracking code — and converting everything to it at ingestion. A row's barcode might arrive in any of the three encodings; a single conversion module brings them all to the canonical 20 digits, stored under a UNIQUE constraint that doubles as the cross-partner delivery join key. All four partners' mortgage data lands in one shared staging table, with partner identity derived from the source filename. A single core-processing function promotes staged rows to the main table — partner travels with the row rather than forking the code path.

The barcode decoder

The single highest-risk component was the barcode encoder/decoder — a from-scratch implementation of USPS Publication 97 (11-bit CRC, codeword table, bit-to-bar mapping) to convert between the bar-state string and the numeric forms. I validated it against 11,396 real production barcode pairs with a 100% round-trip pass rate. Because of that validation, a decode failure at runtime is treated as a data-quality signal — malformed partner input — not a bug in the decoder. A meaningful operational distinction.

Correctness under re-runs

The pipeline is idempotent at every layer, so re-sends and re-runs can't corrupt state:

Per-row error isolation means one malformed row never blocks the rest of a file.

Business logic in the database

All lifecycle rules are encapsulated in PL/pgSQL functions, so every ingestion path inherits identical behavior. Delivery confirmation, for example, filters by partner inside the function — so one partner's feed can never accidentally overwrite another's record. Putting these invariants in the database rather than scattering them across scripts is what kept four input channels behaving consistently.

Reporting layer

Roughly a dozen materialized views back the standard reports — A/B response rates by mailer version, delivery performance by partner, callback rates by geography and mail-piece type — all refreshed concurrently once per day after ingestion completes. A read-only web app exposes them alongside an ad-hoc SQL window that runs inside a READ ONLY transaction so it can't be used to mutate data.

Deployment

The whole system ships as a single parameterized CloudFormation stack: managed Postgres (RDS), shared file storage (EFS), containerized ingestion tasks (ECS Fargate), and orchestration via Step Functions state machines triggered on a daily schedule. Failures publish to a notification topic; all task output goes to centralized logs.

Multi-tenant by construction. Every account-specific value (network, credentials, schedules) is a stack parameter, so the same template and the same container image redeploy unchanged into a new client's AWS account. Onboarding a tenant is mechanical — no code changes, no cross-account trust.

Keyless operator access. Ops access to the database runs through a session-manager bastion with no SSH keys and no open ports, scoped by an account-local IAM group. Adding an operator is a single command.

CLOUDFORMATION STACK — one template · per-tenant parameters SCHEDULE EventBridge 3 cron rules mortgage · sftp · email ORCHESTRATION Step Functions · 6 state machines Fargate taskingest scripts Fargate taskview refresh ECRimages · git SHA Streamlit appFargate + ALB SNS failure topic + CloudWatch Logs per-machine alerts · stdout/stderr capture DATA LAYER · VPC-private RDS PostgreSQL 14 encrypted · not public 22 tables · 12 MV · 12 fn EFS shared /app/data · 3-folder pattern Secrets Manager DB pwd · FTPS · SFTP · OAuth OPS ACCESS SSM bastion no SSH · no open ports IAM ops grp account-local scoped policy
AWS architecture. A single CloudFormation stack provisions everything: EventBridge triggers Step Functions, which run ingestion and refresh as ephemeral Fargate tasks against a VPC-private RDS instance and shared EFS. Ops access is keyless via an SSM bastion. Every account-specific value is a stack parameter.

Results

PostgreSQL Python AWS CloudFormation ECS Fargate Step Functions ETL Idempotency Multi-Tenant

This case study describes a real client engagement. The client, partners, and call-center vendor are anonymized, and no production data, credentials, or proprietary identifiers are included.