// live · production

Multi-Source Lead-Generation Pipeline

Role Senior Data Engineer (sole engineer) Stack PostgreSQL · Python · AWS · Streamlit

A regional direct-mail marketing firm ran its entire mortgage-protection lead-generation operation out of spreadsheets: daily closing data from four regional mail partners, mail-piece production and delivery tracking, inbound call-center responses, and lead handoff to an insurance agency — all tracked by hand. I replaced that with a PostgreSQL-backed pipeline deployed on AWS, with a login-gated web app for daily monitoring and reporting.

~11.4M

mortgage records ingested per year

4 → 1

partner feeds unified into one code path

100%

barcode decoder pass rate on 11,396 real pairs

AWS tenants from one template

The Problem

The core engineering challenge was heterogeneity. Four partners delivered the same logical data — mortgage records, mail-piece barcodes, delivery confirmations — but in three different barcode encodings and four different file formats across three transport channels. Each weekday, data for all four partners arrives as six files that are not uniform:

Three transport channels — an FTPS file endpoint, a shared email inbox (read via the Microsoft Graph API), and partner SFTP.
Two mortgage-file shapes — one partner embeds the mail barcode directly in a ~32-column spreadsheet; the other three send ~15-column CSVs with no barcode at all.
Three barcode encodings — the USPS Intelligent Mail barcode arrives as a 65-character bar-state string, a 20-digit tracking code, or a 31-digit form. All three encode the same physical barcode.
Four delivery-confirmation formats — positional CSV, fixed-width text inside a zip, and two spreadsheet variants, each on its own channel.

A naive design would build four parallel pipelines. Instead I unified the data path wherever the inputs were logically the same and isolated only the genuinely partner-specific deltas.

Data flow. Four partners arrive over three transport channels and converge into one unified staging table; all barcode encodings are normalized to a single canonical 20-digit key (cyan path) before promotion to the core table, which feeds the three downstream tracks.

Architecture

Canonical-form ingestion

The pivot of the whole design is choosing one canonical key — the 20-digit barcode tracking code — and converting everything to it at ingestion. A row's barcode might arrive in any of the three encodings; a single conversion module brings them all to the canonical 20 digits, stored under a UNIQUE constraint that doubles as the cross-partner delivery join key. All four partners' mortgage data lands in one shared staging table, with partner identity derived from the source filename. A single core-processing function promotes staged rows to the main table — partner travels with the row rather than forking the code path.

The barcode decoder

The single highest-risk component was the barcode encoder/decoder — a from-scratch implementation of USPS Publication 97 (11-bit CRC, codeword table, bit-to-bar mapping) to convert between the bar-state string and the numeric forms. I validated it against 11,396 real production barcode pairs with a 100% round-trip pass rate. Because of that validation, a decode failure at runtime is treated as a data-quality signal — malformed partner input — not a bug in the decoder. A meaningful operational distinction.

Correctness under re-runs

The pipeline is idempotent at every layer, so re-sends and re-runs can't corrupt state:

File level — every inbound file is SHA-256 hashed before a byte is read; a previously seen hash is rejected.
Mail-piece level — the canonical barcode is UNIQUE; duplicate pieces are silently skipped via ON CONFLICT DO NOTHING.
Delivery level — first-wins confirmation: once a piece is marked delivered, a later confirmation (even with a different date) is logged for audit and ignored.

Per-row error isolation means one malformed row never blocks the rest of a file.

Business logic in the database

All lifecycle rules are encapsulated in PL/pgSQL functions, so every ingestion path inherits identical behavior. Delivery confirmation, for example, filters by partner inside the function — so one partner's feed can never accidentally overwrite another's record. Putting these invariants in the database rather than scattering them across scripts is what kept four input channels behaving consistently.

Reporting layer

Roughly a dozen materialized views back the standard reports — A/B response rates by mailer version, delivery performance by partner, callback rates by geography and mail-piece type — all refreshed concurrently once per day after ingestion completes. A read-only web app exposes them alongside an ad-hoc SQL window that runs inside a READ ONLY transaction so it can't be used to mutate data.

Deployment

The whole system ships as a single parameterized CloudFormation stack: managed Postgres (RDS), shared file storage (EFS), containerized ingestion tasks (ECS Fargate), and orchestration via Step Functions state machines triggered on a daily schedule. Failures publish to a notification topic; all task output goes to centralized logs.

Multi-tenant by construction. Every account-specific value (network, credentials, schedules) is a stack parameter, so the same template and the same container image redeploy unchanged into a new client's AWS account. Onboarding a tenant is mechanical — no code changes, no cross-account trust.

Keyless operator access. Ops access to the database runs through a session-manager bastion with no SSH keys and no open ports, scoped by an account-local IAM group. Adding an operator is a single command.

AWS architecture. A single CloudFormation stack provisions everything: EventBridge triggers Step Functions, which run ingestion and refresh as ephemeral Fargate tasks against a VPC-private RDS instance and shared EFS. Ops access is keyless via an SSM bastion. Every account-specific value is a stack parameter.

Results

Replaced a spreadsheet workflow with a production pipeline processing four partners through one unified code path.
Barcode decoder validated at 100% on 11,396 real production pairs.
Full lifecycle tracking — every mail piece moves through mailed → delivered → responded → converted → handed off, with a complete audit log of each transition.
Self-service reporting — a dozen materialized views and an ad-hoc SQL console replaced manual report assembly.
Deployed to two AWS tenants from one template, designed for mechanical rollout to more.

PostgreSQL Python AWS CloudFormation ECS Fargate Step Functions ETL Idempotency Multi-Tenant

This case study describes a real client engagement. The client, partners, and call-center vendor are anonymized, and no production data, credentials, or proprietary identifiers are included.