Multi-Source Lead-Generation Pipeline
A regional direct-mail marketing firm ran its entire mortgage-protection lead-generation operation out of spreadsheets: daily closing data from four regional mail partners, mail-piece production and delivery tracking, inbound call-center responses, and lead handoff to an insurance agency — all tracked by hand. I replaced that with a PostgreSQL-backed pipeline deployed on AWS, with a login-gated web app for daily monitoring and reporting.
The Problem
The core engineering challenge was heterogeneity. Four partners delivered the same logical data — mortgage records, mail-piece barcodes, delivery confirmations — but in three different barcode encodings and four different file formats across three transport channels. Each weekday, data for all four partners arrives as six files that are not uniform:
- Three transport channels — an FTPS file endpoint, a shared email inbox (read via the Microsoft Graph API), and partner SFTP.
- Two mortgage-file shapes — one partner embeds the mail barcode directly in a ~32-column spreadsheet; the other three send ~15-column CSVs with no barcode at all.
- Three barcode encodings — the USPS Intelligent Mail barcode arrives as a 65-character bar-state string, a 20-digit tracking code, or a 31-digit form. All three encode the same physical barcode.
- Four delivery-confirmation formats — positional CSV, fixed-width text inside a zip, and two spreadsheet variants, each on its own channel.
A naive design would build four parallel pipelines. Instead I unified the data path wherever the inputs were logically the same and isolated only the genuinely partner-specific deltas.
Architecture
Canonical-form ingestion
The pivot of the whole design is choosing one canonical key — the 20-digit barcode tracking code — and converting everything to it at ingestion. A row's barcode might arrive in any of the three encodings; a single conversion module brings them all to the canonical 20 digits, stored under a UNIQUE constraint that doubles as the cross-partner delivery join key. All four partners' mortgage data lands in one shared staging table, with partner identity derived from the source filename. A single core-processing function promotes staged rows to the main table — partner travels with the row rather than forking the code path.
The barcode decoder
The single highest-risk component was the barcode encoder/decoder — a from-scratch implementation of USPS Publication 97 (11-bit CRC, codeword table, bit-to-bar mapping) to convert between the bar-state string and the numeric forms. I validated it against 11,396 real production barcode pairs with a 100% round-trip pass rate. Because of that validation, a decode failure at runtime is treated as a data-quality signal — malformed partner input — not a bug in the decoder. A meaningful operational distinction.
Correctness under re-runs
The pipeline is idempotent at every layer, so re-sends and re-runs can't corrupt state:
- File level — every inbound file is SHA-256 hashed before a byte is read; a previously seen hash is rejected.
- Mail-piece level — the canonical barcode is
UNIQUE; duplicate pieces are silently skipped viaON CONFLICT DO NOTHING. - Delivery level — first-wins confirmation: once a piece is marked delivered, a later confirmation (even with a different date) is logged for audit and ignored.
Per-row error isolation means one malformed row never blocks the rest of a file.
Business logic in the database
All lifecycle rules are encapsulated in PL/pgSQL functions, so every ingestion path inherits identical behavior. Delivery confirmation, for example, filters by partner inside the function — so one partner's feed can never accidentally overwrite another's record. Putting these invariants in the database rather than scattering them across scripts is what kept four input channels behaving consistently.
Reporting layer
Roughly a dozen materialized views back the standard reports — A/B response rates by mailer version, delivery performance by partner, callback rates by geography and mail-piece type — all refreshed concurrently once per day after ingestion completes. A read-only web app exposes them alongside an ad-hoc SQL window that runs inside a READ ONLY transaction so it can't be used to mutate data.
Deployment
The whole system ships as a single parameterized CloudFormation stack: managed Postgres (RDS), shared file storage (EFS), containerized ingestion tasks (ECS Fargate), and orchestration via Step Functions state machines triggered on a daily schedule. Failures publish to a notification topic; all task output goes to centralized logs.
Multi-tenant by construction. Every account-specific value (network, credentials, schedules) is a stack parameter, so the same template and the same container image redeploy unchanged into a new client's AWS account. Onboarding a tenant is mechanical — no code changes, no cross-account trust.
Keyless operator access. Ops access to the database runs through a session-manager bastion with no SSH keys and no open ports, scoped by an account-local IAM group. Adding an operator is a single command.
Results
- Replaced a spreadsheet workflow with a production pipeline processing four partners through one unified code path.
- Barcode decoder validated at 100% on 11,396 real production pairs.
- Full lifecycle tracking — every mail piece moves through mailed → delivered → responded → converted → handed off, with a complete audit log of each transition.
- Self-service reporting — a dozen materialized views and an ad-hoc SQL console replaced manual report assembly.
- Deployed to two AWS tenants from one template, designed for mechanical rollout to more.
This case study describes a real client engagement. The client, partners, and call-center vendor are anonymized, and no production data, credentials, or proprietary identifiers are included.