I want to walk through the design of a payment reconciliation engine in enough detail to be actually useful, because most articles on this topic stay at the level of "use idempotency keys" without explaining what that means in a system that is processing thousands of concurrent webhooks from three different payment providers.
Why Reconciliation Is Hard
The surface problem is simple: match incoming payment events to orders in your database. The real problem is that payment events arrive in a hostile environment.
Webhooks arrive out of order. A payment gateway might send the payment.captured event before the payment.authorized event. Your system cannot assume that events reflect the actual sequence of state transitions.
Events are delivered at least once, not exactly once. Payment providers retry failed webhook deliveries. Your handler will receive duplicate events, sometimes minutes or hours apart. Processing a payment.captured event twice must not result in double-fulfillment.
Network errors cause retries on your side too. If your webhook handler crashes after updating the database but before returning a 200, the provider will retry. You need to handle reprocessing the same event without corruption.
The gateway's state and your state can drift. If a refund is processed directly through the payment gateway's dashboard without going through your system, your local records are now wrong. The reconciliation engine must be able to detect and repair this drift.
The Foundation: Idempotency Keys
Every operation in a payment system must be idempotent — safe to execute more than once with the same result. This is not optional.
For incoming webhooks, I use the event ID provided by the payment gateway as the idempotency key. Before processing any event, I check a processed_webhook_events table for that ID. If it exists, I return 200 immediately without reprocessing. If it does not exist, I insert it and process the event within the same database transaction.
This is the only reliable way to handle duplicate delivery. An in-memory cache is not sufficient — if your application restarts between the cache write and the database update, you have lost the deduplication record.
The State Machine
Payment status is not a boolean. I model it as a state machine with explicit transitions and transition guards.
A payment can be in states: pending, authorized, captured, partially_refunded, refunded, failed, disputed. Transitions between states follow strict rules — you cannot move from refunded to captured, and disputed requires human review before any other transition.
I implement this as a state machine in the domain layer, separate from the database and the payment gateway API. The state machine validates every transition attempt and raises a domain exception if an invalid transition is attempted. This catches the out-of-order webhook problem: if a captured event arrives for a payment that is already refunded, the state machine rejects it and flags it for review rather than corrupting the record.
Handling Race Conditions
In a system processing concurrent webhooks, two events for the same payment can arrive simultaneously and be processed on different application instances. Without coordination, you get race conditions.
I use database-level pessimistic locking on the payment record during state transitions. In SQL Server and PostgreSQL this is SELECT ... FOR UPDATE (or WITH (UPDLOCK, ROWLOCK) on SQL Server). The row lock is acquired at the start of the transaction and held until commit. The second concurrent handler blocks until the first completes, then reads the updated state — at which point the state machine determines whether the second event is still applicable.
Optimistic locking (row version comparison) is an alternative, but in high-concurrency payment scenarios, I prefer the explicit guarantee of pessimistic locking. The performance cost is acceptable given that the critical section is the state transition itself, which is fast.
The Async Webhook Handler
Webhook handlers should be fast and dumb. My pattern:
- Validate the webhook signature (reject anything unsigned)
- Write the raw event payload to a
webhook_queuetable - Return 200 immediately
A separate background worker reads from webhook_queue, processes events using the state machine, and handles errors. This decouples webhook receipt from processing — if processing fails, the event stays in the queue for retry without the gateway ever receiving a non-200 response.
The Reconciliation Sweep
Even with all the above, state drift is possible. The reconciliation sweep is a scheduled job that runs every few hours and compares your local payment records against the payment gateway's records via their reporting API.
For each discrepancy, the sweep creates a reconciliation_exception record with the local state, the gateway state, and the difference. Exceptions fall into automated categories (fixable by the sweep) and manual categories (requiring human review). Automated fixes include marking payments as captured when the gateway shows capture but the local record shows authorized — a common result of a webhook delivery failure.
This sweep is the safety net. The event-driven pipeline handles 99%+ of cases correctly in real time. The sweep catches the remainder and provides an audit trail that satisfies compliance requirements.
For teams working on financial systems who also need to think about database security alongside architecture, my post on database security practices covers the infrastructure layer.
If you are designing or auditing a payment system and want a review of the architecture, get in touch.