Hero Image

Introduction: The Deceptive Simplicity of Webhooks

Integrating a payment gateway like Razorpay often starts with a false sense of security. The documentation shows a happy path: user pays, Razorpay calls your webhook, you update the database, and ship the product. In a development environment with one request per minute, this works flawlessly.

In production, however, this naive approach is a ticking time bomb.

At scale, the internet is hostile. Networks are flaky, servers restart, and 3rd party APIs experience latency. To ensure reliability, Razorpay (like Stripe and others) operates on an “at-least-once” delivery guarantee. This means they will send you the same webhook multiple times until you acknowledge it with a 200 OK. Furthermore, there is no guarantee of sequencing. You might receive a payment.captured event before the payment.authorized event for the same transaction.

If your backend treats every incoming webhook as a command to immediately execute business logic (e.g., “Add Credits to Wallet”), you are architecting a system that will inevitably double-credit users, corrupt data states, and fail under load.

This deep dive explores how to architect a production-grade, event-driven ingestion engine for Razorpay webhooks. We will move beyond simple handlers to a robust system capable of handling duplicates, out-of-order delivery, and high concurrency.

Comparison Chart


Pillar 1: The “Accept-Then-Process” Pattern (Async Architecture)

The Bottleneck: The Fatal Mistake of Synchronous Processing

The most common anti-pattern in webhook integration is performing business logic inside the HTTP request handler.

Imagine your webhook endpoint looks like this:

  1. Receive Request.
  2. Verify Signature.
  3. Query Database.
  4. Call internal “Provisioning Service” (gRPC/HTTP).
  5. Send Email Receipt (SMTP).
  6. Return 200 OK to Razorpay.

Why this fails:

  • Timeouts: Razorpay has a strict timeout (usually 5-10 seconds). If your SMTP server lags or your database locks, the request times out. Razorpay marks the delivery as failed and retries.
  • Cascading Retries: Because the first request “failed” (timed out), Razorpay sends it again. Your server might have actually completed the provisioning but failed to respond in time. Now, the retry triggers the provisioning again.
  • Coupling: Your ingress is now tightly coupled to the availability of your email service and internal microservices.

The Strategy: Ingress as a Buffer

To solve this, we decouple Ingestion from Processing. The webhook endpoint should have zero knowledge of business logic. Its Single Responsibility is to verify the message and persist it safely.

The flow changes to:

  1. API Layer: Verify Signature -> Persist Payload to webhook_logs -> Push Event ID to Queue (SQS/RabbitMQ/Redis) -> Return 200 OK.
  2. Worker Layer: Consume ID -> Load Payload -> Execute Business Logic -> Update Status.

Architecture Diagram


Pillar 2: Idempotency & Deduplication

Razorpay explicitly states that during network fluctuations, they may send the same webhook event multiple times. If your server processes the same payment.captured event twice, you might trigger duplicate fulfillment.

The X-Razorpay-Event-Id

Every webhook request contains a header: X-Razorpay-Event-Id. This is a UUID generated by Razorpay that uniquely identifies a specific event occurrence. Even if Razorpay retries the delivery 10 times, this ID remains constant.

Do not use the payment_id for deduplication at the ingress layer. A single payment ID will have multiple events (authorized, captured, failed). The Event ID is your unique constraints key.

Implementation: The Redis Guard

To handle high concurrency (e.g., “Retry Storms” where two identical requests arrive milliseconds apart), a database unique constraint might be too slow or result in lock contention. A Redis-based distributed lock or a simple key check is preferred at the API gateway level.

import redis
import hmac
import hashlib
from fastapi import Request, HTTPException

r = redis.Redis(host='localhost', port=6379, db=0)

async def razorpay_webhook_ingress(request: Request):
    # 1. Extract Headers
    signature = request.headers.get('X-Razorpay-Signature')
    event_id = request.headers.get('X-Razorpay-Event-Id')
    payload_body = await request.body()
    
    # 2. Security: Verify Signature (HMAC SHA256)
    # ALWAYS do this first. CPU is cheaper than DB connections.
    secret = b"YOUR_WEBHOOK_SECRET"
    generated_signature = hmac.new(secret, payload_body, hashlib.sha256).hexdigest()
    
    if not hmac.compare_digest(generated_signature, signature):
        raise HTTPException(status_code=400, detail="Invalid Signature")

    # 3. Idempotency Check (Redis SETNX)
    # Key format: webhook:event_id
    # TTL: 24 hours (Razorpay retries usually happen within hours)
    cache_key = f"webhook:{event_id}"
    
    # setnx returns True if key was set, False if it already existed
    is_new = r.setnx(cache_key, "received") 
    
    if not is_new:
        # We have already received this event ID.
        # Immediately return 200 OK to stop Razorpay from retrying.
        # Do NOT process it again.
        return {"status": "ok", "message": "Duplicate event ignored"}
    
    # Set expiration for the key
    r.expire(cache_key, 86400) 

    # 4. Proceed to Persistence (See Pillar 3)
    await persist_to_db(event_id, payload_body)
    
    return {"status": "ok"}

Technical Diagram


Pillar 3: Storage Implementation (The Raw Log)

Before any processing happens, the raw payload must be saved. This provides an audit trail and allows for re-playing events if your worker logic contains bugs.

We need a table specifically for ingestion, separate from your payments or orders tables.

Database Schema (PostgreSQL)

CREATE TYPE processing_status AS ENUM ('PENDING', 'PROCESSING', 'COMPLETED', 'FAILED', 'IGNORED');

CREATE TABLE razorpay_event_logs (
    id BIGSERIAL PRIMARY KEY,
    
    -- The Holy Grail of Idempotency
    razorpay_event_id VARCHAR(255) NOT NULL UNIQUE, 
    
    -- "authorized", "captured", "failed"
    event_type VARCHAR(50) NOT NULL, 
    
    -- Flexible storage for evolving API schemas
    payload JSONB NOT NULL, 
    
    -- Tracking lifecycle
    status processing_status NOT NULL DEFAULT 'PENDING',
    
    -- For diagnostics
    error_message TEXT,
    retry_count INT DEFAULT 0,
    
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Index for querying by entity (e.g. find all logs for order_123)
CREATE INDEX idx_rpay_payload_entity ON razorpay_event_logs ((payload->'payload'->'payment'->'entity'->>'order_id'));

Key Decisions:

  • razorpay_event_id Unique Index: This is the hard stop at the database level. If Redis fails or is flushed, the database ensures integrity.
  • JSONB Payload: Razorpay adds fields to their webhooks occasionally. Using a rigid schema here effectively breaks your integration when they update their API. Always store the raw JSON.
  • Status Enum: We track the lifecycle of the event processing, not the payment itself.

Pillar 4: Handling Sequencing & Race Conditions (The “Out-of-Order” Problem)

This is the most complex challenge.

Scenario:

  1. User completes payment.
  2. Razorpay emits payment.authorized.
  3. Razorpay emits payment.captured.
  4. Network anomaly: Your server receives payment.captured (Event B) at 10:00:01.
  5. Your server receives payment.authorized (Event A) at 10:00:05.

If you blindly process these events, you might move your local order state from CAPTURED back to AUTHORIZED, or trigger a “New Payment” workflow twice.

Concept Illustration

We have two primary strategies to handle this.

Solution 1: The State Machine Guard

This approach relies on a strict definition of valid state transitions in your application logic. You must check the current state of the Order/Payment in your database before applying the webhook update.

The Logic:

  • CREATED -> AUTHORIZED (Valid)
  • AUTHORIZED -> CAPTURED (Valid)
  • CAPTURED -> AUTHORIZED (INVALID - Ignore)
def process_webhook_event(event_type, payment_id, payload):
    # Lock the row to prevent race conditions during the check
    order = db.query(Order).with_for_update().filter_by(payment_id=payment_id).first()
    
    if not order:
        # Handle case where webhook arrives before your system created the order record
        # (Rare, but possible if Order creation is async)
        return handle_orphan_payment(payload)

    if event_type == "payment.authorized":
        if order.status in ["CAPTURED", "COMPLETED"]:
            logger.info(f"Skipping authorized event for {payment_id}; state is already {order.status}")
            return # No-op
        
        order.status = "AUTHORIZED"
        order.authorized_at = now()
        
    elif event_type == "payment.captured":
        # Capture is terminal for the payment flow usually
        order.status = "CAPTURED"
        order.captured_at = now()
        
    db.commit()

Solution 2: Fetch-on-Sync (The Source of Truth)

The “State Machine” approach assumes your local logic perfectly mirrors Razorpay’s logic. A more robust (albeit slightly slower) approach is to treat the webhook merely as a “Wake Up Call”.

When a webhook arrives, we don’t trust the payload’s state implicitly. Instead, we ask Razorpay for the current state.

The Logic:

  1. Receive payment.authorized (Event A).
  2. Worker picks up Event A.
  3. Worker calls Razorpay API: GET /payments/{payment_id}.
  4. API returns: status: captured (Because the capture happened milliseconds later on Razorpay’s side).
  5. Worker updates local DB to CAPTURED.
  6. Later, payment.captured (Event B) arrives.
  7. Worker calls Razorpay API: GET /payments/{payment_id}.
  8. API returns: status: captured.
  9. Worker updates local DB to CAPTURED (Idempotent update).

This solves the out-of-order problem completely because you are always fetching the “final” state from the source.

Code Flow

import razorpay

client = razorpay.Client(auth=("KEY", "SECRET"))

def worker_process_event(event_id, payload):
    payment_id = payload['payload']['payment']['entity']['id']
    
    # FETCH-ON-SYNC STRATEGY
    # Ignore the state inside the webhook payload. 
    # Fetch the authoritative state from Razorpay.
    try:
        live_payment = client.payment.fetch(payment_id)
    except Exception as e:
        # Handle API failures/rate limits with retry (exponential backoff)
        raise RetryableException(e)

    current_status = live_payment['status'] # e.g., 'captured'
    amount = live_payment['amount']
    
    # Upsert logic: Create or Update based on payment_id
    upsert_payment_record(payment_id, current_status, amount)
    
    mark_event_log_completed(event_id)

Trade-off: This introduces an external API call for every webhook, which increases latency and eats into rate limits. However, for financial data consistency, this is often the preferred architecture.


Pillar 5: Reconciliation (The Safety Net)

No matter how good your webhook architecture is, you will miss events.

  • Your servers might be down during a maintenance window.
  • A bad deployment might break the signature verification logic.
  • Razorpay might disable webhooks if your endpoint returns 500s too often.

To reach 100% reliability, you need a Reconciliation Worker.

The Cron Job

Every night (or every hour), a background job should run to close the gap between your database and Razorpay.

Logic:

  1. Fetch all payments from Razorpay updated in the last X minutes/hours.
  2. Iterate through them.
  3. Compare the status in Razorpay vs. your Local DB.
  4. If different, trigger the logic used in the “Fetch-on-Sync” worker.
# Pseudo-code for Reconciliation
def reconcile_payments(time_window_minutes=60):
    start_time = now() - timedelta(minutes=time_window_minutes)
    
    # Razorpay supports filtering by 'from' timestamp
    # Note: Pagination is required here for high volume
    payments = client.payment.all({
        'from': int(start_time.timestamp()),
        'count': 100 
    })
    
    for rp_payment in payments['items']:
        local_payment = db.get_payment(rp_payment['id'])
        
        if not local_payment:
            logger.warning(f"MISSING PAYMENT FOUND: {rp_payment['id']}")
            create_payment(rp_payment)
        elif local_payment.status != rp_payment['status']:
            logger.warning(f"STATUS MISMATCH: {rp_payment['id']} Local: {local_payment.status} Remote: {rp_payment['status']}")
            update_payment(local_payment, rp_payment['status'])

Architecture Diagram


Conclusion

Building a webhook ingestion engine for payments is not just about creating an API endpoint. It is about acknowledging that distributed systems are inherently unreliable and architecting defenses against that unreliability.

By implementing the Accept-Then-Process pattern, guarding against duplicates via Event IDs, handling out-of-order delivery through State Machines or Fetch-on-Sync, and implementing a Reconciliation safety net, you transform a fragile integration into a bulletproof financial engine.

Your payment infrastructure is the lifeline of your business. Don’t let a network retry be the reason that lifeline breaks.