Beyond the Happy Path: Architecting Fault-Tolerant Recurring Payments with Razorpay

The Illusion of the “Happy Path”
In the world of distributed systems, the “happy path” is a sedative. It lulls engineers into a false sense of security where networks never partition, latency is zero, and customers always have sufficient funds in their bank accounts. When building one-off payment flows, you might get away with a simple request-response model. But in the domain of recurring payments and subscriptions, the happy path is the exception, not the rule.
A subscription system is a long-running process. It is a temporal contract between your architecture and a user’s financial state. Over a 12-month period, cards expire, banks flag transactions as fraudulent, webhooks get dropped, and users change their minds.
If you are architecting a subscription engine on top of Razorpay for a high-scale FinTech or SaaS product, treating payment failure as an edge case is a catastrophic design flaw. It must be a core state of your system.
This article is not a “getting started” guide. We are diving deep into the architectural patterns required to build a fault-tolerant, resilient subscription system. We will cover state machine recovery, idempotent webhook ingestion, complex proration logic, and the “Dunning” architecture that prevents involuntary churn.
1. The Subscription State Machine: Purgatory vs. Death
The most common misconception I see in backend implementations is treating the Razorpay subscription status as a binary: active or cancelled. The reality is nuanced. The lifecycle of a subscription involves critical intermediate states that dictate whether your system should retry a charge or revoke access.
The two states that cause the most confusion—and revenue loss—are pending and halted.
The Critical Distinction
- Active: The happy path. Payment was successful, access is granted.
- Pending (Purgatory): Razorpay attempted a charge, but it failed (e.g., insufficient funds). Crucially, retries are active. Razorpay’s internal scheduler will attempt to charge the card again based on your retry configuration (usually T+1, T+3 days).
- Your Action: Do not cancel access immediately. This is often a temporary glitch.
- Halted (The Cliff): All retry attempts have been exhausted. Razorpay has stopped trying to charge the card.
- Your Action: This is where most implementations fail. They treat
haltedascancelled.
- Your Action: This is where most implementations fail. They treat

The “Rescue” Operation: Reviving Halted Subscriptions
When a subscription hits the halted state, the intuitive UI flow is to ask the user to “Update Payment Method.” However, purely updating the card doesn’t always immediately trigger a charge or move the state back to active instantly.
A superior architectural pattern is to programmatically rescue the subscription using the Invoice API.
Instead of creating a new subscription (which breaks historical data and metrics), you can issue an ad-hoc invoice linked to the subscription. When this invoice is paid, Razorpay automatically transitions the subscription from halted back to active.
The Implementation Strategy:
- Listen for the
subscription.haltedwebhook. - In your frontend, present a “Pay Outstanding Balance” flow.
- Backend generates a manual invoice for the pending amount.
- User pays the invoice -> Subscription auto-heals.
import razorpay
from my_app.exceptions import PaymentError
client = razorpay.Client(auth=("KEY_ID", "KEY_SECRET"))
def rescue_halted_subscription(subscription_id: str, user_id: str):
"""
Rescues a halted subscription by creating a manual invoice
for the failed billing cycle.
"""
try:
# 1. Fetch Subscription to get details
sub = client.subscription.fetch(subscription_id)
if sub['status'] != 'halted':
raise PaymentError("Subscription is not in halted state.")
# 2. Identify the unpaid amount (usually the plan amount)
# In a real scenario, calculate strictly based on outstanding cycles
amount_due = sub['plan_details']['amount']
# 3. Create an Invoice linked to this subscription
# This tells Razorpay: "If this is paid, fix the sub."
invoice_payload = {
"type": "invoice",
"customer_id": sub['customer_id'],
"line_items": [
{
"name": "Outstanding Balance Recovery",
"description": "Payment for failed renewal",
"amount": amount_due,
"currency": "INR"
}
],
"subscription_id": subscription_id, # CRITICAL LINK
"sms_notify": 1,
"email_notify": 1
}
invoice = client.invoice.create(invoice_payload)
return invoice # Return invoice URL to frontend
except Exception as e:
# Log critical failure
raise PaymentError(f"Failed to create rescue invoice: {str(e)}")
By linking the subscription_id in the invoice creation, you maintain the continuity of the subscription entity, preserving the billing_cycle count and lifetime value (LTV) data.
2. Webhook Architecture & Idempotency
Webhooks are the heartbeat of a subscription system. However, webhooks are asynchronous, unreliable, and prone to duplication. If your architecture assumes that webhooks arrive exactly once and in the correct order, you will eventually double-credit a user or disable an account that just paid.
The Challenges
- Out-of-Order Delivery: You might receive
subscription.chargedbeforesubscription.pending. - Duplication: Razorpay (and most gateways) operates on an “at-least-once” delivery guarantee. If your server times out or responds with a 500 error, they will send the event again.
- Security: An attacker could spoof a webhook to grant themselves free access.

Security: HMAC Verification
Before looking at the payload, you must verify the signature. This is not optional. It prevents Man-in-the-Middle attacks and spoofing.
import hmac
import hashlib
import json
def verify_webhook_signature(request_body: bytes, signature: str, secret: str) -> bool:
"""
Verifies the X-Razorpay-Signature header.
Args:
request_body: Raw bytes of the POST body
signature: The hex string from X-Razorpay-Signature header
secret: Your webhook secret configured in Dashboard
"""
generated_signature = hmac.new(
key=bytes(secret, 'utf-8'),
msg=request_body,
digestmod=hashlib.sha256
).hexdigest()
# hmac.compare_digest prevents timing attacks
return hmac.compare_digest(generated_signature, signature)
Idempotency: Handling Duplicates
Idempotency ensures that processing the same event multiple times yields the same result as processing it once.
The Anti-Pattern: checking your database for if transaction_exists: return.
The Better Pattern: Using a distributed lock (Redis) combined with an Event Log.
You should use the x-razorpay-event-id header or the payload.id to uniquely identify the event.

Implementation with Redis (Python/Pseudo-code):
import redis
from django.db import transaction
r = redis.Redis(host='localhost', port=6379, db=0)
def process_webhook(event_id, payload):
# 1. Atomic Check via Redis
# SETNX sets the key only if it doesn't exist.
# We set an expiry (TTL) of 24 hours to prevent infinite growth.
is_new_event = r.set(f"webhook_event:{event_id}", "processing", ex=86400, nx=True)
if not is_new_event:
# We have seen this event. Return 200 to stop Razorpay retries.
return 200, "Event already processed"
try:
# 2. Database Transaction
with transaction.atomic():
# ... process business logic (provision access, record payment) ...
pass
# 3. Mark as success in Redis (Optional, for debugging)
r.set(f"webhook_event:{event_id}", "success", xx=True)
return 200, "OK"
except Exception as e:
# 4. Critical: If processing fails, DELETE the key so we can retry
r.delete(f"webhook_event:{event_id}")
raise e # Let the server 500 so Razorpay retries
This logic ensures that even if two requests with the same ID hit your servers in parallel (race condition), only one will acquire the Redis lock and process the logic.
3. Handling Upgrades & Downgrades (Proration)
Moving a user from a “Basic” plan (₹500/mo) to a “Pro” plan (₹1000/mo) in the middle of a billing cycle is mathematically complex. You cannot simply switch the plan_id. You must account for the money already paid for the unused days of the Basic plan.
schedule_change_at: The Lever of Control
Razorpay provides the schedule_change_at parameter in the subscription update API. You have two choices:
cycle_end: The upgrade happens when the current month finishes. Easy logic, but delayed revenue and user gratification.now: The upgrade happens immediately. This triggers Proration.

The Proration Flow
When you update a subscription with schedule_change_at="now" and proration_behavior="create_prorations", Razorpay does the following:
- Calculates the unused amount of the current plan (Credits).
- Calculates the cost of the new plan for the remaining days of the cycle (Debits).
- Creates an immediate invoice for the difference:
(New Plan Cost) - (Old Plan Unused Credit).
Architectural Warning: Your local database must stay in sync with this logic. Do not try to calculate proration yourself. Rely on the webhook subscription.charged that results from the upgrade to update your local “Plan Tier” field.
Code Example: Performing an Immediate Upgrade
def upgrade_subscription(sub_id: str, new_plan_id: str):
try:
# Razorpay handles the math (credits/debits)
resp = client.subscription.edit(sub_id, {
"plan_id": new_plan_id,
"schedule_change_at": "now",
"proration_behavior": "create_prorations",
"quantity": 1
})
# NOTE: Do NOT update user access to 'Pro' here immediately.
# Wait for the successful payment webhook (subscription.charged)
# generated by this change.
# If the proration charge fails, the user stays on the old plan.
return resp
except Exception as e:
# Handle scenarios like "Upgrade not possible on halted sub"
handle_error(e)
Key Takeaway: Never update user permissions synchronously during an upgrade request. The proration charge might fail (card declined). Always drive permission updates via webhooks.
4. Resiliency Patterns: The “Dunning” Process
Dunning is the process of communicating with customers to ensure the collection of accounts receivable. In SaaS, it means “How do we handle failed renewal charges without losing the customer?”
A naive implementation listens for subscription.charged (failed) and immediately sets user.is_active = False. This leads to high involuntary churn. A user’s card might fail due to a bank server timeout, and you’ve just locked them out of their account.
The Grace Period Architecture
You need to decouple Payment Status from Access Status.
- Payment Status: Tracks the Razorpay state (
active,pending,halted). - Access Status: Tracks whether the user can use the app (
granted,revoked).
When a payment fails (subscription.pending), the user enters a Grace Period.

The Logic Flow:
- T+0 (Due Date): Razorpay attempts charge. Fails.
- Webhook:
subscription.pending. - System Action: Send email “Payment Failed”. Keep access active.
- Webhook:
- T+1 to T+3: Razorpay auto-retries.
- If Success (
subscription.charged): Send “Thank You” email. Extend validity. - If Fail: Do nothing. Keep access active.
- If Success (
- T+5 (Max Retries Exhausted): Razorpay gives up.
- Webhook:
subscription.halted. - System Action: Enter Grace Period (Internal). Give them 48 more hours to fix it manually.
- Webhook:
- T+7 (Hard Cutoff):
- Cron Job checks
subscription.halted+grace_period_expired. - System Action: Revoke Access.
- Cron Job checks
Configuration
In your Database, your Subscription model should look something like this:
class Subscription(models.Model):
razorpay_id = models.CharField(max_length=50)
status = models.CharField(choices=['active', 'pending', 'halted', 'cancelled'])
current_period_end = models.DateTimeField()
# The field that actually controls login/feature access
access_valid_until = models.DateTimeField()
def handle_payment_failure(self):
"""
Called on subscription.pending.
We don't revoke access yet, but we might flag the UI.
"""
self.ui_warning_flag = True
self.save()
def handle_halted(self):
"""
Called on subscription.halted.
Set a hard stop date, perhaps 2 days from now.
"""
self.access_valid_until = datetime.now() + timedelta(days=2)
self.save()
This decoupled architecture ensures that temporary banking glitches do not result in a degraded user experience.
5. System Architecture & Scaling
To tie this all together, your backend architecture needs to be robust. You cannot rely on a single monolithic web server to handle user traffic and webhook processing simultaneously.

The Recommended Stack
- Webhook Ingestion Layer: A lightweight API endpoint that does nothing but signature verification and pushing the payload to a Message Queue (SQS, RabbitMQ, Kafka). This ensures you respond
200 OKto Razorpay instantly (under 2 seconds is the requirement). - Worker Fleet: Asynchronous workers consume events from the queue. This is where the heavy lifting happens (Database writes, Email triggers, Redis locking).
- Reconciliation Cron: Even with the best webhook architecture, events get dropped. Run a nightly cron job that fetches all
activelocal subscriptions and compares their state with the Razorpay API. If a mismatch is found (Local:active, Remote:halted), auto-correct the local state.
Conclusion
Building a recurring payment system is 20% integrating APIs and 80% handling failure states. The difference between a junior and senior implementation lies in how the system behaves when things go wrong.
By implementing a robust State Machine that understands the difference between pending and halted, securing your webhooks with HMAC and Redis-based idempotency, and architecting a compassionate Dunning process, you turn payment failures from a churn event into a recovery opportunity.
The goal isn’t just to process payments; it’s to build a system that maintains trust with your users, even when their bank declines the card.