This article was written under the guidance of Michael,
our Senior DevOps Engineer with 10+ years of experience in this domain.
We recommend reading the previous articles from our blog covering access to critical information, data preservation, monitoring, and data integrity in two parts:
Where Do Your Secrets Live? Security Architecture in Gambling and Fintech;
Data Monitoring in FinTech and Gambling: A Question of Survival, Not Comfort;
Data Integrity in FinTech and Gambling: How Not to Lose Data, Transactions, and Money. Part 1;
Data Integrity in FinTech and Gambling: How Not to Lose Data, Transactions, and Money. Part 2.
Why does this matter? Because we are about to address something most IT companies prefer not to discuss: what to do when all precautions have proven insufficient and a critical failure has occurred regardless. In FinTech and Gambling projects, having a backup alone is not enough to resolve such a situation. You need a pre-prepared plan for rapidly restoring the project in a different data center or with a different cloud provider, without losing critical data. As we established in previous articles: data equals money.
Multi-Datacenter / Multi-Cloud
If you follow the development of online business, you likely already have a general understanding of how data is stored and processes are handled: everything converges on servers in a specific data center. For FinTech and Gambling projects, there must be more than one such location — unlike most other IT business verticals. The requirement is straightforward: processes must be continuous. So even if one data center goes offline due to a power outage, a natural disaster, or any other unforeseen event, another must take over all operations within the shortest possible time and possess the entire dataset that has become unavailable. Simply having a system backup on the same server where it is deployed is insufficient; data copies must be distributed across separate, geographically dispersed data centers. This is called “disaster resilience” — and the name speaks for itself.
A project should be stored or replicated across multiple environments simultaneously. The following is an example of data distribution between two data centers:
Primary Datacenter: | Secondary Datacenter: |
Main application cluster | Standby application nodes |
Primary database | Replica database |
Redis / RabbitMQ / Kafka | Backup storage |
Monitoring | Prepared Docker/Kubernetes manifests |
Production traffic | Prepared secrets |
| Monitoring endpoints |
In a cloud-based implementation, the distribution is simpler: a Primary Cloud on a well-known and reliable provider, and a Secondary Cloud on a less prominent but equally reliable one — or even a self-managed server:
- Primary: Hetzner / OVH / AWS / Vexxhost;
- Secondary: a different provider or a different region.
Alternatively, the entire system can be distributed across different servers, though this must be approached with great care:
- Hetzner → primary infrastructure;
- Vexxhost → standby infrastructure;
- AWS S3 / Backblaze / Wasabi → backup storage;
- Cloudflare → DNS / WAF / traffic switch.
RTO and RPO
Let us examine the two key metrics used to measure recovery success. They are used to evaluate the effectiveness of the architecture that has been built. In FinTech and Gambling projects, both values should approach zero — or at least be as small as possible.
- RTO — Recovery Time Objective: indicates how long it takes to restore the system after a critical failure;
- RPO — Recovery Point Objective: defines how much data loss is acceptable.
Both metrics are used together and are critically important. RTO is measured primarily in minutes. An entry of “RTO=30” means the system must be restored within a maximum of 30 minutes. RPO, on the other hand, is a relative metric measured in data volume divided by time. For example, “RPO=0–5” means that losing up to five minutes’ worth of data is acceptable for the business. Exceeding this threshold constitutes a critical failure that will inevitably result in financial losses.
From practical experience: five minutes is already too long. For payments, balances, bets, and withdrawals, even three minutes of inactivity is problematic. Within that time, significant financial discrepancies can accumulate — as discussed in previous articles — leading to user dissatisfaction and regulatory scrutiny.
The Foundation for Recovery
Unlike other IT projects, FinTech and Gambling cannot rely on a backup alone for recovery. Typically, a backup storage contains only ZIP archives of files and database dumps — backup copies holding the full structure and/or content of a database at a specific point in time. These are specialized files (most commonly in .sql format) that capture tables, records, and settings, enabling data to be instantly restored or migrated.
For the vast majority of less demanding projects, this is perfectly sufficient. However, FinTech and Gambling require not just a database and files, but an entire set of preserved information organized into a single recovery package:
- Code repository;
- Docker images or Dockerfiles;
- Docker Compose / Kubernetes manifests;
- Terraform / Ansible / Helm charts;
- Database backups;
- WAL/binlog for point-in-time recovery;
- Object storage backups;
- Secrets in Secret Manager / Vault;
- CI/CD pipeline;
- DNS switch plan;
- Recovery steps documentation;
- Post-recovery monitoring;
- Reconciliation scripts.
This covers the essentials. The core purpose of a recovery package is to provide not merely a backup, but a fully restorable dataset for use in the event of a failure.
Infrastructure as Code
Rapid and reliable recovery requires not only complete information, but also the right methods for working with it. We typically picture the configuration process as routine manual work performed by a specialist interacting with various services and graphical interfaces, checking boxes in dashboards, and verifying the correctness of each step. While this is common and feasible in many contexts, it is not appropriate for FinTech and Gambling, where every millisecond counts and the human factor introduces unacceptable risk.
This is where Infrastructure as Code (IaC) comes in — an approach to managing and configuring IT infrastructure (servers, networks, databases) using specialized code files. Rather than manually configuring systems through a sequence of console commands, the infrastructure is described as a single codebase that can be edited, version-controlled, and automatically deployed. The advantages are clear:
- Speed: deploying an entire environment (servers, databases) takes minutes, not hours or days;
- Standardization: an identical environment is always created (for development, testing, and production) without errors introduced by the human factor;
- Version control: configuration files are stored in Git, making it possible to track who changed what and when, and to quickly roll back to a previous version;
- Scalability: new servers or network resources can be added as needed without additional manual configuration.
For writing this type of code, the following tools are recommended based on extensive hands-on experience:
- Terraform → provisioning servers, networks, firewalls, volumes, load balancers;
- Ansible → configuring servers, Docker, users, packages, configs;
- Helm / Kubernetes manifests → running services in Kubernetes;
- Docker Compose → rapid deployment of small to medium infrastructure;
- GitHub Actions / GitLab CI / Jenkins → automated deployment.
An example of an effective recovery architecture might look like this:
git clone infrastructure-repo
terraform apply
ansible-playbook setup.yml
docker compose up -d
restore database
switch DNS
The key point about IaC: unlike manual administration, this must be a fully automated solution with no human involvement. Working scripts must be tested repeatedly, and in a critical situation they must execute without requiring any manual intervention.
Database Recovery
We now arrive at the actual data restoration process. Before initiating recovery procedures (if they are not fully automated), a complete checklist of required resources must be verified.
For the database:
- Regular full backups;
- Incremental backups;
- Binlog / WAL archiving;
- Replication;
- Point-in-time recovery;
- Restore testing;
- Backup encryption;
- Backup integrity checks.
For MySQL:
Full backup + binlog → restoration to a specific point in time.
For PostgreSQL:
Base backup + WAL → point-in-time recovery.
Successful completion of the restoration process is not yet a reason to stand down. For most IT projects, this would be sufficient — but not for FinTech and Gambling. The recovery process takes time, during which errors may accumulate and lead to financial discrepancies as a result of data integrity violations. The following additional operations are therefore MANDATORY:
- Ledger verification;
- Balance reconciliation;
- Payment provider reconciliation;
- Withdrawal reconciliation;
- Outbox replay;
- Queue verification;
- Read model rebuild;
- CRM sync check.
Active-Passive vs. Active-Active
Two more important architectural approaches must be addressed here, as they relate to high availability and server fault tolerance. They determine how load is handled and how the system responds to failures.
Active-Passive
In this configuration, only one server (the active node) is in operation. The second server (the passive, or standby node) remains on standby and does not receive traffic. All work is performed by the primary server. In the event of its failure, the passive server “wakes up” and assumes control — the failover process discussed in previous articles. This solution is simpler to configure and manage, but may involve a brief delay or service interruption during the switchover. Additionally, the standby server consumes resources while idle.
Active-Active
In this architecture, all servers or nodes operate simultaneously and collectively process user requests. A load balancer distributes incoming traffic across all active servers. If one server fails, the system simply redirects requests to the remaining nodes. The primary advantage is near-seamless failover with minimal or no downtime, and full utilization of all resources. The system scales better and performs faster. However, the cost of this solution is significantly higher (a more powerful infrastructure must be maintained), and management and development are more complex due to the need for real-time data synchronization between servers. For FinTech and Gambling, this presents a substantial challenge, as it requires simultaneously addressing a range of critical issues:
- Write conflicts;
- Distributed transactions;
- Global balance consistency;
- Split-brain;
- Idempotency across data centers;
- Event ordering;
- Consensus;
- Inter-region latency.
Based on extensive experience with high-load projects, including online casinos, the optimal approach is a hybrid model that uses both architectures for different purposes: active-passive for the financial core, and active-active only for the stateless application layer or read-only services.
Stateless Application Layer
It is worth noting that the speed of server recovery or its restart in a different data center depends directly on the amount of information stored locally. The stateless approach actively supports this goal: under this model, each request is processed as new, without accumulating data about client sessions. All bulky data is stored separately and includes:
- User sessions;
- Uploaded files;
- Critical temporary data;
- Queue state;
- Payment state;
- Generated reports;
- Cache as a single source of truth.
The best solution is a well-designed distribution of data across different services and storage systems. A stateless application can then be quickly launched on virtually any other server and simply pointed at the appropriate databases, without migrating or duplicating them:
- Sessions → Redis / database / external session storage;
- Files → S3-compatible object storage;
- Queues → RabbitMQ / Kafka with persistence;
- Cache → rebuildable;
- Logs → centralized logging;
- Secrets → Secret Manager / Vault.
Object Storage and Files
Continuing the topic, it is worth addressing other “heavy” files that accumulate continuously. This places a burden on the system, making it undesirable to store everything on the application server itself. Such files include:
- User uploads;
- KYC documents;
- Payment reports;
- Invoices;
- Exports;
- Logs;
- Backoffice files;
- Media;
- Generated documents.
Organizing the handling of such files in advance will make the project faster and more efficient — and, most importantly, will prevent problems in a critical situation. A number of reliable services and methodologies are available for this purpose. The following are recommended based on direct experience:
- S3 / S3-compatible storage;
- MinIO;
- Wasabi;
- Backblaze B2;
- AWS S3;
- OVH Object Storage;
- Vexxhost OpenStack Swift;
- Replication between buckets;
- Versioning;
- Lifecycle policies;
- Encryption.
DNS and Traffic Switch
Another critically important issue following data recovery — one that requires heightened attention in FinTech and Gambling projects — is traffic switching. Even if everything is restored quickly and flawlessly, the system will remain idle without data traffic, and the business will continue to lose money and reputation. The following tools can be used for traffic management:
- Cloudflare;
- Route53;
- Low TTL for DNS;
- Load balancer;
- Health checks;
- Failover records;
- WAF (Web Application Firewall);
- Reverse proxy.
It is important to remember that DNS TTL must not be set to 24 hours, as may be acceptable for other IT projects. Furthermore, FinTech and Gambling administrators must clearly know which domains need to be switched and under what conditions — as quickly as possible. Special attention must be paid to API, admin, webhook, and callback URLs: all of these must be accounted for and routed to the new data center. In addition to clear switching instructions, a step-by-step action plan should also be prepared.
Finally, particular attention should be paid to payment provider webhooks, which must also be updated during a traffic switch:
payment callback URL | KYC provider callback | CRM webhook |
withdrawal callback URL | affiliate webhook | game provider callback |
Secrets and Access Credentials
This topic was covered in detail in the first article of this series, Where Do Your Secrets Live? Security Architecture in Gambling and Fintech, but a brief recap of the key points is worthwhile here. For rapid project recovery, secrets must not be hard-coded for a specific server. The correct approach to secrets management:
Secret Manager / Vault
↓
service account / IAM role
↓
application retrieves secrets at runtime
The following secrets must be available in the secondary data center:
- DB credentials;
- API keys;
- Payment provider credentials;
- JWT/private keys;
- OAuth secrets;
- Webhook secrets;
- Object storage credentials;
- Queue credentials.
However, it must be emphasized: developers must never have direct access to production secrets. Only access control solutions can guarantee the security of the project.
Post-Recovery Monitoring
Monitoring is the subject of a dedicated article on our blog: Data Monitoring in FinTech and Gambling: A Question of Survival, Not Comfort. It bears emphasizing that after failover, it is not sufficient to verify only that the website or application is running. A full range of indicators must be checked:
API availability | webhook delivery | ledger mismatch |
5xx errors | queue lag | CRM sync lag |
payment success rate | replication status | SSL certificates |
pending transactions | outbox events | DNS propagation |
withdrawal queue | dead-letter queue | external provider callbacks |
KYC provider status | balance mismatch |
|
In other words, recovery in FinTech and Gambling projects is considered successful only when business processes are functioning — not merely when servers are running. This is a fundamental distinction from most other IT solutions.
Disaster Recovery Drill
In closing, we want to make sure this message is heard. Years of experience with complex financial projects — including numerous critical incidents and remediation efforts — has shaped a clear view: if you are building a business in FinTech or Gambling, engage professional developers and do not neglect the elements that can save your project and your money. Many companies genuinely have backups but have never tested a restore, which leaves them dangerously exposed. The following practice, refined over years of work, is recommended for addressing this risk. Ideally it should be performed monthly, but no less than once per quarter:
- Test infrastructure rebuild from scratch;
- Restore from backup;
- Run smoke tests;
- Verify payment sandbox;
- Verify queue processing;
- Verify monitoring;
- Measure actual RTO;
- Verify RPO;
- Update the runbook.