Data Integrity in FinTech and Gambling: How Not to Lose Data, Transactions, and Money. Part 2

Data Integrity in FinTech and Gambling: How Not to Lose Data, Transactions, and Money. Part 2

This article was written under the guidance of Michael,
our Senior DevOps Engineer with 10+ years of expertise in this field.

This article was written under the guidance of Michael, our Senior DevOps Engineer with 10+ years of expertise in this field.

Been waiting? Here it is — the conclusion of our article on data integrity in FinTech and Gambling projects, where IT plays a decisive role. Part 1 is available on our blog: “Data Integrity in FinTech and Gambling: How Not to Lose Data, Transactions, and Money. Part 1”. We recommend reading it first if you arrived here directly, as it covers the key differences between FinTech and Gambling projects and other online businesses.

As a reminder: data in FinTech and Gambling projects is not merely a collection of zeros and ones — it is your money. We established that every time you read the word “data,” you should picture money. And conversely: money is data — the foundation of financial security, user trust, business stability, and regulatory compliance. With that in mind, let’s continue.

Ledger as the Source of Financial Truth

In any financial IT project, the formation and presentation of data must be approached with the utmost seriousness. A balance figure should not be just a number that “somewhere gets updated” — it must be a verified value based on accurate and current information. By “accurate” we don’t simply mean untampered with — the critical point is that the number must reflect reality and cannot be distorted by various factors before it reaches users or administrators.

Let’s look at a concrete example. Here is an unsafe approach to displaying a user balance:

❌ users.balance = 150

This is a variable containing a number. Nothing seems unusual — it appears to be correct. But in FinTech and Gambling, this approach is unacceptable. Every significant number visible to users must be the result of a reliable function, not simply a stored value. The correct and secure approach looks like this:

✅ users.balance = cached/current value;

ledger_transactions = full history of all changes.

In IT, the Ledger architectural pattern is used to create audit trails. All transactions are recorded using an append-only concept — meaning they cannot be deleted or silently modified after the fact. The name is borrowed from the centuries-old practice of recording financial operations in large accounting journals. Here is an example of ledger_transactions entries:

deposit +100

bet -20

win +50

bonus +10

withdrawal -80

rollback +20

As shown, the precise and sequential recording of balance changes is not a simple number — it is the result of calculating each individual event. The key advantages of the ledger approach include the ability to:

  • Reconstruct a balance from its history;
  • Identify discrepancies;
  • Perform audits;
  • Verify who changed funds and when;
  • Execute reconciliation;
  • Roll back or compensate for errors;
  • Build read models for CQRS;
  • Validate cached balances.

The Ledger principle works seamlessly alongside CQRS, which we covered in detail in the previous article. The relationship can be illustrated as follows:

The key conclusion: a player’s balance is not simply a number in a table — it is the result of calculating a precise, controlled history of financial events. Any read model can be rebuilt, which represents a direct vulnerability to potential unauthorized access. In contrast, a properly designed Financial Source of Truth cannot be substituted, which is why the Ledger approach is essential for FinTech and Gambling projects.

Transactional Outbox: Ensuring No Events Are Lost

The next pattern deserving special attention concerns the guarantees of event delivery to the database. This is a fairly common problem: even when an event is executed successfully, the responsible service may fail to function correctly, the database write does not occur, and the entire system enters an inconsistent state. Consider the following example:

  • We update the payment status in the database;
  • An event should then be sent to Kafka/RabbitMQ;
  • But between these two actions, the service crashes and the operation does not complete;
  • The database contains the status, but the event was never sent;
  • The read model, CRM, or analytics never received the required event — data integrity is compromised.

An effective solution to this problem is the Transactional Outbox — an architectural pattern in software development that guarantees reliable and consistent delivery of event messages to a message broker (such as Kafka or RabbitMQ) after data has been saved to the database. It prevents situations where data in the service has changed but no notification of that change was sent. In this approach, a single database transaction handles the payment update, creates the ledger transaction, and writes a record to outbox_events.

A separate worker then reads outbox_events, sends the event to Kafka/RabbitMQ, and marks it as “sent.” The flow looks like this:

To summarize: using the Transactional Outbox pattern in FinTech and Gambling projects ensures that no events are lost between the database and the message broker. And preserving data — as we established — means preserving money.

Queues, Retries, and Dead-Letter Queues

Before proceeding, it is important to establish a key point. In FinTech and Gambling projects, many processes operate asynchronously — meaning events or actions occur independently of the main execution flow and do not require an immediate response to continue. Instead of halting and waiting for a task to complete, the system continues executing other actions. A helpful analogy is YouTube: when you request a video to load, the website or app does not freeze. While the video loads in the background, you can continue scrolling, clicking other buttons, and reading comments. Typical asynchronous processes include:

  • Payment callbacks;
  • Withdrawal processing;
  • KYC verification;
  • CRM sync;
  • Bonus calculation;
  • Settlement;
  • Email / SMS notifications;
  • Affiliate events;
  • Webhook delivery;
  • Report generation.

In each of these cases, a failure or delay may occur, in which case a retry of the request is possible — especially if the previous attempt received no response. This is known as a Retry: the automatic process of re-queuing a message or task for reprocessing after a previous attempt has failed. It is used to ensure operational continuity and protect against data loss. However, retries must be handled with great care, and each one must be controllable. The following are non-negotiable requirements:

  • Retry limit;
  • Exponential backoff;
  • Dead-letter queue;
  • Idempotent handlers;
  • Event deduplication;
  • Correlation_id;
  • Manual replay;
  • Poison message detection;
  • Alert on dead-letter queue;
  • Logging of job failure reasons.

To illustrate the risks, consider a highly dangerous scenario:

  • A payment callback job stops working;
  • The queue automatically retries;
  • The handler is not idempotent;
  • Each retry modifies the user’s balance again.

We discussed the dangers of duplicate operations in detail in the previous article. The overall conclusion for this section is: a retry is necessary for reprocessing in the event of failure, but the resulting business outcome must be applied only once. BeFund specialists have correctly configured this process across numerous projects.

Backup ≠ Data Integrity

The most widespread — and unfortunately mistaken — belief is that backups are the primary defense against data integrity violations. Backups are absolutely necessary, but their purpose is to improve the chances of data recovery after a disaster, not to ensure data integrity. A clear example:

🤗 Backup was created at 02:00 AM;

🤬 Incident occurred at 02:35 PM;

🤔 What happens to the transactions between 02:00 AM and 02:35 PM?

For financial systems — which include FinTech and Gambling projects — the following backup configurations are mandatory:

  • Regular backups;
  • Point-in-time recovery;
  • Binlog / WAL archiving;
  • Replica;
  • Transaction logs;
  • Audit logs;
  • Ledger;
  • Reconciliation reports;
  • Restore testing;
  • Disaster recovery plan;
  • Backup encryption;
  • Backup integrity checks.

Of particular importance here is Point-in-Time Recovery (PITR) — the process of restoring a database or system to the exact state it was in at a specific moment in the past. This allows the system to be “rewound” — for example, to the second before a user error or data failure occurred. Rather than relying solely on a single daily backup, PITR uses a combination of a full backup and change logs (WAL, transaction logs, or binary logs) that continuously record all transactions and changes occurring after the backup was created. Even with PITR in place, verification checks must be configured for:

  • Payments confirmed by the provider;
  • Withdrawals that were actually executed;
  • Bets marked as settled;
  • Bonus transactions created;
  • Webhooks received;
  • Events in queues;
  • Outbox events sent;
  • Read models that need to be rebuilt.

As demonstrated, the primary purpose of backup is data recovery, not the preservation of data integrity. Relying solely on backups is therefore insufficient. That said, a preserved data set significantly facilitates its organization and contributes to integrity preservation when properly configured in conjunction with other services and systems.

Reconciliation: Verifying Internal State Against External Sources

The most unpleasant — and dangerous — scenario in FinTech and Gambling projects is when a discrepancy in financial figures is discovered by a user. Uncredited deposits, incorrect bonuses, erroneous fund withdrawals — all of these are serious blows to reputation. And even if the situation is resolved quickly and all figures return to normal, the negative impression remains and affects the client’s future attitude and word-of-mouth. But can discrepancies be caught before users encounter them? Yes — by using external sources.

Consider some common examples of potential data discrepancies:

  • Our payment status vs. provider payment status;
  • Our withdrawal status vs. payout provider status;
  • Our balance ledger vs. cached balance;
  • Our bet settlement vs. game provider settlement;
  • Our CRM status vs. core system status;
  • Our bonus ledger vs. bonus balance;
  • Our affiliate commission vs. partner platform.

In practice, discrepancies look like this:

To address these scenarios, external data reconciliation should be used. Based on BeFund’s experience, the following time intervals are sufficient to address 93% of potential errors. Reconciliation jobs can run:

🕐 Every 5 minutes for payments;

🕑 Every 15 minutes for withdrawals;

🕒 Hourly for CRM/KYC;

🕓 Daily for financial ledger reports;

🕔 Separately after any incident or failover.

In addition to scheduled reconciliations, checks should also be triggered immediately following specific normal — and especially abnormal — events:

  • Failover;
  • Restoration from backup;
  • Payment provider outage;
  • Webhook delays;
  • Queue issues;
  • Settlement errors;
  • Manual intervention by the support or admin team.

This comprehensive set of measures enables financial discrepancies to be identified and corrected before they become visible to users at large. While clients can only respond to such errors with open dissatisfaction, discrepancies identified by regulators will invariably create additional and unnecessary problems for the business. Reconciliation should therefore be built into the development process at the project planning stage.

Data Recovery Without Financial Loss

When data loss has already occurred and recovery is required, FinTech and Gambling projects have an important distinction: it is not enough to simply restore what was lost — the entire financial context must be reconstructed. As discussed above, a user’s balance is not simply a number in a table; it is the result of calculating a precise, controlled history of financial events. The mandatory data points for recovery include:

  • The last valid backup;
  • Binlog / WAL position;
  • State of primary and replica;
  • Replication lag;
  • Whether a split-brain condition occurred;
  • Which transactions were committed;
  • Which outbox events were not sent;
  • Which jobs remained in queues;
  • Which webhooks may have been missed;
  • Which payments changed status at the provider;
  • Which balances need to be verified against the ledger;
  • Which read models need to be rebuilt.

To ensure correct recovery without the risk of data loss due to an incorrect sequence of actions, the BeFund team recommends following this time-tested procedure:

  • Stop write operations or switch the system to maintenance mode;
  • Identify the point of failure;
  • Verify primary / replica consistency;
  • Restore the database via backup + PITR;
  • Verify the ledger;
  • Verify outbox events;
  • Restart or replay jobs;
  • Run reconciliation with providers;
  • Check for balance mismatches;
  • Gradually restore write traffic;
  • Conduct a postmortem;
  • Add monitoring and alerts to prevent recurrence.

Demonstrating the correct system state to users following an incident in financial projects is a business requirement. Without evidence and reliable proof, an owner cannot simply announce: “The issue has been resolved, everything is working — invest your funds with confidence.” No one accepts unsubstantiated claims when their assets are at stake.

Data Integrity Monitoring

In our previous article, “Data Monitoring in FinTech and Gambling: A Question of Survival, Not Comfort” from the BeFund blog series, we covered the objectives and tools of data monitoring in FinTech and Gambling projects in detail. On the topic of today’s article, it is worth stating plainly: data integrity without monitoring is faith, not control. This subject is directly tied to observability — the need to see not only CPU and RAM metrics, but financial consistency as well. Without the confirmation that monitoring provides, we can never be certain that configured processes are executing correctly or that our data reflects reality. The following are the areas requiring mandatory monitoring and the alerts that must be configured immediately:

Monitoring:

Alerts:

replication lag;

replication lag > N seconds;

failed transactions;

payment pending for more than N minutes;

duplicate transaction attempts;

withdrawal stuck for more than N minutes;

ledger mismatch;

outbox events not being sent;

balance mismatch;

dead-letter queue growing;

pending payments;

ledger balance ≠ cached balance;

stuck withdrawals;

provider status ≠ internal status;

outbox events not sent;

read model lag exceeds threshold;

dead-letter queue size;

backup older than acceptable threshold;

reconciliation differences;

restore test not performed recently.

failed recovery jobs;

 

split-brain risk;

 

backup age;

 

last successful restore test;

 

replica health;

 

binlog/WAL archiving status;

 

provider webhook delay;

 

read model lag;

 

CQRS projection lag.

 

Example of a Correct Architecture

How should data integrity be properly ensured? The following diagram illustrates this best:

As the diagram shows:

  • idempotency check → protects against repeated webhook processing;
  • DB transaction → guarantees atomicity of financial changes;
  • ledger → preserves the financial history;
  • balance update → updates the current state;
  • outbox event → guarantees that the event will not be lost;
  • message broker → delivers events to other services;
  • read model → enables fast data reads without placing load on the primary database.

In other words, the most effective data integrity is achieved through the combination of idempotency, transactionality, the ledger approach, CQRS, outbox, message broker, reconciliation, and monitoring.

Core Principles of Data Integrity in FinTech and Gambling — Summary

The time has come to summarize both parts of this article, which has grown quite substantial — but necessarily so, given the exceptionally broad range of critical issues it addresses for FinTech and Gambling. Data integrity is not merely a technical quality of the system. It is the protection of money, users, licenses, and business reputation. Replication, backup, and failover are important, but they are not sufficient — the system must be idempotent, transactional, observable, and capable of reconciliation. We have identified 12 principles that describe the core requirements:

  • Critical writes must go through the primary database;
  • Financial operations must be transactional;
  • Every external event must be idempotent;
  • Balances must be validated against ledger history;
  • Replicas must not be used for critical stale-sensitive read queries;
  • CQRS should reduce load without breaking consistency;
  • Read models can be rebuilt;
  • The financial source of truth must be protected;
  • Retry without idempotency is dangerous;
  • Backup without PITR, ledger, and reconciliation is insufficient;
  • Failover without consistency verification can trigger a financial incident;
  • Monitoring must cover not only servers but financial integrity as well.

The correct system architecture — which BeFund specialists will help you build — guarantees stable operation, eliminates problems, and prevents the most critical errors:

  • A single payment will not be credited twice;
  • A withdrawal will not be lost;
  • A balance can be reconstructed from the ledger;
  • Read models can be rebuilt;
  • Replica lag will not break financial logic;
  • Failover will not result in lost transactions;
  • Any discrepancy can be found and corrected without financial loss.