How I solved real-time sync between SQL Server and PostgreSQL with Apache Kafka

William Prigol Lopes
7 min readApr 12, 2020

Note: This article is more focused on concepts and advantages than hands-on. Another article will be created to explain how to do that in a small simulated scenario.

In one of my clients, I faced a problem involving an infrastructure change. I began the work in the company developing a series of small systems from more than 4 years ago from the change. With the time and trust of the client, I developed the e-commerce and some border systems around the ERP (Enterprise Resource Planning) system.

The scenario that I received used only one RDBMS technology, the structure was well stable, easily reproducible (if needed), and running with two approaches, as follows:

  1. Warehouse data: The native PostgreSQL streaming replication works very well, syncing the replica basically in real-time and avoiding to put a load on the main server with queries, avoiding to saturate the ERP RDBMS server resources.
  2. Border systems: To border systems, SELECTS are executed directly on the warehouse database. To manipulate any data, communication was requested by APIs.
Figure 1. The RDBMS structure with PostgreSQL in all databases

In the figure above (Figure 1), we can see a simple description of the RDBMS structure. In red, highlighted, the main system, the “ERP Database”. On the right from “ERP Database”, we can see the “Warehouse” database that was a read-only replication slave, using streaming replication to maintain data in sync. On the right of the “Warehouse” database, we have the border (auxiliary) systems like e-commerce. On the other side of “ERP Database”, we also have two 3rd party databases that communicate directly with the main database.

In this scenario, everything works fine. As the database was naturally growing, the challenge was: How to create a new structure to allow generate materialized views or to allows creating indexes, by removing the “read-only” state? Because we saw a large usage on the “Warehouse” database (more than 5k queries daily) and, the data analysis was relatively deep and began to get a relatively high CPU cost because we can’t create specific indexes on read-only DB.

In the same way, the company received a notice from the ERP company that the PostgreSQL compatibility will be discontinued in future releases and a switch to SQL Server became a need.

— “Our system will no longer be compatible with PostgreSQL databases. The company decided to discontinue the connectors to this RDBMS. So, we advise that your company needs to change the RDBMS to SQL Server”.

So, in this moment, we said: — Ok, what we need to make the things done? How we can check the impact on the systems and how we can do this transition in a better and transparent way, avoiding the impact on the daily operations?

As the ERP was considered the Single Source of Truth (SSOT) of 99% of data in the company I faced the need to maintain the compatibility between all the systems. The border systems implemented like e-commerce, data warehouse, and some complimentary small systems are developed using the PostgreSQL architecture as a permanent storage system. At that moment, changing the database will generate a significant impact on border systems that uses some specific features of PostgreSQL like full-text search and, also, new licenses in SQL Server will generate extra costs and months of work to adapt specific features.

To take actions about the changes, I’ve taken some decisions:

  • The only database that we will change is the ERP database.
  • The databases of border systems need to consume the data whit a low-latency with ERP database to maintain the competitive advantage of the information.
  • The queries from the border systems need to be done in a warehouse, with no direct queries in the ERP database to avoid slow down the daily operations by using the main database to query.

So, at the moment we needed to get a technology that makes this transition in a transparent way and in low-latency mode. The research began and I considered two approaches.

  1. ETL (Extract, Transform, Load): Basically, a set of scripts that runs in a short period of time and update the information on the warehouse, the “traditional” way to sync data by batching modifications and reproducing by times to times (minutes, hours, days…).
  2. Streaming Replication: In the same way that the database sync was running with PostgreSQL but with different databases. But, it is possible to do the streaming replication with different databases?

The first option was discarded because some border systems need the information on the fly. In the second option, we found Apache Kafka as a viable solution, as they say: “A distributed streaming platform”. Looking where the Apache Kafka is used — Netflix, Uber, etc… — we can see the power of this stack. So, for our scenario we’ve got a question: — “Apache Kafka is viable in our scale since that Kafka is applied on bigger situations? How complex is the real application of this stack?”

And the answer is — “Yes, we can do that with Kafka and No, it’s no complex”.

And the new scenario was using Kafka and Debezium technologies, as we can see below.

The new scenario

The main server is highlighted in red using SQL Server that uses CDC (Change Data Capture) and Debezium SQL Server Connector as Source Producer and defaults JDBC Kafka Connector as Sink consumer.

Figure 2: Servers with SQL Server and Streaming Replication

To better understand, two concepts are extremely important on Kafka, the two concepts are Source and Sink and will be explained below:

  • Source: The data source are the origin of data. In Kafka, everything that needs to be streamed can be a source like system logs, geolocation register, database changes, etc. Everything that creates a flow of information. The source of information is collected here and, in our case, was the CDC from SQL Server.
  • Sink: The data that is collected from sources are organized and maintained on Kafka to a defined period of time and need to be consumed. This consumption is made by sink consumers. In our case, are — initially — two PostgreSQL databases.

The interpretation of information is defined by the source and sink connectors, Kafka takes care of collecting and organizing data to be consumed in the right way. So, if you need to transform the data, we create sets of data called “topics” and, every topic is well-defined streaming information. In our case, every topic is a replica of the table created by Debezium producer and consumed and interpreted to PostgreSQL by JDBC sink connector.

The image below shows a simple way to see how source and sink data are used and where is the place of the topic.

The big change is here. Source production and sink consumption are very fast, consistent, and stable. From the moment that the data was produced by CDC changes to consumption in the sink, there was a mean time of 200 ms. Yes, 200 milliseconds! You can choose which topic do you want to consume (in this case, which tables I chose) that you want, so you don’t need to replicate everything and, finally, the warehouse doesn’t need to be read-only so, we can manipulate indexes and create views and materialized views to specific scenarios resulting in reliable better performance and flexibility.

Another point is the leader election, Zookeeper. As a distributed system, Kafka has consistency in scenarios that we need some “bulletproof” availability. So, you can put more Kafka processes in different machines or containers and put more Leader Electors or Zookeepers distributed. This technology is very flexible at this point.

In general, the set ZooKeeper + Kafka + Debezium was a bar raiser to solve the client’s problem. The final result was a better distributed system and true real-time communication between different databases, also, a certain freedom of a read-write database allowed to explore deeper the database resources used to optimize and organize the warehouse, like allowing to create indexes, materialized views, triggers, etc.

All tools are under Open Source Licenses so, you or your team only need to know how it works. Also, we have good open-source monitoring and managing systems for Kafka. If you need an enterprise solution, you can use Confluent, a paid tool that makes things very easy and fast and is one of the most contributors to the core tools in Kafka.

The system was running for a year and the maintenance only updated and reloads when something broke on the main system (like stops and database changes when new ERP versions are released). Until the creation of this post (first semester of 2020), the database changes (table structure changes like new columns or alter columns) need a little intervention to be reproduced in an already started streaming (a point that the Debezium team seems trying to solve this).

We need to look at only the initial database loading. In my case, the database was around 250Gb of size, the load depends on your infrastructure capacity and the tables on the source database need to be temporarily locked on this first load. In our case, running the load takes from 2 to 4 hours to get 100% online on sink databases, depending on the usage of the systems.

And finally… a special thanks to Stephane Maarek. His courses and tutorials about how Kafka works allowed me to apply this to a real-life scenario faster than I imagined.

--

--