Rebuild This Castle: A Data Migration Story

Imagine yourself in a run-down house. Make it a big one. Huge. Some walls are still standing, but barely. Some rooms are just a pile of bricks lying in the middle of the floor. There are secret passageways connecting random chambers, almost like a labyrinth. Did you make the house gigantic in your mind? Go on, make it even bigger. A quirky, dysfunctional, enormous castle.

 

Here is your challenge: you need to rebuild the castle on another land. In its new version, the building should regain its splendor, while guests and inhabitants alike should be able to find their way seamlessly. Oh, and if you fail, even slightly, even by a fraction, all the inhabitants of the palace lose their livelihood.

 

Proceed.

 

This is the magnitude of the challenge that software architect Sebastian Gavril found himself in, when he was approached to migrate a fintech’s core databases to the cloud.

 

Tinka, a Dutch fintech company, already had some of its services in the cloud, such as microservices, authentication systems, message brokers, CI/CD, and monitoring tools. But the core services, related to the actual financial processes, remained on-premise, part of the so-called legacy system: the virtual machines, databases, the Enterprise Service Bus, and workflows.

 

The job entailed moving from on-premise into the cloud 30+Tb of data, split in over 30 Oracle database instances across multiple Exadata machines — mini data centers that can be rented out — and 7 environments.

 

In other words, the core services were still in the “old castle” and needed to be moved. All those discarded bricks, secret passageways — they were actually scattered databases on old servers, connected among each other in a complex network, being used by various applications, microservices, or monitoring tools.

 

So how does this story unfold?

The heroes

This job needed an architect. A software architect. “Normally, an architect first needs to understand his customers, their needs and desires, constraints and opportunities. Then he translates that into a formal language that can be used by engineers to construct the building his customers desire. And during the construction phase, the architect is always present to make sure the plan is followed and change requests from clients are included with minimal impact on the integrity of the building.”

    

This is what Sebastian did. He needed to have an ensemble view of the challenge, decide what’s needed, and get the team together. They were: Simona Gavrilescu (Delivery Manager), Ancuta Costan (Test Lead), and Monica Coros (DevOps Engineer), as well as employees of Tinka.

    

At Levi9, the team is defined together with the customer, with the aim of enhancing collaboration and working together as one team.

The castle

Moving core databases is an extremely complex project. When you move databases, you don’t just move the data, but you need to make sure that all the critical systems will function correctly.

 

Plus, the old castle — the legacy systems — were responsible for a large part of Tinka’s income. This is true for most companies.

 

While looking to achieve a positive impact for the clients, one of Levi9’s goals of this project was to have no negative effect on the Tinka customers. Customers were not supposed to notice anything. However, if things did go wrong, the company could have lost or corrupted customer data.

The new kingdom

So where was this new land where the castle was being built? The cloud.

 

Until recently, “the cloud” actually meant just a couple of key players, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. Nowadays, any large, established company starts to develop its own cloud: Oracle Cloud, IBM Cloud, Digital Ocean, Heroku, Alibaba. To attract customers, many offers features such as niche-specific services, better integration between cloud and legacy systems, and migration tools.

 

The chosen solution was the RDS – Relational Database Services, the database service offered by AWS, mainly because part of Tinka’s infrastructure was already in AWS. After migration, AWS would be the one keeping up with maintenance and updates and many more as part of the package.

The challenge

“It was imperative to understand the complexity, which I saw on two levels,” says Sebastian.

 

The first level of complexity is that of the database. In this case, the team used a tool provided by the AWS Data Migration Service (DMS) of AWS called the Schema conversion tool. “When you move a database from one place to another, this tool generates a scheme that shows the objects, triggers, procedures, and packages. It also shows which ones can be converted automatically and where manual changes need to be made, through simple, medium, or complex actions. This helps us understand what database-specific effort we need,” explains software architect Sebastian Gavril.

 

The second level of complexity is that of the “landscape”. That means everything that calls on the database. The landscape was very rich. From old, legacy servers – that nobody really understood, to modern microservices, with everything in between: ESB, custom jobs, etc.

 

It helps us understand where the problems are. What we need to change: connection strings, drivers, maybe even code that used specific functionality in Oracle Enterprise Edition (an edition we wanted to move away from).

The map

So what does moving the castle entail? Well, when planned systematically, things look pretty straightforward: you need to move the bricks and you need to redesign the rooms. In data transfer language this means that you need to take into consideration two aspects: the applications and the databases.

   

On the applications side, there are some conversions to be made: for example update libraries, change versions, rewrite some code, and make sure the configurations are in order.

    

The databases also need to be converted, either by using tools such as the schema conversion tool or manually. For some, you can find a procedure. For others, they need to be built from the ground up.

    

For migrating the actual data, just like renting a moving truck, AWS has a Database Migration Service offer. You can rent a machine, point to the “departure” address, set an “arrival” address, and up you go. It can take minutes, hours, days. Or weeks, in one of our cases.

It all comes crumbling down: Part I

Every project starts with “it’s not much”. “It’s practically done”. Sebastian Gavril knows to cringe at this one. The first migration project was started 1 year before and this was how it was presented to the team. The first go-live attempt was possible about 2 months after these famous phrases, after a lot of tweaking and testing on the initial solution.

 

And this was just one room.

 

In such a complex data transfer project, issues keep popping up at every step, sometimes baffling entire teams. The funniest thing was when we migrated a database and ran a simple smoke test to validate that everything is working. The test failed repeatedly so we spent the whole day trying to fix the issue, only to roll back in the end. The second day we realized that we were simply testing the wrong way. We were probably a bit too nervous.”

Slaying the budget monster

Every castle needs its own monster and this was no exception. As one of the key issues of the project was to lower costs, the budget was the monster that needed slaying and slaying. “Traditionally, software engineers didn’t have to think about costs too much.” Having his own very healthy financial mindset for expenses at home, Sebastian Gavril explains why he took care to chip away costs at every opportunity. “With the rise of cloud providers, engineers are using cloud services more and more and taking most of the decisions in regards to the configuration of those resources. For example, the biggest databases you can order at the click of a button in AWS are ~80 times more expensive than what we chose for one of our medium workloads.”

 

For example, by keeping an eye on resource metrics for a database, the Levi9 team was able to cut costs by about 75% from initial estimates.

Time to say goodbye

Some rooms of the castle simply got left behind and it was for the better.

 

One of the biggest surprises of the Levi9 team was when it realized it could take some smart decisions. The goal was to stop using our data center, not necessarily to migrate everything.

 

“We found out that the largest database, a 15 TB behemoth, was actually what we call “highly available”. That means, there are 2 databases instead of 1. The second one is there only if the first one crashes for various reasons. But this was not a business-critical database, so we did not need to have the db up and running 24/7. So we decided we can just drop one of the databases and rely on backups to restore the primary database in case of failure.”

 

It was only because of the Levi9 team’s culture of focusing on one common goal that enabled it to know the business and the client so well, that they were able to take such surprisingly easy, but important steps.

It all comes crumbling down: Part II

And then there was the huge data latency problem.

 

Levi9 practically moved the databases from the premises in Zwolle (Netherlands) to Dublin (Ireland). Some apps were left in Zwolle (Netherlands), to be migrated later, but they were calling on the database in Dublin.

 

“Data transfer is pretty fast nowadays, so normally this extra distance is not noticed. But due to the way one of the applications was written, the data was going to and from the database much more than needed. It became unusable, as some common operations were now taking tens of seconds to complete.”

 

Because of the architecture of this application, it was like going to the castle for a lamp, then for the bed, then for the cover. The team was on the brink of giving up because it was not able to change the application. In the end, they managed to tweak the infrastructure so that it would allow for much more traffic than usual. It was like sending 3 people to the castle each for one thing only – not ideal, but it worked.

The welcome party

For a while, you will live between castles. You will still have the old one, but also the new one, perfectly in sync. And then?

 

And then you just need to make up your mind and throw a welcome party!

 

In data migration, the welcome party is extremely tricky. The inhabitants must not notice any difference, almost like they would be teleported. The service cannot be down for more than 15 minutes. The data about each Tinka client must be precisely the same. And, God forbid, DO NOT, under any circumstances, mess up their bank account. Good luck!

 

When it was time for the last final move, Sebastian Gavril did what he always does. He planned for it. Sebastian Gavril and his team had a step-by-step, second-by-second to-do list.

 

How detailed should this plan be? “Just to give you an idea,” says Sebastian Gavril, “we prepared so thoroughly that one of the go-live plans had 66 tasks, each representing an action.”

 

There are hundreds of small decisions that need to be taken before the key moment: for example at what time do we go live? And there are numerous connections that should be made between databases and services, all with the purpose of making a seamless transition, with as little downtime as possible. Maybe most importantly, what to do not if, rather when things go wrong.

 

Sebastian Gavril and his team split the very detailed action plan into three: before go-live (checking everything is in order at the new site), during go-live (making sure that each team member knows what to do at each moment) and after go-live (monitoring the new setup and being ready to take action if needed).

The happy ever after

It all ends well, as it should.

 

For a result-driven, impact-delivering, change-making company like Levi9, failure was not an option at any point. No data was lost. Customers of Tinka might have noticed a two to a three-minute website glitch, but nothing more.

 

“Having managed to deliver this feat was a great self-accomplishment. It’s also the kind of project that challenges and empowers a senior engineer to grow, because it involves many aspects that are not part of the day-to-day life: coding, database administration, database development, cloud, networking, DevOps, architecture, third-party management, finance. A very good opportunity to develop as a well-versed engineer.”

 

To move critical systems, to make everything run better, to significantly increase budget savings, all the while protecting the client from any negative consequence — this could all be summarized in what Levi9 wants to achieve: impact.

In this article:
Published:
29 September 2023

Related posts