Technical insights from building a massive Data Lake

Peter Yeung
4 min readDec 8, 2021

One aspect of work that I particularly enjoy is solving difficult problems. I am lucky that I am responsible for a platform that works with a large volume of data. Having worked in Enterprise tech most of my career, it no longer fazes me to work on a couple of million rows. To have any fun, I need to work on billions of rows.

I am going to share some experience with how to architect a massive data lake that is scalable and works.

Below is a high level diagram:

  1. Make sure to implement delta detection from the transactional system

The transactional system is the source of the data. This can be CRM, ERP, or any application that collects data.

It is important to only import only rows that have changed since the last data ingestion. In traditional ETL terms, this mechanism is known as a Data Change Capture mechanism. Extracting only the delta’s allow us to minimize the load on the source system

2. Do not go from the transactional system directly to the reporting layer

Data lakes have been popular for a reason. It provides some flexibility and a layer of abstraction between the source and target. What if the source changes? Or what if the reporting requirements change? We don’t want to tightly couple the two. Implement a Data Lake or an ingestion layer in between

3. Mirror the data of the transactional system in your Data Lake

It helps to have an exact replica of your transaction tables. You don’t want to have to keep querying against the transactional system which might be a SaaS platform with limited API calls, or has a backend database that you slam with queries. The purpose of the transactional system is to record interactions, not support adhoc queries. Due to this, it is very convenient when you have a replica of the transactional system tables in your Data Lake.

This is very straight forward if one is extracting full tables from the source system. It gets a little bit trickier if we only extract changes from the source system. Off the shelf tools can be used to perform ETL. However, when trying to create a replica of a table when extracting only delta’s, it can be broken down to 3 operations: 1) inserts 2) updates 3) deletes.

SQL statements like REPLACE INTO or INSERT ON DUPLICATE UPDATE is very handy in these scenarios. Let the database do the job of handing inserts and updates, and you don’t need to code complex logic.

4. Do not need to extract every single table. Only extract tables that contain useful information.

This sounds very obvious but a lot of implementations do not do this. I see many projects where every single table is downloaded, in case there is something useful. Storage is cheap, so why not just download it. This way we don’t have to keep on going back and extracting tables when we realize they are missing.

a) Not all data from the transactional system is useful for analytical and reporting purposes. There is data that is used purely to support the UI and screen navigation. There is data that doesn’t provide a whole lot of insight. Pick and choose what tables and subject areas you want to focus on. Go deep and not wide.

b) Extracting every single table also goes against the philosophy of using agile and iterative MVP approaches.

c) New tables in the transaction system are created all the time anyways. This negates the argument that it is better to extract everything in one shot, since what tables to extract have to be revisited periodically anyways.

5. Choosing the right technology can reduce effort by half or even more

I often see a technology stack being chosen because it is popular. Some other team is using it. Or our developers are already familiar with it. You will save some upfront time and costs, but you will end up paying more for it over the long term. And we architect and build for long term.

For example, some people may want to use MySQL for both transactional and analytical tables like star schemas. MySQL is fantastic for handling inserts, updates, and deletes, as well as guarantying atomic transactions even for a very large table. However, it is not very good at group by statements across a large table.

a) The “some other team is using it argument” doesn’t hold much weight. The other team does not have the same desired goals and outcomes as you. Sometimes, it helps to build an architecture from scratch rather than inherit technical debt. Also, if you core dev’s build it themselves, they will actually know and understand the code a lot better. Compare this to inheriting a codebase, and they will then try to make Frankenstein type of changes to support your platform.

b) The “Let’s use jquery because a developer is familiar with it” argument doesn’t hold either. It is very possible the developer’s knowledge is a little bit outdated because they were supporting a very old system. In choosing a new language or framework, there will be a learning curve. But the disadvantage of a learning curve quickly disappears when the newer framework can implement features easier, cleaner, and faster.

--

--