Lessons learned building a high performance data lake for CRM using AWS and Apache Druid

6 min readJan 5, 2021

Last year, I had the pleasure of building a massive data platform from scratch. I was given the opportunity to build the stack from the ground up, and not be constrained by legacy platforming decision with a couple of exceptions. First, it had to be on AWS. Second, it had to be very cost effective. This ruled out using commercial offerings like Snowflake.

In this article, I will write about some of the lessons learned from this experience.

Business Challenge

First, let’s start with a description of the problem we were trying to solve. At a high level, we needed a data platform that to store a lot of CRM data. Storing data is not enough. It had to be in a format conducive for reporting.

Data Architecture

The architecture was extremely simple, and kept this way on purpose. We wanted to get start generating real world valuable insights as soon as possible, so we did not get any points for taking a long time to build up fancy infrastructure just in case.

We followed a minimal viable product (MVP) model. The bare bones requirement was just to get a database up. This allowed for SQL queries. The output of queries can then be extracted and put into Excel for reporting.

The diagram above has been simplified in that I did not display how the data moves from one data store to another, i.e. the ETL portion. As modern practices allow, the movement of data followed a ELT model, where data was extracted and loaded first, before looking for transformation opportunities, only when it proved that reporting would not be fast enough.

Using MySQL as the Operational Data Store

The first decision that had to be made was what type of database? We chose an offering called AWS RDS: https://aws.amazon.com/rds/aurora/serverless/ and we used the Serverless option.

Why MySQL? Extracting data and dumping it into s3 raw was not flexible, nor fast enough. It is possible to use Athena and just issue s3 queries. However, as with any data platform, cleaning the data is a huge part of the undertaking. The data coming from CRM is structured in terms of tables, but the contents of various fields was anything but regular. s3 does not allow for updating of files to fill in values for certain columns. Unless the data is super clean coming from the source, there is always a need to clean the data: standardize values, and/or fill in missing values.

Now that we have established that we needed a transactional database, let’s talk about why we went with RDS Serverless. We could have went with a traditional RDS, where we did not have to host and maintain the database. But that wasn’t enough. In order to get off the ground as quick as possible, we needed something that could scale up fast. RDS Serverless allows for the adding of Capacity Units very quickly. We can go from 2 to 256 capacity units very quickly.

Our data usage patterns also matched well with RDS Serverless. We extracted data once a week, and used reporting may only one or two days of the week. For Serverless, you are not charged for computing unless you are using it, only for storage. This was great for us while we were ramping up. We had a lot of data on day one, but not a lot of queries.

Lessons Learned for using RDS Serverless:

Since it is serverless, there is no server! This meant that a simple MySQL select into OUTFILE will not work. There is no local storage. Creating table dumps do not work either, as there is no local storage for dumping. One has to use SQL to query and extract the data. SQL has overhead, and hence not as fast as a table dump.
Capacity Units do not scale that seamlessly. This might be due to the fact that we have data in one large table, and scaling may work automatically if queries are spread across multiple tables. For now, queries against our large table will lock something and the AWS RDS console simply says that an Autoscaling point cannot be found. There was somewhat of a workaround by setting the minimal capacity units to be something really high like 64. Either the database was sleeping, or if it wasn’t, we wanted it to start off with 64 units instead of gradually scaling up from 2, 4, then 8 and so on.
MySQL RDS Serverless still felt a little bit slow. However, there is not much tweaking that can be done, as the admin user does not have access to the underlying operating system or other low level operations. But perhaps that is the beauty of RDS Serverless. An occasional admin does not have to worry about low level tweaking.

Standing up reporting using Apache Superset

As mentioned, you can get reporting on Day 1 by just using SQL and Excel. This bought us enough time to evaluating several reporting options.

AWS Quicksight — at first glance, this looked like a quick option. Since it is a service, there is nothing to even install. Granting access was super easy as well, since we can use native AWS security constructs like roles. However, in terms of features and the types of charts available, Quicksight was very basic. There were the basic line and pie charts, etc. but nothing like heatmaps or bubble charts, which I consider as normal for modern reporting tools. When selecting a reporting platform, it is important to keep in mind it isn’t just the current features, but what lies in store for the future. It felt to me that AWS wasn’t aggressively trying to expand Quicksight features.
Metabase — this is a recommended open source option. It also suffered from not having rich charting functionality. At first glance, it looked like a cool tool since it would automatically compute certain statistics about a table such as how many records, and which values there the most common. But it kept on scanning our database endlessly. This was bad because it locked certain tables, and second, it kept the MySQL database connections open constantly. RDS Serverless never went to sleep, and hence negated any cost savings from using Serverless.
Apache Superset — we finally settled on using Superset. It satisfied the requirement of being open source without licensing fees. It also had an incredible range of charting options. Superset cannot compete with commercial options like Tableau or Qlik, but for an occasionally used reporting tool, it was fantastic.
Tableau — business users can continue to use Tableau.

Our first iteration had Superset reporting directly against MySQL. But since some tables were easily a billion rows, any aggregate query performed poorly no matter how many indexes we put in. MySQL is a transaction database, and can never perform as well as any database or data store purpose-built to support reporting.

Apache Druid as a high speed high volume analytics tool

While MySQL was sufficient initially, as the data volume grew, it could no longer keep pace on the reporting side. It was still sufficient on the transactional side.

I must say that I am pleasantly surprised with the performance of Druid. I have used Redshift and Hive before. Redshift was much better but developers still had to build aggregate tables from the base tables. Redshift was also not meant to support a high level of concurrent users swamping it with queries. On the plus side, Redshift is a columnar data store. This makes it quite fast for group by aggregate types of queries.

Hive was decent, but did not add much speed, just more scale sideways.

Apache Druid was perfect for our use case for group by types of queries. In addition, because it uses bitmap indexes that can be created automatically during ingestion, it was also able to fetch records using id columns (high cardinality) extremely quick. In addition, it also has a feature to use “count distinct approximate” or top “approximate”. Not sure how these work under the cover, but it was sufficiently accurate. After all, the data is not perfect so having a margin of error of 1% is not bad if it speeds up performance by double digits.

All the pieces coming together

After standing up Druid, we did not repoint all queries from MySQL to Druid. Sometimes it was very useful to just be able to run some basic Superset reports against MySQL tables, and not always have to ingest into Druid. Like many other transactional systems, some tables are very large, but not all. So large tables ended up in Druid. Smaller tables remained in MySQL.

Thanks for taking the time to read this!