I dreamed of building a data platform which much less dependent on cloud-managed services. I know that not using cloud services is impossible these days. The age of on-premise infrastructure has seen its dusk since cloud services become common and everyone can taste world-class infrastructure without worrying about the huge upfront budget for server procurement. It’s so easy even if I could rent a space on a rack in seconds and only get charged from the amount of server uptime.
Things are even more efficient in the past few years when containerized applications can squeeze even more server utilization and make running services on the cloud getting cheaper. For example, if I have three services, I usually need three virtual machines for each service. That’s not including any overhead usage such as the OS itself and some monitoring or security agents. When these services run as containerized applications, I can squeeze them into just only one virtual machine.
Maybe the example won’t show significant optimization. So let’s say I have two services in two clusters, each service/cluster has 15 virtual machines with a total of a whopping 30 virtual machines running to their specifications. With containerization, these services along with their replications can be squeezed inside a container cluster with only six to eight virtual machines.
Since almost applications now are ready to be used as containerized applications so I also dream to build a complete Data Platform inside Kubernetes. Yes, a full platform inside a Kubernetes is possible today. Also, I tried to make this platform less dependent on any cloud provider which makes this solution possible to be implemented in any cloud provider with Kubernetes service or even a Kubernetes cluster running on on-premise.

Let’s start from the left:
- “Services” is any service outside the platform.
- “Streming modules” is using Kafka.
Kafka so far still the best flexible option and of course, we also use some of Kafka’s complementary modules like Kafka Connect, Schema Registry, and source 1(e.g. Debezium) and sink connectors 2(e.g. GCS sink, S3 sink, or HDFS sink). Streaming modules stream data as-is from the source services database. - “Data Lake” is distributed storage of your choice 3(e.g. Google Cloud Storage, AWS S3, or HDFS)
- “Orchestrator / Dataset generator” is using Airflow.
We’re using airflow to transform ingested raw data into ready-to-consume datasets. Airflow is very versatile. And why I like Airflow is because it’s fully programmable. I know, it’s hard to adapt it for non-tech savvy users since it’s not a drag and drop ETL software, but I’m sure that most Data Engineers are quite versed with programming, especially Python. And with the latest decorator implementation in Airflow, it’s even easier to adapt plain Python modules into Airflow DAGs. - “Query Engine” is using Trino.
I’ve been quite following PrestoSQL since 2019. I’ve been trying to implement it in my previous workplace but unfortunately, our choice fell to Dremio. It’s very refreshing that the dev decided to rebrand it as Trino and use a cute space rabbit for the logo. Unlike Dremio I’ve mentioned before, Trino is a pure query engine. And since it’s just a query engine we need something more reliable for its front end. Of course, you can run SQL directly on Trino but I don’t think it’s very wise for the whole system. - “SQL Frontend” is using Hue.
Hue has been around for a quite long time and if I’m not mistaken it’s the oldest stack used in this dream data platform. It’s a very mature web-based SQL editor (or what they call SQL assistant) primarily used for the Hadoop ecosystem before and expanded to get more and more connectors so it can connect to any type of database or query engine. As a matured application, its display may not really have a “modern touch” but it really does its job and the project is very alive and fully maintained. - “BI Platform” is using Superset.
Most data visualization software mostly comes with hefty price tags as they were very crucial in decision making. This doesn’t stop the open-source community to build something for the community. Thus comes the Superset to save the community from hefty price tags. The superset is very powerful, customizable, and more importantly strong community support. They may be “free”, but you need a team of engineers to deploy and maintain them. If you need a ready-to-use version of Superset, you may subscribe to their premium service. Don’t worry, it’s still relatively cheap compared to the more popular options for visualization.
I really dreamed that, along with my team, I can fully implement it in my current workplace.