Data Lake + DWH = Lakehouse đź’™

Alex Gordienko
Geek Culture
Published in
4 min readJan 6, 2021

--

Adult companies think today about how to increase the value of data stored in their data warehouses (DWH) and data lakes (DL), but to cut the costs for maintenance and support. New companies, at the same time, face big doubts on what should they plan to deploy first to get maximum advantage from their data — DWH or DL.

Many people, who look into the future, would say that nowadays is the best time to combine both approaches and to get a double benefit. But let’s firstly briefly look at two already known concepts.

Photo by janer zhang on Unsplash

According to Wikipedia, data warehouses are systems used for reporting and data analysis. DWHs are invented in the late ’80s and fed by data that came from different other systems, such are accounting or controlling systems. They are central repositories of data for company reporting, they store both current and historical data in one single place which would highly likely be a relational database management system (RDBMS). Single place simplifies and unifies connections to your data. The data warehouse concept implies extraction, transformation, and loading processes (ETL). Generally, it means that data coming to the system will be transformed when loaded. ETL approach helps your company to create a single version of the truth.

DWHs are widely used for creating analytical and government reports for workers throughout the enterprise because data is stored in a consistent and highly-structured form. Moreover, they help simplify a system audit which is crucial for some regulated business types.

All the written above brings us to the necessity of complicated transformation mechanisms which are obligatory for unifying ingested data in data warehouses. Sophisticated transformations inevitably complicate the process of scaling the system at the time of increasing sources or volumes of data. By the way, it is impossible to operate stream and raw data with data warehouse systems in most cases.

Semi-structured and unstructured data become very popular and widely used. JSON, XML, and few other formats are now known as industry leaders. The volumes of data significantly raised. A new problem arose in the 2010s. Companies felt the demand for easily-scalable multipurpose systems on the one hand and systems that are not tied to structured data on the other hand. Data warehouses cannot cover this need. This is the time when data lakes appear.

The world is triumphing. It is an incredible innovation in organizing and storing big data (which is also the term born in that epoch, by the way) and DL could probably displace DWH in near future. Big data, patterns, and analytics all together ranked second in Gartner’s Top 10 list of IT infrastructure and operations trends. Data lakes can answer a variety of “how-to” questions.

Bringing to the world a new ELT approach, data lakes now able to operate terabytes of data just like finger click. There is no more need for heavy-weight transformation of data whereas it is loading. You can put your data in raw formats directly into your DL, transformations will be used later when data will loading by its customers. Speed of data ingestion now allows companies to simply gather tons of data from IoT devices, web analytics, game telemetry, and other streaming sources. The popularity of the new profession, data scientist, is growing. They are the people who can find insights and patterns in data, build and use ML models for data analysis and future predictions.

At the end of the decade, there is no opinion left that DL could replace DWH. Different approaches pursue different goals as well:

Two approaches comparison

Support and maintenance of two solutions simultaneously entail a significant burden on support and additional costs for the company. It is high cost, complex, and not reliable as a single version of the truth. The professional community is beginning to understand that both data lake and data warehouse will co-exist together. This is why data lakehouse pop-ups as a symbiosis of DWH and DL.

The main features provided by lakehouse solution are:

  • Schema enforcement. The solution should have a way to support schema enforcement (i.e. star or snowflake).
  • Concurrent reading and writing of data. Operational users should be able to access data quickly through known BI tools while stream and raw data should be saved into storage without interruption.
  • Easy access to the data. Data should be available for analysis and exploration by industry-recognized tools (Python, R, Scala, etc.).
  • Data governance. The solution should be able to keep track of data integrity and have solid governance and auditing mechanisms.
  • Support for diverse data types. The solution should operate structured data as well as semi-structured and unstructured data.
  • End-to-end streaming. Support for streaming is needed for real-time enterprise reports.
  • Cost reduction. Storing the data in cheap object storage such as S3, Blob, etc.

Tools already existing on the market are Delta Lake, Snowflake, Azure Synapse Analytics, AWS Redshift + Redshift Spectrum, etc. Each solution provides its own additional feature set, but it is certainly a paradigm of the decade for sure and it is a modern way to increase the value of your data.

There are additional resources to read about lakehouse architecture:

https://www.snowflake.com/guides/what-data-lakehouse

https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is

https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html#c-spectrum-overview

--

--