Data Lake Houses Join Data Lake in Big Data Analytics Race

Big data continues to evolve, as data lake houses seek to combine the best of data warehouses with that of data lakes. For some, though, it prompts a re-evaluation of build-buy strategy.

Jack Vaughan

April 21, 2021

8 Min Read
Ilustration of digital wave about to wash over map with a briefcase
Big wave of a computer code sweeping over a businessman, EPS 8 vector illustration

Edge computing is prominent in IoT discussions these days, particularly where artificial intelligence comes into play. Software architects see advantage in processing or pre-processing data on the edge of the Internet.

But, for now, much of Internet of Things( IoT) analytics occurs on the cloud. Even as edge moves forward, correlating edge data with historical enterprise data troves – often referred to as big data  –  will be the norm.

Certainly, IoT architects need to sift through expanding edge options. These options are expected to grow along with edge infrastructure that, according to IDC estimates, will increase by more than 50% by 2023. At the same time, IoT architects face options for cloud-based big data analytics that also continue to expand.

The latest is data lake houses, which are closely related to cloud data warehouses. These systems strive to combine the best aspects of relational data warehouses with Hadoop data lake.

This data lake house combination appears to be a vibrant new part of a global big data market that MarketsandMarkets suggests will grow at 10.6% annually to $229.4 billion by 2025.

Unlike a data warehouse, a data lake house is meant to handle a wide variety of unstructured incoming data –structured data that doesn’t follow a data model, and is underpinned by highly scalable and relatively inexpensive object cloud storage formats. Unlike early data lakes, data lake houses can readily enable queries for analytics, while ensuring transactional integrity where it’s needed. This description works for many new cloud data warehouses too.

All this is driven by an industry trend that sees IT shops move processing – including what was formerly done as on-premises data analytics – to the cloud. Importantly, decisions on the data lake house/cloud data warehouse invite IoT architects to revisit an enterprise’s build-buy strategy.

Purveyors of data lake houses and cloud data warehouses are many, and opinions on qualifications vary. Arguably – and some vendors avoid the label – one could include all or part of AWS Lake House Architecture, Databricks Delta Lake and Delta Engine, Google Big Query, the IBM Cloud DB Reference Architecture, Microsoft Azure Synapse Analytics, Oracle Autonomous Data Warehouse, the Snowflake Cloud Data Platform, and other entries vie in the vigorously competitive categories.

Flaws in Data Lake Models

Like centralized data lakes before it, data lake houses are the object of technical criticism. It appears at a time that sees decentralized formats abounding. Neil Raden, industry analyst, consultant, and founder of HiredBrains, addressed the data lake house issues in a recent blog commentary.

“The concept of a data lake is flawed,” he writes, while concluding the same for the data lake house. As always, a universal data store that holds “one version of the truth” has drawbacks. “In an age of multi-cloud and hybrid-cloud distributed data, not to mention sprawling sensor farms of IoT, there is no advantage to pulling it all together,” Raden argued.

However, edge architecture that complements cloud computing will take time to grow, he said.

“People are collecting either tons of data or portions of data. But, in terms of intelligence at the edge, it’s early,” Raden said in an interview.

The edge analytics movement is potent, but nascent, agrees Igor Shaposhnikov, director of business development at SciForce, a Ukraine-based IT company specialized in software development.

“The development of 5G will benefit edge analytics,” he said via e-mail message, while noting there are constraints to analytics at the edge. Edge analytics, he indicated, should not be seen as a full replacement for centralized data analytics. Instead, developers will need to be flexible as they make constant tradeoffs between full-scale collection of data off-line and prompt data analysis in real time.

Hadoop As History

It’s been a rocky road to data lake houses, with evolving offerings improving on predecessors, but still falling short in some regard.

The data warehouse arose beginning in the 1990s for specialized analytics set apart from the enterprise’s warhorse transactional database.

As unstructured data became more common, and data warehouse costs went up, data warehouses were challenged in the early 2000s by open-source Hadoop systems that supported massively distributed cloud-style processing along with the Hadoop Distributed File System format.

These formed clusters that are called data lakes – places where data was cast, to be organized and archived later “downstream.”

While the Hadoop style surfaced at cloud providers data centers, commercial versions were tried out mostly at on-premises data centers, where Hadoop called for system programmers and  configuration specialists. The Hadoop data lake become the loci for a build-movement among open source developers who created new software for every possible big data job, from data ingestion and streaming to analytical querying and, eventually, machine learning.

Views on the Data Lake House

Continued growth of cloud and a sense of disorganization that grew surrounding a data lake caused new takes on big data processing to appear. And the time came, some would suggest, when it seemed right to come up with a new name to distinguish this year’s analytics from things beginning to resemble legacy systems.

The marriage of the best in data warehouses (data quality, consistency, and SQL analytics) and the data lake (massive processing and elastic scalability) is a natural, according to Joel Minnick, vice president, marketing at Databricks.

The company, whose founders originated the popular — and alternative to Hadoop’s MapReduce processing engine – open source Apache Spark analytics engine, has recently released the cloud-based Delta Lake, and in turn promoted the data lake house notion.

According to Minnick, Delta Lake is marked by a transactional data layer that brings quality, governance and better performance surpassing original data lake designs.

The notion was simple, he said. Delta Lake was designed “by looking at what was good about the old architectures and shutting off the bad.”

He said the original data lakes became silos, with specialized lakes emerging for data warehousing, streaming workloads, data engineering groups and data science cohorts.  Data is often locked up.

Now, especially as machine learning is applied to IoT and other data types, collaboration across groups is needed more than ever, he said.

According to David Langton, the data lake house is about convergence. Langton, who is vice president for products at data integrations software maker Matillion, said his company last year partnered with Databricks to launch new extract, transform, load (ETL) capabilities for Delta Lake.

For Matillion, Databricks and others, drag-and-drop interfaces for assembling ETL pipelines for analytics on cloud have become de rigueur.

“The lake house is kind of a paradigm where you acquire, clean and store data once. That is valuable as the volume of data is growing and the number of different sources of data that you need to combine to get a single view of something is also increasing,” he said. This complexity, he continued, is made somewhat easier by moving to cloud.

According to Bernd Gross, CTO, Software AG, the classic Hadoop ship has sailed, and fans are hard to find.

“It is a bit out of fashion,” he said. “Today, you look to keep the data where it is produced, and process it on the fly.”

Still, there are existing systems out there representing considerable investment. Gross said the Software AG Cumulocity IoT DataHub works to combine newly acquired end-point sensor data with historical data, so this Software AG IoT platform integrates with data warehouses and data lakes already in place. It supports cloud-based object storage formats as well.

End-to-End Data Pipelines

There are a lot of moving parts in IoT pipelines that feed data warehouses, data lakes, data lake houses and cloud data warehouses. The need is to assemble tools to ingest streaming data, filter out the noise from the signal, highlight anomalies and in many cases display output on a map for analysis, according to Suzanne Foss, product manager at geographic information specialist Esri.

Many users still maintain on-premises Hadoop and Spark big data processing architecture, Foss said, but cloud processing is taking on a big role in handling IoT users’ workloads. Assigning staff to administer complicated hardware clusters can be work they’d rather forgo.

“Big data in and of itself is increasingly commoditized, and organizations are getting to the point where they want all that handled for them,” she said. Kubernetes microservices, too, impel users to package-up data processing complexity and run jobs on the cloud, she indicated.

These are among the drivers behind the design of recently released ArcGIS Velocity, a cloud-native update to ArcGIS Analytics for IoT intended to bring end-to-end (data ingest, processing, storage, query and analysis) capabilities that reduce complexity for end users.

For a technologist charged to help track hazardous waste in the state of California, cloud hosting was a favorable aspect of ArcGIS Velocity. According to Roger Cleaves, GIS specialist with the California Department of Toxic Substances Control, the pace of running such systems is accelerating, as agencies move from paper manifests to direct digital feeds of vehicle operations, and as end users become habituated to immediately available maps showing assets in motion.

This is important for a department that must monitor the whereabouts of dangerous toxic waste, whether in the ground or in transit. So, the department is moving toward real-time geographic tracking of such waste for capacity planning and impact modeling,

With ArcGIS Velocity, Cleaves said, his department can start getting streaming data and creating feature layers to analyze, while placing data into cloud object storage for historical purposes. The department accesses ArcGIS Velocity via the ArcGIS Onlne service, he said.

All this happens while offloading server administration and workload scaling tasks to a cloud service provider.

“Today we’re always looking at cloud-native technology because that is where we want to live,” he said. “Cloud is just the way of the future.”

That future is likely to include advances on the edge side of the IoT equation. But even as edge analytics processing techniques come on-line to complement cloud processing, a drive will continue to simplify systems in the face of greater complexity, and to move to forge some end-to-end systems — with less assembly required.


Sign Up for the Newsletter
The most up-to-date news and insights into the latest emerging technologies ... delivered right to your inbox!

You May Also Like