A data lake, as a concept, has been around for a few years now. Its promise is to harness the power of big data and be a reliable, single source of truth repository for all of an organization’s data with almost infinite storage and massive processing power.
However, the vast array of different data sources have led enterprises to rethink the whole data lake architecture and adopt a slightly different approach: data virtualization.
Now, data virtualization’s promise is to help enterprises overcome many issues related to data lakes thanks to a more agile, reliable, secure, and user-friendly data integration approach.
A data lake’s idea is to store large volumes of structured and unstructured data from multiple locations in a single repository (such as Hadoop, S3, etc.). Unlike in a traditional data warehouse, this data is not modeled and remains in its granular form. Hence the data lake concept was initially embraced by so many – it is less expensive to run it, and you expect a significant ROI.
What are the shortcomings of a centralized repository, “data swamps”?
Much criticism about data lake technology centers on organizations that can’t access and analyze all this big data quickly and reliably.
The advance of IoT and the myriad new data sources that came with it in the last 10 years led to an exponential increase in data volumes that organizations wanted to store. It is forecast, 41.6 billion IoT devices by 2025 will generate some 79.4 zettabytes (79.4 trillion gigabytes) of data!
But all smart big data can provide business value only if one can analyze and use it.
Data replication is unreliable
Merging all data sources into one huge repository has led to continually having to do data replication. The organization’s local data sources have to be kept in sync with the central repository. This means a lot of traffic and a high potential for data inconsistency. Plus, the problem of freshness: the data is only as fresh as the last sync point.
It’s a messy place
The principle “load now, use later” quickly leads to ungoverned data riddled with duplicates, old versions, and redundant database tables. It is easy to imagine how this happens: data changes in one place may not be reflected in other areas, resulting in a chaotic environment.
However, the biggest challenge is related to data security and governance. The new GDPR regulations restricting data location mean organizations cannot move sensitive data into the cloud or a centralized repository but have to keep it in the native location. As a result, organizations can’t use that data for analytics.
These challenges negate ROI, delay project launches, decrease value while increasing operational costs, and lead to frustration due to failed expectations and promises not coming true.
Leaving data uncontrolled and unmanaged in one big pool will inevitably turn a data lake into a “data swamp.” And the data lake won’t deliver on the promises.
What Virtualization Brings To The Table
Data virtualization (DV) is a logical rather than a physical approach that makes it possible to view, access, and analyze data virtually without data replication or movement. DV accomplishes this with a virtual layer on a data lake platform, where accessing, managing, and delivering data happens without the need to know the data’s location in the repository.
Truth: No data replication
Since there’s no data replication, copying data becomes an option, not a necessity. An organization still has access to all its data assets regardless of their location and format and can keep track of their transformations, lineage, and definitions. Tracing where the information originated from and the modifications are automatic and fix the above “messy place” problem.
Probably, the main difference with data virtualization is that DV adds the agility to react to changes. Adding a new data source becomes a breeze, and in a matter of minutes, one can start using that new data right away.
Data transformation occurs instantly, too, because all that is needed is a new rule applied to data. And a data virtualization solution can immediately detect modifications of structure in the sources, so, for example, adding a new data column has an immediate effect.
Moving data to the cloud creates specific security and privacy challenges. However, DV supports recreating an on-premises authentication model in the cloud. And DV solutions can integrate with the existing authentication system (e.g. LDAP/AD). Cloud sources can use the exact security mechanism and access controls as on-premises sources that support role-based access control to the different tables, views, rows, and columns.
Data virtualization serves data in the format organizations need. They need it to be complete, reliable, structured, and real-time. Data virtualization integrates disparate data sources in real-time or near-real-time and presents data for viewing in a structured form or publishes it to apps and devices.
The best part? Implementing a virtualized data lake does not require the costly replacement of physical hardware. But an organization can either deploy it in the cloud or on-premises. Many organizations are.
In 2018, Gartner predicted that 60% of organizations will have deployed a data virtualization solution by 2020. In line with that, Varada 2021 The State of Data Virtualization report showed 60% of data experts surveyed prefer data virtualization as an alternative to a data warehouse and projected that by 2021 50+% will have virtualization deployments.
If that wasn’t enough, it’s been shown in many use cases data lake coupled with virtualization can shorten development cycles, reduce operational costs, increase the ROI, and more compared to a traditional physical lake.
Not adopting data virtualization may result in a loss of competitive advantage for organizations that want to capitalize on data. Without data virtualization, organizations risk knowing less about their customers, missing real-time business insights, and spending more money and time to address data challenges.