In recent times, data lakehouses have emerged as the preferred data platform architecture option for the BFSI industry. A hybrid implementation of a data warehouse within a data lake, this novel data repository utilizes the cloud to store data in a cost-effective way. While it can be integrated with traditional business intelligence, data lakehouses can also serve as the foundation for artificial intelligence, machine learning, data science, and other data analytics solutions. Deployed both on-premises and in the cloud, data lakehouses cater to banking and insurance use cases centered around customer centricity, regulatory adherence, and product innovation. When it comes to the adoption of data lakehouses, organizations must begin with choosing the right platform from a market that is inundated with solutions, including Google BigQuery, Apache Drill, Snowflake, Amazon Athena, Delta Lake, and Microsoft Azure Data Lake.
Identifying the right data lakehouse solution
Organizations in the banking and financial services industry receive data from internal and external sources with varied formats and frequencies. The preliminary validation involves checking for the elementary features that every data lakehouse provides. This includes:
- Structured, semi-structured, and unstructured data support
- Integration with different data sources through connectors
- Data durability of 11 nines (99.999999999%)
- Data replication and failover in multiple availability zone (AZ) or regions or cloud solution providers (CSPs)
Financial institutions must also consider enterprise-grade security features like access control, data governance, data cataloging, and custom organization-specific limiting policies, such as virtual private network (VPN), firewall, proxy, and multi-factor authentication (MFA).
Once a platform passes the aforementioned checks, banking and financial institutions must evaluate the data lakehouse solution against their future storage and performance requirements. Assessing the solutions by considering the specific financial use case filters out needless features and reduces the cost and complexity. The approximate data volume and its scaling should be determined for storage and compute cost management. Given that the billing mechanisms of various service providers are different and can’t be compared on an equivalent scale, institutions must identify the limitations and choose the optimal solution accordingly. For example, if the financial dashboard has simple data retrieval processes such as for current assets, liabilities, and cash balance, with high volume of data, then, a lakehouse with time-based billing is preferable; but if it has complex processes like computing totals or averages, time-series analysis, and multiple level rollup with low data volume, a lakehouse with volume-based billing can be opted to reduce the costs. In terms of storage, some platforms store the data in a highly compressed manner, thus reducing the cost.
Cost and resource optimization with data lakehouse
Data lakehouse architectures that separate metadata and storage help in dynamic data masking and cloning, which are substantial contributors to storage cost reduction. Data masking implements column-level security policies to mask sensitive financial data selectively at query runtime. Lakehouses with this feature reduce data redundancy and prevent unnecessary object creation, like views for implementing access controls. Besides, data cloning creates replicas of database objects without producing additional copies of data for different environments. If used for long, this process helps reduce operational efforts and costs of the financial institutions significantly.
Lakehouses also use storage and compute on separate clusters and can scale to more concurrent users and larger data sizes. For optimized resource utilization, enterprises must select a solution that supports both vertical and horizontal scaling. Preferably, the lakehouse solution needs to be tested with heavy parallel loads through SQL-based read or write transactions. It should have enough scaling on the compute for better data retrieval without the help of any traditional performance enhancers like indexes or partitions.
The migration path from an existing data system to the target platform should be easy to achieve in terms of datatype compatibility, SQL support, and other similar factors. The transition should also support schema enforcement and evolution, for data warehouse schemas like Star and Snowflake. Financial institutions often have a vendor-neutrality requirement, where the lakehouse should have the capability to be hosted on any cloud provider, like GCP, AWS, or Azure, for more flexibility of data migration in the future. The ease-of-use and compatibility of the new lakehouse solution with the traditional one must also be evaluated to eliminate significant training and knowledge transfer efforts and costs.
What’s next for lakehouses?
The increasing need for the versatile architecture of data lakehouses looks promising due to its performance and efficiency advantages for atomicity, consistency, isolation, and durability (ACID) of transactions. This is particularly important for banking and financial organizations in simplifying their legacy data platforms that are incapable of satisfying the emerging needs of different user sets. Banking organizations are also looking at their data monetization strategy (data provider) or improving their decisions regarding marketing or sales with external data enrichment (data consumer). Despite data lakehouse being an emerging concept, we see it gaining traction as most organizations adopt it to become data-driven businesses and secure a competitive edge.