MANUFACTURING NEXT

Strong Data Pipelines for Mighty Data Platforms

 
July 25, 2019

With the rise of the era of analytics and insights, current generation organizations are aiming to build mighty data platforms to explore newer disruptive opportunities with their enterprise data estates. Accordingly, enterprises are looking to leverage their data platforms such as data lake for data monetization to unleash higher value from data.

They also want to become insight-driven organizations to achieve mission-critical capacity with data science models, presently in the lab or in exploratory phases. Insights gained from deep diving into data empowers organizations to explore new business models. These business models were probably never thought through earlier. But they have the potential of creating more value from data platforms.

The design of these new age business models needs to enable them to handle the famous Vs of Big Data - Volume, Velocity, Variety, and Veracity. Along with other Vs, focus should be most importantly on variety of data, such as structured, semi structured, unstructured, and streaming. While unstructured data in files presents many opportunities for manipulation and analysis, lack of careful planning can quickly turn a data lake into a swamp.

Let us now examine how strong data platforms can assist organizations in the areas of data monetization, meaningful insights, data science capabilities, and data governance.

Mighty Data Platforms for Data Monetization

Data monetization is the act of turning enterprise data into dollar (currency). Though data monetization is considered as the end goal, it is an evolving process that takes place over a long time.

Aspiration of data monetization is supported by the building of strong data platforms with strong data acumen and driven by the various qualities (the 4Vs) of new age data platforms. Enterprise data architects need to specially focus on the data engineering process to provide coverage for ensuring pre-requisites of the file/data set. The capturing and validating requires metadata to fully support discovery of available data.

How Stronger Data Pipelines Drive Insights

The data engineering pipeline plays an important role in the discovery of desired business insights from new age data platforms. The insights drawn also need to be actionable. This is possible only with rightly built platforms. Insights around monitoring of market conditions, market risks, supply chain, and other operations can only be drawn with a deep knowledge of metadata and by building an analytics model on top of it. 

The data engineering pipeline can give special focus on capturing every possible metadata across three dimensions of business, technical, and operational.

Stronger Data Pipelines Help in Scaling Up Data Science Capabilities

Alongside looking at scaling up data science capabilities through modern data platforms, attention must be paid to the mechanism of collecting and storing the data in a usable format. This will help build an analytical and data science domain. It requires stronger, more robust data engineering pipelines and more speed to deliver data over the information value chain.

Robustness of the data engineering pipeline can be designed to establish the information value chain to deliver business contents by managing scenarios like data exceptions, missing data identification, and data enrichments.

Speed of data engineering pipelines is also an important performance indicator, as it signifies how quickly data reaches the analytics region and how effectively it is available for analysis.

Stronger Data Governance for Building Successful Data Pipelines

Organizations are also heading towards multiple risk scenarios. Data lake platforms are turning into data swamps over time with factors such as lack of data governance.

State-of-the-art data engineering pipelines ensure minimum turnaround time for reaching the desired information. Organizations need to have enterprise data strategy and data governance strategy to deploy new age data platforms such as data lakes.

Implementing data platforms can also run another risk. Data science operations are sometimes impacted badly as data volume increases in data lake, raising data preparation efforts simultaneously due to longer turnaround times, impacting the availability of expected quality in the data lake.

Considering these risk scenarios, let us take a look at the designing guidelines for building stronger data pipelines.

Designing Data Engineering Pipeline Strategy

In order to have strong data governance, as we start moving towards solution domains to adopt the data engineering pipeline strategy, the following factors need to be considered:

  • Enterprise metadata management:
    • Capturing right metadata at right joint, by right technology, and storing at the right place
    • Focusing on quality of captured metadata with cleansing rules, as data is getting transformed or enriched with business context. It is important to maintain metadata lineage to ensure forward and backward-looking traceability availability with all stakeholders
    • Creating well-defined meta-lake along with data lake

 

  • Data lineage in data pipeline:
    • Building logical lineage between data sets (includes file to file , table to table, and column to column mapping)
    • Capturing transformation definitions, formulae, conversions, and data type changes

 

  • Data security management:
    • Building stronger data security and access control by well-defined ACL (access control list)
    • Defining business suitable file /table/schema/folder schemes for maximum search ability

 

  • Efficient data publishing mechanisms:
    • Considering following key factors while designing publish mechanism:
      • Empowering business users to create new data sets (data wrangling capabilities)
      • Enforcing stronger access control
      • Enabling faster response time of the queries
      • Enabling self-service reporting to create and analyze their own data

Viraj Khandalkar is a Consultant at TCS, and is currently leading the Data & Insight Center of Excellence for UK manufacturing industry unit. He has more than 18 years of experience in BI, Big Data, advance analytics and consulting across the globe. Viraj is passionate about solution architecture, technology solutions and governance setup. He is an MBA in International Business Management, Bachelor of Electronics Engineering, Oracle Certified Professional and TOGAF certified.