In my previous post, we discussed Big Datas big potential for pharma its applications in patient care and ability to impact pharmas business bottom-line. Taking cognizance of the technical challenges, we emphasized the need for assurance to go the extra mile – look beyond data warehousing and formulate customized frameworks for Big Data assurance. Besides being customized for Big Data (and specifically to address the 4V challenges discussed earlier), its also important for these frameworks to be localized to the pharma context. In this post, I present a set of key challenges, and propose an assurance response to each of them.
On top of the priority list are two important challenges – Heterogeneity and Completeness. Its important to carefully structure input data for machine analysis algorithms because these algorithms expect it in homogeneous, not heterogeneous form. However, majority of data generated in organizations is heterogeneous coming from varied internal and external sources such as scans, diagnostics results, prescriptions, patient recordings, and others, exchanged over the web or in physical form. The other challenge is that data can be in the form of text, graphics, audio recordings, video files, and others. Such unstructured data cannot be force-fitted into a structured row-column database framework, posing challenges for processing algorithms. To understand completeness, lets consider another example from an electronic health record maintenance application, which expects basic patient information such as date of birth, occupation, and blood group. What if some patients provide only part of this information? Most application developers would tackle such scenarios by providing for NULL values for optional fields to be stored in databases. Analytical exercises that depend on fields that contain NULL or incomplete values would render incorrect results. To cope with such scenarios, assumptions must be made but these assumptions hold good only if the missing values are likely to be similar to common values in the database. However, even after such scientific assumptions and thorough data cleaning, some errors will creep up, rendering the exercise incomplete.
The QA framework for data analysis must recognize and manage these challenges and errors by carefully structuring the data. Its imperative for QA to be involved at the outset, right from the data design stage.
Next come scale, size, and speed – the de-facto aspects for Big Data. Until recently, Moore’s Law – processors getting faster – has been the default answer for data size. But with data volume scaling faster than computing resources and CPU speeds getting static, Moore’s Law may no longer be applicable. Big Data needs smarter frameworks that can do more with less. Large, voluminous data processing takes time, impacting the speed of results availability.
Developing partial results in advance could be a good strategy, which assurance can effectively induce into the lifecycle. This intervention strategy, involving introduction of new data in small, incremental batches, helps speed up analytical results. The strategy is also best suited for data sets that grow rapidly, evolve continuously, and require tight queries with high response time and limits.
In the Big Data context, data privacy is a growing concern. Strict laws govern usage of electronic health records. There is also concern regarding appropriate usage of personal data, particularly when this data is linked to multiple devices, systems, and sources. To realize Big Datas big promise, assurance controls must manage privacy from two perspectives – technical and sociological. Lets understand with an example of location based mobile services. As these services require users to share their location information with service providers, its important to ensure privacy of this information, transmitted electronically through cell phone towers. Although no personal information is transmitted in conjunction with the location data, hackers, with their sophisticated sniffing tools, can derive close correlation between identity and movement patterns. By tracking connection points, trail of items and subsequent movement from one cell phone tower to another, and correlating this data with set patterns over time, hackers can easily determine not just user location and identities, but also extremely personal information related to health, habits, behavior and even religious preferences. A smart (or smarter than sniffers!) security system is required to fend off such hackers and sniffers.
Finally, understanding Big Data’s complexity requires multiple experts across domains. The right-size Big Data analysis framework must support multi-disciplinary and -domain inputs, as well as shared exploration of results. Given space, time, and location constraints, these experts may be spread across geographies. A robust collaboration platform must support the distributed input from these experts. Crowdsourcing is has emerged as a promising option for harnessing human ingenuity to solve problems.
With better analysis of large volume pharma data, pharma companies can make faster advances and the right moves in many scientific disciplines, thereby improving profitability and chances of success. But the road ahead is not easy. Challenges remain to be addressed. In addition to right sized frameworks and processes, pharma companies must drive a mindset change – look beyond the obvious issues of size, speed, and volume, and also consider issues such as heterogeneity, structure, error handling, privacy, timeliness, provenance, and visualization. Another important aspect to keep in mind is that a one-time solution intervention is not adequate. Issues must be addressed continuously, across the analytical lifecycle – from the data acquisition phase, right up to interpretation of results. Yes, the path ahead appears challenging. But its not as daunting as it seems to be. Companies that forge ahead, overcome the challenges, that do it right, have Big rewards waiting.