Set HomePage - Favorites
HOT: Investing News
<-- AD960X90 -->0.0
Location: Home > TECH >

2017~ Building a Global Meta-data Fabric to Accelerate Data Science

2017-06-08 16:50 [TECH] Source:Netword
Guide:In a World Wide Herd, a meta-data fabric connects the data scientist to globally distributed data — and does a great deal of heavy lifting along the

Building a Global Meta-data Fabric to Accelerate Data Science

By Patricia Florissi, Ph.D.

In an earlier blog post, Distributed Analytics Meets Distributed Data, I wrote about the concept of a World Wide Herd (WWH), which creates a global network of Apache™ Hadoop® instances that function as a single virtual computing cluster. WWH orchestrates the execution of distributed and parallel computations on a global scale, even across multi-clouds, pushing analytics to where the data resides. This approach enables analysis of geographically dispersed data without requiring the data to be moved to a single location before it can be analyzed. Only the privacy-preserving results of the analysis are shared.

WWH is a tremendous concept, and a key part of future data strategies as we look to a world that is projected to have 200 billion connected devices by 2031. Data will increasingly be inherently distributed with limited data movement. WWH makes that data accessible for analysis, wherever the data happens to be.

That’s the big picture. But how do we really make data that is scattered around the world, in many different formats, locatable, accessible and useable for analysis by data scientists via a World Wide Herd? This is the topic for today’s post.

Let’s begin with a high-level architectural overview. At a simplified level, WWH has three tiers: a data fabric at the physical infrastructure level, a meta-data fabric in the middle, and an analytics fabric at the top level, where the data scientist works. For this “how-it-works” discussion, I will focus on the middle layer, the meta-data fabric.

figure 1

Dell EMC

Figure 1: The three layers in WWH

Meta-data, of course, is data about data. In the case of WWH, its meta-data fabric abstracts physical data resources, such as a file or a blob store, into meta-resources, which contain meta-data about the physical resources themselves. Two different actors contribute meta-data to meta-resources:

Data architects map meta-data related to the physical properties of the data, including the physical location of data that helps the analytics fabric to locate and address the data; and

Meta-data curators map meta-data related to the semantic properties of the data, including whether it stores genome data or Financial data, and weather it stores Personally Identifiable Information (PII).

figure 2

Dell EMC

Figure 2: Actors for meta-data creation

The meta-data fabric itself is a collection of distributed runtime engines, referred to as catalog nodes. Each catalog node stores all the meta-data information about all the data in the local data-zone where it is located, and knows about at least another catalog node, also referred to as a next hop, allowing for the meta-data fabric to be fully connected and accessible from any node.

figure 3

Dell EMC

Figure 3: Meta-data fabric distributed runtime engine

The meta-data fabric ties all the physical nodes together for the analytics fabric layer and enables the automation of key functions during the code execution through the meta-data amalgamation process.

This architecture frees the data scientists from a great deal of the heavy lifting. The data scientist doesn’t need to know how to locate, address or access the data. WWH takes care of all of that coordination. The data scientist simply interacts with a virtual computing node, and the catalog tells this node how to address the data.

To make this story more tangible, consider, for example, a team of data scientists that wants to study the relationship between high blood pressure and heart disease among different associates, or groups of individuals who share common statistical factors, such as age and ethnicity. The data scientists start a WWH computation in a virtual computing node they have access to, passing as parameter the name of a meta-resource that includes meta-data tags, such as “patient-data,” “heart-disease,” “age,” “ethnicity” and “blood-pressure.” This indicates that the computation should be performed only on data sources that contain information regarding patients with heart conditions, where the age and the ethnicity are known, and for which blood pressures are being measured. It is important to note that the data scientists are not concerned with the specific format of the data, the data store being used for the data itself, or the location and address of the individual data stores.

Once started, the WWH computation first connects to the local catalog node passing the name of the meta-resource defined by the data-scientist. The catalog node returns to the WWH virtual computing node two types of information:


<-- AD690X200 -->
<-- AD250X250 -->
<-- AD250X250 -->
<-- AD960X78 -->