Establishing Data Hierarchy


Data are the numbers, the words, and the images. Information is the meaning behind data, meaning that comes from placing that data into context. Placing data into context is the role of the domain expert, because domain expertise is knowledge of context: field specific but also timely.

Those who have studied steel metallurgy understand the general impact of carbon, of austenite stabilizers, and of ferrite stabilizers; they understand the role of dislocations and phase transformation in plasticity. Similarly, ceramists understand the role of oxygen ion vacancies and polarons in electrical conductivity. These are examples of General Expertise.

But domain expertise can also be specific and timely: knowing where Sample X came from, knowing how it was made, and from what precursors; information that is known only to those involved, and unless recorded, information that quickly disappears to history. This is Specific Expertise.

These are the two types of information we are looking to capture.

Structure data to capture General and Specific Expertise

Materials data is decidedly hierarchical. Their properties depend on many things: on the precursor materials, their composition and form (powder, rod/bar stock); on synthesis methods; on processing history; and on operational environment.

In complex systems, which is to say most systems, several domain experts work to impart their respective domain expertise and optimize the system. This requires a data collection and curation system that is able to capture and impart the general and specific domain knowledge brought to bear by these experts.

Use nodes to capture General Expertise

If nodes are to materials data what bytes are to digital data, then attributes are the bits, the indivisible quanta of information in the node. Each attribute could be a single measurement, observation, or state variable. Zero-or-more related attributes are grouped to form nodes. While each node may contain any number of attributes, each node must contain a unique identifier (UID).

Use trees to capture Specific Expertise

To establish the hierarchical relationship between nodes, each node must, in addition to its UID, also include a reference to the parent node, or FID (foreign ID). The parent (FID)—child (UID) relationship defines the tree. But while the connection is a matter of history, the meaning of the connection is not.

The meaning of the connection demands interpretation by a domain expert. For example, there are materials processing steps, such as precipitation annealing of alloys, doping semiconductors, or cross-linking polymers that fundamentally change the behavior of the material. Therefore, presenting that information to a domain expert is critical. An interesting (albeit very large) example connects unary, binary, and ternary compounds into a graph to show the relationships between these materials. Hover over a node to expose the properties of that node.

A precipitation anneal changes the child, such that the yield strength of the child is no longer representative of the yield strength of the parent. The electrical conductivity of the doped-child is no longer representative of the conductivity of the parent. And cross-linking fundamentally alters the mechanical properties of the polymer. Therefore, a mechanism must exist for a domain expert to describe the relationship, not its existence, but its nature, in order to control the flow of information. This is discussed in greater depth in the discussion on aggregation (below).

The use of trees to capture the hierarchical relationship in materials data is conceptually straightforward, but as the number of nodes and connections grow, the network connectivity graph becomes difficult to manage. The following figure shows 1,400 connections necessary to describe an average of 5.6 synthesis and processing steps that produced 33,327 distinct measurements of 161 unique techniques collected by five groups across four separate organizations.

Each connection and every attribute must be revisited with every new measurement or update.

Create unambiguous node relationships

While each node may have any arbitrary number of children, each node may have one, and only one parent. This ensures the origin and flow of information to-or-from any node is unambiguous. This does open up the possibility of restructuring the data hierarchy based on connections other than time provenance, but building such a structure requires that every node have a value for the connecting property that points back to one, and only one other node. Such a logical extension of the provenance model is still an active area of investigation.

Propagation conveys history to descendent nodes

The root (top-most) node contains all the information we know, or care to include, about the starting point of the object under investigation. This is the beginning state of every level-1 child of the root. Each child adds some information, augmenting our knowledge of the state of the object or changing its state. This assumes all processes are stateful; that the responses of an object to external stimuli depend on the state of the system going into that process. Level-2 child nodes, like their level-1 parents, contribute or modify the state of the system, but do so under the conditions established by the root and level-1 nodes. Algorithmically, this easily extends to any arbitrary depth; to any tree that describes the history and hierarchy of the system, however complex.

Put simply, a child node inherits as its starting state the properties of its parent. This is why each child can have only one parent, because unlike genetic inheritance, there is no general mechanism–at least not one found thus far–that can join or merge conflicting parent states.

Aggregation collects properties into predecessor nodes

If propagation imparts history to an object, aggregation collects the response of the system to that unique process history. During aggregation, a child node writes its properties as properties of the parent node. This assumes that child nodes–which could be specific measurements of a parent sample, or sectioning of the parent sample into smaller subsamples, or property measurements of one (or more) of these subsamples, or any number of other operations acting on the parent sample–are either representative of the parent sample or they are not.

Which raises the questions: if they are representative, how are multiple operations handled by the parent node? And if not, what happens to the attributes of the child node during aggregation?

Reductions: Calculating summary statistics

Each child node can have one and only one parent node, but each parent may, and often does, have more than one child. This raises a question; how are multiple operations handled by the parent node? Answer: reductions.

In general, reductions accept a list (or array, or set, etc.) of values and return a reduced-order value; often but not necessarily a scalar. For lists of scalar numerical data, min, max, mean, median, IQR, standard deviation are the most obvious, and sufficiently descriptive for gaussian-random processes, but there is no limit to the types of reductions that can be applied.

After all, you can ignore data you have, but you cannot use data you don't.

Reductions are also possible on other types of data: sets of series data and of spectral, image, and categorical data. Reduction operations on spectral data could include peak identification and characterization: a reduction not in dimension but in data volume. Image reductions could include segmentation and connected component analysis. Series data could include identification of inflection points, onset temperatures.

The concept of reductions that take a vector of values (scalars, vectors, or matrices) as input and return a value that is also vector-valued raises the concept of the identity reduction, which is to say, no reduction at all. The identity reduction returns the unaltered list of aggregated values for future use, because again, you can ignore data you have, but you cannot use data you don't.

The list of reductions that should be attempted on all aggregated properties are set once by domain experts and applied with every update–however many nodes and connections exist.

When a node no longer represents its predecessors

As alluded above, there are operations that fundamentally alter the nature of an object. The examples given being annealing, doping, and cross-linking. The state that results from this operation is still a function of the input state, but the affected properties of the child object are no longer representative of the parent.

Therefore, domain experts must define the attributes affected by the operation, and during an update those attributes are blocked, withheld from the parent and all upstream nodes.


Without context, data is of limited value. Traditional databases are designed to warehouse data and rely on domain experts to either know or, through exploration, discover useful relationships. Many high-quality tools have been developed to assist in this exploration, but for that data to be useful outside that domain–for manufacturing process and control data to be useful to the metallurgist or for metallurgical data to be useful to the designer–requires multidisciplinary exploration, extensive collaboration, and time.

This method separates data content in nodes from data context in the connections. This enables, but does not simplify, learning from data by pressing on the domain experts not only to provide the numbers, but also to define the data's taxonomy. But by establishing these relationships conceptually, they can applied to ever-growing materials data sets across ever-larger multidisciplinary collaborations.