By David Beck | December 2017
From high-throughput experimentation to large-scale observational studies and massive simulations, there has been an unending succession of advances in our ability to generate large data sets. These advances have been made possible by the commoditization of robotic instrumentation, the increasing availability, decreasing cost, and improved accuracy of sensing and imaging technologies, and the continually declining cost of scalable computation and data storage. However, large data sets are often synthesized from several sources, resulting in heterogeneity and complexity that requires reconciliation. They may also contain noise or other challenging features. As a result, knowledge extraction has been limited. Data Science is meant to address this bottleneck. At its core, Data Science is the intersection of statistics, data management, visualization, Machine Learning and software engineering. It is at its most powerful when overlaid onto the fabric of substantive domain expertise.
The challenges of data management begin the moment data is collected. In the case of a large-scale streaming sensor network, do we save all the data–potentially billions of samples–or just time-windowed statistical summaries? What implications do these decisions have on our ability to use the data effectively in downstream analyses? In a high-throughput imaging experiment, how should the data be stored to be most effectively used by image processing software? In all cases, we wrestle with what metadata to record so that data is correctly interpreted and its provenance captured. Working effectively with data sets requires knowledge and understanding of databases and data management strategies.
Chemical engineering is an ideal environment to deploy Data Science methodologies. For example, chemical plants are highly automated settings in which sensor networks provide continuous streams of data. However, traditional model predictive control techniques often fail due to process complexity or a mismatch between the speed of computation and the response rate required for high performance. In such instances, Machine Learning methods can be used to learn process control rules with a high level of accuracy and millisecond run times.
Machine Learning is often broken down into supervised and unsupervised. In supervised learning, a model designed to predict the behavior of a system is trained with data labeled by domain experts and careful experimental design. For example, to identify order or disorder in a microscopy image, a Machine Learning model may be trained with hundreds of labeled images. Unsupervised learning does not have the benefit of labels and tries to discover meaningful relationships and structures within the data. An example would be the clustering of gene expression data to design an industrial microbe that produces massive amounts of a commercially valuable compound.
Data Science is equally powerful when applied to hybrid pipelines that couple high-throughput characterization experiments to data driven modeling. These pipelines tend to be “virtuous design-build-test cycles” where experimental data are used for model building, models are used to design experiments that test model accuracy, and results are fed back into models. With this cycle repeating until the model is accurate, it becomes possible to create new materials or medicines with unique properties.
Finally, statistics and visualization play important roles in Data Science. A rigorous knowledge of statistics, including probability distributions and hypothesis testing, are important to high quality experimental design that will yield valid and reproducible results. Statistics also plays a vital role in mean-time-between-failure analysis and other production floor associated concepts. In the case of visualization, when data contain thousands of variables across millions of observations, we have go beyond scatterplots to convey the stories in our data to stakeholders and peers.
Building on our leadership position in graduate Data Science training (see accompanying articles), we are developing curricula and enhanced experimental facilities at the undergraduate level. Students will continue to learn how to design statistically rigorous experiments, but will do so in a high-throughput setting. The large datasets they will generate will then be used for Data Science coursework, such as process modeling and control with Machine Learning techniques.
These future chemical engineers will be poised to dive into a data-rich future to generate new knowledge and solutions for our changing world.