Accelerating discoveries using machine learning

Science quite often creates complicated math models which successfully describe certain observable phenomena. We use these models to predict a set of quantities/properties. For instance, we use classical physics to predict how much weight a bridge would be able to tolerate. Quite often, the relationship between the target property (e.g. the amount of weight the bridge can handle) and the configuration of the object (e.g., the way the bridge was built, material used, etc) requires expensive computations or experiments. Now what is super cool is that with the help of machine learning and the "black-box" approach, sometimes you can take a new shortcut to finding these "structure-property" relationships. I will try to make this clear in the following so bear with me.

Machine learning is a branch of computer science in which different models have been created to learn certain patterns from data which are referred to as training data. For instance, given a set of images of cats and dogs we want to have a model that takes the value at each pixel for each image and combines them in such a way that ultimately is able to say whether the image was a cat or a dog image. Training such model is an optimization process and requires enough labeled data, i.e., a set of (input, output) pairs used to find the best configuration of the model parameters which minimizes the error in labeling new inputs.

The scope of applications of such models is very broad. The machine learning model can learn to model any relationship between structure and property as we described in the beginning. In general if enough high quality (not too noisy) data of (configuration, property) pairs are available, one can train machine learning models which could be used to predict the property given a completely new configuration. This means that we may no longer need first-principle models for doing the same job. I emphasize again this would only hold if enough data is available.

The advantage is that these models, after being trained, can make predictions using very basic computations such as simply performing a series of matrix multiplications, therefore they are by far less computationally intensive. This significantly reduces the cost of simulating/measuring certain properties of each new configuration of the object under study from the nearly infinite configurations which can in theory be feasible to exist in real world. 

The science of mixology as an example

Let me give you a fun example to make this process, often referred to as "High-throughput Accelerated Discovery" more clear. Consider the problem of finding best recipes for making good cocktails. The space of feasible configurations, i.e., ways/processes one can combine various ingredients and the proportions, is infinite! If you ask a mixologist to find the best recipe, he/she would use his sense of smell and taste (and of course vision and design too) and would try to run many experiments to find the best recipes. Of course this process is extremely slow and one can only search a tiny subset of infinite possibilities. So even if all mixologists work together, it is very likely that have missed very delicious cocktails never tried by humans before! (it's a bit sad our species may get extinct without discovering greatest recipes). 

If you ask a scientist, he/she would probably try to model our sense of taste by understanding the receptors at a molecular level and how they respond to various materials. He/she would then study how receptors function differ from person to person based on persons physical and psychological characteristics...

Both of the aforementioned approaches are extremely slow and fail to do a fair job in searching the space of infinite feasibilities in a comprehensive way. Now, consider a data-driven discovery approach to tackle this challenge. You feed all the data from the so far available cocktail recipes and data from how people have perceived these recipes and feed it to a machine learning model. The machine learning model would learn the relationship between the recipe information and the ultimate outcome, i.e., how people would rate the associated cocktail. If enough data is available (which requires diverse subsampling of the "search" space), then the black-box model can predict how people would rate new recipes before they have ever been tried in the real world! Using this approach one can find many potential candidate recipes and go and test them in real world to see if they actually are good recipes. 

Quantum physics example

Let me give you a more serious and perhaps less fun example if you are not a nerd like me. Physicists have been searching for high-Tc superconductivity in the past few decades. We still do not have a proper understanding of how high-Tc works. The ultimate objective of understanding the mechanism of high-Tc would help scientists to find materials with higher transition temperature and perhaps even room-temperature superconductivity which can revolutionize many industries (eliminating all the electric power loss in devices in the form of heat produced).

The interesting question is "can a data-driven discovery based on so far collected data allow a black-box modeling of high-Tc in a way that new high-Tc materials could be searched in theory and eventually get discovered even without understanding the underlying mechanism?" Some physicists and material scientists have already started trying this new computer science based approach.

I think this shortcut approach can lead to interesting discoveries in a near future. The turning point is when enough high quality data becomes available. For many applications the data may already have been reached the turning point of enough volume and diversity required to make such shortcut discoveries possible!