How data is changing the classical way of scientific theorisation

JAN 14, 2022

The classical theory of problem-solving studies activities characteristic to science through the processes of systematic observation, reasoning, prediction, formation and testing of hypotheses. These are the very measures that have long differentiated scientific activities from non-science. Moreover, scientists have been trained to differentiate between correlation and causation, that one must find a cause-effect relationship to understand and accept a correlation. But with the advent of humungous data, machine learning and black-box models, we are shifting to mere correlations.

Data-driven shift

The shift in approach was predicted by Chris Anderson more than a decade ago, asserting that at the petabyte scale, information is about dimensionally agnostic statistics that requires us to forego the visualisation of data. For science that has always been about causation, the new change pushes it to the considerations of correlations. Data has to be viewed mathematically first and contextually later. Algorithms, machine learning, and data visualisation are all applied mathematics, where more subjective concerns take a back seat. For instance, as Laura Spinney illustrated, Facebook’s ML algorithm will predict the user’s preference better than a psychologist.

But Facebook’s algorithm is a black box, which is the main differentiating factor from the classical theory. These solutions are silent on how they work, why they know your preferences and how they provide the best outputs. Together, data and applied mathematics are replacing the process of hypothesise-model-test. Anderson called computer found relations better than we do, asserting they called out our theories for just being oversimplifications of reality.

The petabytes enable us to assert that correlation is enough. The quality of output provided by the humongous data that can be fed into computers allow us to let go of modelling or hypothesising. “Somewhere between Newton and Mark Zuckerberg, the theory took a back seat,” Spinney said. Considering the new age of data, this is fair, calls for new ways of understanding it.

Theory will not die

Although, when questioned if this change has already occurred, researchers are of two opinions. The consensus is that the theory will not die regardless of research techniques. AI and algorithms are not foolproof, they are biased, and they often make mistakes despite the tons of data input.

Researchers are constantly working on theorising why AI behaves the way it does. At times, AI is not the end goal. It provides us with data-backed information for further analysis. For instance, modelling consumer shopping behaviour may be done by an algorithm, but it will provide possible scenarios from which we have to theorise and form a solution. One could think of it as being handheld to explore theories in more detail than you could have done manually. This side of AI is still explainable and theorisable, merely supporting humans with better analysis. But that is not the case with all models, especially the more recent algorithms such as Facebook’s ad algorithm or Google’s hiring system that are theory-free.

Theory free black box

Theory free algorithms exist in lots, but it makes us humans uncomfortable. To give up all control to a machine with no explanation is not in our nature; it does not set well, and rightly so, as proved by the terrible mishaps created by Google’s sexist hiring algorithm or Facebook’s racist ad recommendation system. But as of now, devoid of successful researches, most algorithmic models are black boxes. Just models are replacing the hypothesis-model-test process. This is the side of science that many researchers are sceptical about. This is because we can’t give up power to a model that may have been trained on incorrect data or small datasets – it will provide false correlations as output.

Human bias has always been a problem, and the training datasets are also biased, but with traditional research, the steps of verification of the hypothesis help mitigate the differences. In an unexplainable black box, one may not even know the inherent bias for a long time. Alternatively, scientists in favour of algorithms have shed light on the fact that, unlike AI’s, human biases are harder to interrogate and tougher to mitigate.


Brunton and Beyeler’s paper on data-driven models in neuroscience maps out the contradiction in black boxes, their advantages and inherent risks. The paper discusses how the black-box nature of data-driven methods impedes adoption by scientists. This human need for interpretability may have prevented researchers from identifying breakthrough insights that can only arise from massive datasets. But on the same lines, they also acknowledge that it is important for the insights to be explainable and understandable if they are to be translated into solutions and trusted by humans.

“Meaningful progress requires the community to tackle open challenges in the realms of model interpretability and generalisability training pipelines of data-fluent human neuroscientists and integrated consideration of data ethics,” the paper explained.