The nature, limitations and future of AI
Some thoughts about the definition(s) and limitations of AI – and a few
“weird and wonderful” ideas for a model that tries to overcome them
There has been a long ongoing and still continuing debate about what artificial intelligence actually is. Let’s just walk through the history of definitions very briefly. John McCarthy coined the phrase in 1956 during what is now called the Dartmouth Conference, the first get together of scientists working on this subject. His definition was:
"Artificial intelligence is a sub-field of computer science. Its goal is to enable the development of computers that are able to do things normally done by people -- in particular, things associated with people acting intelligently."
There are several other traditions of this definition, but the gist of it remains the same: McCarthy is comparing the ability of machines with the abilities of intelligent humans. Which raises the next question: What is human intelligence? Unfortunately, this term is equally ill defined.
This of course doesn’t impede psychologists measuring human intelligence. Hence, one could take a rather pragmatic approach and define a machine as intelligent if it passes a human intelligence test. But that of course also doesn’t really help either.
Other definitions, for example in the Encyclopedia Britannica, stress that machines can be considered (artificially) intelligent, if they are able to perform tasks better than humans.
If one thinks about that, any simple calculator would be “intelligent”, as it most likely performs the multiplication of two ten digit numbers better and certainly faster than humans.
Elaine Rich altered this notion of intelligence in a very smart way – and lots of people consider her definition of 1983 to be the best one until now. She said:
“Artificial Intelligence is the study of how to make computers do things at which, at the moment, people are better”
As Johann Ertl put it in his textbook “Grundlagen der Künstlichen Intelligenz”, the charm of this definition is that it will still hold true in 50 years.
However, a closer look leaves us still unsatisfied.
Firstly, it is not quite what we normally think of as a definition. A machine playing chess or identifying lung cancer on an X-ray image is artificially intelligent, as long as humans are still better at this task. Once that is no longer the case, this machine ceases to be considered artificially intelligent.
Secondly, the definition is in a certain way quite profane. Humans have always been aiming at developing tools that outperform themselves. Take a Stone Age person, say the first one – who would be the one who discovered that you can shape materials better using a stone than with your bare hands. Or kill animals. I think one could argue that (almost) every invention by humans was aimed at performing a certain task better than humans could without it. A plough in front of a horse is more effective than a spade, a combine harvester is more effective than a scythe - and yes: a rifle is more effective than a bow and arrow.
Clearly, the “ideal” definition, if it exists, will not use direct or even temporary comparisons to human abilities.
I would like to make a suggestion, even though it is certainly not perfect and also might not cover the entire range of AI. I will still argue my case for it:
“AI is a discipline of IT, which aims at developing, training and optimizing computer programs that learn from mistakes”.
There is, of course, a connotation to human behavior, as we all also learn from mistakes. To take a classic example: When a small child touches a hot stove plate, it learns to be more careful the next time. Or when a pupil learns vocabulary, he or she is most likely to write the words in their native language on the left side of a sheet of paper and the translation on the right. It reads both columns a few times, then covers the translations and checks whether it knows them one by one. All the ones it got wrong are learnt again. The two examples are equivalent to unsupervised and supervised learning in AI.
There is a definition of AI (or machine learning) that does not use the comparison to humans. It is along the lines “AI (ML) systems are computer programs that are not explicitly programmed” and you might wonder why this does not suffice?
Well, it does and it doesn’t, at least in my opinion. If it does, it certainly is a huge exaggeration of what AI is actually able to do. But let’s consider the phrase “not explicitly programmed” a bit closer.
At the core of any AI algorithm is a cost function, or to be more abstract: Some metric that evaluates predictions either to true labels or to previous predictions or, again slightly more abstract, to a new state of the system that is based on previous predictions. That holds true for supervised learning as well as unsupervised learning.
The pseudo code for a ML system looks something like this:
- Initialize the model (more or less) randomly
- Pass your data through the system and get a temporary result
- Evaluate that result on the basis of a clearly defined cost function or metric
- Alter the state of the system with respect to the evaluation, again on the basis of precisely defined rules or procedures
- Stop at a predefined criteria
Now which bit in this iteration is “not explicitly programmed”?
Take for example a neural network.
Step 3) would be:
- Compare the temporary result to the true label and remember it as cost.
And step 4):
- Calculate the gradient of the cost, propagate it back through the network and update the model with respect to the gradient.
Or take clustering, as an example of an unsupervised machine learning algorithm.
- Step 3) would be: Assign each cluster point to the nearest cluster. “Nearest” of course refers to a predefined metric or distance function.
And step 4) would be:
- Find the new cluster centers, again with respect to a predefined metric.
Take any AI or ML system and work through it - every step is well defined and there are clear rules to be carried out. In supervised learning as well as in unsupervised learning. To put it very bluntly: Without temporary results based on precisely defined cost-functions, evaluation metrics, or however you want to call it, and without clear rules of what to do and how to proceed with these temporary results, AI and ML just would not happen.
Two nice things about the definition of AI as a system that “learns from mistakes” is that it puts the nature of the current state of AI and the crucial role of a cost function into perspective. It also allows the identification of the origin of AI and ML, back in 1755, when Joseph Boscovic and Christopher Maire (quite rightly) assumed that the Earth is not a perfect sphere.
They tried to detect the ellipticity with measurements of the length between two adjacent latitudes at five locations: Cape Horn, Quito, Rom, Paris and Lapland. Any two of the data points would normally be enough to get the desired value – but the measurements were too inaccurate. The two scientist then calculated the ellipticity for all combinations and investigated the derivation of the predicted length for the distance between two adjacent latitudes from the measured distances.
It was the first time in history that the error of measurements was used to derive an optimal result. They described the procedure in their book “De Litteraria de Expeditione per Pontificam ditionem ad dimetiendas duas Meridiani gradus”. It is viewed as the first application of linear regression, even though the two didn’t formalize it mathematically.
Adrien Marie Legendre and Carl Friedrich Gauss caught up on that 150 years later, introducing squared errors as a metric and formalizing the technique and Francis Galton applied it, again 70 years later, to the size of birds. By doing so, he found out that unusually big offspring tends to have smaller offspring and described that as “regression to the mean”.
Let’s skip the history tour here. From linear regression it’s not far to logistic regression and non-linear regression, logits, perceptrons and finally neural networks, which by the way were thought up in the 1940s and hence also quite long ago.
In many ways, we are “only” carrying out what people centuries and/or decades ago have thought of – because we now have the computer power and the data sets to do so.
Don’t get me wrong: In no way do I want to belittle AI. It is an exciting and fascinating discipline and the results we achieve with it today are purely stunning and extraordinary. But I do wonder: Will we ever get beyond a cost function? And how does our brain do it? How do we learn? How does our brain make decisions and how do we make decisions when our brain is switched off, i.e. when we are acting emotionally or intuitively?
Unfortunately, we do not know. There is that omnipresent skepticism especially about neural networks because they are supposed to be non-transparent. Some people even call them black boxes. But that is not true. As Andrew Ng puts it in one of his lectures on Coursera: Today we know more about how neural networks learn and work than we do about the human brain.
But let’s come back to the definition that states that AI systems are ones that “are not explicitly programmed”. I would personally rather take this as an incitation than a state we already achieved. It is a worthwhile goal.
So let’s go and do it. Let’s find the restrictions of AI and ML – and see whether something can be done about it. Let’s dream up a different model. A model that behaves more “humanly”
When neural networks were invented back in the 1940s, as mentioned above, their pioneers were inspired by the (at the time) relatively new findings of how the human brain is built up. They tried to replicate the functionality of neurons. They did, at least as far as I know, not look at how humans learn. And maybe that is a good starting point.
Let’s view the brain as a black box. It doesn’t matter what happens inside. We only consider what is going in and what is coming out.
We are confronted with Thousands and Thousands of impressions every day. We read advertisements or newspaper headlines (some people even still read newspapers) on our way to work, we smell things and we hear noises. Everywhere and all the time. It is an enormous flood of data.
And if we want to know, find out or “learn” something, we probably first grab in our memory for answers – but then we specifically look for the information we need or think we need. Google is our friend at this point.
That is precisely the point where most neural networks fall short. Their work begins, when we have collected, explored and prepared the data – which normally takes up at least half of the time of any project.
So what about a model that chooses its own data. Just consider a multi class multi label problem, say for the sake of the argument with huge class imbalances. Let our model reach out for more data itself. It is quite simple – the pseudo code could look something like this:
- After every epoch, make a prediction on the training data.
- Flag all the incorrectly predicted data points.
- Add the augmented flagged data points to the list of file paths to the images of the training set.
- Feed the combined lists to the data generator in the next epoch. The data generator has of course to be altered. If it recognizes image to be augmented, it will do so.
Bullet point 4) will be discussed later. But here, at the sixth bullet point, the code branches. We have two possibilities:
We can either keep the augmented data points through all epochs – or we can skip them after one, two or k epochs.
If we go for the latter, our model does something that we humans do all the time: It forgets. Forgetting is important for learning as well – as it clears up data that is (supposedly) not needed any more. By doing so, there is scope for new data or information. Forgetting in a way allows us to more effectively adjust our system (the black box brain) to a new state.
So far, so good. But let’s have a closer look at how we humans reach out for information if we need it. Most of the time, we have a pretty good idea about where to look and what to look for. We often know that a certain piece of information is in a certain book or on a certain website. On a high level, we have learnt how to learn.
Can our specific model learn how to learn as well?
Here is a suggestion.
We alter the pseudo code above in the following way:
Firstly, in bullet point 5), we randomly choose from a whole variety of augmentation methods. And we keep track of each augmented image, especially of the augmentation method.
In bullet point 4) we evaluate the previously augmented images and set up a probability matrix in the following way: For each augmentation method, we store the fraction of images that have turned from an incorrect prediction to a correct prediction. And then, in bullet point 5), we still chose the augmentation methods randomly – but on the basis of the probability matrix. To be precise: If augmentation method 5 has turned a previously incorrect prediction to a correct one in 33 percent of all instances when it was applied, this augmentation method gets chosen with a 33 percent probability.
This of course adds a whole load of new hyperparameters: Do you augment all incorrectly predicted images or only a fraction? How long do you keep augmentations in the pipeline, i.e. when is the model allowed to forget? Do you combine the probability matrix in some way with the class distribution – and if so: with the original class distribution or with the current one?
However: The outlined procedure might also help in not overfitting the model. But in any case, our system has now learnt to not only process the data but also to choose or even generate it. This adds a whole new dimension. But even though it is much more complex, there is still a cost function – or several. There are still explicit rules everywhere. Yes, our model is still acting rationally.
But we humans don’t always act rationally. We also act emotionally. Then we can behave totally irrationally. Some people hit other people when they are very angry or disappointed, or even injure or kill them. And then they end up in jail. Others start dancing on the road when they have had a very good news (say their beloved one has accepted their proposal) – and get run over by a car so that they don’t end up in church, but in hospital. And there are of course thousands of less extreme examples.
This is all natural, but the question is: Why do we do it? Why have we, during all these millenniums, not learned that emotional reactions can be, and quite often are, dangerous. I don’t know the answer, but one possible explanation could be that it is overall beneficial to reset our system during a crisis, a shock or a moment of great happiness. Or maybe only that it is beneficial to sometimes not use the part of the brain we normally use.
So maybe it might be a good idea to add a few emotional states to our model. Let’s start with just two: joy and despair.
I will spare you any pseudo-code at this point but just give a few suggestions. Say “despair” is defined when the discrepancy between the training and validation error reaches a certain threshold. We know (and we explicitly program the model to know it as well) that any overall improvement is very unlikely at this point. The normal reaction would be to stop the model there and store it or at least keep the weights.
But instead, we let it go into despair mode and do silly things. For example increase the learning rate for one or some epochs to the extent that the gradients are overshooting in backpropagation. Or you wipe out the weights of a randomly (or otherwise) chosen layer, or some of the weights. There are endless possibilities. And yes, that will partially destroy the best state that we have reached so far – but maybe the model is able to build up to an even better end state from there.
The definition of joy could be similar, for example when the validation error decreases over one or several epochs more than in (a specified number of) previous epochs.
But it could also be quite different, as there is no need to stay within the system. Have you noticed that we still are in the restricted space of the model and its data. Why not define joy as an event that occurs if there were k days with more than a specified number of hours of sunshine. We live in an open environment – so why should our model not do so as well?
Where has this all led us to?
Well, our model is certainly less restricted than a plain vanilla one – it has more dimensions and it is more randomized. Maybe it will be more robust, but not perform as well as standard models. Maybe it will be the other way round. Or it will be worse in every aspect. Feel free to try it (I am doing so as well, but it is too early for a judgement).
But with respect to the cost-function, there is no real progress. We still need to specify every single step in a mathematical manner. But maybe there is no way round it. Or how would you define a certain state without a definition?
I would be curious to hear your comments – and I hope this forum will turn into a lively discussion floor for similar “weird and wonderful” thoughts and ideas.
Note from the author:
This blog post is not supposed to be a scientific paper, rather an essay, with the intention of making you think. What I’ve written are my thoughts – and maybe several of them have been discussed in the literature, in which case I am not aware of it. If this is the case, feel free to add the papers to this forum.