Table of content
Artificial Intelligence is undergoing a fundamental change since 2017. After the presentation of the paper "Attention is all you need", a paradigm shift began to take shape that continues to this day.
In this new series of articles, we will tell you everything you need to know about it to understand the reality that Artificial Intelligence is going through, and where we are going.
The change process led to the creation of a new generation of models (such as BERT, GPT-3, etc.) that have several characteristics in common (and others that differentiate them from each other). They are all models trained on large amounts of data, generally using a concept called self-supervision at scale, and their main advantage is that they can be adapted to other tasks and get better and better results when it comes to problems established in the academic world for measuring the state of the art.
These models are currently known as "Foundation Models" and are the ones that are driving this new stage in NLP (Natural Language Processing).
However, like any new beginning, there are both opportunities and risks. It is important to know their capacities, their technical characteristics, in which sectors they can be applied and what moral impact they generate.
The generation of companies around these models and their presence in the state of the art in so many tasks generate a lot of traction, but at the same time requires caution. We still need to understand how they work, when they fail, and what they are capable of.
The objective of this series of articles is to be able to explain how we got to where we are, where we are going and to be able to differentiate "fashion and marketing" from real advances in the field.
That being said, nothing better than starting at the beginning:
What is Machine Learning?
The beginnings of machine learning go back to the year 1990. Its development meant a change in the way of building Artificial Intelligence models since instead of specifying how to solve a task, the idea was to introduce algorithms capable of learning from data. While these algorithms were a breakthrough, they did not have the ability to generalize. That is, they were capable of "solving" one task, but could not be applied to the resolution of another. This was particularly notable in NLP, where tasks of high semantic complexity still could not be solved by this type of traditional ML.
Related article: Advantages and disadvantages of chatbots
The beginning of Deep Learning
In the 2010s, deep neural networks made a comeback, primarily because they performed better on tasks than traditional ML algorithms. This change, called "Deep Learning", was characterized by the use of neural networks, large amounts of data, an increase in computational use (using specialized hardware called GPUs), and the obtaining of hierarchical features from raw data. This also meant a shift towards "generalization" as instead of having one algorithm for every application, the same architecture could be used for multiple tasks.
Foundational models of AI
At the end of 2018 is when this new stage begins. In the beginning, the most important factor for it was the use of "transfer learning at scale", that is, the possibility of taking the knowledge learned in a task and transferring it to the resolution of another task.
The use of this technique is what made the training of the new type of models possible, but the key is also in the scale: This is precisely what makes them powerful. The scale requires three ingredients:
- Hardware→ GPUs
- Development of Transformer architecture, which allows using the parallelism of GPUs and training models with more and more parameters.
- Availability of massive amounts of data. This point is key since the availability of correctly annotated data for solving tasks is a non-trivial cost and imposes limits on learning. However, by adding self-supervision, the pre-training task can be made unsupervised. For example, BERT is trained using a masked language modeling task, the goal of which is to predict a word in a sentence given its context. So this task can be done with raw text (no supervision or labels).
Self-Supervised learning has several time milestones:
1) Word embeddings (Mikolov et al 2013)
2) Autoregressive language modeling, predict the next word given the previous ones. (Dai and Le 2015).
3) Contextual Language Models as:
a) GPT (Radford 2018)
b) Elmo (Peters 2018)
c) ULMFiT (Howard and Ruder 2018)
4) BERT (Devlin et al 2019)
5) GPT-2 (Radford et al 2019)
6) RoBERTa (Liu et al 2019)
7) T5 (Raffel et al 2019)
8) BART (Lewis et al 2020)
All these models incorporate the concepts described above, incorporate more powerful deep bidirectional encoders, and scale to ever larger architectures and larger data sets.
As we already mentioned, one of the great objectives is generalization: the use of a single model for several tasks meant the beginning of the foundation models stage.
The risk of AI foundational models
If we look at the SoTA (State of the Art) for all tasks within NLP, all the models found on the leaderboards come from one of these foundational models. But this high ability to generalize is a double-edged sword. Any improvement to the foundational models brings immediate benefits to all NLP tasks, but is also a risk, as all systems using those models can inherit the flaws or biases they possess.
The scale leads to the concept of "emergence", for example, GPT-3, with its 175 billion parameters (compared to 1.5 billion for GPT-2), enables something called in-context learning, where a language model can be adapted to another task by providing a prompt (natural language description of A homework).
Generalization and the ability to "emerge" interact in ways we do not yet understand. Generalization can bring great benefits in domains where data availability is very limited. Since the capacity of these models comes from their “emerging” capacities, this puts us in a dilemma, since we know that they also make serious mistakes.
Eliminating risk is one of the keys in the construction and implementation of this type of model, and it is something that all companies that use these models in production must consider. At Aivo we take this very seriously, especially due to the dominance in which our bots respond.
Learn more about Aivo's Conversational AI
In future articles, we will describe in more detail the technical characteristics of these models, the benefits they bring to the field, the risks they imply, where the field is headed, and how we can mitigate the risks.
In the meantime, you can get an in-depth look at how Aivo's conversational AI works here.