Many companies including: OpenAI, Google/DeepMind, Microsoft, and countless others, have started the race for “truly” intelligent AI. For the majority of this article I’ll be referencing OpenAI’s GPT series of machine learning models. However, the question: “what does it mean to be truly intelligent?” needs to be answered. OpenAI have modeled this problem as a text transformer model. This model takes sequences of pieces of words (two character pairs) and tries to predict the next set of word parts. They claim this shows intelligence by embedding human knowledge into question and answer form. As shown in Kevin Lacker’s article GPT3 can factually answer many questions like:
However, Lacker also highlights one important flaw in GTP3’s answers. He discovered that if a nonsense question was asked the model would have no problem returning a nonsense answer:
Lacker also makes some important observations about the model when answering certain nonsense questions:
Lacker says this about the answers:
Lacker says that these answers are good guesses of the correct answer because they are US-related political entities. I suspect that GPT3 operates in a similar manner to Word2Vec models, in that the model develops groups e.g. people related to the US, people related to politics, etc. in some latent representation (intermediate output of the model’s layer or layers). Unfortunately, GTP3, as of now (7–23–2020), has not been released to the public so I cannot test this hypothesis. In personal project that I am currently working on, I have developed the technology to measure this (granted, it might not scale, but that is just another issue to fix). However, to measure this, would still be a significant undertaking. The reason is that when measuring correlation between two categories like: US, and European political figures, you need to be very careful that you’re actually measuring the actual categories and not some random correlation. Logically, for better performance, the lower layers should be disregarded because of neural networks hierarchical structure. To illustrate this lets look at a CNNs activations maximized layer by layer created by Fabio M. Graetz, the full article, by Fabio M. Graetz with more explanations can be found here:
The neural network was designed to classify images. We can see that its understanding of a class of image is developed by creating an hierarchical structure in which each layer constructs a feature out of layers below it, eventually leading up to the high level features such as what class an image is.
The same principle applies to any neural network. So to find these correlations it’s best to start looking from top to bottom. Since the decoder’s job is to generate text/probabilities of tokens, they should be omitted from data collection. An illustration of text transformers by Jay Alammar:
However, what’s important to note is that GPT3 is not intelligent. GPT3 is, in a sense, the worlds most advanced search engine. It distills all of the knowledge that it has seen into its weights and when asked a question it constructs the most likely answer from the training data, which was made by scraping the internet. This is demonstrated by its responses to Lacker’s questions as well as neural networks hierarchical structures. Even if the neural network performed human like computation, the transformer architecture is impractical for human level intelligence. The main issue with systems like GPT3 and BERT is that they are feed forward neural networks. What this means is that information can only flow from input to output. Illustration of how a feed forward neural network looks like, illustration by Stanford:
The information flows from the input layer through both hidden layers and finally to an output layer. The key takeaway is that fundamentally BERT and GPT3 have the same structure in terms of information flow. Although attention layers in transformers can distribute information in a way that a normal neural network layer cannot, it still retains the fundamental property that it passes forward information from input to output. The first problem with feed forward neural nets is that they are inefficient. When processing information, the processing chain can often be broken down into multiple small repetitive tasks. For example, addition is a cyclical process, where single digit adders, or in a binary system full adders, can be used together to compute the final result. In a linear information system, to add three numbers there would have to be three adders chained together; this is not efficient, especially for neural networks, which would have to learn each adder unit. This is inefficient when it is possible to learn one unit and reuse it. This is also not how back propagation tends to learn, the neural network would try to create a hierarchical decomposition of the process, which in this case would not ‘scale’ to more digits. Another issue with using feed forward neural networks to simulate “human level intelligence” is thinking. Thinking is an optimization process. An example of this is: designing an layout for a web page. GPT3 could do this task but it is limited in computation power. In feed forward neural networks, the amount of layers/neurons is directly correlated to its computation power. If we think as a group of layers making up one ‘optimization unit’ then we quickly see the neural network can only perform so many optimization steps before presenting the output. However, when discussing these operation blocks, it is important to note that such things do not usually occur in neural networks (with few exceptions due to architecture). This is because of the way back propagation updates neural networks. Back propagation, as seen in the image example, ‘distills’ the information hierarchically. Another issue with transformers is their lack of memory. GPT2’s maximum input length is 1024 tokens (two byte pairs); anything longer will not be incorporated into the model (the model uses sliding windows but it still has the 1024 byte pair limit). Humans deal with this differently; humans do not remember the exact words said, but still have an internal ‘context variable’ which is updated as they process new information, retaining all the information necessary to understand the conversation and help generate an answer. The gated recurrent unit (GRU) has been the best implementation of persistent memory in neural networks but GRU networks have fallen out of favor in NLP applications since the transformer’s invention.
The next step for AI is to create non-structured neural networks where information does not flow linearly. This would solve the current efficiency problems with such neural networks. Obviously, these neural networks would have limited applications due to the complexity of training and the numerous problems that can be solved by feed forward neural networks. However, like with any technology, there exist some problems that could be solved with such neural networks.
After reexamining the GPT3 paper, the claims I have made seem to be correct. A good visualization by Jennifer Yen:
I have no contentions with GPT3, just that it is not the epitome of NLP systems and has some limitations, like not being able to learn ‘algorithmic/cyclic processes’.