Friday, June 2, 2023

The Making of ChatGPT

ChatGPT is a language model that was developed in two phases: pre-training and fine-tuning. 

In the pre-training phase, the model was trained on a large amount of text data using unsupervised learning techniques. This phase helped the model understand the structure of language and the relationships between words and sentences. 

Transformer, deep-learning model designed to process sequential data by capturing the relationships between different elements in the sequence, was utilized in this process. Unlike traditional recurrent neural networks (RNNs) that process input sequentially, transformers operate in parallel, making them more efficient for long-range dependencies. The core component of a transformer is the self-attention mechanism, which allows the model to weigh the importance of different words in the input sequence when generating representations. 

Sentences are transformed into vector embeddings - dense, low-dimensional representations of words that capture their semantic meaning. Each word in the sentence is mapped to its corresponding embedding vector. To apply the self-attention mechanism, these embedding vectors are divided into three parts: queries, keys, and values. For each word in the sequence, the self-attention mechanism computes a weighted sum of the values, where the weights are determined by the compatibility between the query and the keys. This is computed by taking the dot product between their respective vector representations. This dot product is then scaled by a factor of the square root of the dimension of the key vectors. This scaling ensures that the dot products do not grow too large as the dimensionality increases. Next, the scaled dot products are passed through a softmax function to obtain the attention weights. These attention weights indicate the importance or relevance of each word in the sequence to the current word. Words with higher weights are deemed more relevant and will contribute more to the weighted sum. The weighted sum of the values is computed by multiplying each value with its corresponding attention weight and summing them up. This represents the attended representation of the current word, which incorporates information from other words in the sequence based on their relevance.

The self-attention mechanism enables the model to capture contextual information effectively, allowing it to understand the dependencies between words or tokens in the sequence. The self-attention layer is repeated multiple times, which allows the model to learn increasingly complex relationships between the tokens in the input sequence. This is in contrast to RNNs, which can only learn one relationship at a time.

One common technique used in pre-training of GPT was language modeling, where the model is trained to predict the next word in a sentence given the preceding context. This task helps the model learn the statistical properties of language and the relationships between words.

Another technique was masked language modeling. In this task, random words in a sentence are masked, and the model is trained to predict the original words based on the context. This helps the model grasp the contextual dependencies between words and improves its ability to fill in missing information.

ChatGPT is built on a transformer-based neural network with some modifications to improve its performance: layer normalization (moved to the input of each sub-block), pre-activation residual network, and a modified initialization.  Additionally, an extra layer normalization is added after the final self-attention block. The modified initialization takes into account the accumulation on the residual path with model depth. The weights of the residual layers are scaled by a factor of 1/√N, where N represents the number of residual layers. The vocabulary size was significantly expanded (50,257 in GPT-2). The context size, aka the length of the input sequence, was increased (from 512 to 1024 tokens for GPT-2). 

Next innovation was the dataset. The model was trained on a diverse dataset of web pages called WebText, which was collected from various domains and contexts. 

In the fine-tuning phase, the model was trained on specific tasks like text completion, question-answering, and dialogue generation using labeled datasets. The model's parameters were adjusted to minimize the differences between its predicted outputs and the correct answers for those tasks. 

ChatGPT is paving the way for a future where knowledge creation is accelerated. Our paper highlights the remarkable adoption and expansion of ChatGPT across various domains.


Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. Attention is All you Need (

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Blog. 2019. Available online:

Gabashvili I.S. The impact and applications of ChatGPT: a systematic review of literature reviews. Submitted on May 8, 2023. arXiv:2305.18086 [cs.CY].

No comments:

Post a Comment

From Asimov to AI Predicting Human Lives

For decades, storytellers have envisioned worlds where technology holds the key to predicting the future or shaping human destinies. Isaac A...