EleutherAI researchers have open-source GPT-NeoX-20B, a 20 billion parameter natural language processing (NLP) AI model similar to GPT-3. The model was trained on 825 GB of publicly available text data and its performance is comparable to similarly sized GPT-3 models.
The release was announced on the EleutherAI blog. GPT-NeoX-20B was trained on EleutherAI’s Pile open source dataset using NVIDIA A100-SXM4-40GB GPUs. When evaluated on several common NLP benchmark tasks, GPT-NeoX-20B achieved near-linear interpolation accuracy between OpenAI’s Curie and DaVinci models, while its point performance on the dataset MATH tests exceeded those of GPT-3 175B. EleutherAI claims that GPT-NeoX-20B is the largest pre-trained open source autoregressive language model available, and
We hope that the increased accessibility of models of this size will facilitate research into the safe use of AI systems and encourage anyone interested in working in this direction to contact us.
OpenAI first published a paper on Generative Pre-Trained Transformers (GPT) in 2018 and released its 1.5B-parameter GPT-2 model in 2019. In 2020, OpenAI announced a 175B-parameter model, GPT -3, but did not release the trained model. files. Instead, OpenAI provided an API that allows developers to embed the model into their code through web service calls. Since then, several models larger than GPT-2 have been open sourced, including Megatron-11B, Pangu-α-13B, Meta’s Fairseq 13B, and EleutherAI’s previous models GPT-Neo and GPT-J-6b, which InfoQ has covered in the last year.
In addition to these open source models, there are even larger models, such as GPT-3, with hundreds of billions or even billions of parameters. However, according to EleutherAI, these are “almost universally” either blocked by an API or not publicly available at all. Part of EleutherAI’s motivation for publishing its models is its belief that open access to these models is necessary to advance research in the field, as it is their large scale that makes them interesting.
The architecture of GPT-NeoX-20B is similar to GPT-3, with a few key differences. First, GPT-NeoX-20B uses rotating position integrations instead of learned integrations for token position encoding. Second, GPT-NeoX-20B computes the attention and feed-forward layers in parallel rather than serially, resulting in a 15% increase in throughput. Finally, where GPT-3 alternates sparse and dense layers, GPT-NeoX-20B only uses dense layers.
GPT-NeoX-20B was trained using EleutherAI’s custom codebase (also known as GPT-NeoX), which is based on Megatron and DeepSpeed and is implemented in PyTorch. Since the model is too large to fit in a single GPU, the team used model parallelism as well as data parallelism during training. Additionally, since the team’s computational budget constraints made the search for hyperparameters “unsolvable”, they opted to reuse the hyperparameters published in the GPT-3 paper.
The researchers evaluated GPT-NeoX-20B against a “diverse collection” of NLP benchmarks, including LAMBADA and WinoGrande, as well as the HendrycksTest knowledge benchmark and the MATH dataset. They compared its performance to their previous model GPT-J-6B as well as Meta’s FairSeq 13B and several different sizes of GPT-3. According to the team, the performance of GPT-NeoX-20B on NLP tasks “could be improved”, but its performance on scientific and mathematical tasks “excellent”.
EleutherAI researcher Connor Leahy answered several questions about the model on Twitter. Asked about the impact of trying different random boot seedsLeahy replied:
We only had enough compute for a single run of 20B, so we didn’t compare random seeds. However, we didn’t see any noticeable seed-based fluctuations in the smaller models. [Large language models] tend to converge to a similar loss, they are not as unstable as [reinforcement learning].
GPT-NeoX-20B code and pre-trained model weights are available on GitHub