GPT now supported with Transformer Models using CUDA 11.8 and cuDNN 8.6!

In our latest release, version 1.11.8.27, we now support GPT and Transformer Models based on the open source minGPT GitHub project by Karpathy [1].

GPT uses transformer models to learn the context of the input data via attention layers.  Stacking up a set of transformer blocks tends to learn context at several different levels from the training input.  For more information on the minGPT project either see the Karpathy GitHub Site or check out our previous blog post minGPT – How It Works.

GPT MyCaffe Model

Implementing GPT in MyCaffe required creating several new layers: TransformerBlockLayer, CausalSelfAttentionLayer, LayerNormLayer, SoftmaxCrossEntropy2Layer, TokenizedDataLayer, and a new AdamWSolver [2].  Used together, these form the minGPT model shown below.

minGpt MyCaffe Model

As you can see, the minGPT model feeds learned embeddings for both the token and position data into a stack of six TransformerBlockLayers, which are normalized and ultimately fed to an InnerProductLayer to create the logits.   The logits are then fed through a SoftmaxLayer to produce the probability distribution for the next output based on the given input.

Inside each TransformerBlockLayer resides the CausalSelfAttentionLayer which is used to learn the context at the current transformer block.  The output is normalized and fed through the MLP that includes the activation layer.  Note, we use a RELU activation layer whereas the minGPT project (and other GPT models) use a GELU layer.  We chose the RELU activation for its curve is close to the GELU and the observed results were very similar.

Transformer Block Internal Layers

At the end of the GPT model, a SoftmaxCrossEntropyLoss layer computes the loss.  One difference that you will note when viewing the MyCaffe GitHub implementation of the new classes is that we not only compute the forward pass, but also compute the gradient calculations of the backward pass for MyCaffe uses direct gradient calculations, thus giving you the developer more fine-grained control over and access to the gradient calculations.

GPT MyCaffe Model Results

When running the MyCaffe minGPT model on a Shakespeare sonnet, the model produces the following results after 2200 iterations.

To get even better results, just train for around 10k iterations.

GPT MyCaffe Implementation

To see the implementation of the new classes used in the GPT model, check them out on the MyCaffe GitHub site:

TokenizedDataLayer
TransformerBlockLayer
CausalSelfAttentionLayer
LayerNormLayer
SoftmaxCrossEntropy2LossLayer

AdamWSolver

New Features

The following new features have been added to this release.

  • CUDA 11.8.0.522/cuDNN 8.6.0.163/nvapi 510/driver 522.06
  • Windows 11 22H2
  • Windows 10 22H2, OS Build 19045.2251, SDK 10.0.19041.0
  • Added new LayerNormLayer
  • Added new TransformerBlockLayer
  • Added new TokenizedDataLayer
  • Added new SoftmaxCrossEntropy2LossLayer.
  • Added new AdamWSolver.
Bug Fixes

The following bugs have been fixed in this release.

  • Fixed bug in TransposeLayer backward
  • Fixed bug in SoftmaxCrossEntropyLoss layer when axis > 1
  • Fixed bug in create run net with SoftmaxCrossEntropyLoss not converting.

To create and train your own GPT project to learn Shakespeare, check out the new Create and Train a GPT model to learn Shakespeare tutorial.  For other great examples, including beating ATARI Pong, check out our Examples page.

Happy Deep Learning!


[1] Andrej Karpathy GitHub: karpathy/minGPT, 2022, GitHub

[2] Ilya Loshchilov, Frank Hutter Decoupled Weight Decay Regularization, 2017, 2019, arXiv:1711.05101