How Would You Apply GPT to Images?

Angelina Yang
2 min readAug 1, 2022

Generative Pre-trained Transformer (GPT, GPT-2 and GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. There are a lot of deep explanations elsewhere so here I’d like to share tips on what you can say during an interview setting.

1. How would you apply GPT to images?

Here are some example answers for readers’ reference:

Actually, there was an ImageGPT model proposed in Generative Pretraining from Pixels by Mark Chen and other researchers from OpenAI (2020). ImageGPT (iGPT) is a GPT-2-like model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.

Mark Chen et al. [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf).

There are a few things you have to do: You have to modify the autoregressive next word prediction objective. You can think of images as a very strange language, where the words are pixels instead. And you need to predict the next pixel at each point. We can just change the objective for the next word prediction to the next pixel prediction. In the language setting, we pre-train on this large unlabeled dataset on the internet and we fine-tune on question answering or other benchmarks. In images, the

--

--