DALL-E 2: creating photorealistic images from text

News

AI generated image of an astronaut lying on a poolside lounger

In April, OpenAI announced DALL-E 2, an AI text-to-image model that is based on GPT-3, and the results are seriously impressive.

First, a bit of background. GPT-3, introduced by OpenAI in 2020 is a 175 billion parameter language model that was trained on a huge internet dataset (think Wikipedia, web pages, social media, etc.). The true size of this dataset is difficult to estimate, but some reports suggest that the entire English Wikipedia (roughly 6 million articles) made up just 0.6% of its training data. The result is a language model that can perform natural language processing tasks like answering questions, completing text, reading comprehension, text summarization, and much more.

You may have heard about GPT-3 in 2020, when researchers and journalists alike played with its powerful text generator and found that GPT-3 was able to convincingly generate poems, articles, and even whole novels using just a single text prompt. GPT-3 was pushed further when Sharif Shameem demonstrated that GPT-3 was able to generate functional JSX code, having been shown just 2 JSX code samples to learn from. The key point is that GPT-3 doesn’t just output any old text, it is able to recognise the tone, style, and language of writing and generate new unseen and convincing text in that same style.

DALL-E 2 takes this one step further. Leveraging the natural language understanding abilities of GPT-3, it is capable of generating photorealistic images that have never been seen before from simple text prompts. It was trained through CLIP, which stands for Contrastive Language-Image Pre-training. The model was shown millions of images and their associated captions and was able to learn the relationship between the semantics in natural language and their visual representations. This understanding of how images relate to their captions and vice versa allows it to generate, new previously unseen images from descriptions which combine multiple features, objects, or styles, using a process called ‘diffusion’. In the same way that GPT-3 can recognise writing styles and contexts, DALL-E 2 can ‘understand’ both the descriptions of the content i.e. “a car” and the style in which to generate i.e. “designed by Steve Jobs” to produce never seen before imagery, like this car.

Or given the text prompt “photograph of the Eiffel Tower built, rebuilt with bamboo sticks” DALL-E 2 is able to combine its understanding of what the Eiffel Tower looks like with its understanding of what bamboo looks like to create photorealistic images of a structure that doesn’t exist.

AI generated image of the Eiffel Tower made out of bamboo

Source: Twitter (@hardmaru)

It also excels when given difficult prompts like “handwritten notes by Alan Turing about the design of a Brain-like Turing machine” as shown below.

AI generated images of complex looking invention diagrams

Source: Twitter (@hardmaru)

It is even able to create complex visual effects, such as light distorting through glass. For example, when prompted with “looking through a glass of red wine while driving on the highway” it produces the artistic shot below.

AI generated image of the road ahead whilst driving, taken through a glass of red wine

Source: Twitter (@hardmaru)

It can also convincingly edit and retouch photos based on a simple language description, for example, “replace the cat in this photo with a koala bear, and give the koala bear a top hat” and can be used to create variations of images in different styles and orientations. It can also perform ‘inpainting’ where damaged, deteriorated, or missing parts of an image are generated from simple text prompts to create new images, whilst preserving the context and style of the existing image.

DALL-E 2 was closely followed by Imagen, a text-to-image AI model built by the researchers at Google Brain. Unlike DALL-E 2 which was trained on a dataset of text-image pairs, the researchers at Google used a text encoder to convert input texts into embeddings (vector representations of discrete inputs) which are then fed into successive diffusion models. Again, the result are fairly mind-blowing. The left hand image below was generated from the text prompt “Teddy bears swimming at the Olympics 400m butterfly event”, while the right hand image was generated from the prompt “An art gallery displaying Monet paintings. The art gallery is flooded. Robots are going around the art gallery using paddle boards”

AI generated images of a teddy bear lane swimming and robots kayaking in an art gallery

Source: Imagen Research Google

Arthur C Clarke once said “any sufficiently advanced technology is indistinguishable from magic” and DALL-E 2 and Imagen certainly feel like magic. The examples I’ve highlighted here demonstrate the creativity (depending on how you define it), flexibility, and power of these state-of-the-art image generation models. They can produce anything from photorealistic images of never seen before products, to artwork in the style of deceased artists, to futuristic cityscapes, and everything in between. What’s more, they are able to generate these completely new unseen images, in any drawing or photographic style we want, using just a simple text prompt.

It doesn’t take much to imagine a not-too-distant future where these AI-generation systems are able to generate convincing audio and video from only simple text descriptions. These outputs could be easily tweaked and redesigned, again, just by describing what we want the output to look like. From UI, to architecture, to film making, image generation models can help us visualise designs, logos, buildings, or TV scenes without having to actually design or build them. It might allow us to iterate these designs much more rapidly than ever before, completely transforming the relationship between humans and computers in the design process.

The implications of sophisticated AI-powered generation technologies are vast and we have probably only just started to scratch the surface of what this might mean for all sorts of industries underpinned by creativity and design. How will these tools shape the next 10 years of technology and media? Will we see AI-generated artwork, photography, film, and media or will image generation just replace the need for Getty images? Let us know your thoughts!

MTM team
22nd July 2022

Title image source: OpenAI