OpenAI showcases DALL-E 2, a powerful A.I. for creating photorealistic scenes from text descriptions

The creation and editing of photorealistic digital images is about to get much easier.

OpenAI, the San Francisco artificial intelligence company that is closely affiliated with Microsoft, just announced it has created an A.I. system that can take a description of an object or scene and automatically generate a highly realistic image depicting it. The system also allows a person to easily edit the image with simple tools and text modifications, rather than requiring traditional Photoshop or digital art skills.

“We hope tools like this democratize the ability for people to create whatever they want,” Alex Nichol, one of the OpenAI researchers who worked on the project, said. He said the tool could be useful for product designers, magazine cover designers, and artists—either to use for inspiration and brainstorming, or to actually create finished works. He also said computer game companies might want to use it to generate scenes and characters—although the software currently generates still images, not animation or videos.

Because the software could be also used to more easily generate racist memes or create fake images to be used in propaganda or disinformation, or, for that matter, to create pornography, OpenAI says it has taken steps to limit the software’s capabilities in this area, first by trying to remove such images from the A.I.’s training data, but also by applying rule-based filters and human content reviews to the images the A.I. generates.

OpenAI is also trying to carefully control the release of the new A.I., which it describes as currently just a research project and not a commercial product. It is sharing the software only with what it describes as a select and screened group of beta testers. But in the past, OpenAI’s breakthroughs based on natural-language processing have often found their way into commercial products within about 18 months.

The software OpenAI has created is called DALL-E 2, and it is an updated version of a system that OpenAI debuted in early 2021, simply called DALL-E. (The acronym is complicated, but it is meant to evoke a mashup of WALL-E, the animated robot of Pixar movie fame, and a play on words for Dali, as in Salvador, the surrealist artist, which makes sense given the surreal nature of the images the system can generate.)

The original DALL-E could render images only in a cartoonish manner, often against a plain background. The new DALL-E 2 can generate photo-quality high-resolution images, complete with complex backgrounds, depth-of-field effects, realistic shadows, shading, and reflections.

While these realistic renderings have been possible with computer-rendered images previously, creating them required some serious artistic skill. Here, all a user has to do is type the command, “a shiba inu wearing a beret and a black turtleneck,” and then DALL-E 2 spits out dozens of photorealistic variations on that theme.

Shiba Inu dog in black turtleneck and beret — This image of a Shiba Inu dog was created by OpenAI’s DALL-E 2 image generation software.

DALL-E 2 also makes editing an image easy. A user can simply place a box around the part of the image they want to modify and specify the modification they want to make in natural-language instructions. You could, for instance, put a box around the Shiba Inu’s beret and type “make the beret red,” and the beret would be transformed without altering the rest of the image. In addition, DALL-E 2 can produce the same image in a wide range of styles, which the user can also specify in plain text.

The captioning and image classification algorithms that underpin DALL-E 2 are, according to tests OpenAI performed, less susceptible to attempts to trick it in which an object is labeled with text that is different from what the object actually is. For instance, previous algorithms that were trained to associate text and images, when shown an apple with a printed label saying “pizza” attached to it, would mistakenly label the image as being a pizza. The system that now makes up part of DALLE-2 does not make the same mistake. It still identifies the image as being of an apple.

Ilya Sutskever, OpenAI’s cofounder and chief scientist, said that DALL-E 2 was an important step toward OpenAI’s goal of trying to create artificial general intelligence (AGI), a single piece of A.I. software that can achieve human-level or better than human-level performance across a wide range of disparate tasks. AGI would need to possess “multimodal” conceptual understanding—being able to associate a word with an image or set of images and vice versa, Sutskever said. And DALL-E 2 is an attempt to create an A.I. with this sort of understanding, he said.

In the past, OpenAI has tried to pursue AGI through natural-language processing. The company’s one commercial product is a programming interface that lets other businesses access GPT-3, a massive natural-language processing system that can compose long passages of novel text, as well as perform a number of other natural-language tasks, from translation to summarization.

DALL-E 2 is far from perfect though. The system sometimes cannot render details in complex scenes. It can get some of the lighting and shadow effects slightly wrong or merge the borders of two objects that should be distinct. It is also less adept than some other multimodal A.I. software at understanding “binding attributes.” Give it the instruction, “a red cube on top of a blue cube,” and it will sometimes offer variations in which the red cube appears below a blue cube.