Image generation for infinite games
[Paper: Unbounded: A Generative Infinite Game of Character Life Simulation]
Summary by Adrian Wilkins-Caruana
Video games offer players a sense of interactivity and open-endedness that things like films or books often can’t. But even video games don’t offer an infinite degree of interactivity and flexibility. For instance, they might bound the area that a character can explore or limit the actions they can take. The obvious reason for this is that game designers can’t create an infinite amount of content. But what if AI could create the content instead? This is where a new game called Unbounded comes in: Its content is generated by vision and language models in real time to create an endless gaming experience.
Unbounded’s concept is quite simple: It’s a story that unfolds as the player interacts with the game. For example, the image below shows Archibus the Wizard teaching his students, but then he gets hungry, so then the player instructs him to eat some pears. The curious thing about this game is that the player can instruct Archibus to eat anything because Unbounded generates its content on the fly based on the player’s instructions, and the game’s narrative progresses with each turn. The researchers call it a generative infinite game.
Several recent advances in generative AI have helped to make Unbounded possible, the first being latent consistency models. LCMs are latent diffusion models that can generate high-resolution images in a few diffusion steps — far fewer than the tens or hundreds of steps typical of diffusion models — allowing near real-time image generation in around 1 second per image. Another pair of technologies enable consistent generation of a character from scene to scene: Dreambooth and low-rank adapters (LoRA). With a few generic images of a subject like, say, the wizard Archibus, the Dreambooth text-to-image model that can make new images of the subject, like Archibus in a classroom studying arcane magic. The researchers fine-tuned the LCM + Dreambooth system for Unbounded scene-generation task using LoRA, which is an efficient way to fine-tune models by training a low-rank matrix of weights that get added to the original model’s frozen weights.
In addition to generating consistent characters, Unbounded also needs to generate consistent environments. To do that, the researchers made a new method called regional image prompt adapters, which is an extension of an existing technology called image prompt adapters. A regular IP adapter is a really effective way to generate a new image using an image plus a text prompt. But Unbounded needs to generate scenes using two image prompts: One for the character, and another for the environment. So, the regional IP adapter is a way to guide the diffusion model to generate each of these prompts in different parts of the image using a dynamic mask (which I’ll explain shortly). The figure below shows how these components work together to generate the final scene.
The clever part about the dynamic mask is that the scene is partitioned into character parts and environment parts automatically as the model generates the scene. To do this, the researchers noticed that some parts of the image have high cross-attention to the character prompt, while other parts have lower cross-attention. So, using a predefined threshold, they selected the blocks in the image that have the highest cross-attention to the image prompt to be where the model generates the character, and the other parts for where the model generates the environment.
The researchers used quantitative experiments to show that their method generates more consistent characters and environments than other scene-generation methods, and also that the rest of the image content aligns with the other details in the prompt. The figure below also shows some examples from these experiments. (The “[V]” in the prompt is a special symbol that the model understands as the character.)
To turn this into a game, the researchers needed a way to make the open-ended user input into a prompt the system can understand (e.g., replace “Archibus” with “[V]”) and to describe the progression of the game’s narrative (e.g., “Archibus is hungry”). To do this, they compiled a dataset of narrative topics, and then used these topics to simulate user-LLM interactions — using another LLM in place of the user — that include various environments, character actions, and game mechanics. The researchers then use these simulations to fine-tune an LLM for prompt re-writing and narrative progression.
The Unbounded game and generative infinite games are really cool ideas, and I can’t wait to see how game designers will use them. As exciting as the idea is, though, I wonder how interesting Unbounded is to play. Though its image-generation component is quite capable, the narration component was trained on a dataset of two LLMs interacting with each other, so I’d be surprised if the narratives are enthralling. But Unbounded’s LLM could conceivably be swapped out for one that’s better at coming up with good stories. I anticipate that future iterations to Unbounded might make progress in this area, and I think it would make for some very interesting games.