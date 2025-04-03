Today, the journal Nature (opens in new tab) is publishing our latest research, which introduces the first World and Human Action Model (WHAM). The WHAM, which we’ve named “Muse,” is a generative AI model of a video game that can generate game visuals, controller actions, or both. The paper in Nature offers a detailed look at Muse, which was developed by the Microsoft Research Game Intelligence (opens in new tab) and Teachable AI Experiences (opens in new tab) (Tai X) teams in collaboration with Xbox Games Studios’ Ninja Theory (opens in new tab). Simultaneously, to help other researchers explore these models and build on our work, we are open sourcing the weights and sample data and making the executable available for the WHAM Demonstrator—a concept prototype that provides a visual interface for interacting with WHAM models and multiple ways of prompting the models. Developers can learn and experiment with the weights, sample data, and WHAM Demonstrator on Azure AI Foundry (opens in new tab).

In our research, we focus on exploring the capabilities that models like Muse need to effectively support human creatives. I’m incredibly proud of our teams and the milestone we have achieved, not only by showing the rich structure of the game world that a model like Muse can learn, as you see in the video demo below, but also, and even more importantly, by demonstrating how to develop research insights to support creative uses of generative AI models.

Generated gameplay examples

What motivated this research?

As we release our research insights and model today, I keep thinking back to how this all started. There was a key moment back in December 2022 that I remember clearly. I had recently returned from maternity leave, and while I was away the machine learning world had changed in fundamental ways. ChatGPT had been publicly released, and those who had tried it were in awe of OpenAI’s technical achievements and the model’s capabilities. It was a powerful demonstration of what transformer-based generative models could do when trained on large amounts of (text) data. Coming back from leave at that moment, the key question on my mind was, “What are the implications of this achievement for our team’s work at the intersection of artificial intelligence and video games?”

A new research opportunity enabled by data

In our team, we had access to a very different source of data. For years, we had collaborated with Xbox Game Studios’ Ninja Theory (based in Cambridge, UK, just like our research team) to collect gameplay data from Bleeding Edge, their 2020 Xbox game. Bleeding Edge is a 4-versus-4 game where all games are played online, and matches are recorded if the player agrees to the End User License Agreement (EULA). We worked closely with our colleagues at Ninja Theory and with Microsoft compliance teams to ensure that the data was collected ethically and used responsibly for research purposes.

“It’s been amazing to see the variety of ways Microsoft Research has used the Bleeding Edge environment and data to explore novel techniques in a rapidly moving AI industry,” said Gavin Costello, technical director at Ninja Theory. “From the hackathon that started it all, where we first integrated AI into Bleeding Edge, to building AI agents that could behave more like human players, to the World and Human Action Model being able to dream up entirely new sequences of Bleeding Edge gameplay under human guidance, it’s been eye-opening to see the potential this type of technology has.”

Muse Training Data

Until that point in late 2022, we had used Bleeding Edge as a platform for human-like navigation experiments, but we had not yet made meaningful use of the large amount of human player data we now had available. With the powerful demonstration of text-models, the next question was clear: “What could we achieve if we trained a transformer-based model on large amounts of human gameplay data?”

Scaling up model training

As the team got to work, some of the key challenges included scaling up the model training. We initially used a V100 cluster, where we were able to prove out how to scale up to training on up to 100 GPUs; that eventually paved the way to training at scale on H100s. Key design decisions we made early focused on how to best leverage insights from the large language model (LLM) community and included choices such as how to effectively represent controller actions and especially images.

The first sign that the hard work of scaling up training was paying off came in the form of a demo that thoroughly impressed me. Tim Pearce, at that time a researcher in Game Intelligence, had put together examples of what happened early versus later in training. You can see the demo here – it’s like watching the model learn. This led to our follow-up work showing how scaling laws emerge in these kinds of models.

Muse consistency over the course of training

Ground truth

Human gameplay Game visuals generated by Muse with 206M parameters

Conditioned on 1 second of real gameplay and 9 seconds of actions Original 10k training updates 100k training updates 1M training updates Character recognizable​ ✔ ✔ ✔ Basic movements and geometry​ ✔ ✔ ✔ No degeneration over time​ ✘ ✔ ✔ Correct interaction with power cell​ ✘ ✘ ✔ Models flying mechanic correctly​ ✘ ✘ ✔

Multidisciplinary collaboration: Involving users from the beginning

We had started to investigate how to evaluate these types of models early on. For example, we wanted to understand the representations learned using linear probing, which was driven by Research Intern Gunshi Gupta and Senior Research Scientist Sergio Valcarcel Macua; to explore online evaluation, driven by Senior Research Scientist Raluca Georgescu; and to generate both visuals and actions, initially termed “full dreaming” and driven by Research Intern Tarun Gupta. But working through how to systematically evaluate Muse required a much broader set of insights. More importantly, we needed to understand how people might use these models in order to know how to evaluate them.

This was where the opportunity for multidisciplinary research became crucial. We had discussed aspects of this work with Senior Principal Research Manager Cecily Morrison and her Teachable AI Experiences team for several months. And we had already partnered on an engagement with game creatives (driven by Cecily, Design Researcher Linda Wen, and Principal Research Software Development Engineer Martin Grayson) to investigate how game creators would like to use generative AI capabilities in their creative practice.

“It was a great opportunity to join forces at this early stage to shape model capabilities to suit the needs of creatives right from the start, rather than try to retrofit an already developed technology,” Cecily said.

Linda offered some valuable insights about how we approached the work: “We’ve seen how technology-driven AI innovation has disrupted the creative industry—often catching creators off guard and leaving many feeling excluded,” she said. “This is why we invited game creators to help us shape this technology from the start. Recognizing that most AI innovations are developed in the Global North, we also made it a priority to recruit game creators from underrepresented backgrounds and geographies. Our goal was to create a technology that benefits everyone—not just those already in positions of privilege.”

Unlocking new creative use cases with the WHAM Demonstrator

Now, with the model’s emerging capabilities and user insights in mind, it was time to put all the pieces together. The teams joined forces during a Microsoft internal hackathon to explore new interaction paradigms and creative uses that Muse could unlock. As a result, we developed a prototype that we call the WHAM Demonstrator, which allows users to directly interface with the model.

“The Global Hackathon was the perfect opportunity for everyone to come together and build our first working prototype,” Martin said. “We wanted to develop an interface for the WHAM model that would allow us to explore its creative potential and start to test ideas and uses we had learned from our interviews with game developers.”

WHAM Demonstrator

For interacting with World and Human Action Models like Muse, the WHAM Demonstrator provides a visual interface for interacting with a WHAM instance.

Identifying key capabilities and how to evaluate them

The hands-on experience of exploring Muse capabilities with the WHAM Demonstrator, and drawing on insights we gained from the user study, allowed us to systematically identify capabilities that game creatives would require to use generative models like Muse. This in turn allowed us to establish evaluation protocols for three key capabilities: consistency, diversity, and persistency. Consistency refers to a model’s ability to generate gameplay sequences that respect the dynamics of the game. For example, the character moves consistently with controller actions, does not walk through walls, and generally reflects the physics of the underlying game. Diversity refers to a model’s ability to generate a range of gameplay variants given the same initial prompt, covering a wide range of ways in which gameplay could evolve. Finally, persistency refers to a model’s ability to incorporate (or “persist”) user modifications into generated gameplay sequences, such as a character that is copy-pasted into a game visual. We give an overview of these capabilities below.

Muse evaluation of consistency, diversity and persistency

Consistency

Diversity

With our evaluation framework in place, and access to an H100 compute allocation, the team was able to further improve Muse instances, including higher resolution image encoders (our current models generate visuals at a resolution of 300×180 pixels, up from the 128×128 resolution of our earliest models) and larger models, and expand to all seven Bleeding Edge maps. To show some of the capabilities of the model we are publishing today, we have included videos of 2-minute-long generated gameplay sequences above, which give an impression of the consistency and diversity of gameplay sequences that the model can generate.

According to Senior Researcher Tabish Rashid: “Being handed an allocation of H100s was initially quite daunting, especially in the early stages figuring out how to make best use of it to scale to larger models with the new image encoders. After months of experimentation, it was immensely rewarding to finally see outputs from the model on a different map (not to knock the lovely greenery of Skygarden) and not have to squint so much at smaller images. I’m sure at this point many of us have watched so many videos from Muse that we’ve forgotten what the real game looks like.”

One of my favorite capabilities of the model is how it can be prompted with modifications of gameplay sequences and persist newly introduced elements. For example, in the demo below, we’ve added a character onto the original visual from the game. Prompting the model with the modified visual, we can see how the model “persists” the added character and generates plausible variants of how the gameplay sequence could have evolved from this modified starting point.

Persistency

Conclusion

Today, our team is excited to be publishing our work in Nature and simultaneously releasing Muse open weights, the WHAM Demonstrator, and sample data to the community.

I look forward to seeing the many ways in which the community will explore these models and build on our research. I cannot wait to see all the ways that these models and subsequent research will help shape and increase our understanding of how generative AI models of human gameplay may support gameplay ideation and pave the way for future, novel, AI-based game experiences, including the use cases that our colleagues at Xbox (opens in new tab) have already started to explore.

