Neural models that build cities
Paper: CityDreamer: Compositional Generative Model of Unbounded 3D Cities
Summary by Adrian Wilkins-Caruana
Picture this: You're soaring above a cityscape, marveling at the skyscrapers and parks below. It looks incredibly real, but guess what? It's not. It's a 3d-generated city, and it's almost indistinguishable from the real thing. Welcome to the world of CityDreamer, a fascinating new tool designed to create lifelike 3d cities that capture the complexity and diversity of actual urban environments. Unlike other 3d scene generators, which can be a bit hit-or-miss when it comes to cities, CityDreamer is finely tuned to make sure that buildings aren't just tall rectangles but have the intricate details you'd expect to see in real life. Here are a couple examples:
The researchers behind CityDreamer have tackled the uniquely complex challenges of city generation, from the variety of building appearances to peoples’ sensitivity to distorted urban structures. There are several key technical components that contribute to CityDreamer’s state-of-the-art generation ability. The first is a layout generator that determines the location and layout of things like roads, buildings, green space, construction sites, bodies of water, and the heights of buildings. As shown below, this information is represented in two 2d maps: a height map (left) and layout map (right), which is a semantic map of the various objects.
CityDreamer then uses these two submodels to generate an image of the city from any perspective. The authors break this task down into three steps: One model determines what the buildings look like from the camera’s perspective, while another model determines what the background (e.g., the ground) looks like. Finally, an image compositor puts all this info together to render the complete city image.
Creating the images — for both the background and the buildings — relies on the “semantic” map (capturing roads vs. rivers vs. parks, etc.) and the height map (how tall each building is) as precursors. Using that information, a neural radiance field model (NeRF) generates images from the chosen camera angle. To put this all together into a final image, the image compositor first renders the background, and then renders all the building instances in layers from the back of the scene to the front, as shown here:
CityDreamer can generate realistic cities and buildings because its models have been trained using vast amounts of data from actual cities. The first dataset was gathered from OpenStreetMap data, which covers 60 cities and over 6k square kilometers. The researchers used this data to train the model that generates the height and semantic layout maps. If you’ve noticed that the composited images kind of look like New York City, that’s because the training data for the NeRF models came from Google Earth Studio, and includes images of New York City from 400 “orbit trajectories,” which you can think of as views of the city from a plane window.
CityDreamer is significantly better than other city-generation methods both quantitatively (in terms of numerical performance metrics) and qualitatively (in terms of human perception). See for yourself in the examples below, which compare CityDreamer (bottom row) to two other methods (top and middle rows):
While CityDreamer generates diverse and consistent views of cities, its approach is limited in that it can’t generate buildings that are more complex than polygons extruded vertically from the ground (no Guggenheims). The building instance generation can also be quite slow, since the view of each building needs to be generated individually. At the same time, it’s amazing to create an entire city — one that feels real — at the click of a button.