The path to AI as web developers
[Paper: Design2Code: How Far Are We From Automating Front-End Engineering?]
Summary by Adrian Wilkins-Caruana
Have you ever right-clicked a webpage and selected “View Page Source?” If so, then you’ve glimpsed the world of frontend web development — the source code that tells your browser how the things on the webpage should look. Unless you’re a frontend developer, you’d probably have a hard time seeing a page design and then mapping it to its source code. But could an AI do that? Today’s paper explores whether multimodal AIs like GPT 4 or Gemini Vision Pro can generate a webpage’s source code from an image of a page design. If it is possible, then this could become a component of webpage-building AI tools.
Design2Code is a framework based on the above premise — it auto-generates the code for a webpage based on an image of what that page should look like. The framework includes a dataset of webpage screenshots and their corresponding source code. The dataset contains a diverse range of webpages, including blogs, company/organization webpages, product pages, and news pages. Unlike other datasets of this kind that are typically generated synthetically, Design2Code’s dataset is sourced from real-world webpages. Here are a few examples from it:
The Design2Code dataset only contains 484 examples because it’s not meant for training models, but for evaluating how well the model can generate webpages. The Design2Code framework can score a webpage-generating AI along 5 axes. To help generate these scores, the framework divides both the input image and the output webpage into blocks, which are rectangular sections within the image, and an associated subset of the resulting webpage. The blocks come in pairs (one from the image, one from the generated webpage), and it’s good if the blocks in a pair are similar to each other. Within that context, the authors measured these forms of similarity between image and generated page:
Color: The perceptual difference between colors in the reference image and generated webpages.
Position: How closely the coordinates of blocks on the reference image and generated pages match.
Block-match: Overall, how closely the set of blocks in each page match with the set of blocks in the image. (This criteria can help punish dropped or hallucinated blocks.)
Text: How similar the text is between matched blocks from each webpage.
CLIP: How similar each webpage is to its reference image overall, which is achieved using image embeddings (a semantic vectorization) for screenshots of both the page and of the image.
The first four of these axes are all done on a block-by-block basis. To do this, the authors needed a way to determine which blocks in each webpage correspond to each other, even when the two blocks aren’t exactly the same. For example, if there is a “About us” block in the reference page, the algorithm might match it with the “About” block in the generated page. To do this, the authors use a fancy but standard algorithm (called the Jonker-Volgenant algorithm).
The figure below shows a radar chart comparing the performance of four different webpage-generating models: GPT-4V (Vision); Gemini Pro Vision; WebSight, which is a model trained on a synthetic dataset of webpage-code data; and Design2Code, which we’ll discuss in more detail shortly.
The authors then used the Design2Code benchmarking data and these metrics to evaluate how well these models generate webpages. GPT-4V consistently scores the best or close to the best across all the metrics. Even when the researchers tried various prompting techniques (e.g., direct, text-augmented, and self-revision prompting), GPT-4V came out on top. This was true even when people evaluated the generated webpages — they preferred GPT-4V over Gemini and other open-source models.
The quantitative aspects of Design2Code are really important for making incremental improvements in automatic webpage generation. Yet, on their own they left me feeling like these models aren’t quite up to the task yet. For example, even GPT-4V has a block-match score of only 78%! (Intuitively, this means something like this: The webpages made by GPT-4V either had only 78% of the blocks they should have, or that the image contained only 78% of the blocks that were in the created page; the 22% discrepancy is bad either way.) However, this is where the most intriguing aspect of the paper comes in: The researchers had people compare an original webpage to a webpage that was generated by GPT-4V. They then asked these people: Can the AI-generated webpage replace the original webpage? And is the reference webpage or AI generation better?
Amazingly, 49% of respondents considered the AI webpage to be interchangeable with the original, while 64% said they preferred the AI webpage to the original! I find this fascinating because, even though Design2Code provides really useful metrics for scoring webpage reproduction, none of them capture whether the generated webpage is functionally good enough. This is an aspect that would be great to see in more machine learning papers, especially ones that explore practical applications of AI. It’s a reminder that an AI model doesn’t need to score 100% on the relevant quantitative metrics for it to be useful. Sometimes, the bar can be much lower than that, and other times the metrics might not capture progress on fundamental questions like, Could this AI do a human’s job?
One final thought: As someone who shivers in fear when I hear the words “HTML” or “CSS,” I was really hoping that Design2Code would mean that I’d never have to write a line of webpage code again in my life. Unfortunately, it seems like the reality is that webpage generation remains a challenging task for AIs. But one take away from the paper is that webpages become much harder to generate (according to the 5 axes outlined above) as the total number of tags increases. Unsurprisingly, this means that simpler webpages are easier to reproduce than more complex ones. So, if you have a simple design in mind and you just can’t be bothered to code it up, there’s a decent chance that an AI might be able to do it for you. However, if your design is complicated, then it might be best to leave it to the pros!