LLMs have original, research-worthy ideas
[Paper: Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers]
Summary by Adrian Wilkins-Caruana
In the early days of my PhD — which was before ChatGPT — I spent a lot of time identifying what the open problems were in my field and assessing the feasibility of tackling them. While generating new research ideas is a highly speculative exercise, it’s arguably one of the most important skills of a scientist — which is why it was such an important part of my training. But even the most skilled researchers can spend years pursuing an idea that might not work or lead to a dead end. That’s why some researchers from Stanford have studied whether AI models like ChatGPT can automate the idea-generation process.
Their study involved comparing human- and LLM-generated research ideas. They broke the task of research ideation and evaluation into three steps:
Generating the idea.
Writing-up the idea to explain and communicate it.
Expertly evaluating the idea.
I think this three-step process is quite fitting, especially because the second and third steps are akin to grant writing and appraisal, respectively, which are essential for getting new research projects funded.
There are lots of details in the research methodology. I won’t describe them in detail, but here are the highlights:
The LLM that generated the ideas was Claude 3.5, and it used something called retrieval-augmented generation, which means it could access a database of research papers and their references while generating ideas.
The researchers recruited 49 experts to write up ideas and 79 experts to review the write-ups (both the human- and LLM-generated ones).
Claude 3.5 also ranked the write-ups. To do this, it appraised pairs of research ideas, and picked the “better” of the two, repeating the process until all the ideas were ranked. In this context, “better” is purely Claude’s opinion, but it’s not complete baloney: The researchers found that Claude (more than other LLMs) is 71.4% accurate at picking the higher-scoring review in pairs of submissions to a famous machine learning conference (ICML).
When the LLM appraised the pairs of papers, it evaluated them across five different axes: novelty, excitement, feasibility, expected effectiveness, and a final overall score. Across 119 human-generated ideas and 109 AI-generated ones, the researchers found — with statistical significance — that the AI-generated ideas were both more novel and more exciting than human ones. But their results didn’t reveal any statistically significant difference in feasibility, effectiveness, or overall score. The figure below compares the scores of the human (yellow) and AI (light-blue) ideas. The AI+Rerank (dark blue) bars indicate scores where one of the authors further refined Claude’s rankings.
You might be wondering what some of these ideas look like, so I’ve included two examples and their problem statements (the first section of the write-up) at the end of this summary.
After studying these results, the researchers noticed some limitations of current LLMs that might prevent them from being useful scientific agents. First, they found that the LLMs didn’t generate diverse ideas — across 4,000 generated ideas, only about 200 were distinct (5%). The second issue is related to judging research ideas. Human reviewers of AI research papers typically agree with each other about 70% of the time, but the best LLM reviewer (Claude) only agreed with itself 54% of the time (meaning if you asked it to rank the same pair multiple times, it would often change its answer); this is only slightly better than random guessing (50%).
With both of these limitations, it’s probably best not to put too much weight on the paper’s main conclusion (that AI generates more novel and exciting research ideas than people do). But I think that’s missing the point. What’s clear to me from this study is that AI can absolutely generate research ideas! Sure, there’s some uncertainty about whether these ideas are slightly more novel or slightly less feasible than human ideas, but the ideas aren’t nonsense, which I think is incredible. I for one would have been so grateful to have an AI companion — even a flawed one — when I started my research.
As promised, here are a couple of examples of AI-generated ideas for NLP research:
Modular Calibration for Long-form Answers: Calibrating the confidence of Large Language Models (LLMs) when generating long-form answers, such as essays and code, remains an open challenge in the field of natural language processing.
Translation with LLMs through Prompting with Long-Form Context: Stable generation of text in low-resource languages is an unsolved issue in large language models.