My lab held a hackathon yesterday to play with places where large language models could help us with our research in cognitive science. The mandate was, "how can these models help us do what we do, but better and faster."
Some impressions:🧵
Whatever their flaws, chat-based LLMs are astonishing. My kids and I used ChatGPT to write birthday poems for their grandma. I would have bet money against this being possible even ten years ago.
But can they be used to improve research in cognitive science and psychology?
But can they be used to improve research in cognitive science and psychology?
1. Using chat-based agents to retrieve factual knowledge is not effective. They are not trained for this and they do it poorly (the "hallucination problem"). Ask ChatGPT for a scientist bio, and the result will be similar but with random swaps of institutions, dates, facts, etc.
2. A new generation of retrieval-based agents are on their way but not here yet. These will have a true memory where they can look up individual articles, events, or entities rather than predicting general gestalts. Bing and Bard might be like this some day, but they aren't now.
3. Chat-based agents can accomplish pretty remarkable text formatting and analysis, which has applications in literature reading and data munging. E.g., they can pull out design characteristics from scientific papers, reformat numbers from tables, etc. Cool opportunities. These functions are critically dependent on long prompt windows. Despite GPT-4's notionally long prompt length, in practice we couldn't get more than 1.5k tokens consistently. That meant that pre-parsing inputs was critical, and this took too much manual work to be very useful.
4. A massive weakness for scientific use is that cutting-edge agents cannot easily be placed in a reproducible scientific pipeline. Pasting pasting text into a window is not a viable route for science. You can get API access but without random seeds, this is not enough. (We got a huge object lesson in this reproducibility issue yesterday when OpenAI declared that they are retiring Codex, a model that is the foundation of a large number of pieces of work on code generation in the past year. This shouldn't happen to our scientific workflows.) Of course we could download Alpaca or some other open model, set it up, and run it as part of a pipeline. But we are cognitive scientists, not LLM engineers. We don't want to do that just to make our data munging slightly easier!
5. Chat agents are not that helpful in breaking new ground. The problem is that, if you don't know the solution for a problem, then you can't tell whether the AI did it right, or even is going in the right direction! Instead, the primary use case seems to be helping people accomplish tasks they *already know how to do*, but to do them more effectively and faster. If you can check the answer, then the AI can produce a candidate answer to check.
6. It was very easy for us to come up with one-off use-cases that could be very helpful (e.g., help me debug this function, help me write this report or letter), and surprisingly hard to come up with cases that could benefit with creating automated workflows. At small scale, using chat AI to automate research tasks is trading one task (e.g., annotating data) for more menial and annoying ones (prompt engineering and data reformatting so that the AI can process it). This is ok for large problems, but not small and medium ones.
7. Confidence rating is a critical functionality that we couldn't automate reliably. We need AI to tell us when a particular output is low confidence so that it can be rechecked.
In sum: Chat AI is going to help us be faster at many tasks we already know how to do, and there are a few interesting scientific automation applications that we found. But for LLMs to change our research, we need better engineering around reliability and reproducibility.
No comments:
Post a Comment