Evaluation of Large Language Models in Advanced Physics Problem Solving
Recent research by Mohamad Ali-Dib and Kristen Menou, titled "Physics simulation capabilities of LLMs," evaluates the performance of state-of-the-art Large Language Models (LLMs) in solving advanced physics problems. The study, submitted to arXiv on December 4, 2023, investigates how LLMs can tackle PhD-level computational physics challenges using established coding frameworks in physics and astrophysics.
The authors present approximately 50 original problems across various domains, including celestial mechanics, stellar physics, fluid dynamics, and non-linear dynamics. These problems are designed to assess the models' capabilities in generating accurate coding solutions. The evaluation criteria include counts of different types of errors, such as coding and physics errors, as well as a Pass-Fail metric that captures the essential physical concepts involved in the problems.
Findings indicate that the current leading LLM, GPT-4, struggles with most of the presented problems, achieving a passing grade on about 40% of the solutions. However, it produced 70-90% of the code lines that were deemed necessary, sufficient, and correct. The study highlights that the most frequent issues encountered were related to physics and coding errors, with some lines being unnecessary or insufficient.
This research provides insights into the current limitations of AI in simulating physical phenomena and suggests areas for improvement. The authors emphasize that their work serves as a preliminary assessment of LLMs' computational abilities in classical physics, pointing out the need for advancements if AI systems are to achieve a basic level of autonomy in physics simulations. The full paper can be accessed at arXiv:2312.02091.