Can AI Teach Science?

Imagine that you are a college student taking a class on mathematics, physics, computer science, or environmental science. You attend your lectures diligently, taking notes. But when reviewing those notes later on, you find yourself a little confused. How exactly did the professor solve that differential equation? Continually baffled, you open your laptop and type your question into an AI language model. It provides you with an answer, but a nagging feeling remains: just how reliable is it? Can you really trust a solution that artificial intelligence has generated?
The effectiveness of AI models, particularly large language models (LLMs), as educational tools in STEM subjects has recently been evaluated by a cross-disciplinary research team that includes three IAS scholars: Alexis Chevalier, Member (2022–23) and Visitor (2023–24) in the School of Mathematics; Sebastian Mizera, Member (2019–24) in the School of Natural Sciences; and Toni Mikael Annala, Member (2022–24) in the School of Mathematics. Their results were published in the Proceedings of the 41st International Conference on Machine Learning.

“There is an increasing interest in utilizing AI for educational purposes,” said Chevalier. “But given the challenges that current language models face, such as generating incorrect answers or ‘hallucinating’ information, we felt that it would be useful to assess the correctness and helpfulness of the language models themselves.”
To do this, the scholars, each an expert from different STEM fields, first used open-source textbooks to devise a range of plausible questions that a student could ask, based on the source material. These questions were diverse in nature: “Some were written as if they came from students who were completely confused about the topic and were just asking a basic question,” Mizera explained. “In other cases, we included very detailed questions about a specific sign in a specific equation, asking, for example, why it's there.”
Also included were questions that replicated real-life instances where a student might misunderstand the subject matter and include false assumptions within the question. A crude example of this is as follows: “If the E in E=mc2 stands for electrons, what does the m stand for?” Such questions were, in Mizera’s words, “trickiest.” This is because AI models “tend to be agreeable,” he continued. “They are often biased towards agreeing with the user. In this case, this is precisely something that you don't want because the student is clearly confused!”
With the questions having been devised, the scholars then prepared a range of language models to assess. They began with a number of so-called “foundation models,” such as Llama. These open-source AI models had been pre-trained by Meta and other companies on extensive, high-quality datasets such as books, Wikipedia, and academic papers. As a result, they already “understood” general logic and language. Through a meticulous process, the scholars fine-tuned these foundation models with additional data from textbooks in the specific subject areas that they wanted to evaluate: mathematics, physics, computer science, and environmental science. They were then ready to begin their assessment. “Our aim was to design kind of a scoreboard in which we could assign grades to various language models’ abilities as science assistants or tutors,” stated Mizera.

The fine-tuned models that the scholars had trained were then deployed to answer the questions that they had devised. Finally, the answers were evaluated for correctness and helpfulness. “To facilitate effective grading of each language model’s responses, we developed a benchmark of ‘key points’ that outlined what constituted a good answer for each question,” Mizera explained. The scholars assessed not only whether the answers were correct, but assessed whether they were correct for the right reasons.
“Going in, we hypothesized that doing more fine-tuning would make the model score better on our benchmark,” outlined Chevalier. But this wasn’t exactly what they found. “Training models on the textbooks alone ended up having no impact on their performance,” he added.
Instead, they found that the model needed to be trained on data that was highly relevant for educational settings, namely pedagogical conversations between a teacher and student. However, finding such student-teacher conversations was challenging. The team resorted to creating synthetic dialogues using other LLMs, such as ChatGPT, for the purposes of training their model.
“We instructed ChatGPT to rephrase the content of each textbook chapter in the form of a dialogue between a teacher and a student,” explained Chevalier. “Well-written textbooks can very easily be rephrased in this way, so this task wasn’t too difficult for the LLM.”
“Most interestingly, we found that the most useful conversations were ones where the synthetic student makes lots of mistakes, and the synthetic teacher corrects them,” he continued. “When the model is only trained on conversations where the student understands everything accurately, the model ends up always agreeing with the student, which would be problematic in real-life tutoring sessions. So, we generated a lot of simulated conversations where the student makes mistakes and the teacher corrects the student. Training our models on these conversations led to the biggest performance improvements.”
Although the conversation-based training saw improvements in all models, some still emerged as more reliable than others. GPT-4, for example, was identified as a particularly strong model across all subject areas. It was so successful that it facilitated another innovative aspect of the research: as well as employing LLMs to generate the training data for their models, the scholars began deploying GPT-4 to grade other AI-generated answers! The team found that using GPT-4 to evaluate responses produced by other models yielded results comparable to human grading. “So, it's an AI grading an AI answering a human question!” said Mizera.

Another surprising finding from the study was the lack of correlation between a model’s training in one scientific domain and its performance in another. For example, a model trained extensively in mathematics did not necessarily perform better when answering physics questions. This suggests that within AI language models, knowledge transfer between different scientific disciplines may be limited, indicating that specialized fine-tuning is crucial for effective tutoring in each subject area.
The implications of this research are significant for education technology. “Our work highlights both the potential benefits and limitations of using AI as a tutor in STEM subjects,” said Annala. “While some current language models show promise in assisting students with a range of questions, the challenging questions that could be asked by, for example, an advanced undergraduate student often lead to hallucinations. Furthermore, the effectiveness of different models varies widely.” The scholars’ research emphasizes the importance of ongoing improvements in AI capabilities and suggests that future developments could focus on enhancing knowledge transfer between disciplines and refining fine-tuning techniques.
The scholars have made their models, data, and evaluations available open source on GitHub. “In the future, if you want to assess the abilities of a language model as a science tutor, you can just run our benchmark and find out how it performs,” said Mizera. “Hopefully, this technology will continue to improve going forward.”