Monday, February 16, 2026

An LLM-backed "socratic tutor" to replace reading responses

My hot take on college-level teaching is that reading responses are mostly a terrible assignment, and they're even worse in the age of AI. I'm piloting something a bit different with my co-instructor right now: a "socratic tutor" bot that asks students to answer open-ended socratic questions about a specific text and "passes" them when they show sufficient comprehension. Initial feedback from students in a first trial has been extremely positive, so I am thinking more about how this could be useful in the future, as well as some of the potential problems. LLMs are far from a panacea for education – they cause way more problems than they solve, at the moment! – but this might be an interesting use case. 

As an instructor, one major challenge is that you want people to read the assigned reading and engage with it so that what you do in class can build on this content in a meaningful way; some students would prefer not to (or just don't have time, or whatever). How do you solve this problem? Weekly quizzes are possible but they're time-consuming to make and give and annoying to grade; plus they reinforce a memorization mindset, rather than inviting students to engage. 

The humble reading response is a frequent alternative: you ask students to respond to, critique, or build on their readings, usually in a short response ranging from a paragraph to a page. At their best in a well-prepared seminar, the instructor reads these beforehand, synthesizes them, and calls on individual students to share their reactions. But in a larger course, often this synthesis is impossible – and so the reading response becomes an assignment that no one wants to write and that can be tedious to read at the level they deserve. Even worse, if you're not getting called out on your reaction, it's possible to "respond" to a reading without having read it. And that's even before you can ask an AI to write a response to a text that it has ingested at some point (or that you've pasted into its chat window). What do we do? 

This quarter at Stanford I'm co-teaching SymSys 1, "Minds and Machines," with my long-time collaborator and friend, Noah Goodman. SymSys1 is the introductory course to Symbolic Systems, Stanford's answer to cognitive science – this is the program I majored in as a student and that I now act as faculty director for. We get up to 250 students per quarter for the course, so assessment and personalized feedback is a real challenge for the course. Around 2020, Noah, Dan Lassiter, Erica Yoon, and I redid the course to be a pandemic-style flipped classroom with limited in-class lectures and a lot of video materials, section-based instruction, and interactive module projects. Overall, this revision has been a big success, boosting ratings for the course and leading to larger enrollments. But every course needs a refresh eventually and so I joined the team this quarter to add some lectures and try some new post-AI evaluation practices.

Noah and I sat down to think about what the ideal educational assessment should look like in the post-AI era, and immediately started talking about the socratic method. There's already an educational literature on the value of socratic questioning in higher education (example guide), and I think the basic idea of teaching students to answer hard, reasoning-based questions about a particular topic is even more relevant now that it's so easy to engage superficially. The skill of reasoning is one that's built in discussion! Socratic questioning is also the kind of formative assessment that helps students learn what they don't know – and encourages them to go back and learn it. It's not about giving a grade, it's about encouraging thinking.

The trouble is of course that socratic questioning is very hard to do at scale. Some of my colleagues are giving up a huge amount of instructional time to do oral exams with every student, but absent that, how do you ask good questions on a one-on-one basis in an intro course? Our answer was to use LLMs to do this. 

Noah introduced me to Google AI studio, a simple vibe-coding interface for making standalone web apps, and we quickly prototyped a single-purpose tool.* You choose a reading, it ingests the reading as part of the prompt, and then is instructed to examine the student on the reading using a set of open-ended questions. (For AI uses, it's important to make sure content is licensed appropriately: we used readings from the Open Encyclopedia of Cognitive Science, a CC-BY-NC resource that I co-edit and that is intended for precisely this kind of use case). Here's a screenshot:


The model** is instructed to go through 3-5 conversational exchanges with the student. After the model is satisfied that the student understands the reading, it instructs the student to download the transcript (PDF button at top left) and turn it in to their TA. The PDF includes a short assessment of the student's knowledge at the top. If the student doesn't demonstrate understanding, the model will instruct the student to reload the page and try again. This isn't an AI-proof assessment (virtually nothing is!) but the chat window blocks copy-paste, to try and defeat the most obvious strategy of pasting in another AI's responses. As an app, the socratic tutor is about as simple as it gets – it doesn't even have a backend that stores chats or assessments, and there's currently no authentication layer either; students just turn in their own work. 

Student responses to the Socratic Tutor have been very positive. We have had each class section do alternate weeks on socratic tutor vs. regular reading response (so there's no confound of reading content with assessment method). In our midterm evaluation, one representative student wrote "I enjoyed the Socratic Tutor much more than the traditional reading response. I thought it was much more engaging, and encouraged me to think about the readings in a deeper way." On a scale of 1-5 (definitely socratic tutor vs. definitely reading response), the average rating was 2.2, suggesting that students were robustly rating the socratic tutor as better than the traditional reading response (61% of raters preferred the tutor). Not everyone loved the bot; some students said that it was equivalent to the standard reading response (12%) or worse (17%). Of those that didn't like it, often the sentiment was similar to the student who wrote "as long as you're detailed enough by spitting back the article, it'll pass you."

Just to be clear: AI is causing major problems for education (and especially higher education). Sadly lots of college assessment text can simply be generated (now with relatively decent quality) by current LLMs. One response is to go straight back to blue books and oral exams. We may need to do that. But there are some places where the customizability and intelligence of LLMs actually can be helpful, if used correctly. Using LLMs to provide customized formative assessment to improve comprehension – basically, what we think the socratic tutor does – might be one such place.*** 

FAQ

1. You are advocating an LLM tool, are you actually just abrogating your responsibility as an instructor?

Well, in the original class design the students were going to write reading responses and the TAs were going to read them. Now they are doing chats with the bot and the TAs are going to read them. The difference is that the chats are more focused, and they are (slightly) more difficult to generate using AI tools. So no, I'm trying to be more responsive to student needs and behaviors, rather than less so. 

2. Does the socratic tutor hallucinate? How can you be sure it's correct? 

Deploying LLMs in large-scale applications is a major challenge, because creating hard and fast guard rails is hard. I would not deploy a current commercial model as a tutor in this kind of way without a base article. That said, our TA testing and student feedback suggests that the tutor is staying very close to the reading that it's being prompted with.

3. Do you think this kind of solution should be scaled up for all reading responses or evaluations? 

Definitely not. Probably one reason it's useful right now is because it's novel. If it becomes ubiquitous, students will get bored and develop workarounds. We always need a broad and creative landscape of evaluations – even more so in the post-AI era. 

4. What's the next step?

We're going to try running a real randomized experiment with knowledge evaluations as well as systematic measures of attitudes in a future quarter of SymSys 1. 

---
* This process was exhilarating – vibe coding like this is amazing and feels like a real expansion of capacity! I posted about how vibe coding is letting me try stuff as a teacher and researcher in ways that I hadn't expected: https://bsky.app/profile/did:plc:kkij4bfbznvqomfmzv36ksgw/post/3me5l2pmn6s2k
** We're using gemini-3-flash-preview as our model, which seems to be pretty good for this; rate limits on the pro API meant we couldn't use the most capable model. 
*** At least until everyone starts doing this, at which point someone will write a chrome plugin to connect other models to the chat window; at which point, the socratic tutor will just be gemini (that's done the reading) talking to gemini that hasn't done the reading. Womp womp. 

No comments:

Post a Comment