The Illusion of AI SAT Prep
Vol. XII
Google’s announcement that it will embed full-length, free SAT practice tests directly into Gemini, using content from The Princeton Review, is being framed as a breakthrough moment for educational equity. Free, high-quality test prep, delivered at scale by one of the world’s most powerful technology companies, sounds like an obvious win. For families priced out of traditional tutoring, it appears to level the playing field and challenge what critics like to call the “test prep cartel.”
From inside the industry, the picture looks far less reassuring. What Google is really doing is scaling an already flawed model, then placing a highly confident but imperfect AI on top of it. The result risks being a convincing simulation of preparation rather than the real thing.
The core issue begins with The Princeton Review itself. Among professional tutors and psychometricians, its materials have long been criticized for drifting away from the actual construct of the Digital SAT. Vocabulary is a prime example. The modern SAT has deliberately moved toward high-utility academic language — words students are likely to encounter in real college coursework. Princeton Review questions, by contrast, often lean into dense or obscure phrasing that feels closer to the old GRE than to the College Board’s current design. When students miss those questions, they are often being punished for decoding unnecessary linguistic noise rather than failing to demonstrate the reasoning skills the test is meant to measure.
Scoring compounds the problem. There is a widespread, and not unfounded, belief that Princeton Review practice tests are deliberately punitive. Students routinely report scoring 100 to 150 points lower on TPR exams than on official College Board tests taken shortly thereafter. While this “stress-testing” approach may be defensible as a motivational tactic, it undermines assessment validity. If Gemini is trained on that logic, it will consistently misrepresent readiness, leaving students with a distorted sense of where they actually stand.
Math alignment is no better. The Digital SAT allows Desmos throughout the math section, fundamentally changing how problems are solved. Yet Princeton Review questions are often engineered to be resistant to calculator-based strategies, forcing manual computation that the test itself no longer prioritizes. That mismatch trains students for a version of the SAT that no longer exists.
These concerns are not abstract, and they are not isolated. Shortly after Google’s announcement, experienced tutors and content developers in an online community of test prep professionals began testing the Gemini–Princeton Review integration in real time. Within hours, a pattern emerged. Reading and Writing modules were not structured correctly, with question types appearing out of sequence in ways that undermine one of the primary purposes of a practice test: teaching students the rhythm and architecture of the actual exam. Math questions surfaced with multiple valid answers expressed in different formats, violating the SAT’s requirement that each item have one unambiguous correct response. In other cases, Gemini confidently misstated basic facts about how the SAT itself works.
What stood out was not hostility toward AI. Many of the same professionals actively use AI in their own classrooms and businesses. The frustration was procedural. These were not edge cases. They were foundational errors — exactly the sort of issues a serious human review process is designed to catch before a product is released. More than one practitioner observed that the tool would have done less harm if it had simply pointed students to official materials rather than wrapping third-party content in AI-generated certainty.
This tracks my own experience. I’m part of a Digital SAT startup, Assessiv, alongside three partners who each operate significant test prep businesses and collectively bring decades of experience to the table. We have approached AI with far more care than most. We worked closely with an MIT-trained machine learning engineer and spent substantial time on prompt engineering designed specifically to generate dSAT-aligned questions. That process has meaningfully reduced our workload and accelerated iteration. But it has also reinforced a hard truth: even under near-ideal conditions, AI does not reliably produce publish-ready questions.
The majority of AI-generated items — especially at higher difficulty levels — still require human editing. Language needs tightening. Logic needs correction. Difficulty needs recalibration. Subtle misalignments creep in that are invisible to non-experts but fatal to assessment quality. Our experience mirrors what we see across the market. There are now countless competing digital SAT platforms, nearly all advertising AI-powered features. The ones that allow AI to build large portions of their question sets with minimal human oversight tend to have visibly weak banks: inconsistent difficulty, unnatural phrasing, misaligned math strategies, or reading questions that test trivia instead of reasoning. These flaws are easy to miss at first glance and devastating once scores are compared to official exams.
Layer AI on top of already misaligned content and the risks multiply. Large language models are excellent at producing fluent explanations, but fluency is not the same as correctness. In math especially, AI systems have a documented tendency to replace reasoning with computation, generating explanations that arrive at the right answer for the wrong conceptual reasons. A student who asks Gemini why an answer is correct may receive a confident, plausible-sounding explanation that subtly violates the SAT’s logic or framing. That kind of error is far more dangerous than a wrong answer because it teaches the wrong lesson with authority.
Even more problematic is AI’s inability to model human difficulty. Research on human–AI difficulty alignment consistently shows that what machines find easy and what students find hard often diverge sharply. AI systems misjudge where students struggle because they cannot authentically simulate novice cognition. They converge on a machine consensus of difficulty that feels internally coherent but does not reflect how real students think. When that judgment drives question generation, explanations, or study plans, the result is content that looks polished yet misses the actual learning bottlenecks.
All of this raises an obvious question: why would Google pursue this route at all when Khan Academy already exists as the College Board’s official partner? The answer is strategic, not pedagogical. Khan Academy operates inside a psychometric moat, with access to test blueprints and item response parameters that no third party can replicate. Google cannot compete with that on accuracy, so it competes on ecosystem. Free SAT prep inside Gemini is a loss leader designed to capture students at a formative moment and habituate them to Google’s AI environment before college and the workforce. Educational fidelity is secondary to platform lock-in.
Lost in the hype is the most important constraint of all: motivation. The SAT does not reward secret tricks, and it hasn’t for years. It rewards durable reading, quantitative reasoning, and consistency. Those skills are built through disciplined practice and productive struggle, not instant answers. Tools like Khan Academy already prove the point. Access is not the bottleneck; follow-through is. Most students simply do not engage deeply enough, often enough, to see real gains.
AI reduces friction, and friction is where learning happens. An answer engine that instantly explains everything risks short-circuiting the cognitive effort required to internalize mistakes. Without a human tutor to impose structure, accountability, and emotional regulation, even the most advanced AI becomes little more than a high-tech textbook — impressive, available, and largely ignored.
Google’s move into test prep is best understood as ecosystem defense wrapped in the language of equity. By pairing unofficial content with AI systems that still struggle to understand human learning constraints, it offers something that looks like preparation but lacks its substance. Real test prep is not about infinite questions or elegant explanations. It is about building habits over time. That remains stubbornly resistant to automation.
Artificial Intelligence
Gemini introduces Personal Intelligence: Personal Intelligence, a new beta feature in the Gemini app, enhances user experience by integrating with Google apps like Gmail, Photos, YouTube, and Search. This feature allows Gemini to deliver personalized responses by leveraging context from these connected sources, thereby improving the quality of interactions. For instance, when needing information about tire specifications, Gemini not only provided the data but also suggested relevant options based on past family trips, showcasing its ability to connect details from multiple formats—text, images, and video. The focus on privacy is paramount; users maintain control over which apps are connected and can opt-out at any time. This ensures that sensitive information is not used for training the model, which aims to provide tailored assistance while safeguarding personal data. Furthermore, early feedback indicates areas for improvement in response accuracy, particularly regarding personal interests or relationships, suggesting the potential for a more nuanced understanding over time. As this feature rolls out to select users, the insights gained can inform educators about the possibilities of personalized learning environments and AI-driven tools that support tailored educational experiences. This hasn’t been rolled out to my account yet; I am both eager and afraid to try this.
Claude Is Taking the AI World by Storm, and Even Non-Nerds Are Blown Away: The emergence of Anthropic’s Claude AI, particularly its Claude Opus 4.5 model, has significantly transformed how individuals approach coding and software development, leading to widespread engagement even among non-technical users. Developers have reported remarkable productivity gains, with some claiming tasks that would typically take a year can now be completed in a week, as seen with Malte Ubl of Vercel. The tool's appeal lies in its versatility; it allows users to conduct complex analyses, such as evaluating economic data or managing personal tasks, without extensive programming knowledge. This democratization of coding through AI tools like Claude Code is reshaping traditional skill requirements, as evidenced by responses from professionals like Andrew Duca, who expressed concerns over how quickly AI can replicate years of acquired expertise. Claude’s popularity has surged, with its web audience doubling in December compared to the previous year, illustrating a broad adoption trend. Moreover, the development of user-friendly interfaces, such as Cowork, reflects an understanding of the importance of accessibility in technology adoption. My primary workflows are still in ChatGPT, but I’m starting to experiment with Claude — many people whose opinions I admire have switched to Claude.
Claude's Constitution: Anthropic created a constitution for Claude outlining its mission to create safe, beneficial AI systems, particularly through the development of its model, Claude. The focus is on balancing safety, ethics, and helpfulness, emphasizing the need for AI to adhere to human oversight while demonstrating good values and judgment. Claude is designed to prioritize human oversight as a critical component in AI development, preventing potential harms that could arise from unregulated AI behavior. In situations where Claude's values might conflict, it is instructed to prioritize being broadly safe, ethical, compliant with guidelines, and genuinely helpful in that order. This structured approach helps mitigate risks associated with the deployment of advanced AI systems. Anthropic has been on a tear recently and seems to be taking the leadership position away from OpenAI in many areas. This may be a “feel good” idea, but it theoretically has large implications in how Claude processes and outputs data.
K-12 Education
Test-Optional Admissions explores why colleges might choose to ignore standardized test scores even though this information could technically improve their selection process. The authors propose that the shift toward test-optional policies — which surged from one-third of Common App colleges in 2019 to 95% during the 2021-22 application season — is primarily driven by a desire to reduce "disagreement costs" with a society that often scrutinizes admission decisions. Using a formal model, the researchers demonstrate that by hiding score disparities, a college can admit a student body it prefers (such as one with more racial or income diversity) while mitigating social pressure, even assuming society is "Bayesian" and correctly guesses that non-submitters generally have lower scores. The paper cited a 2022 PEW survey showing only 26% of respondents support using race as a factor in admissions, compared to 85% who believe test scores should be a major (39%) or minor (46%) factor. Furthermore, the paper suggests that if affirmative action is banned, more schools may switch to "test-blind" policies as a secondary way to maintain diversity, though the authors note that a 2020 University of California report found test scores remain a strong predictor of student success across all demographic groups. Given the issues we’re seeing in California with the number of students needing remedial math, I doubt we’re in danger of schools going test blind.
Grade Grubbing—Who's Asking and How Teachers Feel About It: The phenomenon of "grade grubbing," where parents or students request changes to grades, is increasingly common and presents challenges for educators. A recent EdWeek Research Center survey revealed that 44% of educators have changed a student's grade at least once, with 76% of those changes resulting in higher grades. The reasons for grade changes vary, with 45% of teachers citing corrections of errors and 42% indicating that students submitted additional work after the fact. Factors such as the pandemic's "grace before grades" approach and heightened parental involvement — often referred to as "helicopter parenting" — are contributing to these requests. Educators report feeling uncomfortable with grade change requests and express a lack of support from administrators. Responses from teachers highlight a spectrum of experiences; while some encounter frequent demands for grade adjustments, others maintain strict boundaries, stating that only they can change grades. I am in full support of students championing their own grading issues, if they manage to get a better grade, good for them! Parents, on the other hand, should be encouraging their kids to be their own advocate; but, if the teacher won’t budge, parents should be supporting the teacher.
From X: SAT & ACT scores are falling, is it Covid, grade inflation, or both?
Competition Coming for the SAT, ACT, AP, and International Baccalaureate: The CLT emerges as a viable alternative to traditional monopolies in K-12 education, notably the SAT, ACT, AP, and IB programs. With over 300 colleges accepting the CLT, it is redefining assessment by promoting high-quality, meaningful evaluations that align with classical education principles. CLT's recent initiatives include developing "Enduring Courses" to compete with AP offerings, particularly in humanities, addressing concerns about AP's rigor. Additionally, the introduction of the Classical Baccalaureate, in partnership with Arcadia Education, aims to create a comprehensive curriculum and support system for classical education, which is increasingly desired by parents seeking alternatives to traditional public school systems. Noteworthy legislative support in Florida, such as laws allowing the CLT for graduation requirements and university admissions, indicates a shift toward educational choice and the potential dismantling of testing monopolies. The CLT is bold — I still have my doubts on its long-term viability. Most of my friends in the industry believe the CLT is harder than the SAT and ACT; I’m not confident that parents (or students) will go for a harder test. Time will be the judge.
College
As Iran protests to end Khamenei’s authoritarian reign, where are the encampments from lefty college students?: The article highlights the stark contrast between global protests and the lack of support from Western activists for the ongoing uprising in Iran against the authoritarian regime of the mullahs. Notably, there are no prominent campaigns or movements from progressive groups in the West advocating for the Iranian people’s struggle for freedom, illustrating a disconnect in activism. This silence is attributed to a perceived ideological alignment between progressive factions and political Islam, which has led to the media’s muted response to the brutalities in Iran. The author emphasizes that while many celebrate social justice movements, they often overlook or ignore the complexities of global human rights issues, especially when they conflict with established narratives about Islam and political movements. Furthermore, the regime’s brutal crackdown on dissent, including the execution of women for religious crimes, highlights the urgent need for a reevaluation of solidarity with oppressed groups globally. No Jews, no news. This is what happens when universities are dominated by one political party and take in billions and billions of dollars from countries that do not have America’s best interests at heart.
Antisemitism Has a Campus Problem: Antisemitism is increasingly prevalent on university campuses across the United States, particularly following the events surrounding the October 7th attacks and the ongoing conflict in Gaza. Recent research indicates that approximately 72% of Jewish student leaders surveyed have experienced antisemitic incidents, while 82% have witnessed such incidents. Verbal abuse remains the most common form of antisemitism, with 78% of respondents reporting exposure to derogatory phrases targeting Jewish identity. Furthermore, nearly 60% experienced doxxing campaigns, where personal information was shared to incite hostility. Faculty and administrators also play a role, as 29% of incidents involve faculty members, highlighting a troubling culture that can foster or tolerate such behaviors. The findings underscore the urgent need for university leaders to address the rising tide of antisemitism, as the safety and inclusivity of Jewish students are severely compromised. Creating a supportive environment for all students and openly challenging antisemitism can mitigate these issues, ensuring that Jewish students do not feel compelled to hide their identities or retreat into isolated spaces. The implications of these findings extend beyond the college experience, influencing societal perceptions of tolerance and acceptance.
Miscellaneous
Meta's Legal Team Abandoned Its Ethical Duties: The erosion of legal ethics within Meta highlights a troubling trend that parallels historical corporate malfeasance, particularly the practices of Big Tobacco. Recent revelations indicate that Meta's legal team has engaged in actions to suppress critical research about the impacts of its platforms on children and adolescents, including evidence related to child exploitation and mental health issues. Notably, internal communications reveal that Meta lawyers ordered the destruction of evidence and research findings that could portray the company in a negative light. For instance, a 2020 internal study, known as Project Mercury, discovered that reduced time on Facebook led to decreases in depression and anxiety among users, yet the findings were buried. Additionally, whistleblowers testified about the presence of predatory behavior within Meta's VR ecosystem, with evidence of child exploitation being systematically erased by the legal department. This conduct not only raises ethical concerns but also poses significant risks to child safety in digital spaces. Talk to your kids about the risks and talk about them often! Also, use the screen time feature built into Apple products, it’s not perfect, but it definitely works.
Man says Goldman Sachs put him through a gauntlet of 39 one-on-one interviews—and the decisive conversation was less than a minute: The journey of Sharran Srivatsaa, who navigated an extraordinarily rigorous hiring process at Goldman Sachs, underscores the importance of adaptability and coachability in interview settings. With an acceptance rate of less than 1% for internships — 0.7% for the 2025 program, compared to Harvard's 4.2% — the article reveals that Srivatsaa faced 39 interviews, an exceptionally high number, reflecting the bank's competitive nature. One pivotal moment during the interview involved a managing director who presented Srivatsaa with a challenge to set up a meeting using a binder filled with contact information. Unlike others who attempted to showcase their assertiveness through immediate calls, Srivatsaa asked if he could receive guidance on how to best represent the director, which led to a quick, impactful exit from the interview. This experience highlighted that being open to coaching can significantly influence hiring decisions, suggesting that humility and a willingness to learn are valued traits in professional environments. The narrative illustrates that beyond technical skills, emotional intelligence and interpersonal communication play a critical role in success during high-stakes evaluations. Be memorable.
Jason’s Recommendations:
What I’m Watching: No new shows at the moment. I’ve been enjoying Only Murders in the Building with my daughter and Fallout by myself.
What I’m Reading: A little more than halfway done with Never Split the Difference. I’ve been using some of the strategies in my personal life to great success. My favorite strategy so far is “mirroring,” i.e., repeating the last 1 to 3 words someone said as a question — it’s really helped me to get my kids to give me more details about their day! Also, one of the most interesting, and counter-intuitive pieces of advice made a lot of sense: ask questions that will bring an answer of “no.” More on that next week!
My Favorite Recent Podcast: While I listen mostly to podcasts, nothing has really stood out to me recently worth sharing.




