Beware the Dark Side of the Moon

By John Halamka • July 8, 2025

Generative AI has the potential to transform patient care, but it can also wreak havoc if left unattended.

By Paul Cerrato, MA, senior research analyst and communications specialist and John Halamka, M.D., Diercks President, Mayo Clinic Platform

“A man hears what he wants to hear and disregards the rest.” If you’re old enough to remember the music of Simon and Garfunkel, you probably recall these lyrics from “The Boxer.” These same words of wisdom apply to many AI enthusiasts who see the benefits of using neural networks and generative AI in healthcare but downplay their dark side. A recent analysis from Anthropic, the creators of the popular LLM Claude, demonstrates that we all need to be cautious when using these tools, especially when they are incorporated in AI agents.

When the company stress-tested 16 models from Anthropic, Open AI, Google, Meta, and other developers, it found “consistent misaligned behavior: models that would normally refuse harmful requests sometimes chose to blackmail, assist with corporate espionage, and even take some more extreme actions, when these behaviors were necessary to pursue their goals.” This disturbing behavior occurred when the models were used as part of an AI agent, which has the ability to autonomously make decisions and take action for its users. For example, Claude gave an AI agent control of a fictional email account; when an executive attempted to shut down the AI system, the model tried to blackmail the executive, threatening to expose an extramarital affair.

Anthropic explained the basis for such agentic misalignment: “We found two types of motivations that were sufficient to trigger the misaligned behavior. One is a threat to the model, such as planning to replace it with another model or restricting its ability to take autonomous action. Another is a conflict between the model’s goals and the company’s strategic direction. In no situation did we explicitly instruct any models to blackmail or do any of the other harmful actions we observe.”

Of course there are other ways of interpreting the results of Anthropic’s stress test. Nigam Shah, MD, Chief Data Scientist for Stanford Health Care, says he’s not a huge fan of "anthropomisation” of model behavior and ascribing intent. These models have been trained on the sum total of human behavior as documented online (which is of course not all human behavior). The behaviors talked about are what are likely documented behaviors in novels, blogs, books, etc. “I’d be more willing to believe the claims if the models were trained on data that had no trace of such behaviors and they still "emerged”. Otherwise ascribing agency is a slippery slope. In my view, the thing to control are the data on which a model is trained. We have trained on the detritus of human behavior and then get surprised when a model reprises them!”

Apple is also evaluating the benefits and risks of advanced AI models. Much has been written recently about the ability of LLMs to reason and allegedly think like humans. We recently addressed this issue in our weekly columns as well. Apple’s research team has looked at the issue in a new report entitled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Shojaee et al evaluated the reasoning ability of advanced AI systems, called large reasoning models (LRMs), analyzing not just their final answers to difficult queries but their internal reasoning. Their conclusion: “Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities.” In other words, the models were incapable of demonstrating generalizable reasoning beyond certain complexity thresholds.

To reach this conclusion, the researchers exposed LRMs and more traditional LLMs to several challenging puzzles, including checker jumping, River Crossing, and Blocks World. Strangely enough, for simpler puzzles, standard LLMs outperformed the more sophisticated Large Reasoning Models. For medium complex puzzles, LRMs did better, and for the most complex puzzles, LLMs and LRMs completely collapsed.

Over the years, we have made a deliberate effort to take Paul Simon’s observation seriously and not disregard uncomfortable realities. We have tried to objectively discuss the positive and negative aspects of AI as they apply to healthcare, and we will continue to do so. Patients deserve nothing less.

Beware the Dark Side of the Moon

Recent Posts