What’s Inside Generative AI’s Brain?
Large language models like ChatGPT are finding their way into healthcare, business, and everyday life. The study of mechanistic interpretability may help us create safer, more trustworthy algorithms by “looking under the hood.”

By Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., Diercks President, Mayo Clinic Platform
Many clinicians think twice about using large language models because they remain black boxes that don’t explain exactly how they arrived at their answers, and because they sometimes generate believable lies that can harm patients—so-called hallucinations. A growing number of AI experts have been taking a deep dive into how LLMs work, hoping to address both these concerns. Some researchers refer to this effort as explainable AI while others have coined the term mechanistic interpretability. Neel Nanda, who heads up this research for Google DeepMind, explains: “I want to be able to look inside a model and see if it’s being deceptive.” He and his colleagues have developed a digital tool called Gemma Scope to help developers peer into the inner workings of LLMs.
To make sense of mechanistic interpretability, it helps to understand how LLMs work. Many users don’t realize that LLMs don’t access a database of facts and theories when they give answers. If you have a question about how to diagnose appendicitis, for instance, you might refer to an online version of Harrison’s Principles of Internal Medicine, or check PubMed for reliable answers. LLMs don’t refer to such forms of ground truth. Instead they use statistical pattern matching in their data set, and if they find enough word sequences that match the question a user asks, they generate a response. As ChatGPT admits: it evaluates the ‘probability of certain word sequences rather than any deep understanding of meaning, logic, or real-world truth’. That also means they don’t actively check facts against a reliable source and don’t realize the limits of their knowledge. What makes matters worse is LLMs seem to have a built-in tendency to make users happy, even if it means fabricating answers that they think users want to hear.
With these shortcomings in mind, several data scientists have been looking at what’s under the hood. At the operational level, we do understand many aspects of how LLMs work. They use transformer technology and the attention mechanism to respond to queries. The process, explained in the landmark paper Ashish Vaswani and associates at Google and the University of Toronto, are illustrated in the graphic below. Unfortunately, without a degree in computer technology, the diagram will likely make little sense. A few basic definitions of terms like tokenization, positional encoding, embedding, vectors, and decoding will help.

Computers don’t communicate using human language, they speak in numbers, namely the binary language of 0’s and 1’s. With that in mind, the first step in answering a user’s question or “prompt” is to convert it into numbers. A question like: “What color is the sky?” would be converted into a series of numeric token IDs—roughly equivalent to the individual words in the query. These tokens in turn generate a new set of numbers called vectors, which represent the meaning of these tokens.
These vectors have to be assigned a position in sentence for an LLM to understand it. Without this positioning encoding, the system might confuse itself by changing the order of the words. The model is usually able to choose the correct order because it has access to a massive data set it’s scraped from the Internet, or some other source. The transformer architecture helps the model gain an understanding of the relationships between important words and predicts which words come next, a process we’ve explained in more detail in a previous column.
But these steps don’t really tell us much about how LLMs reason at a deeper level. Ordinarily, data scientists might attempt to reverse engineer an LLM by looking at all the neuron activations created during a prompt. Unfortunately, that’s not very helpful in this context. The engineers at Anthropic, the makers of Claude, point out: “From interacting with a model like Claude, it's clear that it’s able to understand and wield a wide range of concepts—but we can't discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.”
To understand how LLMs deal with concepts, Google’s DeepMind is first trying to find the underlying features, namely the “categories of data that present larger concepts.” This is where Gemma Scope fits in. This set of digital tools is slowly digging into these neural networks with the help of “sparse autoencoders,” which have been compared to microscopes that can take a closer look at the layers within an LLM.
Until data scientists fully understand the generative AI “brain”, clinicians and patients still need practical advice on how to reduce the risk of misleading or false responses when using ChatGPT, Gemini, Claude, and other popular LLMs. One solution is to switch to a chatbot that was trained on fit-for-purpose data or chatbots that incorporate retrieval-augmented (RAG) tools that derive their content from scientifically reliable ground truth. Developing prompt engineering skills can also help reduce the risk of the chatbot hallucinating in response to your query. Among the tips worth considering when you ask a question:
- Be precise and detailed.
- Insist on references from peer-reviewed medical literature when appropriate.
- Verify those references by going to the original source.
- Provide context. If you’re an experienced clinician or a patient with detailed information, making the LLM aware of that fact can help it narrow its search. For example, rather than just asking what’s causing this cough, provide specific details, including patient history, related signs and symptoms, timeline, and lab results.
- Don’t settle for the LLM’s first response. Analyze the bot’s initial response and recast your question based on this first response.
We may never fully understand how generative AI functions, but there’s still much that can get gained by using it carefully, and comparing its responses to ground truth.
Recent Posts

These digital tools have the potential to transform patient care by tapping data resources rarely used in routine medical practice. By John Halamka, M.D., Diercks[...]

By John Halamka and Paul Cerrato—NLP enables humans and computers to communicate in ways never imagined a few short years ago. The results have practical implications for anyone working in healthcare.

By John Halamka and Paul Cerrato— Guidelines and guardrails for the safe use of AI require more than regulation. The Coalition for Health AI has created a framework to reduce these risks and improve the safety and effectiveness of these models.