Prompt Engineering Can Improve the Performance of Generative AI

By John Halamka • December 12, 2023

No sensible clinician would ever allow a large language model to replace them at the bedside, but those who want to use these digital tools to supplement their decision making might benefit from prompt engineering.

By John Halamka, M.D., President, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform.

With so many major technology developers creating large language models (LLMs), clinicians and healthcare executives often struggle to find the most effective way to use them for administrative and clinical purposes. Anyone who has used ChatGPT, for instance, knows that it can produce widely contrasting results, some of which are accurate while other complete fabrications. One way to improve the results of your prompts is to use prompt engineering.

ChatGPT agrees. When asked to explain prompt engineering, in simple terms, the chatbot responded: “Prompt engineering is the process of carefully crafting the questions or instructions you give to a language model to get the best and most relevant responses. It involves tweaking and refining the prompts to achieve the desired outcome.” It went on to point out that the process is best accomplished using experimentation and refinement.

Earlier this year, we asked ChatGPT to diagnose a patient with chest pain and several other signs and symptoms. We could have taken a more nuanced approach and used prompt engineering to improve the query and perhaps elicit a more accurate response. The initial response to our “unengineered” prompt was off target. It suggested the patient had a myocardial infarction when, in fact, at least of one his symptoms suggested a very different diagnosis, aortic dissection. In a subsequent blog, we discussed more sophisticated approaches to finding a possible diagnosis. One research group, for instance, asked ChatGPT to generate the top differential diagnoses for the scenario they created. Using the patients’ medical history and physician exams, clinicians “correctly included the diagnosis in the top differential diagnoses for 83% of cases,” compared to 87% for ChatGPT-4.

Prompt engineering can also take advantage of several other tactics. One can experiment with different prompt styles, for example, and instead of asking a direct question, you can ask the chatbot to explain the step-by-step process of diagnosing a patient with a specific set of signs, symptoms, and lab findings. Bertalan Mesko, M.D., Ph.D., with the Medical Futurist Institute, Budapest, Hungary, has also suggested role playing as another tactic. You might pose the prompt this way: “Assume you are a cardiologist trying to explain to a patient what their signs and symptoms might signify.”

Another tactic is to iterate and refine after the chatbot generates the response to your first query. One of the exceptional features of chat-based LLMs like ChatGPT and Bard, unlike Alexa or Siri, is that they remember your initial prompt and let you have a very long conversation on the same question. This enables you to modify the output based on the feedback, adding a second and subsequent prompts. This unique capability is a function of a LLM’s attention mechanism and the transformer architecture. In the use case we initially provided ChatGPT, about the patient with chest pain and pain in his left leg, we could have challenged the bot’s suggestion about an MI with a follow-up question: “The patient’s leg pain is not consistent with MI. Please explain what this pain might mean.”

Use cases that demonstrate the value of prompt engineering include an analysis by Songhai et al, which introduced “instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine.” In addition, there are several tutorials to help users apply the latest advances in prompt engineering in general and in a healthcare setting.

Although prompt engineering has gained favor among many AI stakeholders, some critics question whether prompt engineering is the best way to improve the outcomes generated by LLMs. As an alternative, they point to problem formulation as a worthwhile strategy. Oguz Acar explains the process in a recent Harvard Business Review article: “Prompt engineering focuses on crafting the optimal textual input by selecting the appropriate words, phrases, sentence structures, and punctuation. In contrast, problem formulation emphasizes defining the problem by delineating its focus, scope, and boundaries. Prompt engineering requires a firm grasp of a specific AI tool and linguistic proficiency while problem formulation necessitates a comprehensive understanding of the problem domain and ability to distill real-world issues. The fact is, without a well-formulated problem, even the most sophisticated prompts will fall short. However, once a problem is clearly defined, the linguistics nuances of a prompt become tangential to the solution.”

Prompt engineering may not be a panacea that magically generates perfectly accurate results and eliminates fabrications. It’s nonetheless a step in the right direction.

Prompt Engineering Can Improve the Performance of Generative AI

Recent Posts