A behavioral simulation system for practicing clinical conversations, surfacing communication patterns, and making patient-centered competence observable.
In clinical environments, providers are trained to diagnose. They are not consistently trained to be present, explain uncertainty, preserve autonomy, or notice when the patient has stopped following the conversation. Discharge paperwork can be clinically correct but functionally unreadable for elderly patients or those with low health literacy. Providers discuss patients in front of patients. Emotional cues pass without acknowledgment.
This gap became clear through my experience as an EMT, where effective care often depended less on diagnosis and more on how information was communicated in high-stress, low-context situations.
Clinical training often focuses on what not to say — for liability. It does not always train how to adapt communication in real time, navigate emotional resistance, or balance information extraction with empathetic presence.
The Clinical Roleplay Simulator gives clinicians a space to practice — without standardized patients, without scheduling overhead, and with evaluation specific enough to change behavior rather than just score it.
Three goals drove the system design: simulate real interactions with enough fidelity that communication habits form, evaluate both clinical and communication performance independently, and generate feedback that surfaces patterns the clinician can actually act on.
From a human factors perspective, communication failure is not simply a deficit in individual empathy. It is often a predictable output of systems that reward speed, certainty, and throughput over understanding, trust, and reflection.
Worked within a small team, where I led system design and built the full functional prototype — driving evaluation and iteration across all four domains.
Decisions in each domain were sequential: research findings drove product decisions, which drove architecture choices.
In early testing, when patient and evaluator states shared context, the LLM's alignment training caused it to drift toward helpfulness. Within a predictable number of turns, the patient persona began responding like a clinician — mirroring the very user it was meant to challenge.
Full state isolation prevented evaluation context from leaking into patient behavior. If the patient had access to the evaluation rubric, it would effectively guide the clinician through the test.
This required careful prompt structuring, context control, and retrieval design to ensure each state behaved independently while maintaining conversational coherence.
Scenarios and personas are generated independently — the same clinical presentation behaves differently across an anxious patient versus a blunt, uncooperative one. This creates healthy dialogue diversity without sacrificing evaluation consistency.
Each of the 6 patient personas is defined by two governing variables: emotional state and cooperation level (a continuous 0–1 scale). These aren't decorative traits — they actively control how much information is shared, how responses are structured, and how the conversation evolves based on clinician behavior.
A cooperation score near 0 doesn't mean the session is impossible. It means the clinician's communication quality is the only variable that can unlock the information needed for diagnosis.
Emotional variables: anxious, fearful, annoyed, frustrated, demotivated, blunt. These expressed through tone in the STT/TTS layer — behavioral realism in voice, not just text.
In early testing, without state separation, the model defaulted to its trained behavior — resolution, helpfulness, clarification. After a predictable number of conversational turns, it stopped being the patient.
More realism doesn't equal a better training system. If everything varies, evaluation becomes inconsistent and feedback loses meaning. 6 personas, OSCE-grounded scenarios, and tight guardrails meant every session produced evaluable output — and the system could scale without the feedback degrading.
Clinical competency functions as a hard constraint — it supersedes all communication scoring. A misdiagnosis costs a life. Poor communication costs trust, adherence, disclosure, and shared decision-making. The hierarchy was intentional from the start.
Most evaluation systems produce a rubric score or highlight specific errors. This system produces feedback from three distinct analytical positions — each one addressing a different kind of learning and turning abstract professionalism into language a learner can practice against.
After the system had played both standardized patient and conversational evaluator, I asked it to reflect on what it had learned. The output became a useful qualitative artifact: it named the latent dimensions the simulator was already trying to assess.
The reflections pointed toward a more mature product artifact: a shared language for what "good communication" actually means in a simulated encounter. The model below translates values into observable behaviors and failure modes.
Clinicians described feedback as specific, actionable, and directly tied to the dialogue.
The project matured through a professional product loop: domain research, architecture decisions, prototype delivery, clinician testing, and a roadmap for longitudinal skill growth.
The MVP proved the training loop. The next ceiling is longitudinal: making communication growth visible across sessions, personas, and transitions from student to resident to physician.