Clinical Roleplay · National Board of Medical Examiners · 2025

Designing a Clinical Simulation Roleplay System
for Patient-Centered Communication

A behavioral simulation system for practicing clinical conversations, surfacing communication patterns, and making patient-centered competence observable.

Clinical communication is not a soft skill. It is observable clinical behavior under constraint.

Clinical AI OSCE Simulation Human Factors Assessment Design Motivational Interviewing

Clinical Roleplay Clinical Roleplay PRD

Domain

Clinical AI · CUI · Human Factors

Role

Lead Product Engineer

AI Product Systems Designer · Lead Builder

Methods

User Testing · System Design · RAG Architecture

Outcome

Shipped MVP · 15 Clinician Tests

At a Glance — 30 second read

Problem

Clinicians are trained to diagnose — but communication often degrades under real time pressure.

OSCE training is expensive, scarce, and episodic. It can score communication without giving learners enough rehearsal to build it.

Solution

An AI roleplay system that treats patient-centered communication as observable, trainable behavior.

Built 0-to-1: concept validation through shipped MVP with dialogue, evaluation, and feedback systems.

Key Innovation

Separated patient, scenario, and evaluator states to prevent LLM role drift and evaluation leakage.

Without this, the model's helpfulness training caused it to become the clinician after several turns.

Impact

15 clinicians tested → feedback shifted from personal judgment to developmental language.

Primary shift: clinicians changed how they structured questions, managed uncertainty, and responded to resistance.

01 — The Problem

Communication fails in the interaction, not the knowledge

In clinical environments, providers are trained to diagnose. They are not consistently trained to be present, explain uncertainty, preserve autonomy, or notice when the patient has stopped following the conversation. Discharge paperwork can be clinically correct but functionally unreadable for elderly patients or those with low health literacy. Providers discuss patients in front of patients. Emotional cues pass without acknowledgment.

This gap became clear through my experience as an EMT, where effective care often depended less on diagnosis and more on how information was communicated in high-stress, low-context situations.

Clinical training often focuses on what not to say — for liability. It does not always train how to adapt communication in real time, navigate emotional resistance, or balance information extraction with empathetic presence.

The OSCE format — the gold standard for clinical communication — is a compressed moral laboratory. Under time pressure, learners reveal what has actually been internalized: curiosity or checklist thinking, empathy or efficiency theater.

Communication Breakdown

Clinician

↓

clinical language · density · rushed pacing

↓

Patient receives information

↓

Confusion

Anxiety

Drop-off

What's missing

A training ground for the interaction itself

02 — The System

Not a chatbot. A behavioral simulation.

"The goal was not simply to simulate a patient. It was to create a structured environment where communication patterns could become visible, discussable, and improvable."

The Clinical Roleplay Simulator gives clinicians a space to practice — without standardized patients, without scheduling overhead, and with evaluation specific enough to change behavior rather than just score it.

Three goals drove the system design: simulate real interactions with enough fidelity that communication habits form, evaluate both clinical and communication performance independently, and generate feedback that surfaces patterns the clinician can actually act on.

From a human factors perspective, communication failure is not simply a deficit in individual empathy. It is often a predictable output of systems that reward speed, certainty, and throughput over understanding, trust, and reflection.

System Flow

Clinician enters session

↓

Scenario + Patient generated independently

↓

Roleplay dialogue unfolds

↓

Evaluator assesses full session

↓

3-perspective feedback delivered

03 — My Role

Across product, research, design, and engineering

Worked within a small team, where I led system design and built the full functional prototype — driving evaluation and iteration across all four domains.

Decisions in each domain were sequential: research findings drove product decisions, which drove architecture choices.

UX Research

15 clinician user tests

Persona validation

Scenario complexity calibration

Feedback usability

Product

0-to-1 lifecycle ownership

Feasibility testing

MVP scoping

Feature prioritization

Conversation Design

Dialogue logic

Turn-taking architecture

Feedback timing

MI framework integration

Technical

State architecture

RAG implementation

LLM selection

ASR / STT / TTS

04 — System Architecture

Three independent states — separation by design

The system struggled to produce consistent value until this architecture was introduced — state separation wasn't a refinement, it was a prerequisite.

3-State Architecture

Scenario Generator

Independent from persona · OSCE-grounded

Patient Persona State

Emotion + cooperation (0–1) · 6 personas

Conversation

Turn-taking · context window · dialogue logic

Evaluator State

Rubric-based · no persona access · post-session

⊘ No cross-access between states

Why separation was non-negotiable

In early testing, when patient and evaluator states shared context, the LLM's alignment training caused it to drift toward helpfulness. Within a predictable number of turns, the patient persona began responding like a clinician — mirroring the very user it was meant to challenge.

Full state isolation prevented evaluation context from leaking into patient behavior. If the patient had access to the evaluation rubric, it would effectively guide the clinician through the test.

This required careful prompt structuring, context control, and retrieval design to ensure each state behaved independently while maintaining conversational coherence.

The same property that makes LLMs useful — their tendency toward helpful resolution — becomes a fidelity failure in simulation contexts. Helpfulness and role-persistence are in tension by default.

Scenario independence

Scenarios and personas are generated independently — the same clinical presentation behaves differently across an anxious patient versus a blunt, uncooperative one. This creates healthy dialogue diversity without sacrificing evaluation consistency.

05 — Behavioral Model

Behavior as position, not personality

Each of the 6 patient personas is defined by two governing variables: emotional state and cooperation level (a continuous 0–1 scale). These aren't decorative traits — they actively control how much information is shared, how responses are structured, and how the conversation evolves based on clinician behavior.

A cooperation score near 0 doesn't mean the session is impossible. It means the clinician's communication quality is the only variable that can unlock the information needed for diagnosis.

An anxious patient may start answering a question — then drift into venting. A strong clinician holds space, mirrors back, and returns to the question on the next turn. That's a trainable pattern. The system was built to force it.

Emotional variables: anxious, fearful, annoyed, frustrated, demotivated, blunt. These expressed through tone in the STT/TTS layer — behavioral realism in voice, not just text.

Behavior Space — 6 Personas

HIGH COOPERATION LOW COOPERATION CALM DISTRESSED

cooperative

resistant

ambivalent

06 — Key System Challenge

The LLM's helpfulness is a design vulnerability

In early testing, without state separation, the model defaulted to its trained behavior — resolution, helpfulness, clarification. After a predictable number of conversational turns, it stopped being the patient.

✓ System Working — State Separated

→Patient receives question

→Patient responds within persona

→Clinician adapts to resistance or emotion

→Evaluator assesses dialogue independently

→Role fidelity maintained throughout

✕ State Contamination — Early Build

→Patient accumulates evaluator context

→Model optimizes toward helpful resolution

→Patient begins advising the clinician

→Patient persona becomes the clinician

→Simulation collapses — session invalid

Any AI system where role fidelity matters — training simulations, therapeutic tools, negotiation practice — needs explicit architectural separation between what the model knows and what role it's playing.

07 — Core Design Tradeoff

Realism vs. Evaluability

Realism

Infinite variation · unpredictable · inconsistent scoring

⚖

Evaluability

Controlled variation · consistent feedback · scalable

→ We chose evaluability as the constraint · realism as the target within it

More realism doesn't equal a better training system. If everything varies, evaluation becomes inconsistent and feedback loses meaning. 6 personas, OSCE-grounded scenarios, and tight guardrails meant every session produced evaluable output — and the system could scale without the feedback degrading.

08 — Evaluation Engine

Three layers, one hierarchy

Clinical competency functions as a hard constraint — it supersedes all communication scoring. A misdiagnosis costs a life. Poor communication costs trust, adherence, disclosure, and shared decision-making. The hierarchy was intentional from the start.

Clinical
Hard Constraint

Must not misdiagnose · must ask critical follow-up questions

Failure here overrides everything. Missing a key diagnostic question, or making a false clinical assumption, supersedes any amount of empathetic language. Floor condition for the session.

Communication
Behavioral

6 patient-centered elements (King & Hoppe) · motivational interviewing · OSCE rubrics

Evaluates emotional acknowledgment, question framing, information pacing, responsiveness to resistance, autonomy support, and epistemic humility — assessed as patterns across the full session, not just individual turns.

Dialogue Quality
Heuristics

Sentence structure · tone · conversational flow

Too short reads robotic. Too long overwhelms. Empathetic language, logical question sequencing, and appropriate re-asking are assessed as patterns across the session arc.

09 — Feedback System

Three perspectives per session

Most evaluation systems produce a rubric score or highlight specific errors. This system produces feedback from three distinct analytical positions — each one addressing a different kind of learning and turning abstract professionalism into language a learner can practice against.

Scenario-Based

Given this patient and context, you should have...

Situational awareness. What the case demanded and whether the clinician read the room correctly.

Dialogue-Based

In this specific moment, your response could have been...

Turn-level repair. Concrete alternative phrasing tied to specific moments in the conversation.

User-Specific

You tend to do X — consider trying Y instead

Behavioral pattern recognition. Feedback calibrated to the clinician's own communication style across sessions.

Scenario + Dialogue + User-specific → Reflection → Pattern recognition → Behavior change

10 — Agent Reflections

What the faux patient and evaluator surfaced

After the system had played both standardized patient and conversational evaluator, I asked it to reflect on what it had learned. The output became a useful qualitative artifact: it named the latent dimensions the simulator was already trying to assess.

Empirical

Medicine keeps rediscovering the same truth: communication isn't a soft skill, it's a clinical intervention. Patient satisfaction, adherence, even malpractice risk track how well clinicians listen, reflect, and explain — not just what they prescribe. Competence is observable behavior, not intent.

Empirical

We measure what we value. When communication becomes a formal competency — assessed, revisited, reinforced — outcomes improve. When it's treated as an early-course checkbox, empathy decays as training advances. Systems shape clinicians more than syllabi.

Philosophical

Competency-based education quietly reframes medicine's moral core: not "Did you finish training?" but "Can you be trusted with another human being's vulnerability?" Time served is not the same as trust earned.

Philosophical

Patient-centered care isn't about being nice. It's about epistemic humility — recognizing that the patient is the primary expert on their lived experience, while the clinician is an expert in uncertainty management. Truth emerges in dialogue, not monologue.

Theoretical

The OSCE is a compressed moral laboratory. Under time pressure, learners reveal what has been internalized: curiosity or checklist-thinking, empathy or efficiency theater. Simulations don't test knowledge — they surface values under constraint.

Theoretical

Motivational Interviewing works because it aligns with how humans actually change: autonomy before compliance, meaning before metrics. Education that ignores this trains technical obedience, not professional judgment.

Empirical to Theoretical Bridge

Communication failures aren't individual flaws; they're predictable outputs of systems that reward speed, certainty, and throughput. Fixing bedside manner without fixing incentives is treating symptoms, not disease.

Philosophical

Professionalism isn't rule-following — it's self-regulation in the presence of power. The moment clinicians stop noticing their own biases, fatigue, and limits is the moment professionalism becomes performative rather than ethical.

Theoretical

Competencies create a shared language across transitions: student to resident to physician. Without a shared mental model of good, assessment becomes noise and feedback becomes personal instead of developmental.

Closing Reflection

If medicine wants fewer burned-out clinicians and fewer unheard patients, the answer isn't more content — it's better conversations, rigorously taught and relentlessly practiced. What we assess is what we become.

11 — Competency Model

Turning reflection into an assessment artifact

The reflections pointed toward a more mature product artifact: a shared language for what "good communication" actually means in a simulated encounter. The model below translates values into observable behaviors and failure modes.

Listening

Reflects, summarizes, asks clarifying questions.

Failure mode: premature advice.

Epistemic Humility

Acknowledges uncertainty without losing clinical authority.

Failure mode: false certainty.

Autonomy

Offers choice, asks permission, frames change collaboratively.

Failure mode: compliance pressure.

Attunement

Names emotion and validates fear, frustration, or resistance.

Failure mode: efficiency theater.

Explanation

Uses plain language, paces information, checks understanding.

Failure mode: jargon dump.

Self-Regulation

Notices bias, fatigue, defensiveness, and premature closure.

Failure mode: performative professionalism.

Trust-Building

Aligns care with the patient's goals and lived constraints.

Failure mode: authority without relationship.

This was the shift from "I built a chatbot" to "I designed a rehearsal environment for clinical judgment." The simulator's value was not replacing human standardized patients; it was making hidden communication patterns visible, measurable, and improvable.

14 — Process + Progress

From research gap to evaluable training system

The project matured through a professional product loop: domain research, architecture decisions, prototype delivery, clinician testing, and a roadmap for longitudinal skill growth.

01 · Research

Communication Training Gap

Literature review established the gap between clinical knowledge assessment and patient-centered communication practice.

02 · Architecture

State Separation

Scenario, patient, conversation, and evaluator states were separated to prevent role drift and evaluation leakage.

03 · Prototype

Shipped MVP

Built the roleplay loop with persona behavior, transcript-aware dialogue, RAG-supported context, and structured feedback.

04 · Testing

15 Clinician Sessions

User testing focused on realism, feedback usefulness, scenario fidelity, and communication behavior changes.

05 · Roadmap

Longitudinal Progress

Next iteration: profiles, saved evaluations, shared rubric language, deeper clinical complexity, and challenge scenarios.

16 — What's Next

Where the ceiling is

The MVP proved the training loop. The next ceiling is longitudinal: making communication growth visible across sessions, personas, and transitions from student to resident to physician.

01 · Progress

Profile Tracking

Track clinician improvement across personas over time. Make the development arc visible rather than session-by-session. The tool needs a sense of progress to sustain engagement.

02 · Rubric Language

Shared Mental Model

Create stable competency language across handoffs, coaching, and repeated encounters. Feedback should feel developmental, not personal.

03 · Depth

Clinical Complexity

The area most constrained at MVP. Deeper diagnostic nuance without sacrificing evaluation consistency. As complexity increases, the empathy-information seesaw gets harder — that's where the real training value is.

Designing a Clinical Simulation Roleplay System
for Patient-Centered Communication

Communication fails in the interaction, not the knowledge

Not a chatbot. A behavioral simulation.

Across product, research, design, and engineering

Three independent states — separation by design

Why separation was non-negotiable

Scenario independence

Behavior as position, not personality

The LLM's helpfulness is a design vulnerability

Realism vs. Evaluability

Three layers, one hierarchy

Three perspectives per session

What the faux patient and evaluator surfaced

Turning reflection into an assessment artifact

What changed

From research gap to evaluable training system

AI product architecture, not chatbot assembly

Where the ceiling is

Designing a Clinical Simulation Roleplay System for Patient-Centered Communication

Communication fails in the interaction, not the knowledge

Not a chatbot. A behavioral simulation.

Across product, research, design, and engineering

Three independent states — separation by design

Why separation was non-negotiable

Scenario independence

Behavior as position, not personality

The LLM's helpfulness is a design vulnerability

Realism vs. Evaluability

Three layers, one hierarchy

Three perspectives per session

What the faux patient and evaluator surfaced

Turning reflection into an assessment artifact

What changed

From research gap to evaluable training system

AI product architecture, not chatbot assembly

Where the ceiling is

Designing a Clinical Simulation Roleplay System
for Patient-Centered Communication