Codes by Shrey
Clinical Roleplay · National Board of Medical Examiners · 2025

Designing a Clinical Simulation Roleplay System
for Patient-Centered Communication

A behavioral simulation system for practicing clinical conversations, surfacing communication patterns, and making patient-centered competence observable.

Clinical communication is not a soft skill. It is observable clinical behavior under constraint.
Clinical AI OSCE Simulation Human Factors Assessment Design Motivational Interviewing
Domain
Clinical AI · CUI · Human Factors
Role
Lead Product Engineer
AI Product Systems Designer · Lead Builder
Methods
User Testing · System Design · RAG Architecture
Outcome
Shipped MVP · 15 Clinician Tests
At a Glance — 30 second read
Problem
Clinicians are trained to diagnose — but communication often degrades under real time pressure.
OSCE training is expensive, scarce, and episodic. It can score communication without giving learners enough rehearsal to build it.
Solution
An AI roleplay system that treats patient-centered communication as observable, trainable behavior.
Built 0-to-1: concept validation through shipped MVP with dialogue, evaluation, and feedback systems.
Key Innovation
Separated patient, scenario, and evaluator states to prevent LLM role drift and evaluation leakage.
Without this, the model's helpfulness training caused it to become the clinician after several turns.
Impact
15 clinicians tested → feedback shifted from personal judgment to developmental language.
Primary shift: clinicians changed how they structured questions, managed uncertainty, and responded to resistance.
01 — The Problem

Communication fails in the interaction, not the knowledge

In clinical environments, providers are trained to diagnose. They are not consistently trained to be present, explain uncertainty, preserve autonomy, or notice when the patient has stopped following the conversation. Discharge paperwork can be clinically correct but functionally unreadable for elderly patients or those with low health literacy. Providers discuss patients in front of patients. Emotional cues pass without acknowledgment.

This gap became clear through my experience as an EMT, where effective care often depended less on diagnosis and more on how information was communicated in high-stress, low-context situations.

Clinical training often focuses on what not to say — for liability. It does not always train how to adapt communication in real time, navigate emotional resistance, or balance information extraction with empathetic presence.

The OSCE format — the gold standard for clinical communication — is a compressed moral laboratory. Under time pressure, learners reveal what has actually been internalized: curiosity or checklist thinking, empathy or efficiency theater.
Communication Breakdown
Clinician
clinical language · density · rushed pacing
Patient receives information
Confusion
Anxiety
Drop-off
What's missing
A training ground for the interaction itself
02 — The System

Not a chatbot. A behavioral simulation.

"The goal was not simply to simulate a patient. It was to create a structured environment where communication patterns could become visible, discussable, and improvable."

The Clinical Roleplay Simulator gives clinicians a space to practice — without standardized patients, without scheduling overhead, and with evaluation specific enough to change behavior rather than just score it.

Three goals drove the system design: simulate real interactions with enough fidelity that communication habits form, evaluate both clinical and communication performance independently, and generate feedback that surfaces patterns the clinician can actually act on.

From a human factors perspective, communication failure is not simply a deficit in individual empathy. It is often a predictable output of systems that reward speed, certainty, and throughput over understanding, trust, and reflection.

System Flow
Clinician enters session
Scenario + Patient generated independently
Roleplay dialogue unfolds
Evaluator assesses full session
3-perspective feedback delivered
03 — My Role

Across product, research, design, and engineering

Worked within a small team, where I led system design and built the full functional prototype — driving evaluation and iteration across all four domains.

Decisions in each domain were sequential: research findings drove product decisions, which drove architecture choices.

UX Research
15 clinician user tests
Persona validation
Scenario complexity calibration
Feedback usability
Product
0-to-1 lifecycle ownership
Feasibility testing
MVP scoping
Feature prioritization
Conversation Design
Dialogue logic
Turn-taking architecture
Feedback timing
MI framework integration
Technical
State architecture
RAG implementation
LLM selection
ASR / STT / TTS
04 — System Architecture

Three independent states — separation by design

The system struggled to produce consistent value until this architecture was introduced — state separation wasn't a refinement, it was a prerequisite.
3-State Architecture
Scenario Generator
Independent from persona · OSCE-grounded
Patient Persona State
Emotion + cooperation (0–1) · 6 personas
Conversation
Turn-taking · context window · dialogue logic
Evaluator State
Rubric-based · no persona access · post-session
⊘ No cross-access between states

Why separation was non-negotiable

In early testing, when patient and evaluator states shared context, the LLM's alignment training caused it to drift toward helpfulness. Within a predictable number of turns, the patient persona began responding like a clinician — mirroring the very user it was meant to challenge.

Full state isolation prevented evaluation context from leaking into patient behavior. If the patient had access to the evaluation rubric, it would effectively guide the clinician through the test.

This required careful prompt structuring, context control, and retrieval design to ensure each state behaved independently while maintaining conversational coherence.

The same property that makes LLMs useful — their tendency toward helpful resolution — becomes a fidelity failure in simulation contexts. Helpfulness and role-persistence are in tension by default.

Scenario independence

Scenarios and personas are generated independently — the same clinical presentation behaves differently across an anxious patient versus a blunt, uncooperative one. This creates healthy dialogue diversity without sacrificing evaluation consistency.

05 — Behavioral Model

Behavior as position, not personality

Each of the 6 patient personas is defined by two governing variables: emotional state and cooperation level (a continuous 0–1 scale). These aren't decorative traits — they actively control how much information is shared, how responses are structured, and how the conversation evolves based on clinician behavior.

A cooperation score near 0 doesn't mean the session is impossible. It means the clinician's communication quality is the only variable that can unlock the information needed for diagnosis.

An anxious patient may start answering a question — then drift into venting. A strong clinician holds space, mirrors back, and returns to the question on the next turn. That's a trainable pattern. The system was built to force it.

Emotional variables: anxious, fearful, annoyed, frustrated, demotivated, blunt. These expressed through tone in the STT/TTS layer — behavioral realism in voice, not just text.

Behavior Space — 6 Personas
HIGH COOPERATION LOW COOPERATION CALM DISTRESSED
cooperative
resistant
ambivalent
06 — Key System Challenge

The LLM's helpfulness is a design vulnerability

In early testing, without state separation, the model defaulted to its trained behavior — resolution, helpfulness, clarification. After a predictable number of conversational turns, it stopped being the patient.

✓ System Working — State Separated
Patient receives question
Patient responds within persona
Clinician adapts to resistance or emotion
Evaluator assesses dialogue independently
Role fidelity maintained throughout
✕ State Contamination — Early Build
Patient accumulates evaluator context
Model optimizes toward helpful resolution
Patient begins advising the clinician
Patient persona becomes the clinician
Simulation collapses — session invalid
Any AI system where role fidelity matters — training simulations, therapeutic tools, negotiation practice — needs explicit architectural separation between what the model knows and what role it's playing.
07 — Core Design Tradeoff

Realism vs. Evaluability

Realism
Infinite variation · unpredictable · inconsistent scoring
Evaluability
Controlled variation · consistent feedback · scalable
→ We chose evaluability as the constraint · realism as the target within it

More realism doesn't equal a better training system. If everything varies, evaluation becomes inconsistent and feedback loses meaning. 6 personas, OSCE-grounded scenarios, and tight guardrails meant every session produced evaluable output — and the system could scale without the feedback degrading.

08 — Evaluation Engine

Three layers, one hierarchy

Clinical competency functions as a hard constraint — it supersedes all communication scoring. A misdiagnosis costs a life. Poor communication costs trust, adherence, disclosure, and shared decision-making. The hierarchy was intentional from the start.

Clinical
Hard Constraint
Must not misdiagnose · must ask critical follow-up questions
Failure here overrides everything. Missing a key diagnostic question, or making a false clinical assumption, supersedes any amount of empathetic language. Floor condition for the session.
Communication
Behavioral
6 patient-centered elements (King & Hoppe) · motivational interviewing · OSCE rubrics
Evaluates emotional acknowledgment, question framing, information pacing, responsiveness to resistance, autonomy support, and epistemic humility — assessed as patterns across the full session, not just individual turns.
Dialogue Quality
Heuristics
Sentence structure · tone · conversational flow
Too short reads robotic. Too long overwhelms. Empathetic language, logical question sequencing, and appropriate re-asking are assessed as patterns across the session arc.
09 — Feedback System

Three perspectives per session

Most evaluation systems produce a rubric score or highlight specific errors. This system produces feedback from three distinct analytical positions — each one addressing a different kind of learning and turning abstract professionalism into language a learner can practice against.

01
Scenario-Based
Given this patient and context, you should have...
Situational awareness. What the case demanded and whether the clinician read the room correctly.
02
Dialogue-Based
In this specific moment, your response could have been...
Turn-level repair. Concrete alternative phrasing tied to specific moments in the conversation.
03
User-Specific
You tend to do X — consider trying Y instead
Behavioral pattern recognition. Feedback calibrated to the clinician's own communication style across sessions.
Scenario + Dialogue + User-specific → Reflection → Pattern recognition → Behavior change
10 — Agent Reflections

What the faux patient and evaluator surfaced

After the system had played both standardized patient and conversational evaluator, I asked it to reflect on what it had learned. The output became a useful qualitative artifact: it named the latent dimensions the simulator was already trying to assess.

01
Empirical
Medicine keeps rediscovering the same truth: communication isn't a soft skill, it's a clinical intervention. Patient satisfaction, adherence, even malpractice risk track how well clinicians listen, reflect, and explain — not just what they prescribe. Competence is observable behavior, not intent.
02
Empirical
We measure what we value. When communication becomes a formal competency — assessed, revisited, reinforced — outcomes improve. When it's treated as an early-course checkbox, empathy decays as training advances. Systems shape clinicians more than syllabi.
03
Philosophical
Competency-based education quietly reframes medicine's moral core: not "Did you finish training?" but "Can you be trusted with another human being's vulnerability?" Time served is not the same as trust earned.
04
Philosophical
Patient-centered care isn't about being nice. It's about epistemic humility — recognizing that the patient is the primary expert on their lived experience, while the clinician is an expert in uncertainty management. Truth emerges in dialogue, not monologue.
05
Theoretical
The OSCE is a compressed moral laboratory. Under time pressure, learners reveal what has been internalized: curiosity or checklist-thinking, empathy or efficiency theater. Simulations don't test knowledge — they surface values under constraint.
06
Theoretical
Motivational Interviewing works because it aligns with how humans actually change: autonomy before compliance, meaning before metrics. Education that ignores this trains technical obedience, not professional judgment.
07
Empirical to Theoretical Bridge
Communication failures aren't individual flaws; they're predictable outputs of systems that reward speed, certainty, and throughput. Fixing bedside manner without fixing incentives is treating symptoms, not disease.
08
Philosophical
Professionalism isn't rule-following — it's self-regulation in the presence of power. The moment clinicians stop noticing their own biases, fatigue, and limits is the moment professionalism becomes performative rather than ethical.
09
Theoretical
Competencies create a shared language across transitions: student to resident to physician. Without a shared mental model of good, assessment becomes noise and feedback becomes personal instead of developmental.
10
Closing Reflection
If medicine wants fewer burned-out clinicians and fewer unheard patients, the answer isn't more content — it's better conversations, rigorously taught and relentlessly practiced. What we assess is what we become.
11 — Competency Model

Turning reflection into an assessment artifact

The reflections pointed toward a more mature product artifact: a shared language for what "good communication" actually means in a simulated encounter. The model below translates values into observable behaviors and failure modes.

Listening
Reflects, summarizes, asks clarifying questions.
Failure mode: premature advice.
Epistemic Humility
Acknowledges uncertainty without losing clinical authority.
Failure mode: false certainty.
Autonomy
Offers choice, asks permission, frames change collaboratively.
Failure mode: compliance pressure.
Attunement
Names emotion and validates fear, frustration, or resistance.
Failure mode: efficiency theater.
Explanation
Uses plain language, paces information, checks understanding.
Failure mode: jargon dump.
Self-Regulation
Notices bias, fatigue, defensiveness, and premature closure.
Failure mode: performative professionalism.
Trust-Building
Aligns care with the patient's goals and lived constraints.
Failure mode: authority without relationship.
This was the shift from "I built a chatbot" to "I designed a rehearsal environment for clinical judgment." The simulator's value was not replacing human standardized patients; it was making hidden communication patterns visible, measurable, and improvable.
12 — Outcome

What changed

15
Clinicians Tested
Realism After Redesign
Feedback Perspectives
Question Quality Shift

Clinicians described feedback as specific, actionable, and directly tied to the dialogue.

Primary shift observed Clinicians changed how they structured questions across sessions — not just what they asked. The framing, pacing, sequencing, and repair of information gathering shifted. That's a communication habit forming, not a rubric score improving.
Most healthcare AI systems optimize for correctness.
This system optimizes for evaluability — because what we assess is what we become.
Controlled variation · shared language · feedback loops · behavioral focus
14 — Process + Progress

From research gap to evaluable training system

The project matured through a professional product loop: domain research, architecture decisions, prototype delivery, clinician testing, and a roadmap for longitudinal skill growth.

01 · Research
Communication Training Gap
Literature review established the gap between clinical knowledge assessment and patient-centered communication practice.
02 · Architecture
State Separation
Scenario, patient, conversation, and evaluator states were separated to prevent role drift and evaluation leakage.
03 · Prototype
Shipped MVP
Built the roleplay loop with persona behavior, transcript-aware dialogue, RAG-supported context, and structured feedback.
04 · Testing
15 Clinician Sessions
User testing focused on realism, feedback usefulness, scenario fidelity, and communication behavior changes.
05 · Roadmap
Longitudinal Progress
Next iteration: profiles, saved evaluations, shared rubric language, deeper clinical complexity, and challenge scenarios.
15 — Technical Skills Demonstrated

AI product architecture, not chatbot assembly

LLM Orchestration
Scenario generation
Persona state control
Evaluator isolation
Context-window management
Evaluation Design
OSCE-style feedback
Motivational interviewing metrics
Transcript-based scoring
Actionable rephrasing
System Design
RAG architecture
ASR / STT / TTS workflows
State contamination prevention
Safety and non-goal boundaries
Human Factors
Clinician usability testing
Scenario calibration
Feedback comprehension
Behavioral failure-mode analysis
16 — What's Next

Where the ceiling is

The MVP proved the training loop. The next ceiling is longitudinal: making communication growth visible across sessions, personas, and transitions from student to resident to physician.

01 · Progress
Profile Tracking
Track clinician improvement across personas over time. Make the development arc visible rather than session-by-session. The tool needs a sense of progress to sustain engagement.
02 · Rubric Language
Shared Mental Model
Create stable competency language across handoffs, coaching, and repeated encounters. Feedback should feel developmental, not personal.
03 · Depth
Clinical Complexity
The area most constrained at MVP. Deeper diagnostic nuance without sacrificing evaluation consistency. As complexity increases, the empathy-information seesaw gets harder — that's where the real training value is.
Shreyas Sriram