Validating AI Interactions: Reimagining Educational Assessment
Project Context: Gen AI & Multimodal AI
In today's digital age, artificial intelligence is transforming how we measure and develop crucial skills. At SkillTrack, we faced an fascinating challenge: leveraging generative AI and multimodal interactions to assess skills that traditionally required human observation.
This wasn't just about creating another EdTech tool – it was about pioneering new ways to measure the intangible aspects of human communication and learning.
Impact:
🧠 Led research that united UX, Learning Science, and AI ENG teams to define conversation design principles for skill measurement
⚡ Reduced AI feature validation time from months to weeks
🎮 Rigorous and enjoyable experimentation team culture
Role: UX Researcher
Timeline: 2023 - 2024
Team: Learning Scientists, Product Manager, Instructional Designer, AI Engineers, Co-UX Researcher, Product Designers
*Some details are modified for confidentiality reasons.
Concept Validation | AI Interaction Design | Interdisciplinary Collaboration
|
Concept Validation | AI Interaction Design | Interdisciplinary Collaboration |
The SkillTrack Product Suite
SkillTrack is a series of immersive 15-minute modules that feel more like interactive stories than tests. It measures crucial 21st-century skills including Effective Communication, Innovative Thinking, Professional Resilience, and Cultural Agility.
This case study is going to focus on measuring Effective Communication with AI-powered conversational tasks.
🙋Target Users
We focused on high school and college students preparing for their careers. SkillTrack helps them build and demonstrate essential professional abilities.
The Challenge: Validating the Unknown
How do you test AI interactions that don't yet exist? Traditional usability testing methods weren't enough for validating multimodal AI features that combined text generation, voice interaction, and adaptive feedback. Especially, we were curious about the best practices for conversational agent design in cultural specific workplace context.
Core Research Questions I Defined:
How do different AI response patterns affect student engagement and trust?
What combination of voice and visual feedback creates the most natural learning experience?
How can we optimize AI interactions for authentic skill assessment?
Key Research Methods
Wizard of Oz Prototyping
Survey (Godspeed survey: likeability, intelligence, trust, etc.)
Role-playing
Semi-structured Interview
Co Design
Data Collection & Analysis
Voice and text responses
Code strategies used while conversing with Conversational AI in skill measuring tasks
Survey: Godspeed measures
Thematic analysis of semi-structured interview data
Example: An 1 vs. N conversation prototype I curated using UXD’s early concept designs, considering a wide ranges of interactions, both happy path and edge cases.
Using Wizard of Oz prototyping, I combined wireframes with multimodal elements (audio, text, image) to simulate our future AI product. This approach, paired with Think Aloud studies, provided rich insights into user behavior with our yet-to-be-built features.
AB Testing: AI Voice Personas
I implemented a voice generation pipeline using ElevenLabs API to create two distinct personas for the user’s AI colleagues in the module.
Each persona had Learning Scientists customized:
Speech patterns
Pacing variations
Cultural context awareness
RQ: What effects do conversational agent (CA) personality (formal v. informal) have on CA user perceptions and behavior?
Formality Score: 0.152
Formality Score: -1.3158
Key Findings
In collaboration with AI Engineers and Learning Scientists, we evaluated how different AI communication styles affected how users engage with the communication tasks.
1.Disclosure Behavior
Tina's conversational style elicited longer responses from users (11.8 words on average) compared to Saanvi's more formal approach (11.3 words).
It suggests users tend to share more detailed responses with a peer-like AI personality.
“The conversation with Tina helped me open up about situations where I struggled to speak up in team meetings. With Saanvi, I was more focused on giving the ‘right’ answers.”
2. User Perceptions (Godspeed Metrics)
Tina scored slightly higher on Anthropomorphism (3.0 vs 2.5), indicating her casual style felt more human-like.
3. Communication Style Adaptation
The formality analysis (Coh-metrix score, P=0.01) revealed a striking pattern: Users significantly adapted their language formality to match each mentor.
With Saanvi, participants used more formal language (formality score: 0.656)
With Tina, they shifted to more casual communication (formality score: -0.224)
“With Mentor B, I felt like I was having a real conversation about my career goals. I shared things I wouldn’t normally tell a teacher.”
On communication style adaptation:
Sounds good. :)
But what do these behavioral patterns tell us about measuring communication skills?
Actionable Insights: Informing Design Guidelines
To move from insights to action, I facilitated a collaborative synthesis workshop with our cross-functional teams.
Our AI Engineers brought technical feasibility perspectives, Learning Scientists provided pedagogical frameworks, Product Designers focused on user experience, and Content Strategists contributed workplace scenario expertise. This collaborative analysis helped us develop comprehensive design recommendations for authentic communication skill assessment.
3 key design guidelines were implemented based on the findings:
1. Authenticity & Engagement Optimization
Start with peer-like conversation style to establish comfort and baseline behavior.
We implemented a simple onboarding prompt exercise in casual setting to kickstart the module.
2. Adaptive Formality Assessment
Design multi-stage scenarios that assess style-switching ability, from casual team interactions to formal client presentations. Measure adaptation across power dynamics using Coh-metrix scoring (observed range: -0.224 to 0.656).
Mobile mockups of Casual Conversations
to a more formal meeting participation with key stakeholders.
3. Measurement Framework Design for Learning Science
Implement quantitative and qualitative metrics to assess communication effectiveness.
Track formality scores, response lengths, and engagement levels across different contexts. Include behavioral indicators like appropriate disclosure levels and situational awareness.
This comprehensive approach allows for authentic skill assessment while maintaining measurement rigor.
UXR Impact
Leading this research initiative, I bridged the gap between UX, Learning Science, and AI Engineering to revolutionize our approach to skill measurement.
Through structured yet creative experimentation, we not only accelerated our AI validation process from months to weeks but also fostered a collaborative culture where technical rigor met playful innovation.