Validating AI Interactions: Reimagining Educational Assessment

 

Project Context: Gen AI & Multimodal AI

In today's digital age, artificial intelligence is transforming how we measure and develop crucial skills. At SkillTrack, we faced an fascinating challenge: leveraging generative AI and multimodal interactions to assess skills that traditionally required human observation.

This wasn't just about creating another EdTech tool – it was about pioneering new ways to measure the intangible aspects of human communication and learning.

Impact:

🧠 Led research that united UX, Learning Science, and AI ENG teams to define conversation design principles for skill measurement

⚡ Reduced AI feature validation time from months to weeks

🎮 Rigorous and enjoyable experimentation team culture

Role: UX Researcher

Timeline: 2023 - 2024

Team: Learning Scientists, Product Manager, Instructional Designer, AI Engineers, Co-UX Researcher, Product Designers

*Some details are modified for confidentiality reasons.

Concept Validation | AI Interaction Design | Interdisciplinary Collaboration

|

Concept Validation | AI Interaction Design | Interdisciplinary Collaboration |

 

The SkillTrack Product Suite

SkillTrack is a series of immersive 15-minute modules that feel more like interactive stories than tests. It measures crucial 21st-century skills including Effective Communication, Innovative Thinking, Professional Resilience, and Cultural Agility.

This case study is going to focus on measuring Effective Communication with AI-powered conversational tasks.

🙋Target Users

We focused on high school and college students preparing for their careers. SkillTrack helps them build and demonstrate essential professional abilities.

 
 

The Challenge: Validating the Unknown

How do you test AI interactions that don't yet exist? Traditional usability testing methods weren't enough for validating multimodal AI features that combined text generation, voice interaction, and adaptive feedback. Especially, we were curious about the best practices for conversational agent design in cultural specific workplace context.

 
 

Core Research Questions I Defined:

How do different AI response patterns affect student engagement and trust?

What combination of voice and visual feedback creates the most natural learning experience?

How can we optimize AI interactions for authentic skill assessment?

 

Key Research Methods

  • Wizard of Oz Prototyping

  • Survey (Godspeed survey: likeability, intelligence, trust, etc.)

  • Role-playing

  • Semi-structured Interview

  • Co Design

Data Collection & Analysis

  • Voice and text responses

  • Code strategies used while conversing with Conversational AI in skill measuring tasks

  • Survey: Godspeed measures

  • Thematic analysis of semi-structured interview data

 
 

Example: An 1 vs. N conversation prototype I curated using UXD’s early concept designs, considering a wide ranges of interactions, both happy path and edge cases.

Using Wizard of Oz prototyping, I combined wireframes with multimodal elements (audio, text, image) to simulate our future AI product. This approach, paired with Think Aloud studies, provided rich insights into user behavior with our yet-to-be-built features.

 
 

AB Testing: AI Voice Personas

I implemented a voice generation pipeline using ElevenLabs API to create two distinct personas for the user’s AI colleagues in the module.

Each persona had Learning Scientists customized:

  • Speech patterns

  • Pacing variations

  • Cultural context awareness

RQ: What effects do conversational agent (CA) personality (formal v. informal) have on CA user perceptions and behavior?

Formality Score: 0.152

Formality Score: -1.3158

 

Key Findings

In collaboration with AI Engineers and Learning Scientists, we evaluated how different AI communication styles affected how users engage with the communication tasks.

 

1.Disclosure Behavior

Tina's conversational style elicited longer responses from users (11.8 words on average) compared to Saanvi's more formal approach (11.3 words).

It suggests users tend to share more detailed responses with a peer-like AI personality.

The conversation with Tina helped me open up about situations where I struggled to speak up in team meetings. With Saanvi, I was more focused on giving the ‘right’ answers.
— Student (Senior Year)
 
 

2. User Perceptions (Godspeed Metrics)

Tina scored slightly higher on Anthropomorphism (3.0 vs 2.5), indicating her casual style felt more human-like.

 

3. Communication Style Adaptation

The formality analysis (Coh-metrix score, P=0.01) revealed a striking pattern: Users significantly adapted their language formality to match each mentor.

  • With Saanvi, participants used more formal language (formality score: 0.656)

  • With Tina, they shifted to more casual communication (formality score: -0.224)

With Mentor B, I felt like I was having a real conversation about my career goals. I shared things I wouldn’t normally tell a teacher.
— Student (Junior Year)

On communication style adaptation:



Sounds good. :)

But what do these behavioral patterns tell us about measuring communication skills?

 
 




Actionable Insights: Informing Design Guidelines

To move from insights to action, I facilitated a collaborative synthesis workshop with our cross-functional teams.

Our AI Engineers brought technical feasibility perspectives, Learning Scientists provided pedagogical frameworks, Product Designers focused on user experience, and Content Strategists contributed workplace scenario expertise. This collaborative analysis helped us develop comprehensive design recommendations for authentic communication skill assessment.

 
 

3 key design guidelines were implemented based on the findings:


1. Authenticity & Engagement Optimization

Start with peer-like conversation style to establish comfort and baseline behavior.

We implemented a simple onboarding prompt exercise in casual setting to kickstart the module.

 

2. Adaptive Formality Assessment

Design multi-stage scenarios that assess style-switching ability, from casual team interactions to formal client presentations. Measure adaptation across power dynamics using Coh-metrix scoring (observed range: -0.224 to 0.656).

Mobile mockups of Casual Conversations

to a more formal meeting participation with key stakeholders.

 

3. Measurement Framework Design for Learning Science

Implement quantitative and qualitative metrics to assess communication effectiveness.

Track formality scores, response lengths, and engagement levels across different contexts. Include behavioral indicators like appropriate disclosure levels and situational awareness.

This comprehensive approach allows for authentic skill assessment while maintaining measurement rigor.

 

UXR Impact

Leading this research initiative, I bridged the gap between UX, Learning Science, and AI Engineering to revolutionize our approach to skill measurement.

Through structured yet creative experimentation, we not only accelerated our AI validation process from months to weeks but also fostered a collaborative culture where technical rigor met playful innovation.

Previous
Previous

Conversion Optimization for TOEFL Product suites

Next
Next

Discover Learning Needs for 250M Indian GenZs