AI Experience Score – Artyom Semenov

I’ve been working on something fun that I want to share with the design community. A ‘quick and dirty’ usability score for AI I’m calling the AI Experience Score (AES)—not to be confused with the Advanced Encryption Standard, the name is still WIP—inspired by System Usability Scale and all the evaluative scores that came since.

Back in March 2024 I did a some desk research to see how people build amazing AI products. I looked at HCI papers around the topic of AI, agentic design, and developing human-machine interactions. What I found was a distinct lack of specific guidance or methods. How do we run usability testing with AI interfaces? How do we measure how well we did? No one had the answer. So I set out to find a way to objectively measure how well an AI solution is performing.

Eventually I came across the original System Usability Scale paper and thought, “hey, I can follow the same process and make something useful!”.

With the help of many colleagues, I put together a basic questionnaire based on the SUS and UX-Lite—yes, Jeff Sauro is a bit of a hero of mine—and tested it across two rounds of user testing. I tweaked the wording of the questions in between to align with the 5 key principles of Human-Centred AI [1]:

Usefulness
Ease of Use
Trustworthiness
Controllability
Empowerment.

Xu also mentions Scalability and Sustainability, but I deemed these things to be decided at model level rather than a specific AI interface.

How to use AES

The questions are (5-point Likert; Strongly Disagree – Strongly Agree):

The agent’s capabilities match my needs
It was easy to achieve what I wanted using the agent
I trust the agent’s responses
Using the agent enhances my own capabilities
I can consistently get the answers I want to my questions

This formula gives a final score out of 100:

(((∑Q1–5)-5)*(100/25))+10

(Sum the scores, subtract 5, multiply by 4 and add 10 to the result)

Administer this summative questionnaire after several scenarios at the end of a usability testing session or add this to a contextual survey on your website to get a longitudinal view of the score.

Validating the score

Tested and iterated across 2 rounds of testing the final scaled score was reliable (Cormbach’s alpha = .88) and correlated with NPS (r=.80, p=<.001, n=36) and CSAT (r=.92, p=<.001, n=18).

Though the results are encouraging there are obvious limitations. I would really love for the UX community to test the score on a real product with larger samples and tell me how it went. I’m also keen to hear ideas and feedback on how to improve it.

Leave a ReplyCancel Reply