Designing the voice interface for Capital One’s Alexa skill

Amazon Echo Dot by a laptop and keyboard

I worked with Capital One’s Director of Content Strategy and a product manager to design the home and auto loan voice experience on Alexa. We launched these 2 new features in July 2016.

As the content designer on the project, I wrote the conversations that Capital One customers could have with Alexa about their home and auto loans. I also updated the web content for the product landing page and the app copy for the Skill card. 

Home loan voice features:

  • “When is my next mortgage payment?”
  • “How much do I have to pay for my next mortgage payment?”
  • “Pay this month’s mortgage.”
  • “What’s my principal balance for home loan?”

Auto loan voice features:

  • “How much until my car is paid off?”
  • “When’s my next car payment due?”
  • “How much is my next car payment?”
  • “Make my car payment for [month].”
  • “What’s the principal balance on my car loan?”

You can read about the launch on Digital Trends, Bloomberg, CNET, Business Insider, and PYMNTS.


Reach Feature Parity with Other Lines of Businesses

The features were part of the overall effort to help customers manage their money on their terms – anytime and anywhere. Previously, the team launched the MVP of Capital One Skill for Alexa during SXSW 2016. Customers could ask for their bank and credit card balance as well as pay for their bill and hear recent transactions.

We were tasked with creating an MVP experience for helping customers access their home and auto loan accounts. Some of the content was already written by the Director of Content Strategy when I started the project so my role was to flesh out the experience and getting it ready for launch.

Timeline: 8 weeks

Key Considerations

Designing for the right expectations

To continue building trust with people in a new channel, we needed to design with the system’s capabilities and limits in mind. We wanted our Skill to appear smart without exceeding the system’s ability to be smart.

Reducing cognitive load

People brought their own preconceived ideas about how to interact with our Skill. Using words they would use was key to creating good experiences, along with paying attention to voice inflection and cadence. We wanted to help people complete tasks as quickly as we can.

Building hypotheses into responses

We wanted to make smart assumptions but not the wrong ones. We treated content as hypotheses to answer the question, “Do people even want to do/hear this?” Since Amazon didn’t share platform data with Capital One, the product team could only make inferences about users’ questions and language based on which intents were invoked.

Limiting personality

Personality differentiated an experience but it shouldn’t get in the way. For the MVP, we kept the personality to a minimum but with plans to explore appropriate places to injected it.


Aligning Product and Design

Every content strategy project sits on a foundation called Content Pillars which guides the design of every piece of product content we put out, including voice UI. Content Pillars helped align the product and design teams around what the experience should be. I championed them often in my conversations with the product team.

Content Pillars:

  1. Use case specific
  2. Contextually relevant
  3. Natural language

Design Principles:

  • Design to answer their questions first; ask clarifying questions second.
  • Push the limits of what’s possible by talking like a normal person. Don’t lead with traditional bank syntax.
  • Design for specificity, not for scale. Scale what works; we can’t get there by being generic.
  • Alexa isn’t sarcastic! If statements could end with “…, asshole” – rewrite it.

Understanding How Alexa Works

To get acquainted with a new medium, I spent time understanding the technology behind Alexa and did desk research to understand best practices for voice design.

An illustration of high-level system architecture to show how Amazon's Alexa responds to a customer's question about their account.

From there, I outlined a high-level structure of a basic conversation with Alexa.

An interaction model of a generic conversational structure for Capital One's Alexa skill.

Defining the Architecture of Use Cases

What brings users here? What need are we solving for? To imagine different possibilities of why someone might interact with the skill in the first place, I created a list of possible users, situations, motivations, and outcomes using the Jobs To Be Done framework.

Jobs-to-be-done statements to categorize different customer cognitive states.
The conversation between Alexa and a customer would change based on the person’s awareness of their own finances and the likelihood of filling all required slots.

This exercise helped me gather my assumptions and served to orient my design around user’s mindset and background.

Since the design was already scoped, I went through each use case and mind mapped possible conversation paths and necessary user inputs. Asking a bunch of “what if” questions helped us uncover stress cases as well as point to where some conversation paths should diverge or converge.

Mind map to visualize potential use cases for auto loans, in the form of "what if" questions.
Example of one use case I mind mapped.

During this time, I worked with the Director of Content Strategy to refine our design direction. Eventually, we arrived at a high-level conversation architecture for both auto loan and home loan.

Home loans and auto loans conversation architecture, including the use cases our team was designing for.
Since asking about payment due date and amount signaled a higher intent for paying a bill, the conversation for those use cases flowed into the pay bill use case, whereas other use cases were one-off conversations.

With starting points in place, I created paths to let customers explore where they want to go next. This meant writing conversation scripts that accommodated divergence from “happy paths” as well as error cases.

Voice UX design process starts with writing conversations that lets customers explore different conversation paths. Conversation is written out in an if-then format.

The Director of Content Strategy and I used pair writing to explore possible flows. Writing and editing together helped us keep the interaction honest and useful. It also let us add to our list of hypotheses to test later.

Keeping the Voice UI Contextually Relevant

Each context changes the expectations that people bring to an interaction. To get it right the first time, we needed to consider the different factors that could influence people’s experience with the skill. This meant meeting people where they were at with the language that specifically fit their situations.

I mind mapped various contexts that could influence how people might interact with each use case.

Mind map of different categories of contextual factors that might affect customer interactions with Capital One's Alexa skill.

These contexts were:

  • environmental
  • personal
  • technological
  • social and cultural
  • temporal
  • business

(The contexts that I had the most control over are bolded.)

Understanding possible contexts helped only if you broke down whole conversations into component parts. This was especially important for reducing the cognitive load of voice UI. I did this by breaking sentences into nouns and verbs.

Example voice UI for asking about the amount and due date for this month's auto loan payment.

Take the above conversation. Consider the noun order in Alexa’s response when a user asks:

[Customer] Alexa, what’s my car payment for this month?

Example voice UI revision for making car payments. The change reduced the cognitive load of Alexa's response.

The data that the customer is interested in is $1,231. The cognitive load would be lessened if it were put at the end of a sentence rather than at the beginning.

Also, with nouns and verbs labeled, it was easier to create unambiguous prompts. It’s clear that “Would you like to pay it now?” is a yes/no question referring to the due amount. Calls-to-action should present clear choices and minimize interpretation.

Sentence structure became even more important for responses with multiple sets of nouns and verbs. Below, I  switched the order of the $ amount with the explanation of principal balance for customers with multiple auto and home loans.

Differentiating voice UI for separate use cases for customers who have both home and auto loans through Capital One.
Differentiating the experience also meant differentiating the language.

Breaking sentences into nouns and verbs also helped me clarify the meanings of words. For example, the word balance could mean your:

  • credit card balance
  • checking/savings account balance
  • home/auto loan balance

I clarified “balance” to mean “home loan principal balance.” The trade-off is less concision, but the payoff is in the clarity it gained.

By breaking the voice UI into parts, I identified where we needed customer inputs and what data we needed to return. This also allowed me to communicate  with the engineering and product team which synonyms customers might use.

Refining the Language to Be As Natural As Possible

Natural language is context-specific, and sometimes, it needs a reality check. I refined the voice UI by role playing conversations with my team to get real-time feedback into my conversation design.

Role playing helped me:

  • identify points of anxiety
  • replace bank jargon
  • gut check for “natural-ness” of the conversation

Role playing also helped me develop hypotheses to test. Since we won’t know which conversation paths customers will take – and which ones were worth spending more time designing for, we want to let them to tell us what they want to do next. We can do this by designing hypotheses into the language itself.

For example, if customers missed their payments for more than 1 month, Alexa says:

[Alexa] Unfortunately, your amount is past due. I’m unable to share the due dates at this time. Please go online to get the information. 

My hypotheses were:

  1. People won’t need any more information beyond “Please go online to get the information.”
  2. This scenario didn’t happen enough to warrant a new use case.

We can then use data to figure out how many people actually need to go through that conversation path. The trick is to keep customers talking – if they stop talking, we stop learning what they want to do next.

Sometimes, role playing helped me identify new scenarios like missed payments for home loan.

Examples voice UI for mortgage due date use case.

Embedding hypotheses into the language prevented me from over-designing features that create more work for developers.

But nothing beats robotic interactions like saying your conversation out loud. What sounds good on paper doesn’t always translate well to voice. To test voice UI, I ran every piece of the conversation through the voice simulator in Alexa’s Skills Kit.

Voice Simulator on Amazon's Alexa Skills Kit used for functional testing.

I experimented with putting commas in certain places to arrive at the right pace and changed the wording to get an appropriate inflection that matched the situation.

Putting the Voice UI in Front of Real Customers

We tested the usability of voice UI with both Capital One and non-Capital One customers. The product manager moderated all the tests, and I took notes in the observation room. We used testing to:

  • Assess meaning of words
  • Identify conversation flows that don’t make sense 
  • Uncover new utterances
  • Empathize with how people felt while using our Skill

Through testing, I identified potential points of anxiety in the language. For example, at the end of the pay bill use case, Alexa says,

[Alexa] And keep in mind, you can cancel it online later.

This phrase addresses people’s fear of making a mistake, which puts control back into their hands.

By testing the voice UI, we made the conversation more relevant to customers lives, since generic conversations wouldn’t have provided much value to the interaction.

We also iterated the design with our engineering team. I worked with them to implement recommendations from Amazon’s Skill Certification Team. For example, I made the confirmation language more specific for the home loan pay bill use case.

Revised voice UI after reviewing the design with Amazon's Skill Certification Team.

We also altered conversations in other intents so our new features wouldn’t negatively impact the existing user experience. For example, in the Recent Transactions use case, listing additional accounts (i.e. auto and/or home loan) could overwhelm users. We added a selection step so people would only hear about relevant accounts.

[Alexa] Ok, you have the following accounts. Home loan, auto loan, checking, and credit card. Please say the account type you’d like the transactions for.


Design for learning

Instead of making a big assumption, let people reveal the next steps and the new use cases. People satisfice in different ways, and what “nailing it” means is different for everyone. Ultimately we are modeling people, not systems; we use the system to refine our understanding about the people we’re serving.

What natural language means will differ with different kinds of people

Design at an atomic level so we don’t use generic, meaningless (or wrong) language that will work for a system but not for the person. Every single sentence or word should be a hypothesis we could test to see what “natural” means to people.

Don’t be limited by the technology or the system in place

Challenge convention: in moments of truth, ask “why shouldn’t we be able to do (this)?” We’re here to serve people so necessarily technology should work around people, not limit them. You’re not annoying the engineering team (maybe a little bit); you’re being an advocate for customers you’re designing for.

More Projects