Launching a voice interface for Capital One’s Alexa skill

I worked with Capital One’s Director of Content Strategy and a product manager to design the home and auto loan experience on Alexa. We launched these 2 new features in July 2016.

As the content designer on the project, I helped design the voice interface for the home and auto loan features. I was also responsible for go-to-market deliverables such as updated content for the FAQ page and the Skill Card on the Amazon Alexa app.

Home loan features:

  • When is my next mortgage payment?
  • How much do I have to pay for my next mortgage payment?
  • Pay this month’s mortgage.
  • What’s my principal balance for home loan?

Auto loan features:

  • How much until my car is paid off?
  • When’s my next car payment due?
  • How much is my next car payment?
  • Make my car payment for [month].
  • What’s the principal balance on my car loan?

You can read about the launch on Digital Trends, Bloomberg, CNET, Business Insider, and PYMNTS.


Reach feature parity with other lines of businesses.

The features were part of the overall effort to help customers manage their money on their terms – anytime and anywhere. Previously, the team launched the MVP of Capital One Skill for Alexa during SXSW 2016. Customers could ask for their bank and credit card balance as well as pay for their bill and hear recent transactions.

We were tasked with creating an MVP experience for helping customers access their home and auto loan accounts as well. Some of the content was already written by the Director of Content Strategy when I started the project so my role was to flesh out the experience and getting it ready for launch.

Timeline: 8 weeks


Designing for the right expectations. To continue building trust with people in a new channel, we needed to design with the system’s capabilities and limits in mind. We wanted our Skill to appear smart without exceeding the system’s ability to be smart.

Reducing cognitive load. People brought their own preconceived ideas about how to interact with our Skill. Using words they would use was key to creating good experiences, along with paying attention to voice inflection and cadence. We wanted to help people complete tasks as quickly as we can.

Building hypotheses into responses. We wanted to make smart assumptions but not the wrong ones. We treated content as hypotheses to answer the question, “Do people even want to do/hear this?” Since Amazon didn’t share platform data with Capital One, the product team could only make inferences about users’ questions and language based on which intent were invoked.

Limiting personality. Personality differentiated an experience but it shouldn’t get in the way. For the MVP, we kept the personality to a minimum but with plans to explore appropriate places to injected it.


Aligning Product and Design

Every content strategy project sits on a foundation called Content Pillars which guides the design of every piece of product content we put out, including voice UI. Content Pillars helped align the product and design teams around what the experience should be. I championed them often in my conversations with the product team.

The 3 Content Pillars:

  1. Use case specific
  2. Contextually relevant
  3. Natural language

Design Principles:

  • Design to answer their questions first; ask clarifying questions second.
  • Push the limits of what’s possible by talking like a normal person. Don’t lead with traditional bank syntax.
  • Design for specificity, not for scale. Scale what works; we can’t get there by being generic.
  • Alexa isn’t sarcastic! If statements could end with “…, asshole” – rewrite it.

Understanding How Alexa Works

Because this was a completely new medium, I spent time understanding the technology behind Alexa and did desk research to understand best practices for voice design.

From there, I outlined a high-level structure of a basic conversation with Alexa.

Defining the Architecture of Use Cases

What brings users here? What need are you solving for? To imagine different possibilities of why someone might interact with the skill in the first place, I created a list of possible users, situations, motivations, and outcomes using the Jobs To Be Done framework.

The conversation between Alexa and a customer would change based on the person’s awareness of their own finances and the likelihood of filling all required slots.

This exercise helped me gather my assumptions and served to orient my design around user’s mindset and background.

Since the design was already scoped, I went through each use case and mind mapped possible conversation paths and necessary user inputs. Asking a bunch of “what if” questions helped me uncover stress cases as well as point to where some conversation paths should diverge/converge.

Auto Loan Use Case Mind Map

Example of one use case I mind mapped.

During this time, I worked with the Director of Content Strategy to refine our design direction. Eventually, we arrived at a high-level conversation architecture for both auto loan and home loan.

Home and Auto Loans Conversation Architecture

Since asking about payment due date and amount signaled a higher intent for paying a bill, the conversation for those use cases flowed into the pay bill use case, whereas other use cases were one-off conversations.

With starting points in place, I created paths to let customers explore where they want to go next. This meant writing conversation scripts that accommodated divergence from “happy paths” as well as error cases.

The Director of Content Strategy and I used pair writing to explore possible flows. Writing and editing together helped us keep the interaction honest and useful. It also let us add to our list of hypotheses to test later.

Keeping the Voice UI Contextually Relevant

Each context changes the expectations that people bring to an interaction. To get it right the first time, we needed to consider the different factors that could influence people’s experience with the skill. This meant meeting people where they were at with the language that specifically fit their situations.

I mind mapped various contexts that could influence how people might interact with each use case.

These contexts were:

  • environmental
  • personal
  • technological
  • social and cultural
  • temporal
  • business

(The contexts that I had the most control over are bolded.)

Understanding possible contexts helped only if you broke down whole conversations into component parts. This was especially important for reducing the cognitive load of voice UI. I did this by breaking sentences into nouns and verbs.

Take the above conversation. Consider the noun order in Alexa’s response when a user asks:

[Customer] Alexa, what’s my car payment for this month?

The data that the customer is interested in is $1,231. The cognitive load would be lessened if it were put at the end of a sentence rather than at the beginning.

Also, with nouns and verbs labeled, it was easier to create unambiguous prompts. It’s clear that “Would you like to pay it now?” is a yes/no question referring to the due amount. CTAs should present clear choices and minimize interpretation.

Sentence structure became even more important for responses with multiple sets of nouns and verbs. Below, I  switched the order of the $ amount with the explanation of principal balance for customers with multiple auto and home loans.

Multiple Auto and Home Loans

Differentiating the experience also meant differentiating the language.

Breaking sentences into nouns and verbs also helped me clarify the meanings of words. For example, the word balance could mean your:

  • credit card balance
  • checking/savings account balance
  • home/auto loan balance

I clarified “balance” to mean “home loan principal balance.” The trade-off is less concision, but the payoff is in the clarity it gained.

By breaking the voice UI into parts, I identified where we needed customer inputs and what data we needed to return. This also allowed me to communicate  with the engineering and product team which synonyms customers might use.

Refining the Language to Be As Natural As Possible

Natural language is context-specific, and sometimes, it needs a reality check. I refined the voice UI by role playing conversations with my team to get real-time feedback into my conversation design.

Role playing helped me:

  • identify points of anxiety
  • replace bank jargon
  • gut check for “natural-ness” of the conversation

Role playing also helped me develop hypotheses to test. Since we won’t know which conversation paths customers will take – and which ones were worth spending more time designing for, we want to let them to tell us what they want to do next. We can do this by designing hypotheses into the language itself.

For example, if customers missed their payments for more than 1 month, Alexa says:

[Alexa] Unfortunately, your amount is past due. I’m unable to share the due dates at this time. Please go online to get the information. 

My hypotheses were:

  1. People won’t need any more information beyond “Please go online to get the information.”
  2. This scenario didn’t happen enough to warrant a new use case.

We can then use data to figure out how many people actually need to go through that conversation path. The trick is to keep customers talking – if they stop talking, we stop learning what they want to do next.

Sometimes, role playing helped me identify new scenarios like missed payments for home loan.

Embedding hypotheses into the language prevented me from over-designing features that create more work for developers.

But nothing beats robotic interactions like saying your conversation out loud. What sounds good on paper doesn’t always translate well to voice. To test voice UI, I ran every piece of the conversation through the voice simulator in Alexa’s Skills Kit.


I experimented with putting commas in certain places to arrive at the right pace and changed the wording to get an appropriate inflection that matched the situation.

Putting the Voice UI in Front of Real Customers

We tested the usability of voice UI with both Capital One and non-Capital One customers. The product manager moderated all the tests, and I took notes in the observation room. We used testing to:

  • Assess meaning of words
  • Identify conversation flows that don’t make sense 
  • Uncover new utterances
  • Empathize with how people felt while using our Skill

Through testing, I identified potential points of anxiety in the language. For example, at the end of the pay bill use case, Alexa says,

[Alexa] And keep in mind, you can cancel it online later.

This phrase addresses people’s fear of making a mistake, which puts control back into their hands.

By testing the voice UI, we made the conversation more relevant to customers lives, since generic conversations wouldn’t have provided much value to the interaction.

We also iterated the design with our engineering team. I worked with them to implement recommendations from Amazon’s Skill Certification Team. For example, I made the confirmation language more specific for the home loan pay bill use case.

We also altered conversations in other intents so our new features wouldn’t negatively impact the existing user experience. For example, in the Recent Transactions use case, listing additional accounts (i.e. auto and/or home loan) could overwhelm users. We added a selection step so people would only hear about relevant accounts.

[Alexa] Ok, you have the following accounts. Home loan, auto loan, checking, and credit card. Please say the account type you’d like the transactions for.


Although there were plenty of lessons I took away, these 3 come to the fore:

Design for learning. Instead of making a big assumption, let people reveal the next steps and the new use cases. People satisfice in different ways, and what “nailing it” means is different for everyone. Ultimately we are modeling people, not systems; we use the system to refine our understanding about the people we’re serving.

What natural language means will differ with different kinds of people. Design at an atomic level so we don’t use generic, meaningless (or wrong) language that will work for a system but not for the person. Every single sentence or word should be a hypothesis we could test to see what “natural” means to people.

Don’t be limited by the technology or the system in place. Challenge convention: in moments of truth, ask “why shouldn’t we be able to do (this)?” We’re here to serve people so necessarily technology should work around people, not limit them. You’re not annoying the engineering team (maybe a little bit); you’re being an advocate for customers you’re designing for.


Some resources that helped me think about and design for conversation:

  1. Designing the Conversational UI
  2. Alexa Skills Kit Voice Design Best Practices
  3. What is conversation? How can we design for effective conversation?
  4. Conversational Alignment

More Projects