We are fast moving from the app era to the era of the intelligent agent. Over the next half decade, we’ll witness the birth of hundreds if not thousands of autonomous intelligent agents. By definition, these agents complete entire jobs by themselves, which means they must learn to understand us and our objectives. Therein lies the technical challenge—for humans often don’t say what they mean. Worse, we believe that we’re being clear when our communications are riddled with ambiguity.

Take this simple statement, which was made to Amy, our AI personal assistant:

“Sure. How about Monday-Wednesday, perhaps 8? —Max”

  • Which Monday?
  • Which Wednesday?
  • Was it a range suggestion that includes Tuesday?
  • 8AM or 8PM?

Amy’s job is to schedule meetings. And for her to do so, we spend a massive amount of time and energy trying to figure out how to decipher a sentence like the one above with near 100% accuracy. And lest you doubt why we need to train Amy to be almost perfect, imagine what happens if we don’t: You’ve scheduled a meeting with a new business contact—she’s sitting at the Starbucks on West 21st Street and you’re sitting in another one three blocks away. You are both confused and unhappy.

The fact that humans are notoriously imperfect communicators means that our entire product rests on our ability to excel at NLP (Natural Language Processing).

In fact, the pressure on accuracy and the nature of AI makes building Amy (and her brother Andrew) much more like building hardware than building your typical software product for two reasons: getting to MVP is very labor and time intensive (read on), and we are building the tools to build the machine as we build the machine (read the next post in this series).

To get Amy to do this one job extraordinarily well, we need to amass a huge volume of training data; how else will Amy learn about meetings and their key components (time, people, and location)?

Let’s take the first piece, getting to MVP. Our MVP is an AI assistant who can seamlessly and nearly flawlessly schedule meetings for you via email.

To get Amy to do this one job extraordinarily well, we need to amass a huge volume of training data; how else will Amy learn about meetings and their key components (time, people, and location)? Google Self-Driving Cars have driven over 2 million miles in the past 6 years. We’ve been doing the same with meetings for nearly two years. That means Amy and Andrew have sent and received millions of meeting-related emails.  

At the same time that we’re amassing the data to teach Amy, we need to build and refine the models used to process all those emails and bring Amy to life.

This is where Machine Learning comes in. We can’t write a program full of business rules that would understand those emails well enough to enable Amy to schedule your meetings. Scheduling a meeting is simply too complex for this kind of deterministic logic. On average, it takes eight emails, including the invite, to schedule a single meeting. Each of those meeting conversations contains many pieces of relevant information, with an intricate set of dependencies specific to that particular conversation. The vast scale and heterogeneous structure of all of that data makes it impossible to write enough rules to encapsulate, “What does an email about a meeting mean?”

Instead, we use techniques from machine learning to learn from that data. In this process, we write programs which consume all of that data and produce models. These models allow Amy to develop specific capabilities, such as knowing what a time, location, or participant is. She also needs to be able to understand what the sender intended to tell Amy or the other participants in the email thread. Machine learning techniques build up these skills in models much the way that people do, by reading a lot of emails and seeing what happened next. Once the system has applied these machine-learned models to a given email and figured out what the sender meant, then the system can use rules to produce the logical response to help move the conversation one step closer to a scheduled meeting.

So in the example above, when Amy gets the note “Sure. How about Monday-Wednesday, perhaps 8?” she would classify “Monday-Wednesday” and “8” as a temporal references. Based on our dialog model, earlier emails in the thread, user preferences, guest information, and previous history, Amy would venture a specific date and time.

That’s not the end of the process. Building a system with near 100% accuracy requires a robust data set. And to do this, we insist on a very high “confidence level” for each piece of data that is extracted and labeled by the machine.

We know we’ve taken on an audacious technical problem—making a machine understand the dashed off notes of busy people trying meet up with colleagues, clients, and friends.

So we deploy a process called Supervised Learning to make sure the machine is extracting and classifying the data correctly (or at least as logically as humanly possible). In this process, humans, our AI trainers, verify the machine-generated annotations. They’re invoked when the system has a low degree of confidence around its classification of a particular data point. And we’ve developed  very specific guidelines for data labeling; our guidelines for marking up time, alone, run 16 pages long!

AI trainers spend their days looking at fragments of text that the machine has extracted and then accepting or modifying the label the machine has offered. This process ensures that dirty, mislabeled data doesn’t pollute our models. In instances of ambiguity, once a human verifies the annotation in question, the data is fed back into the system, which then compiles a meaningful response to the sender.

AI trainers are key to Amy’s ultimate autonomy. Back to the Self-Driving car: the safety drivers take over when the car isn’t performing optimally, long before there’s an accident, and then feed information about that scenario back to the data science team, who improve the software models so that the next time the car encounters a similar situation, it navigates it better. The same is true for Amy. We’d rather Amy and Andrew take 15 minutes longer to schedule a meeting, and get it right, than speed through the process and get it wrong.

There is, of course, a huge pay off in insisting on this level of accuracy. Back to that ambiguous sentence. For a human to correctly interpret Max’s reply (“Sure. How about Monday-Wednesday, perhaps 8?”), he’d have to read back through the entire thread of the scheduling conversation in the hopes of gathering sufficient context. And he still might not succeed, because he might not know that Max has set his scheduling hours for 8AM to 4PM. Anyone who schedules their own meetings knows that this kind of detective work can take several minutes. Here, the machine has the advantage. Once we’ve processed and correctly labeled a sufficient volume of data, Amy will be able to make sense of such a statement nearly instantaneously and will be able to do so across millions of emails every day.

We know we’ve taken on an audacious technical problem—making a machine understand the dashed off notes of busy people trying meet up with colleagues, clients, and friends. We know that it takes time (and 60+ people) to build and train a system to behave as well, and in some cases better than, a human assistant. But we are inspired by the possibility of democratizing the personal assistant and by all of the love our beta customers have shown us along the way, as we work to bring Amy and Andrew to life, and to the rest of the world.