Our AI personal assistant Amy schedules meetings. To do this, she must have a natural language dialog with your guests over email; she’ll find out when they’re available, when they arrive in the city if traveling, figure out when her boss is free, and make sure a location is agreed upon. Amy negotiates all of these meeting details on her boss’ (our customers) behalf, and she does this at a level on par with many human assistants. And many mistake her for one.
It is only natural that some have tried to unmask the humans behind the curtain, Oz-like, writing emails. But that is not how you build and train a system like ours and certainly not how you acquire a high-quality dataset—which is a painstaking process, done through precise annotation and verification of text.
Let’s step back for a moment. In the simplest of interpretations, we have a text extraction (read) and a text generation (write) challenge ahead of us.
On the “read” side, Amy must parse emails from you and your guests. Because we humans are often far less clear than we mean to be, this presents a huge multi-year, data science problem. To no one’s surprise, imparting intelligence to a machine is fantastically difficult, even if you and your 71 propellerhead friends are allowed to sit and geek on this for almost three years and counting. (More on that in a moment.)
The “write” side of things is a different story. Pretty much ALL of Amy and Andrew’s output is automated. Over these last three years, we’ve been able amass enough understanding of the meeting scheduling universe for Amy to assemble a correct response in ~99% of her exchanges. Here are some of our recent annotation metrics:
We expect to be at 100% automated—including fallbacks, or cases in which Amy has to tell a guest or host she doesn’t understand them—on the “write” side come end of this year (2016).
These numbers represent a massive design effort on our part. We do not have a collection of static dialog templates. Rather, we’ve had to engineer fluid scenarios that can be dynamically compiled into responses on the fly.
Additionally, the success of the “write” side hinges on the “read” side, since information that the machine has extracted is funneled into the response.
On the “read” side we’re still training on intent prediction (what a customer or guest is trying to do, e.g. reschedule a meeting), entity normalization, composition, anchoring, and intent linking.* And we annotate a ton of data to improve our models. We fight for accuracy on each and every one of the intents (and three primary entities) that we need to extract.
For those intents that occur often (for example, setting up a new meeting or cancelling a meeting), we run at a very high level of accuracy, having essentially “solved” the data science problem; Amy can “read” these intents without any human help (in the form of training). The chart below shows how we achieved this level of accuracy for the “NEW_MEETING” intent.
Other intents, which are more rare, say, “SET_TRAVEL_TIME” (which is dependent on the location for a particular meeting and hence varies by meeting), do still require additional training. And this is where the humans do actually come in, in the form of AI Trainers.
The job of the AI Trainer is make sure our data is properly annotated, within extremely strict annotation guidelines. This is their entire job. It’s not necessarily sexy work, unless you are a data science geek, but it is essential and core to what we do. (It’s called Supervised Learning, and you can read about it in this earlier post.)
To do this, we queue up tens of thousands of data points (in a custom built annotation console). If the machine has already annotated these data points, AI Trainers will validate and verify the annotations. If the machine has not picked up the entity correctly, AI Trainers re-annotate the data. It looks like this:
Human AI trainers approve or fix any of the annotations as above. (That is an exact screen of the annotation console from the end of August, by the way).
As long as we continue to expand the skills which we believe Amy should have in order to do an even better job, AI Trainers will continue to be part of our team. Any new skills start with our willingness to annotate a new large dataset. And AI Trainers will continue to verify data to create a baseline performance for our models.
* Here’s a mini-glossary for all of you nerds who have read this far:
Normalization. Transforming text into a single representation before we use it as input downstream, so that the machine reads “thursday” and “THURSDAY” as the same word and treats them the same way.
Composition. Say we pick up all the time-related entities in this sentence “Set up a call for next Monday or Wednesday any time after 1 PM. I’m in EST.” From this data, we need to “compose” a precise time range with a correct starting time (1PM EST) and implicit ending time that can be used in our system.
Anchoring. The expression “Wednesday” within the text doesn’t provide enough information for us to know which Wednesday is being referred to. Anchoring to, say, today’s date or ANY other date, allows us to map “Wednesday” to a specific date on the calendar.
Intent linking. Identifying the location does not suggest we are to meet there. Linking an intent to an entity (people, location, or time) helps us understand the customer’s intent regarding that entity (Were they accepting or rejecting a location?).