A peek at x.ai’s data science architecture

Amy and Andrew are AI personal assistants who schedule meetings for you. There’s no app. Nothing to download. Once you’ve agreed to meet with someone, just cc amy@x.ai and she’ll take over from there. You won’t hear from her again until she’s successfully negotiated a time and place to meet.
We modeled Amy and her twin brother Andrew after human assistants. (As it turns out, human assistants are really good at scheduling meetings; however, they’re expensive, which means most people don’t have one.)
We want both our customers and their guests to interact with Amy as if she were a human assistant, using natural language. That means we have to teach Amy to understand meeting scheduling conversations conducted in plain English.
So what goes on inside Amy and Andrew’s brain? How do you teach a machine to understand us? We’ve written about this in the past. In this post, we’ll dive deeper into our data science architecture and Amy’s inner workings, with a focus on location– one of the three key elements of a meeting.

Natural Language Processing

When someone cc’s in Amy or Andrew, the machine first needs to “read” the email. Natural Language Processing (NLP) helps identify the meeting dialog, extract the correct meeting-related entities (time, people, location), and detect and classify the sender’s intentions, that is what they’re trying to do (schedule a new meeting, cancel a meeting, change the location, etc.).
The first step in our NLP system is to find any existing conversations between the participants (we call this our Preprocessing Module). Since it’s possible (and usually likely) that the same customer has multiple conversations going on with different guests, Amy needs to be sure she’s found the correct meeting thread for follow-up emails. This step will also extract email metadata including the date, sender, recipients and the email body.
Next, the system needs to segment the email body into different components, such as the main text of the email and signature (or disclaimer). As you might imagine, an address in the main text of an email has a different relevance to the meeting than an address in the signature, and Amy needs to understand these differences.
Entity Extraction
After the initial preprocessing step, the email then moves into the entity extraction pipeline, where the machine arrives at an “understanding” of the contents of the email.
At x.ai, we break down the essential components of a meeting into three categories: time, location, and people. While each of these components is equally important, we’ll focus on location for now to give you a sense of how the system works.

Once the system has segmented the email body and structured meeting related data, it moves on to Location Detection. This is a traditional Named Entity Recognition task. Essentially, the machine is trying to find all words and phrases that people use to refer to a location. Below, you’ll see an email exchange and with all the possible location words highlighted in blue:

We’re still experimenting with different architectures for the Name Entity Recognition task. We’ve tested Conditional Random Fields (CRFs), a type of discriminative undirected probabilistic graphical model widely used for sequential labeling and a typical choice for Name Entity Recognition. CRFs directly model the conditional distribution p(y|x), where y represents the attributes of the sequence we want to predict, and x represents the observed knowledge from the input sequence.
image05Our current architecture for Location Detection uses a Long Short-Term Memory model (LSTM), which is a recurrent neural network. Our model is a two-layer bidirectional LSTM model with auxiliary classifiers connected to the intermediate layer.
The next step in the Entity Extraction pipeline is Location Normalization. This is where the system connects the extracted text to real objects in the world. In this email, the system connects the 6-letter string “Canada” to the country, “coffee” to the activity of drinking coffee, “Starbucks” to the specific chain of coffee shops, and “x.ai office” to an address on a map.
After we understand each individual location mentioned, the system infers and enhances the meeting place based on broader context. We call this the Location Inference and Composition stage. In the email above, the system needs to know that Canada is not related to the meeting, since Angela said “I just got back from Canada.” The system also needs to understand that “coffee” refers to an activity rather than a specific place; the machine categorizes “coffee” as the meeting activity.
In this exchange, Angela has chosen to meet Lewis at “Starbucks.” But there’s a constraint: the Starbucks should be “near the x.ai office.” Based on Angela’s settings, Amy’s knows x.ai is located at 25 Broadway in New York City. The system merges these two pieces of information to find the Starbucks near 25 Broadway in New York.
Quite often the system will return more than one result in the target area. In a dense city like New York, there are five Starbucks within 0.4 miles of the x.ai office. To help Amy find the best one among them, we’ve built a Ranking Module. The module receives input from distances, user preferences, and meeting history data to narrow down the best candidate, which is then sent to the next step as the final location.
Intent classification
After extracting the entities (time, people, location), the last step in the NLP portion of the system is for Amy to understand the meaning of the entities and infer the actions she should take. In the email above to Lewis, the location related intents generated by the system would be:

  • Positive_Location since Angela has identified a place to meet (Starbucks near x.ai’s offices)
  • Irrelevant_Location since Angela refers to her trip to Canada but Canada has nothing to do with the meeting


Natural Language Generation

Once the system has extracted all of the entities and analyzed intents, the machine needs to generate an appropriate response to move the meeting conversation forward. Amy’s goal, after all, is to schedule meetings with the fewest emails. Our Composing Module organizes all of the data derived from our NLP modules, combines it with user preferences and calendar information, and creates the response text. In this case, Amy’s reply looks like this:image03
Amy also sends both Angela and Lewis a calendar invite:
On average, it takes less than 10 minutes for Amy to parse incoming emails and deliver an appropriate response. But it has taken us nearly three years and 80+ people to build Amy and her twin brother Andrew.
For more on the Natural Language Generation step of this process, you can read this blog post by our AI Interaction Designer.
Want to hire Amy + Andrew? Start your 14-day free trial HERE