The Unreasonable Effectiveness Of Neural Machine Translation: A Breakthrough In Temporal Expression Understanding

Written by Rakesh Chada and Marcos Jimenez, data scientists at

At we strive to make pain associated with scheduling meetings a thing of the past. We’ve built a virtual assistant (it goes by the name of Amy or Andrew) who can be cc’d into your typical request to meet with people over email. Amy will “understand” the hand-over and just take it from there with your guests, following up with them to nail the time and location details for the meeting. Just like a human assistant would. 

Under the hood this means that Amy must automatically extract meeting-related pieces of information from your email and, mashing that up with your calendar and overall preferences, proceed to get your guests to agree to a time that works for you and them, plus gather whatever other details are needed for the meeting (phone conference number, meeting room, address, google hangout link, etc …). That’s the premise of the product. Pretty awesome, right?

Now the hard, cool, data-science part. Amy “understanding” all the pieces of information from free-form human text presents us with a number of formidable and fascinating data science challenges. This is the realm of natural language processing (NLP), where recent strides in deep learning have made tackling these problems viable. The problem goes far beyond simply detecting words related to times and locations, or named entity recognition (NER). It’s way more involved because those words are often just partial references to real locations or actual calendric intervals, and need to be resolved in context and sometimes even linked to specific people. Just think of the sentence “Amy, please find a time for a conference call with Iris on the earliest possible afternoon next week. She’ll provide the details. Brian is optional.” Here are some of the things Amy needs to be able to “understand” from that sentence: 

  1. Automatically detect the type of location for the meeting … in this case a phone conference. But could have been an in-person lunch meeting at a cafe instead? There are many different location types (perhaps the subject of another blog ;-), and Amy must distinguish them since they determine how Amy follows up and what sort of times it will suggest to guests
  2. Understand who needs to be asked for the location details? This may be a location Amy should “know” about in which case the details may come from the user profile, or Amy may need to go out and request a conference number from a specific person, in this case, Iris
  3. Who is mandatory and optional in this meeting? Amy only really needs to negotiate times with those people who are mandatory … 
  4. What are the time constraints for this meeting? Agreeing on time is arguably the most important aspect of meeting scheduling. It is therefore critical that Amy capture the customer’s initial constraints correctly from the onset, in this case, next week

The list doesn’t end there, but this gives you a feeling for the complexity behind a conversational agent like Amy. Keep in mind that Amy is a 100% automated dialogue system. There is no human-in-the-loop verifying whether the meeting negotiation is proceeding fine or not, and intervening or handling certain tricky aspects of the meeting. That would be comforting, but it is expensive and not scalable. 

Not to say that human-in-the-loop is altogether a bad idea. We had that at the beginning, while we were collecting data and figuring this out (more on that later). Instead what we have today is pure machine learning (ML) and online performance metrics which we obsessively monitor. We also have various customer feedback channels that help us constantly improve customer experience. But the system itself is running solo. To the best of our knowledge, Amy is actually the first commercially viable multi-participant dialogue system out there, doing a previously purely human task in the real world. 

The fact that Amy is a fully automated system places high demands on the machine learning performance, and in particular in capturing and resolving meeting time constraints from email text, the subject of this blog. Our hope here is to provide an under-the-hood view into how we approached this complex task, including the main lessons learned along the way. This will get a little technical, but not so much that someone without a deep learning background would not be able to follow. At the end, we will arrive at what we believe is a truly novel, state-of-the-art approach to this problem (deserving of a publication). We consider the issue of extracting temporal constraints from human text basically “solved” in our specific domain, by which we mean that our ML has enough coverage and accuracy to make this a commercially viable product. This is proven by Amy’s success and frequent positive customer feedback. Of course, it is not perfect (as neither are human beings), and there is always room for improvement. 

Amy is constantly scheduling meetings for thousands of people. That means that if we design the feedback loop well, it is simultaneously a training-data acquisition machine. How user behavior feeds back into improving Amy’s performance is a topic on its own for a future blog. Here we will limit our focus to the meeting time extraction pipeline … But enough intro. Let’s do this! 

The problem

Before jumping into the solution space, let’s analyze the actual problem further. What do we mean by Amy needs to understand meeting times? The actual problem is that Amy (read ML) needs to infer temporal constraints from free-form text within context. That is, it is not sufficient for Amy to detect that you said next tuesday afternoon in your email. It must be a relevant, positive time constraint that needs to be resolved into a calendric interval to be compared with your calendar availability. Say the email text (the input) was 

I’m flying to New York tomorrow. Amy, schedule a call with Varun and Rakesh sometime next tuesday afternoon. Miguel is optional.

Amy needs to internally translate that text to something that looks something like 2019.07.09T13:00-17:00 EST. Notice that tomorrow has to be ignored in this case. Amy will then check the captured calendric range against your existing calendar events and suggest some available time slots to your guests. 

Fun? Let’s look at a more complex (and more realistic) example. Let’s say this was an email received on June 25th: 

“Great chatting with you last week. Let’s do a follow-up. I’m busy tomorrow. Amy, schedule a 30 min call with Nikhil for next Tuesday. How about at 5 pm”

Amy must internally construct the following understanding from that email :

  • Ignore “last week” (the AI needs to internally learn irrelevance)
  • Resolve “tomorrow” as a negative constraint (the AI must learn the semantic notion of negative temporal expressions)
  • Detect and resolve “30 min” as the duration of the meeting (the AI must distinguish meeting durations)
  • Detectnext Tuesday” and “at 5 pm” as the desired constraints for the meeting (NER)
  • Composenext Tuesday” and “at 5 pm” into a single temporal constraint of “next Tuesday at 5 pm” (composition)
  • Resolvenext Tuesday at 5 pm” into 2019.07.02.T17:00 EST

The image below shows the information displayed by Amy for that exact email. Amy was able to infer successfully the correct constraints from the complex case above. Pretty neat! Spoiler: it’s all thanks to having quality training data and being creative with how we apply deep learning to it (as we’ll see later).

Furthermore, time expressions in emails come in several forms and flavors: from simple phrases like “tomorrow”, “next week” or “morning” to the more complicated “later in the day”, “next tuesday between 5-6 p PST”, “between 7th and 12th Aug after 2 pm”, “the week following the second week of march”, “4/10 after 10H30 EST” etc … the list is endless ! And Amy has to understand them all …

The dataset

With great problems come great solutions. At we knew from the onset that to get high-performance NLP you need large amounts of training data. There is a lot that can be done by training algorithms to learn a language using external, publicly available datasets and then transfer the knowledge to our domain. Depending on the NLP task that alone may work. If the system you design is clever, you may even use the transfer learning strategy to bootstrap your way into some basic product and have users give you the additional training data you need to gradually improve. At this wasn’t really possible. Scheduling meetings is a very sensitive, fault-intolerant task. The virtual assistant needs to be near flawless in order to be useful at all. We needed high-quality training data upfront – and lots of it. Our (arguably costly) solution was to put a human-in-the-loop in every meeting, meaning that initially, humans would actually be the ones doing the work, or at least some of it, labeling the data in the same way we wanted the machines to eventually do it. The human-in-the-loop system is itself complex from many angles. For one humans are faulty, and ensuring good annotations from them is almost an art form … but that’s the topic of a different blog post. 

The point here is that human-in-the-loop is not a scalable system. The more customers you have the more humans in the loop you need. It did, however, provide us with an enormous training dataset (literally millions of hand-annotated temporal constraints from email text) and bought us time to develop domain knowledge about the problems that we actually had to solve. In the end, the training dataset they annotated is what today enables training state-of-the-art deep learning models at, allowing us to leapfrog from having a human-in-the-loop system to our current 100% automated, fully scalable system. 

A history of trial and error

The difference between the amoeba and Einstein is that, although both make use of the method of trial and error elimination, the amoeba dislikes erring while Einstein is intrigued by it.  Karl Popper

In the problem statement, we saw that there are many “understanding” tasks related to inferring times from free-form text. It is not just a matter of detecting some time-related words from free-form text. The temporal expressions must be standardized, filtered, composed, resolved, etc … until we arrive at the end product – inferring the relevant calendric constraint a customer meant. From a more abstract point of view, the task is one of just going from human text (plus some context like the “sent” date of the email) to calendric constraints, but because that task can actually be broken down into a sequence of “simpler” subtasks (first detect words, then standardise, then combine, then etc …) one may naively think that the best approach is to develop independent models for each of those subtasks, creating a pipeline of ML stages where the output of one stage becomes the input to the next. An assembly line of sorts, where a hard problem is broken down into simpler ones. Let’s call this the stacked NLP approach. Perhaps we’ve been unfair in saying this approach is naive. It has actually been the academic standard for decades (and it still is). It is only through the advent of powerful deep learning techniques that we can model this differently, assuming the training datasets are available (as is our case). Follow me as we delve into the stacked NLP approach and examine its strengths and weaknesses before discarding it. 

Under the stacked NLP approach each stage in the pipeline is tackled independently and in sequence. For example, we may start by detecting tokens related to times, using a model trained to label tokens according to whether they are part of a temporal expression or not. Conditional Random Fields or Convolutional or Recurrent Neural Networks are standard ways of approaching this specific NER task, and you can play around with combinations of those to get state-of-the-art results at that specific task. Continuing with our pipeline, the output of that NER model would then be fed into the next stage, a different model whose task is to normalize those tokens into standardized formats. For example, any variant of the word tuesday (tue, Tues, Tuesday, tuseday, tu, etc … ) would get mapped to day-of-week-2 . The output of that model would then be fed as the input of a composition model, and so on … 

There are real strengths to this approach. The most obvious one is that a hard task has been broken down into a sequence of separate, simpler tasks each of which can be tackled individually. It’s a sort of divide and conquer. A robot in a car factory may not be able to build an entire car, but broken into an assembly line, 50 robots each doing simpler tasks in a sequence can produce a car at the end. The simpler tasks in the pipeline can be automated faster and with less training data. The harder tasks can then be singled out for more focus and tackled on their own, perhaps even piping them to a human expert (back to human-in-the-loop) without sacrificing the automation gained by having factored out all the other tasks. 

That’s great, but the stacked NLP approach also suffers from real weaknesses, and for that reason, it was eventually abandoned. Here are a few: 

  • Training data cost. Breaking down a task into subtasks implies acquiring training data for each subtask, which is expensive. If you decide to change the contracts between tasks or the task sequence itself, the old training data may no longer be useful. 
  • Development cost. Developing separate approaches to each task involves training and maintaining many machine learning models, along with online and offline performance metrics for each
  • Performance cost. The stages are not intrinsically independent, but because the models are trained independently, existing correlations among them are not exploited. Moreover, the error will aggregate and cascade through the pipeline in ways that are hard to predict and model. Error in the output of an upstream stage may amplify as it travels down through the pipeline. For example, a well-performing composition model will do poorly if the output of the detection model, it’s input, is erroneous. 

The solution

It is probably fair to say that “neural machine translation” (NMT) as a deep learning technique really took off in late 2016 or early 2017. See for example Google AI’s blog. We refer the reader to that blog for an overview of how NMTs actually work. For our purposes here all we need to know is that NMTs essentially learn a mapping from an input sequence of some length to an output sequence of some other length. They are part of a broader class of encoder-decoder algorithms. The input sequence (encoder input) may, for example, be a sequence of English words. Those get mapped by the neural layers into internal “hidden” representations and combined with an “attention layer” (fancy name for assigning weights to them) and finally fed as the input to the decoder. The output sequence (decoder output) can be anything. Let’s say its a sequence of french words, that is, the corresponding French translation. Hence the name “neural machine translation”. 

The incredible success of NMT models inspired us to take a radically different approach to the pipeline problem. What if we forgot all those intermediate stages, which the end-user doesn’t care about, and framed the problem as a single neural translation task. Instead of translating from English to French, we would be translating from English to calendric constraints. The input language is English, the output language is calendric constraints. Intuitively this is not crazy talk. If the machine can learn to translate to French, it should be able to learn to translate to the far simpler “language” of calendric intervals. In fact, encoder-decoder models have since also shown cutting-edge performance in summarization tasks, and you could view our problem as a summarization of the email as a set of calendric constraints. This approach would help get rid of error cascades and fully exploit correlations among the different stages by absorbing all stages into a single, end to end prediction. Thus, if it works, it could outperform the stacked NLP approach, as we will see in fact does in a major way. 

This approach is radically simpler than the stage-wise approach. There are no intermediate stages. An email comes in with some contextual information (such as the email sent date) and resolved calendar constraints come out the other end. The neural net might still carry out detection, standardization, composition, etc … internally in its hidden representation states, much like humans probably do when they read emails. But how those stages may be represented internally is actually not of our concern. What matters is that we have dramatically simplified the online architecture by getting rid of pipeline stage contracts and error cascading through stages while drastically improving the performance. 

In fact, if the customer corrects Amy, under the NMT approach that correction can become itself a new training datapoint since the end-user is the ultimate judge of what is a correct result. No need for any intermediate training datasets labeled by humans-in-the-loop. An actual end-user could never directly provide training data for those intermediate stages. In essence, this approach allowed us to fully automate the system and replace the back-end human-in-the-loop by the actual source of truth end-user-in-the-loop

Encoder input and decoder output representations

The encoder input to this system is simply a concatenation of the ‘sent’ date of the email with the email text. The ‘sent’ date is needed because the NMT needs to produce the fully resolved calendric interval as the output. Next tuesday can be any of an infinite number of calendar dates if one does not have a date (the sent date of the email in this case) to anchor it to. Without going into gory detail, the input strings are “cleaned” and “tokenized” as text preprocessing steps before creating the final encoder sequence. We played a lot with using transfer learning, for example by tuning an embedding layer that’s initialized with pre-trained representations such as word2vec. Lately, we have been playing around with hooking BERT as the encoder in our NMT, with outstanding results (more on that in the last section). 

The decoder representations are more interesting, complex, and play a critical role in the success of the whole design. In some ways, they represent our insider domain knowledge of the problem we are trying to solve. A decoder can only output sequences. We want calendric constraints. There are many types of calendric constraints. For example, someone may say `any afternoon next week` or `30 minutes sometime in late July`. We need notions of week-of-year, afternoon, after, soon, late, before, etc … which, together with specific values for them, as in year-2019-week-of-year-38 constitute the output language the NMT must learn. A considerable amount of work needs to go into modeling the output language and generating the training datasets. However, the final output language has around 100 unique tokens in it, compared to say French, which would be in the order of 100K. Thus, from the NMT point of view, we have a relatively simple output language. The issue was having to hand-craft it. 

Now it gets really interesting. Let’s think more abstractly about what we are asking the NMT to actually learn:

  1. The NMT must learn to go from free-from English to relevant “calendric types” such as “week-of-year” or “after tomorrow”
  2. The NMT must also learn basic calendar arithmetic operations that allow it to map “next week” to week-of-year-38. To assign the number 38 it must know that the email was sent on week 37 and must learn to do 37 + 1. The math can get much more complex, for example, if the calendric type was next tuesday and the email was sent on July 30th then the NMT would need to output 2019.08.06 . This would require a modulo operation of 30 + 7 mod 31 

Despite being conceptually simple to us, learning these sort of calendar arithmetic operations is an extremely hard task for a Machine Learning model.  This was also obvious from the results of our first model. To address this – we decided to move some of this “calendar arithmetic” outside of what the model needs to learn and implement it as a set of simple ”post-processing” rules that run on top of the model predictions. The model can then just learn to predict “next week” and a post-processing step will mechanically determine that this is week 38 based on the sent date of the email. 

This essentially makes the output sequences capture “patterns” of time expressions, without having to worry too much about specific date values. For instance, all emails that mention a temporal that corresponds to a week representing a specific date would have output sequences as “week of <date> <month> <year>”. These email expressions could come in several forms such as “week of 5/12”, “week of 15th may”, “w/c 11th october”, “week beginning 22nd march” etc. Examples of other temporal patterns include “this <weekday> at <hour> <minute>”, “tomorrow”, “<startdate> – <enddate> <month> <year>” etc. This design choice has enabled us to capture almost all forms of time expressions in an extremely effective manner. 

Model Architectural choices

We’ve seen great success with the standard sequence-to-sequence formulations that incorporate an attention mechanism. Apart from improving the performance, the attention mechanism also helped us understand and interpret the results better. As already mentioned,  we’ve recently updated the encoder part of the model to a BERT-based system while retaining the GRU based decoder. This has led to noticeable performance improvements in production.

Let’s look at some attention visualizations of our model on some example data.

The below plot shows how the model was able to correctly capture all relevant time expressions including the timezone. The bright squares indicate which source token (x-axis) the model was paying attention to while producing the corresponding output token (y-axis).

Email: “Amy please schedule a meeting with Dennis for april 10 at 10:30 london time.”

Prediction Output:  “10 april 2019 at 10 30 am Europe/London”

The below example shows how the model was able to recognize and produce multiple time expressions. It’s also worth noting that it ignored “this week” mentioned at the beginning of the email.

Email: “Hi Marcos, I hope all is well with you. Dennis is visiting Barcelona this week. I would love to see you in person! Would you be able to drop by your office and meet the team next week? Wednesday, Thursday or Fri afternoon would be best. Amy can help us find a time that works.”

Prediction Output:  “next wednesday afternoon , next thursday afternoon , next friday afternoon”

The following four examples show how the model was able to adapt and produce correct output representations despite subtle minor differences in the input text.

Email 1: “Awesome! Looking forward to connecting with you. Amy, pls help to schedule a call in the morning.”

Output 1:  “today morning”

Email 2: “Awesome! Looking forward to connecting with you. Amy, pls help to schedule a call in the morning next week.”

Output 2:  “within next week morning”

Email 3: “Awesome! Looking forward to connecting with you. Amy, pls help to schedule a call in the morning between 1300-1800.”

Output 3:  “within next week morning between 1 – 6 pm”

Email 4: “Awesome! Looking forward to connecting with you. Amy, pls help to schedule a call in the morning between 1300-1800, from weds-fri.”

Output 4:  “next wednesday – friday between 1 – 6 pm”

The model was able to adapt and capture correctly both temporal types (week of year vs day of month) and values.

This model is currently in production serving all the email traffic. It is fascinating to see that it is performing at accuracy rates close to 93% on this incredibly complex task. This single model is now doing the combined (and a much better) job of more than five different models that we had earlier for each step in the stage-wise pipeline. Without throwing too many laurels at ourselves, we want to say that we are not aware of any system out there (even in purely academic settings) that has reached this level of performance at this particular task. 

We hope that this blog helps those out there trying to solve a similar problem. We also hope that those that are just curious about how Amy does it’s “magic” got a good glimpse of all the work that went under the hood to develop this system. We are really looking forward to serving more customers with this system ! From the data science team, all we can say is go and get started on a free trial and give it a whirl! 

Try for free