For over 5 years now, at x.ai we have been applying machine learning to the problem of meeting scheduling. Our AI scheduling assistants Amy and Andrew interact with humans through an email dialog and we have created machine learning models that capture relevant time, location and people information to help the virtual agents schedule meetings with minimal human involvement. It may seem a simple task, but the infrastructure is quite complex.
Machine Learning Requirements
The data pipeline and machine learning infrastructure to support these models had the following general requirements:
- Scalable solution that allows processing millions of meetings
- Support rapid experimentation through all phases – ideation, creating a proof of concept, prototyping and shipping it to production
- Adequate GPU, Memory and Compute resources for deep learning models
- Managing machine learning frameworks and dependencies and ensuring that the offline training and online inference components are in sync
- Enable data scientists to keep track of various models and their associated training, development and test datasets to guarantee reproducible results to ease the deployment process
- Provide a real-time system overview with the necessary logging, monitoring, alerting
Build a custom in-house system
We used AWS cloud platform and frameworks like Mesos but we still had to invest a fair amount of time initially in designing and building the necessary components to support the machine learning pipeline’s needs in-house ( and all of the accompanying logging, monitoring, continuous integration, maintenance work that comes along with it)
Once we started experimenting with deep learning approaches, we needed additional libraries and frameworks. This also added additional infrastructure requirements. We would need to rope in the devops team to help add GPU instances to our mesos pool of resources. Basically we needed computing power and a lot of it.
As you probably guessed, we do not have the luxury of a large team of data engineers to provide the support that is needed to build and maintain such infrastructure from the ground up.
Our teammates outside the core data science team helped us as much as they could but they had their own battles to fight. So to ship these models to production, we data scientists put on our data engineering hats and took ownership of the entire machine learning pipeline. We would manage docker containers, sync the libraries, frameworks, and dependencies ourselves. Amazon’s Deep Learning AMIs, Docker and pipenv helped with this process but there was still a huge overhead in doing this ourselves. Even after we had managed to produce good offline results, we ended up spending a significant part of our development cycles on the infrastructure aspect, just keeping the lights on. Some of the discrepancies between offline model evaluation and online inference required manual effort to resolve.
March 2019 Overhaul
In March 2019, we decided to do an overhaul of our machine learning pipeline.
AWS Sagemaker had been on our radar for over a year but given that we had invested in building something that fit our needs and the fact that this was a relatively new framework, we were also not sure how stable it was.
The following architecture diagram provides an overview of what x.ai’s machine learning pipeline looks like today after making the switch.
There was a learning curve related to learning and adapting to the new framework, reading up documentation and scoping out lots of small experiments. We did have some hurdles during the transition phase when we needed to balance the migration process with short-term production feature requirements. The out of the box SageMaker algorithms, their cost functions, neural network architectures and parameters used for regularization were not sufficiently transparent and customizable for our needs and we initially observed a degradation in model performance.
We needed more fine grained control over these aspects and did invest time in figuring out how to build our own models within the Sagemaker framework for our use cases.
But on the other side of this transition, some of the impact we’ve seen with the new infrastructure and pipeline includes:
- Increased Speed: It has increased speed of experimentation. With our new pipeline, we as data scientists have become more independent in shipping models to production.We have cut out a lot of the devops related steps and it allows us to spend more of our time on the actual modeling aspect of the problem.
- Decreased Cost: So far we’ve observed a significant reduction in unit costs for running an experiment. This is mainly because, given the new modular design of the pipeline, we use the more expensive instances only for the training steps and the training jobs can be configured with automatic stopping conditions. Ad-hoc exploration and data extraction, transformation and loading can be done on cheaper instances.
- Reproducible results and tracking: With the new pipeline, we have a dashboard that provides real time insight into various experiments that are being run. Persistent storage for data along with git integrations with the system enables versioning and source control for the data, source code, hyperparameters and models for any given experiment.
- Machine Learning frameworks and instances: With deep learning AMIs and access to frameworks like TensorFlow , PyTorch, Sockeye, scikit , all of the usual data science related dependencies like pandas, numpy etc. installed and set up, we now have the flexibility to experiment more.
Building your machine learning infrastructure is a tricky process that requires a lot of careful thought, planning and design to adapt to continuously evolving needs.
Switching to frameworks like AWS Sagemaker comes at the price of vendor lock-in, switching costs, and potential service disruptions. Given the tradeoffs related to building and maintaining our own infrastructure, the team size and the needs of our data science team, we decided to bite the bullet and make this transition, a decision we wish we had made earlier given the substantial gains we have realized as a result.