At x.ai, the data engineering team faces a fairly classic big data problem. We have lots of data, a prerequisite for high performance machine learning algorithms. Generally, as data scales up, engineering teams need to manage the increasing amount of time, memory, and cpu required for various computational operations. At x.ai, we use Spark—a framework for distributed, parallel computation—to scale our algorithms across a cluster of machines. By managing a cluster through Spark, we can partition up our data, collect the transformations we plan to make and have the actual execution of the work done in the form of jobs sent to the member machines of that cluster. Sounds simple enough. What’s the catch?

The master machine in the cluster needs to know about the original structure of the data before it can send off the job to the worker machines. Otherwise, it would be incredibly tricky to prevent worker machines from duplicating work and difficult to harness the advantages of parallelism. Instead, by allowing the master machine to break up the data and distribute it to worker machines, you reduce the complexity of the knowledge each worker machines needs to complete its job.

However, having the master machine handle the partitioning of the original data set presents a serious challenge: One of the points of parallelism was to reduce the memory load on a single machine, but if we require the master machine to handle partitioning of the entire data set, theoretically, it would need to materialize the entire collection in memory in order to know the complete size of the data set and then ship it across the wire to the worker machines. It would be like asking a bunch of friends to help you move into an apartment on the third floor of a building but still needing to carry all of your furniture from the street up to the second floor before anyone helped you take it the rest of the way. Performing this kind of operation would largely ruin the advantage of having worker machines in the first place.

At x.ai, we use the Mongo Hadoop connector, which operates in between Spark and our MongoDB instances, to resolve this challenge. As a result, Spark never materializes the entire data collection in memory on the master machine. Instead, all the master machine needs to maintain in memory are pointers to the start and end of each chunk of the split up data set.

Here’s how it works: The Mongo Hadoop connector acts as the translation layer between the Spark library and the Mongo server to figure out the optimal splits for the mongo data. To do this, the Mongo Hadoop connector employs “splitters.” A splitter contains the logic of the command to run against your MongoDB server, and the Mongo Hadoop connector will pick a splitter based on the your database configuration. Mongo has various different types of database configurations; you can use a replication (data is written to a primary database and, with a delay, is redundantly stored in read-only secondary machines) or sharding (your data is large enough that you actually split it up and store it across several different machines).

The default splitter runs the splitVector command on your database server with some pre-defined size (default is 8 megabytes) to group the documents together into chunks and returns the range of each split. By allowing the Mongo Hadoop connector to determine which spitter to use, engineers get the efficiency of a mongo specific splitting solution to partitioning and the added value of running the command against the MongoDB server which is optimized to execute operations against its own collections. The Mongo input splits that the splitVector command returns are cast as Hadoop partitions which Spark will eventually send out to the worker machines. These worker machines then materialize only the chunk of the collection they are working on in memory and apply transformations to this data.

All together, the data engineering team at x.ai has found the Mongo Hadoop connector works well as an intermediary library between Spark and MongoDB. It extends the Spark concept of partitions to the database level and effectively avoids the memory utilization problem that can arise when working with large data sets.