Skip to main content

Cornell University

Bioinformatics Facility redesigns scratch space for faster computing

A jar of gummy bears

What is scratch space and why do we need it?

Scratch space is the storage space needed for temporary data files used during computation. Imagine a jar with thousands of gummy bears. You want to count how many gummy bears of each color are in the jar. One way would be to pour the contents of the jar on a table, sort the candies by color, and then count the number of candies in each pile: a pile for yellow gummy bears, one for red gummy bears, etc. While the gummy bears are spread out on the table, they temporarily take up more space than when they were in the jar. Scratch space is like the table.

What is the issue with scratch space?

Counting a large number of gummy bears (or analyzing large amounts of data) generates two challenges.

First, the amount of scratch space required can be very large. The more colors in the jar, the more distinct piles you need to make, and the bigger the scratch space needs to be. Bioinformatics requires a huge amount of scratch space.

“When assembling sequencing data for a large genome of high complexity, you may need up to 20-30TB temporary storage. The majority of current BioHPC servers do not have this kind of capacity in locally attached disks,” says Jarek Pillardy, director of the Bioinformatics Facility.

A second issue is efficiently transporting your gummy bears from the warehouse where your candy is stored to the table where computation happens. Sending data back and forth can be a slow the process, especially if you have lots of it.

“Typically, our users must copy data from central storage to local storage, do computing, and copy the results back. It works well for small and medium data sets but copying terabytes of data may take hours or even days to complete,” says Qi Sun, co-director of the Bioinformatics Facility.

Why don’t we skip the transport, and perform computations where data is stored? Doesn’t BioHPC already have huge amounts of space for data storage?

Yes, BioHPC Cloud offers its users around 1.5PB of centralized network storage. That space, however, is designed for another purpose. This network storage, our data warehouse, is made of a cluster of servers called Lustre. Lustre systems can be configured in various ways depending on their function. Because our Lustre system is intended for data storage, it’s configured to be very safe, so data will be secure. In other words, our warehouse is designed to keep the candy safe, not to count it. With our current configuration, counting in the warehouse would be prohibitively slow.

Can we configure our Lustre storage differently?

No, because we need to keep our storage optimized for data safety. What we can do, however, is build another storage space that is optimized for computation. “The solution is to set up secondary Lustre storage designed as a centralized scratch space,” says Pillardy.

What would the new configuration look like?

Besides our initial example, there is more than one way to count candies. For example, we could make small piles without the effort of sorting the colors. With only 10 candies in each pile, we could see at a glance that the first pile has two yellow bears, one red, two blue, three purple, one orange, and one green. We could record these numbers, go the next pile and repeat. It’s faster than sorting through gummy bears one by one.

The two counting strategies have different strengths and weaknesses. When using the color sorting strategy, if one pile falls on the floor, it affects the data for only one color, it won’t compromise the count of all the other colors. It’s safe but slow. When using the second approach, if one pile is lost, it affects the data for all the colors in that pile. It’s less safe, but faster.

Similarly, Lustre storage can be configured in two ways. Either each data file is located on a single server (safe and slow), or the files are striped (segmented) over multiple servers (riskier and faster). When data files are striped across servers, failure of a single server may affect many data files, but it allows for very fast computing. This is the configuration that will used for scratch space.

So, is BioHPC building this warehouse?

Yes! BioHPC is setting up a second Luster file system that is configured for speed and computing. “While not as safe as the current one, this new scratch Lustre should be very fast,” says Sun.

In practice, because data safety and integrity are paramount to the facility, users will store a permanent copy of the data on the safe Lustre, and keep a copy available on the fast Lustre for computing. “The idea is to use the new one as a scratch space, as a temporary storage to avoid moving large data sets to local storage,” says Sun. “Users will keep their data permanently on the regular Lustre.”

When will it be available?

System configuration will begin once COVID-19 restrictions are lifted. Facility staff need physical access to the servers to configure the scratch storage and tune it up. “We will conduct extensive tests under various stripping and parallel data access parameters to determine the optimal configuration,” says Sun. These tests will help optimize speed, bandwidth, and other performance features. “We are excited by the project,” Pillardy says. “If it works, it would be fantastic, it would really change the bioinformatics computing capabilities at Cornell”.