MIT develops programming language to make parallel computing for big data much faster

Computer scientists in the US have developed a completely new programming language that makes software programs four times faster than any other existing language, in order to solve the complexities found in big data.

Researchers from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a language called Milk that allows application developers to manage memory more efficiently in programs that need to deal with data scattered across large datasets.

At the moment, a computer chip manages its memory by making the assumption that if a software program requires a large block of data stored at a specific memory location, then it will very likely also need the neighbouring chunks of data close to the chunk that has been requested.

Unfortunately, big data doesn't work this way. Designed as a solution for processing ginormous, often unstructured amounts of data in order to glean intelligent insights to help our daily lives, computers typically have to collect, store and process data in order to form connections between data points and detect patterns.

This puts a strain on computer chips and makes it a slow process to carry out the complex demands required from big data algorithms, as the computer chip works by fetching one single data item, one at the time from its main memory.

Instead of requesting one single data item at a time, the researchers programmed Milk to be able to add a few commands to OpenMP, which is an extension that is used by other coding languages like C and Fortran to make it easier to write code for multicore processors.

In a computer chip, each processor (also known as a "core") has its own cache, which is a small, local, high speed memory bank. The idea is for developers to use Milk to add a few additional lines of code around any instruction requesting data.

Rather than responding to the request immediately, the Milk program gets the address of where the data item is stored and adds it to a list pertaining to a particular core. Once there are enough data item addresses on all the lists, then the cores pool the lists together, and Milk figures out which data items are closest to each other.

Then, new instructions are distributed to the cores, and each core only requests from memory the data items that are needed, and the data is retrieved much more quickly and efficiently.

"Many important applications today are data-intensive, but unfortunately, the growing gap in performance between memory and CPU means they do not fully utilise current hardware," said Matei Zaharia, an assistant professor of computer science at Stanford University.

"Milk helps to address this gap by optimising memory access in common programming constructs. The work combines detailed knowledge about the design of memory controllers with knowledge about compilers to implement good optimisations for current hardware."