Advances in technology have allowed researchers to produce a huge amount of data in a short amount of time. Making any meaningful sense of this information in a timely manner is the next step in “big data.”
With a $1.2 million NSF grant, Sanguthevar Rajasekaran, Director of the Booth Engineering Center for Advanced Technologies, and a team of researchers will devise new algorithms that can efficiently make use of the almost inconceivably large set of information. It is the first NSF-funded grant for big data research project in the state.
“People end up generating so much data, and it’s a big challenge to process the humongous data sets,” said Rajasekaran, who will work on the project with researchers from the University of Florida and the Jackson Laboratory for Genomic Medicine on the UConn Health campus in Farmington, CT. The team includes Reda Ammar (CSE), Jinbo Bi (CSE), Joerg Graf (MCB), Sartaj Sahni (University of Florida), George Weinstock (JAX), and Yufeng Wu (CSE).
In biology, he said, researchers can generate terabytes of data on a daily basis, but the algorithms they currently have to process this data can’t keep up. “The algorithms take up too much space, as well as too much time.”
“If we want to advance science, we have to have that information quickly,” he said.
If a dataset is too large to fit onto the core memory of a computer, it must be placed in a secondary storage, such as a disc or a solid state drive. That greatly increases the time it takes to gain access to the data.
Rajasekaran said his project is to develop more efficient algorithms to process these data sets by developing out-of-core algorithms as well as parallel algorithms. Out-of-core algorithms process data too large for a computer’s main memory and are designed to efficiently retrieve information stored in hard drives or tape drives. Work has been done in this area, but very little has been done in regard to biological big data.
“Not many people are doing the parallel algorithms and even fewer of the out-of-core ones,” he said.
The project will also include the work of George Weinstock, associate director for microbial genomics at Jackson Labs, who will supply Rajasekaran with datasets from their research to test the algorithms.
“It’s now possible to produce amazing amounts of data,” Weinstock said. “But the data doesn’t do you any good unless you can manage it very efficiently and extract actual results from it. This is one of the very large and unmet needs right now in research.”
Weinstock’s work on the genome of the African green monkey will figure into the project.
“We’re very interested in what quantitative traits we can find – those are things like height, or blood pressure, or more complex things like the concentration of neurotransmitters in blood,” he said. “To figure out what the genes are that might have different mutations in them in high blood pressure, or height, or bad behavior, you have to do a genetic analysis of the entire genome in many subjects – sometimes thousands of them. For example, some place in the genome there are particular variants in the genomic sequence that only tall people have.”
Rooting out that particular variant, though, means finding an extremely small deviation in a mountain of data – a single DNA letter difference among 6 billion.
It’s possible to do that today, but it can take weeks or months. Getting that time down to a day or two would make a huge difference. It wouldn’t just make researchers’ lives easier – it could revolutionize science and medicine.
“The Holy Grail in all this is to apply it clinically in medicine,” he said. Go to the doctor now and you can get the results of a blood test back in a day or so. Having the technology to analyze a genomic sequence in the same time could greatly advance treatment for cancer and personalized medicine.
“If you could do those genetic computations in just days, now it becomes a clinical tool that helps in medical care, and that would be huge.”
As Rajasekaran envisions it, the results will have a far-reaching impact. The algorithms will be disseminated widely as a software library, and incorporated in undergraduate and graduate courses.