Big data detectives

A*STAR researchers are bringing big data genome analytics to Singapore

Published online Nov 2, 2016

Computing, analyzing and storing millions of gigabytes of genomic data calls for large-scale collaboration.

Computing, analyzing and storing millions of gigabytes of genomic data calls for large-scale collaboration.

© Mischa Keijser/Cultura/Getty

The cost of sequencing a human genome has plummeted over the past 15 years from one hundred million dollars to one thousand. Every seven months, the rate of sequencing doubles, with projections that by 2025 the genome of every known and catalogued species on Earth will have been spelled out, sometimes several times over.

“Genome analytics is getting more and more powerful and is pervading all aspects of science and biomedicine, both on the research side and on the clinical side,” says Shyam Prabhakar, a group leader at the A*STAR Genome Institute of Singapore (GIS). But, the resources needed to compute and store millions of gigabytes of data are enormous. Scaling up the analysis is not as straightforward as just hooking up more computer processors and hiring more staff.

“A common misconception is that big data genomics just involves pouring money down a pit, turning a crank, and waiting for the answers to come out. It is a much more expert-driven exercise,” says Prabhakar. “When you go from ten samples to a hundred samples to a hundred thousand samples, suddenly, even the simplest problems become awfully complicated.”

“In theory, more data means more information and more knowledge. But in practice, we tend to get overwhelmed by the data and don’t know where to look,” says Niranjan Nagarajan, also at the GIS. “We have to invest in building systems that can cope with the data.”

Prabhakar and Nagarajan recently joined together with colleagues from the GIS and other A*STAR institutes to launch a major initiative to establish infrastructure for big data genome analytics in Singapore. The Centre for Big Data and Integrative Genomics (c-BIG) is a joint collaboration between four institutes at A*STAR, including the GIS, the Bioinformatics Institute (BII), the Institute for Infocomm Research (I2R) and the Institute of High Performance Computing (IHPC), and is set to formally launch at a symposium on 10 November 2016.

c-BIG includes projects designed to catalog genetic variation in Singapore, predict a cancer patient’s response to drugs, prevent the next viral pandemic or track down the source of a bacterial outbreak. Ultimately, data-driven genomics could lead to more precise diagnoses and treatments in the clinic.

Singapore sequenced

Most genomic analyses rely on a reference genome, a representative assembly of the entire set of nucleotides in the genome of a particular species.

For humans, the six-billion-letter reference has since 2001 been maintained and regularly updated by a small international consortium of scientists. The entire genome of Asian and African individuals has been published, however the current genomic literature is substantially skewed to represent Caucasian males.

The A*STAR Centre for Big Data and Integrative Genomics is building reference genomes for the Chinese, Malay and Indian populations in Singapore.

The A*STAR Centre for Big Data and Integrative Genomics is building reference genomes for the Chinese, Malay and Indian populations in Singapore. 

© Aleksander Yrovskih/E+/Getty

c-BIG is supporting the Platinum Genome Project to construct reference genomes for the Chinese, Malay and Indian populations in Singapore. These genomes will be assembled using sophisticated algorithms that patch together two types of sequencing information — long, but error-prone, DNA fragments containing up to 10,000 base pairs, and concise, precise 150-nucleotide strands. Combining the two formats will offer an extremely accurate representation of genomes in the region, to inform a growing field of medical genomics better tailored to local needs.

“We want to do our analysis locally, and not have to borrow insights from other populations. Everything in Singapore is different — from the diets, to the lifestyles, genomes, healthcare systems and compliance to drug regimens,” says Prabhakar.

A larger c-BIG project called SG10K is building a database of 10,000 home-grown genomes to understand the genetic diversity within Singapore. Every section of these genomes will be read and re-read a dozen times. To achieve this, computational engineers at A*STAR, together with several research institutes in Singapore, will need to invent new algorithms. “We are taking the algorithm that many people in the world use to analyse hundreds of genomes, and scaling it up with innovative improvements to analyse 10,000 genomes efficiently,” says Chaolong Wang, a computational geneticist at the GIS in charge of data analytics for SG10K.

Cell scale

Every human being is born with a unique set of DNA. And every cell in their body typically contains an identical copy of this genetic barcode. So what differentiates brain cells from fat cells? How can we account for variation among cells the way we do for variation among people? The answer, for some, is big data.

Mixed into the cellular soup are noodles of RNA that provide a snapshot of the genes being expressed. Every cell has a unique RNA expression profile, which researchers at c-BIG are beginning to characterize through the Cellular Human Bodymap (CellHuB) project. They have already sequenced every tiny snip of RNA, collectively known as the transcriptome, present in 10,000 cells and, working up to a rate of 3,000 cells a day on average, plan to profile a million cells in two years.

“It is a formidable data analytics challenge,” says Prabhakar. Such single-cell analysis, he explains, could be used to compare healthy states with diseased states, or even to classify and diagnose patients more precisely.

“The big data, precision-medicine dream is to build a database on a national scale,” says Prabhakar — one that includes genome sequence data, RNA extracts, patient medical records, and other demographic intelligence. “Once the database is large enough, we will start to see patterns.”

To really succeed in that endeavor, the GIS has to work closely with the high performance computing experts at the IHPC and the hard-core data scientists at I2R. “We are the muscle behind the analytics,” says bioinformatician Feng Mengling.

Mix and match

While researchers at the GIS are more familiar with interpreting genomic sequences, researchers at I2R have extensive experience in extracting and analyzing other heterogeneous clinical information.

In 2014 Feng and his team at I2R developed a tool to assess the benefits of red blood cell transfusions. When to administer such interventions for patients in intensive care has been a subject of controversy. Their statistical evaluation was based on clinical reports of close to 40,000 individuals admitted to hospitals in the United States between 2001 and 2008. Feng’s work found that blood transfusions doubled the chance of survival for older, sicker patients, but halved survival rates in younger, healthier patients.

Feng is also working with several hospitals in Singapore to help them anticipate when a patient might need to undergo intubation. “It is a small operation but still requires a bit of prep time. We are developing a deep learning predictive model that will give clinical care staff 12 hours to prepare for the procedure,” says Feng, referring to a type of artificial intelligence that enables computers to learn by recognizing patterns. Many of these systems have been inspired by the neural networks in the human brain.

Under c-BIG, A*STAR plans to collaborate with the academic medical centers in Singapore to integrate clinical and genomic data for the first time to form one large pool of information that scientists and clinicians can dip into. “Data analytics can offer physicians the evidence needed to make more effective decisions, which will benefit their patients,” says Feng.

Meanwhile, researchers at the IHPC are developing high-performance artificial intelligence tools and a collaborative platform needed to power c-BIG. “We are excited to contribute our expertise and technologies in high-performance computing and artificial intelligence to efficiently and intelligently analyse the vast trove of medical and genomic data,” says computer scientist Rick Goh Siow Mong at the IHPC. “Through this program, we hope to advance the study of how specific medicine can be administered based on the detected variability in an individual — this is no mean feat.”

Unwelcome superbugs

In the time it takes to sequence a single human, machines can sequence a hundred bacterial genomes. “Bacteria are cheaper to sequence, which means that we can collect thousands or even millions of genomes to get a really fine-grained view of how bacteria evolve in an environment,” says Nagarajan, who is doing exactly that as part of a joint project between c-BIG and the GIS Efficient Rapid Microbial Sequencing (GERMS) platform known as Resistance and Outbreak Tracking in Singapore (ROUTES).

Nagarajan is building a tool that can study the diversity and evolution of the hundreds of trillions of bacteria residing inside the human gut. More specifically, he is looking at how some ‘superbugs’ become resistant to a last-resort class of antibiotics known as carbapenems. Killing almost half of the patients they infect, these superbugs are spreading fast, from New York to Israel, Greece and further east, but they have yet to cause serious trouble in the Singaporean stomach.

Nagarajan wants to find out how these bacteria are transmitted between individuals, what conditions make them feel more or less welcome in the gut, and how their presence affects the gut environment. His research could even lead to potential remedies, whether it be a bacteria that the new residents find repulsive or one that can kill them. “There is an arms race within bacteria, and a lot of groups are searching the genetic information of microbial communities for potential antibiotics.”

A fishy infection

Data detectives confirmed the food source of a spike in group B streptococcus infections in Singapore in 2015.

Data detectives confirmed the food source of a spike in group B streptococcus infections in Singapore in 2015.

Courtesy of CDC/ James Archer

In early 2015, a mysterious outbreak of group B streptococcus appeared in Singapore. Streptococcus typically lines the intestines and urinary tracts of one-third of healthy adults, and is mostly considered harmless. The only doctors who really worry about the disease are obstetricians, because of the risk it poses to newborns. But here were strong and fit adults succumbing to the infection, arriving at emergency wards feeling feverish and confused, with severe headaches.

Between January and July, 238 streptococcus cases were detected in Singapore, compared to an annual average of 150. Government officials soon traced these cases to a popular dish of raw freshwater fish called yusheng, which is served with rice porridge at hawker stalls. “Group B streptococcus was not known to be a food-borne illness until this outbreak,” says Swaine Chen at the National University of Singapore and GIS.

Chen and his colleagues used genomic sleuthing to further investigate the source of the outbreak. “We found that more than 90 per cent of the group B streptococcus cases during that period were infected by the exact same strain,” he says. “It was a total slam dunk.”

Genomics offers a level of precision that other detective tools cannot, and has become common practice in outbreak surveillance. It was used to assess the 2014 Ebola epidemic, and is regularly used to monitor listeria and salmonella contamination in food. As part of ROUTES, Chen is trying to increase the scale and speed of such analyses for almost any bacteria.

Driving the initiative is the proliferation of genomic bacterial data. He now has access to 2,000 strains of group B streptococcus, with close to 50,000 strains of Escherichia coli and 130,000 strains of Streptococcus pneumoniae becoming publically available in the next year. Once the system is in place, Chen’s team could analyse the raw bacterial data of an emerging outbreak within six hours, instead of the week it took them in 2015. Eventually, the system will be so easy to use that physicians could do the detective work themselves. “We want clinicians to be able to manage the deluge of data without needing to know much about the low-level processing.”

Penguin flu

Influenza virus sequencing has shown that even penguins can catch the flu.

Influenza virus sequencing has shown that even penguins can catch the flu. 

© Simon Bottomley/DigitalVision/Getty

What Chen is doing for bacteria, Sebastian Maurer-Stroh at the BII is doing for viruses. In the midst of the swine flu pandemic in 2009, he developed FluSurver, a public online tool for analyzing sequences of the influenza virus to identify mutations and determine how those changes affect the structure of the virus.

The World Health Organization’s national influenza centers based in 113 countries use FluSurver for their surveillance, connected to the virus database of the Global Initiative for Sharing All Influenza Data (GISAID), which amounts to up to 20,000 sequences of influenza analyzed every year. The tool has been used to spot where and when new mutations emerge in distant reaches of the world, including antiviral resistant swine flu variants in Singapore and Australia, as well as highly pathogenic strains of the avian influenza virus in Mexican chicken farms and in live poultry markets in eastern China. In 2014, a large collaborative team including Maurer-Stroh confirmed for the first time that even penguins catch the flu, and that their flu type is not dangerous to humans.

Beyond FluSurver, the BII has applied genomic analysis tools to outbreaks of norovirus, adenovirus, dengue, hepatitis C, and even the Zika virus in Singapore. In 2016 for example, the BII team, together with the National Public Health Laboratory of the Ministry of Health, rapidly characterized the local Zika strain linked to a large cluster of cases in Singapore. Under c-BIG, Maurer-Stroh also plans to work with colleagues at the I2R to map the genetic diversity of dengue against the movement of commuters through the public transport system.

Eventually, c-BIG expects to take such analyses into our everyday lives. “In the future, genomics will be just as ubiquitous as computing — everyone will have sequencers in their hands, their homes and various devices, essentially serving as sensors for life,” says Nagarajan. “We need to develop algorithms and robust systems that can aggregate data, make inferences and provide useful information about our environment in real-time.”