Big data for a big country: Mind the gap
[Nature India Special Volume: Biotechnology — An agent for sustainable socio-economic transformation]
doi:10.1038/nindia.2016.71 Published online 31 May 2016
The advent of high-throughput technology for nucleic acids, proteins and metabolite profiling has revolutionised life science research. In the last five years, experiments on individual exomes, genomes, transcriptomes, epigenomes, proteomes and metabolomes have produced data in the scale of petabytes. The unprecedented scale at which data is being generated in genome science requires large compute power to store and analyse such data. For example, it is estimated that 1021 supercomputers are required to store the total amount of information encoded in DNA of the biosphere1.
Furthermore, some anticipate that the computing resources needed to handle genome data will exceed that of the social networking site Twitter and the popular video sharing website YouTube2.While providing answers to many of the puzzling and unanswered questions in biology is now possible using the high-throughput tools and the data that they generate, dealing effectively with the data deluge and making biological meaning out of it will require eclectic solutions. For the first time in biology, the cost of data storage and computing is set to overcome the cost of data production.
It is important to understand the scale and volume of the data before we discuss solutions. For example, the Pan-Cancer Analysis of Whole Genomes (PCAWG) project of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) project are coordinating analysis of more than 2,800 cancer whole genomes to make the data available over cloud. Genome along with other high-throughput data and associated metadata from this effort alone will run into several petabytes. Data from cancer consortia along with data from other large consortia, like the Exome Aggregation Consortium (ExAC), the Genotype-Tissue Expression (GTEx) and future large population-wide programmes, will touch hundreds of petabytes by 2025.
Dealing with this unprecedented scale and pace (velocity) of data production will require precise and accurate solutions for data storage, analysis, sharing and security. Additionally, the heterogeneous nature, multi-dimensionality and the complexity of biological data will require deep understanding of multiple quantitative domains like mathematics, computer science, physics, statistics and information technology, outside of biology, medicine and agriculture.
Learning from the past
What can India learn from the past efforts elsewhere and how can scientists and institutions in India prepare to deal with large amounts of data? We will need to build infrastructure at multiple levels, both local and in cloud, to deal with data at this massive scale.
Firstly, we need to put processes to check sample quality at the source (hospitals, research universities and institutions). Once the genome centres get good quality samples they need to produce high quality sequence data, which to a large extent is resolved by the instrumentations and biological assays available from commercial vendors.
Second, there would be need to build hardware and cloud computing infrastructure for effectively storing, securely sharing and transferring information from the genome centres to a centrally managed server.
Third, we need to design user-friendly and intuitive software, allowing researchers to perform smart analysis, visualisation and interpretation of data.
Fourth, we need to build robust databases towards collaboration, data sharing securely for a wider dissemination. Last, and equally important, will be to create policies and ethical guidelines for data sharing, protection standards with proper encryption and authentication safeguards and ownership. In this context, we can learn and follow policies and guidelines put together by consortia like the Global Alliance for Genomics and Health (https://genomicsandhealth.org).
What is the status of India's infrastructure and how do we plan to deal with the challenges? Whether one supports India’s involvement in large sequencing projects or not, we need to be ready to utilise the data coming out from genome centres. In case of India-centric diseases and population-specific genetic variations, we need to generate our own data. The price for generating genome-scale sequencing information has come down drastically (below 1 lakh rupees in reagent cost for sequencing a single human genome at 30X coverage).
India needs to focus on developing inexpensive hardware and assays for sequencing smaller genomes and/or a small number of genes in human genome for clinical applications. India also needs to focus on building databases. We need to build India-centric diseases and mirror genomic data present in large databases outside of India, for example the national centre for biological information (NCBI) in USA the European Bioinformatics Institute (EBI) and DNA Data Bank of Japan (DDBJ) for faster download of sequence information. The databases need to link the vast amount of data currently stored outside government-supported centres like NCBI, primarily with commercial vendors who are helping manage a large amount of data3, easier download and analysis.
India’s strengths in fields like, frameworks and models, data-and metadata mining and management, design and integration of platforms, parallel and distributed systems and database design and implementation will come handy in optimising low-cost design, infrastructure and protocols for data storage and sharing. By leveraging the best practices in information technology, India can thus create an infrastructure but can also be a global leader in inexpensive design of this infrastructure (both at local level and in cloud) for data analysis, sharing and security at Zettabyte (1021 bytes) or even Yottabyte (1024 bytes) scale.
None of these is possible without high-bandwidth connectivity and proper policy framework. The Government’s initiatives on digital India (DI), Internet of things (IoT) and infrastructure initiatives like national knowledge network (NKN) to connect the educational and research institutions and universities, will help in realising the dream of accessing big data anywhere and everywhere.
However, solving big data problem to enhance innovation in biotechnology is not attainable without a proper big data policy from the Government and an institution that can store, archive and hold all the research data. Setting up a nodal institution under the Department of Biotechnology to administer, store, archive and make big data in biotechnology widely available/accessible should be a priority. Such an institution should follow a hub and spoke model. Institutions around the country, the data generating ones (the spokes) will generate and transfer data to this central institution, which will then be responsible for quality control of the data, building proper and long-term infrastructure for data archiving, building databases, making data sharing protocols on the cloud and data availability across the country.
Additionally, the government needs to come up with white papers and policy documents in the area of information access. An “Open National Science Information Access Policy” document should detail who owns one’s genome data and set guidelines for data distribution and sharing. The first step in making this a reality is to mandate free and immediate availability of all scientific information, including all raw data, scripts and codes, while maintaining confidentiality.
In this direction, the Open Access policy from the Department of Science and Technology (DST) and Department of Biotechnology (DBT) on research manuscripts is praiseworthy. However, it needs to go further and mandate making raw data, scripts, and codes openly and immediately available.
Finally, in addition to scientists, the Government needs to engage a wider layer of experts, sociologists, lawyers, jurists, genetic counsellors, anthropologists and parliamentarians in formulating the policies and help the parliament make appropriate law in the area of human data sharing and data distribution.
Employing best practices
Once the infrastructure, institutions and the policies are in place, we need to put the best practices into action to spur innovation. Government needs to engage with the industry in order for the fruits of big data in biology to reach the citizens through products and services. In order for this to happen, the industry needs to see a financial incentive.
While engaging positively and encouraging entrepreneurship through various schemes in the areas of big data in medicine and agriculture, Government must make sure that the information for a common Indian is utilised ethically and within a proper legal, cultural and financial framework. In molecular medicine, for example, the three important stakeholders, government, industry and academia, have to work together and in synergy4.
In an interconnected world knowledge is no longer the monopoly of a selected few. You don't have to attend the best universities to learn from the best professors; nor access the best libraries to get access to journal articles; or buy expensive hardware to analyse large datasets. All of the above can be achieved online. Therefore, access through universal connectivity has become more important than physical possession. This requires providing 350 million young Indians below the age of 25 years with free, open and uninterrupted access to all scientific information.
However, the access to data and information must be followed through with advice from domain experts as data do not necessarily translate into solution and more data do not lead to understanding causation of many unsolved problems in biology. If done rightly, big data could vitalise innovation and employment in India for the next few decades.
*The author is affiliated with Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology; and Strand Life Sciences, Bangalore, India.