doi:10.1038/nindia.2011.170 Published online 25 November 2011

Virtual cell: How to make one

Constructing a virtual cell out of a real cell calls for massive number crunching of experimental biological data into computable units. The key is to use the right combination of methods and tools, says Pawan Dhar.

A computational model of any cell is a reduced dimension representation of reality. It is about converting a dynamic system into a series of snapshots, which when run together create a movie.

Studying the model and understanding the collective properties of system components vis-à-vis higher-level behavior is the new science of systems biology.

Real to virtual

One starts off by collecting experimental data, converting that into a model and watching the model grow over time. The first step is to make a whole cell map out of gene-RNA-protein interactions.

The most useful map is the one that is based on the thinking: 'if this condition exists, show me the molecular interaction'. If the question involves several organs, then a cell-cell interaction map would be the most useful for starters.

In the next step, the connectivity relationship between any two molecules is represented qualitatively or quantitatively. Qualitative data would be things like — is the gene on or off, is the protein produced or not. Quantitative information, on the other hand, would be — how much protein is produced, how long is it produced, what is the strength of protein-DNA binding, what is the flux in a certain metabolic pathway at equilibrium — and so on.

The choice of method depends upon availability of data and familiarity with the appropriate tools1.

While modeling whole cell systems, one has to keep track of hundreds of processes occurring simultaneously. For efficient management of large volumes of data, researchers use special data analysis, data visualisation and data storage tools.

Once the model is constructed, they use simulation and analytical tools to understand its behavior. By altering components and their associated parameters one can study system behavior under various conditions and make appropriate predictions. The model allows one to ask questions like: what would possibly happen if we knocked-in or knocked-out genes? The accuracy of the answer depends upon correctness and completeness of data that goes into the model.

The inventories

Making of a virtual cell calls for massive data integration. The data comes in the form of parts-inventory, interaction-inventory and context-inventory.

The 'parts inventory' is made of genes, various types of RNAs and proteins. The 'interaction-inventory' is composed of RNA-protein, protein-protein, RNA-RNA, protein-DNA interactions stitched into metabolic pathways, gene regulatory and signaling networks. The 'contextual-inventory' is composed of all environmental/cell culture conditions under which the data was collected.

One frequently ends up integrating gene expression, protein expression, RNA and protein interaction data into contextual relationships. This calls for developing appropriate interfaces among various data types. It also calls for data management, data visualisation and a suite of analytical tools that help find answers across databases.

Modeling lingo

Currently more than 150 tools assist researchers in biological model building. To ensure that different tools talk to each other seamlessly, exchange data and enable data display and analysis, the Systems Biology Modeling Language (SBML) was invented2.

The SBML framework brings together rich, heterogenous data outputs and operating system formats to a common binary encoded denominator. It uses client end libraries such that a model built in Linux, for example, can be opened up for editing and compilation by another tool developed in windows.

Irrespective of enormous efforts in the systems biology community over the last decade to address data treatment and data portability issues, still there is a need to integrate data emerging from various technologies and build data standards for storage and communication.

For example, it is quite challenging to model gene expression scenarios and merge them with spatial and temporal information of pathways and networks. Likewise, semantically codified biological data is sparse due to lack of common data publishing standards and non-uniform data storage formats among biological databases.

Data power

In the space of quantitative modeling, lack of relevant and enough data, often hampers a well planned modeling effort. Frequently, one is forced to import data from irrelevant systems to generate a sense of completeness in the model. For example, if the aim is to build a pathway model in a certain plant and if data on catalytic turnover rate of enzymes was found missing, one frequently ends up importing the data from yeast and E.coli, to generate a sense of model completeness!

Another important issue is the accuracy of published data. For example, the enzyme kinetic data reported in publications comes from experiments that use aqueous buffers at a certain pH. However, in reality a cell is not a bag of aqueous solution. It resembles a gel. Also often one comes across data conflicts that are difficult to resolve.

It is a standard practice to use Michaelis-menton equation to model metabolic pathways. However, enzyme kinetic equations were derived using a number of assumptions that are unrealistic in biological setting e.g., well mixed reactor.

Due to these reasons, modeling is more of an art, is dependent upon personal understanding of most biologically relevant abstraction.

One tends to heavily mathematize biology in a quest to create a linear system out of non-linear biology. However, over-fitting mathematics into biology creates its own problems in terms of lack of knowledge on fundamental parameter values, a huge quantitative space of unknown and error prone predictive methods to fill in that space.

In an ideal setting one would want to integrate the cell level inventory with the tissue level inventory. However, the challenge is enormous from the data integration point-of-view, keeping track of bad data, false discovery rate and incompleteness of model.

Though mathematized transformation of biological data allows a better tractability, unfortunately the nature of abstraction itself constraints the evolution of the system within a given parameter. Also, one does not know how to handle an input, for which prior knowledge was not hardwired into the model. Due to this reason, emergent properties that arise from simple interactions are difficult to capture and simulate.

Finally, there are situations where part of the system may be modeled qualitatively and part of the system quantitatively. An interface that seamlessly enables data movement between two different subsystems, within the same model, needs to be developed. This calls for innovative thinking and sound knowledge of biology.

This article is the second in a series entitled 'Virtual Cell'.


  1. Ghosh, S. et al. Software for systems biology: from tools to integrated platforms. Nat. Rev. Genet. 12, 821-832 (2011) | Article | PubMed |
  2. Hucka, M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524-531 (2003)  | Article | PubMed | ISI |