Take a computational care of gut metagenomics data

Joseph Nathaniel Paulson (@dorageh) and Paul Igor Costea (@CosteaPaul) are both mathematical and computer scientists. Their studies focus on processing large metagenomic data, specifically with gut microbiota and health in mind. They have corresponded recently through articles about normalization methods to get signal from complex omics data in Nature Methods. They were happy to give us their points of view about metagenomics data processing and how statistical analysis could affect biological interpretations.

1) What is your background?

Joe: My background began from an undergraduate degree in Mathematics and Jewish Studies at the University of Maryland, College Park. While completing my bachelor’s degree, I did a summer internship modeling vesicle pool dynamics at ENS, Ulm in David Holcman’s lab. Once there, I decided I would focus on Applied Math / Statistics and apply it to biology. Towards the end of my undergraduate degree I began working with Mihai Pop on problems in modeling systems and learned about the fascinating field of Metagenomics. I was introduced to this new field and a study I could not imagine not pursuing, so I decided to stay on at the Center for Bioinformatics and Computational Biology. I am currently at the Center and my focus has evolved to normalization and statistics under the guidance of Mihai Pop and Hector Corrada Bravo.

Paul: I hold an undergraduate degree in Computer Science from the Technical University of Cluj-Napoca. In the meantime i worked a lot in the private sector as a software developer for various companies. I then moved to do a master’s in robotics at the same university but quickly discovered it was not what I wanted to do with my career. After spending another year as a developer I decided it was biology that I wanted to do and completed a master’s in Computational and Systems biology at the Royal Institute of Technology in Stockholm. There I worked in the lab of Joakim Lundeberg on cancer omics. Currently I am a PhD student in the lab of Peer Bork at EMBL working in the mad field of metagenomics.

Joseph N. Paulson

2) What are the main challenges to study gut microbiota using metagenomics data?

Joe: I would argue detecting bacteria associated with a clinical phenotype is a major challenge in studying gut microbiota. In particular, sparsity of features (be it species or genera) causes anguish. For a variety of reasons, it seems that when one sequences a 1000 samples, for example, the majority of discovered species will not be present in even half of the samples. Many of these issues can affect our ability to detect clinically relevant bacteria, such as detecting novel associations with diarrhea.

Paul: I would argue there are two main challenges. Firstly, there is one of defining what we mean by any given taxonomic unit. So, for example, what is a species of bacteria? I realize there is no good answer to this, but better approximations should be a focus of research. This I think influences a lot of the downstream results and interpretation, for it is crucial for quantification. Measuring a miss-defined unit cannot yield a coherent estimation. Secondly, the detection threshold poses another problem. Even replicates from the same sample will not overlap to a desired extent which implies that a lot of the “absent” bacteria may indeed simply be under the detection threshold. This will results in a space that is a lot sparer than in reality.

Paul I. Costea

3) How bioinformatics and statistical processes could affect biological interpretation?

Paul: When working with immense amounts of data and complex statistical analysis, it is very easy to draw wrong conclusions. More so when the underlying system under analysis is hugely complex. Having the tools (both statistical and computational) to make the analysis correct and consistent will ensure our biological interpretations are better.

Joe: Statistics and bioinformatics can help highlight biological/phenotypic variability and help tease out technical biases. Due to the sparsity, and partially due to under-sampling, it’s very easy to cluster samples according to what is ‘missing.’ That can lead to potential batch effects. Accounting for these sort of issues can help biologists/clinicians avoid incorrect conclusions.

4) What are the main limitations of metagenomics approaches?

Paul: In my opinion, the main limitation at the moment is the considerable variation that exist in the methods, from extraction protocols to primer biases and statistical analysis. This results in considerable batch effects which make between study comparison close to impossible. A standardization effort such as that undertaken by the International Human Microbiome Standards consortia would do much to mitigate this issue.

Joe: One of the main limitations of metagenomics is assembly. Currently, assembly is a very popular topic right now for single genomes and many papers are being published which discuss tweaks to get better relevant statistic. Without longer reads or other external information, assembly for a single genome is impossible. Now, try to assemble thousands of mixed genomes. Another limitation would definitely be accurate abundance estimation, etc. Perhaps that topic is better left for an essay rather than an interview!

5) How can your work help clinicians in the future?

Joe: We are already able to help clinicians with our research. My research on normalization and differential abundance estimation will help researchers find associations between bacteria and clinical phenotypes of any disease. One particular project I have been fortunate enough to work on is a study on moderate-to-severe diarrhea. Following the results of that study, we plan to confirm novel associations, which will lead to better public health and clinical diagnosis.

Paul: Understanding which the best analysis methods are will create a climate in which diagnostics based on the human gut microbiome will become a common reality. This has great implications for clinicians and human health overall.

References
Paulson, J.N., Stine, O.C., Bravo, H.C. & Pop, M. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10, 1200–1202 (2013).
Costea, P.I., Zeller, G., Sunagawa, S. & Bork, P. A fair comparison. Nat Methods. 2014 Mar 28;11(4):359. doi: 10.1038/nmeth.2897.
Paulson, J.N., Bravo, H.C. & Pop, M. Reply to: “A fair comparison”. Nature Methods 11, 359–360 (2014) doi:10.1038/nmeth.2898