A study published in Nature Biotechnology in July 2014 showed that it was possible to segregate human gut metagenomics data into specific biological entities (e.g. microbial species) without the need for reference sequences. Henrik Bjørn Nielsen and Mathieu Almeida agreed to give us insights into their method, their main results, and how this could help clinicians in the future.
What is your background?

Mathieu Almeida
Mathieu Almeida (MA): I obtained a Bioinformatics Licence and Master’s Degree at the University of Paris 7: Denis Diderot (France). Then I obtained a Doctorate Degree in Bioinformatics at the INRA Institute (Jouy-en-Josas, France). During my PhD, I focused on the characterization of microbial communities in dairy samples under the guidance of Dr. Pierre Renault, and on the study of the human intestinal tract microbial composition with Dr. Dusko Ehrlich. I am currently working as a Post-Doc student at the Center for Bioinformatics and Computational Biology (Maryland, USA) under the guidance of Dr. Mihai Pop, and I focus on metagenomic assembly and human oral microbiota analysis.
Henrik Bjørn Nielsen (HBN): I started out as a biologist and my Master’s was in plant molecular biology in the John Mundy lab at Copenhagen University. This was mostly wetlab work. However, I quickly got a taste for in silico biology, and moved to Brunak’s lab to do a PhD in bioinformatics (2003). Since then I have worked with bioinformatics at the Technical University of Denmark. It was mostly transcriptomics in the beginning, but during the last six years it has been human microbiome research. I have been very fortunate to be involved in MetaHIT, where I met many very competent researchers, and to have a small but very dedicated and competent team around me.
What is the context of your work?
MA: This work was a part of an international project named MetaHIT (METAgenomics of the Human Intestinal Tract), that aims to explore the composition of human intestinal tract microbiota and its impact on host’s health. In 2010, the MetaHIT consortium provided a 3.3 million gene catalog, generated from hundreds of human intestinal tract samples. However, more than 70% of those genes come from unknown organisms, making it hard to be used in clinical association studies. So we proposed an approach to regroup genes into clusters to reduce the complexity of the catalog and improve the human intestinal tract microbial characterization.
HBN: In this study we set out to crack one of the hard problems in biology: namely, how to untangle a complex community of microorganisms into biological entities, like species, phages etc. One needs to understand that the majority of the species in the gut as well as all other complex microbial communities cannot be cultivated and therefore we only have a reference genome for a fraction of these species. I remember that we started looking at the abundance profiles of genes that matched a reference genome – it just didn’t make sense that genes from an organism could have such inconsistent abundance profiles. In the end we decided to invert this view and look for clusters of genes with very consistent abundance profiles across many samples. In the beginning we only thought of identifying metagenomic species, but it quickly became apparent that there were a lot more in a microbiome than microorganisms. We found a lot of small entities with very consistent gene abundance profiles, that looked like phages, plasmids or clonal heterogeneity, and that a lot of these could be affiliated to the metagenomic species. It is of course not new that there are phages etc. in the microbiome, but to identify and affiliate these to their hosting microorganisms in bulk is new, and it is very important if we want to understand the interplay between the phages and the microorganisms. To this end, time series data is very powerful and I think the concept of conditional persistence probabilities is an important new way of analyzing such interactions.

Known and unknown microbial species reveal themselves as co-abundant entities in metagenomics data. Here gut microbial genes are projected on a two dimensional Pearson correlation space and coloured by species. (From Henrik Bjørn Nielsen)
What was the most challenging part of your work?
MA: At the beginning of the project in 2010, the MetaHIT catalog was one of the biggest generated biological data sets, and no existing tools were able to handle it. So we needed to design a scalable clustering method that would be able to efficiently cluster millions of different types of microorganisms – such as bacteria, archaea and viruses – at the same time. Additionally, since the majority of the microorganisms are unknown, our method could not rely on existing reference genomes to infer the clusters.
HBN: I think the most challenging aspect of this study was to put this into words. There is so much information in this study and many exciting details that we had to leave out. The main bioinformatic analysis in this study was done in a very short time.
What was the main finding?
MA: First of all, we reconstructed more than 200 genomes of bacteria, and more than 800 phages with a quality comparable to that of the isolated cultured microbes sequencing. As a result, we almost doubled the number of different species from the intestinal tract available in public databases. Secondly, we drew a dependencies map between microorganisms and their hosts. For example, we illustrated relationships between phages and their host bacteria. This allowed discovery of functions that improved the persistence of some microorganisms in their ecosystem.
HBN: The main finding is how powerful the co-abundance principle is in identifying biologically meaningful genetic entities in a complex community. But the affiliations by dependency associations are also very important. I also think the conditional persistence probability is very important and it shows how critical a small set of genes or a phage can be for the persistence of microorganisms in the gut.
What are the limitations?
HBN: I think the main limitation for the methodology is the number of samples that it requires to identify rare species and dependency associations. This is still beyond many studies outside the microbiome field. And then of course the challenge to interpreting all of this information. There is still a long way from identifying the species and phages to understanding their role.
MA: The method we developed relies on high throughput whole shotgun sequencing of multiple samples, which is still a costly procedure. However, we expect that this will be no longer an issue in the coming years since the sequencing price is continuously decreasing. Also, some of the clusters may represent chimeric genomes. Because of that, we defined multiple metrics to detect them, but more effort can be done to minimize this problem.
How could your study help clinicians in the future?
MA: Some microorganisms are hard to culture, so inferring their genomes gives a unique opportunity to predict their beneficial or pathogenic roles in human health. Additionally, a dependencies map could help to isolate and culture new organisms involved in human health, and propose new types of probiotics.
HBN: This is fundamental or basic research and so the future will show its clinical relevance. I think it will be very important for clinical research – not only have we discovered potential clinically important microorganisms, but we have also made the profiling of known species much more accurate. It is also possible that the relationship between phages and their hosting microorganisms may guide future phage therapy.
—
Source: Bjørn Nielsen, Almeida et al. MetaHIT consortium. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nature Biotechnology 2014.
Published by K. Campbell, with files from J. Tap.