Björn Canbäck Bioinformatics

ARAGORN | ARWEN | BRUCE | optalign | RAMI

Supplementary information to Pommier et. al. “RAMI, a tool for identification and characterization of phylogenetic clusters in microbial communities”

small logo

Method description

Collection of sequences, quality checking and taxonomic inference

We used existing 16S rRNA sequences of marine bacterioplankton from eight different locations spread around the world (Acinas et al., 2004; Pommier et al., 2007). Taken together these two sampling efforts represent the largest data sets of rRNA gene sequences from environmental clone libraries of marine bacterial communities yet available. Also both studies aimed to reveal the fine scale structure of the bacterioplankton community and followed nearly the same experimental procedure: comprehensive clone libraries were constructed from each location with careful consideration for sequence quality and monitoring to ensure significant clone coverage.  However, because of their small community size we excluded 2 samples from Pommier et al.: the Baffin Bay and the Arctic Ocean locations. Since gene microdiversity may result from PCR biases or sequencing errors, a scrupulous sequence quality control of the clone libraries was carried out (Pommier et al., 2007).
 This strict selection for accurate sequences nominated 2,878 sequences from the seven locations from Pommier et al., and 1,081 sequences from Acinas et al. All sequences were aligned using the online tool from Greengenes (DeSantis et al., 2006) according to their geographic origin on the one hand and on the other hand, for the sequences retrieved during the global sampling in 2003 (Pommier et al., 2007), according to the phylogenetic group they belong to. Sequences that could not be successfully aligned were removed from the dataset. Sequences were assigned to phyla or proteobacterial divisions with the aid of the Greengenes Compare-Classify module. The assignment was made for a sequence only if at least four of the six taxonomic nomenclatures used in Greengenes (RDP, NCBI, G2-chip, Pace, Ludwig and Hugenholtz) showed the same result. Sequences with ambiguous taxonomic assignments were removed leaving a final number of 2,738 sequences from the seven locations in Pommier et al. and 1,081 sequences from Acinas et al.

Comparison of cluster assemblies

To produce figure 3, all full length gamma-proteo bacterial 16S rRNA sequences were downloaded from the Greengenes database core set in the FASTA aligned format. The alignment was converted to the phylip format which is the required input format for both dnadist from the phylip-package (Felsenstein, 2005) and RAxML (Stamatakis et al., 2005). dnadist was run using gamma-distribution and a rate of 0.5 and produced the input matrix for DOTUR (which was used with the nearest neighbor as cluster method (-c n)). RAxML which produced the input tree file for RAMI was run with the switch -m set to GTRGAMMA, othwerwise default vales. To run BLASTclust, all gaps were removed from the alignment to create the raw sequences. BLASTclust was run with default settings. Both RAMI and BLASTclust were run with various thresholds to create cluster assemblies with different number of clusters.

Alignment, phylogenetic trees and clustering
Unlike in our previous study (Pommier et al., 2007) the sequences were now aligned with the aid of Greengenes (DeSantis et al., 2006), http://greengenes.lbl.gov) which aligns 16S rRNA sequences to 7,682 characters full-length gene templates. Thresholds were set to 250 bp for minimum length and 60% for minimum percent identity. The 2,738 retrieved aligned sequences, mostly consisting of gaps, were grouped into the phyla Actinobacteria, Bacteroidetes, Cyanobacteria, Planctomycetales and Verrucomicrobiae and in the case of Proteobacteria the divisions α-, β-, d- and g-Proteobacteria because of the dominance of these sequences in the dataset. In parallel, sequences were also classified according to their geographic origin (i.e. location).
Phylogenetic trees based on the resulting alignments were reconstructed with the aid of RAxML (Stamatakis et al., 2005). RAxML implements a fast maximum likelihood algorithm that makes it suitable for large data sets. Initial rearrangement settings were determined by the procedure described in the manual with the exception that the number of rate categories were not estimated.  Another online tool, iTOL (Letunic and Bork, 2007), http://itol.embl.de/) was used for tree drawing. To cluster sequences, RAMI was run with a patristic distance threshold of 0.01.

Analysis and visualization of sequence clusters
With the aid of JColorGrid (Joachimiak et al., 2006), which generates color-grids from matrices, clusters in the taxonomic dataset were visualized together with the geographic origin of the corresponding sequences. The patristic distance matrix retrieved from RAMI was converted to a new matrix in such a way that sequence origins and clusters could be visualized with JColorGrid. In this way, sequence clusters appears as multicolored squares along the diagonal, and potential endemism can be revealed by the presence of a plain colored square. By combining the color-grid with the phylogenetic tree drawn with iTOL, it becomes possible to compare sequence clusters with the tree topology.

References

Acinas, S.G., Marcelino, L.A., Klepac-Ceraj, V., and Polz, M.F. (2004a) Divergence and redundancy of 16S rRNA sequences in genomes with multiple rrn operons. J Bacteriol 186: 2629-2635.
Acinas, S.G., Klepac-Ceraj, V., Hunt, D.E., Pharino, C., Ceraj, I., Distel, D.L., and Polz, M.F. (2004b) Fine-scale phylogenetic architecture of a complex bacterial community. Nature 430: 551-554.
DeSantis, T.Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E.L., Keller, K. et al. (2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 72: 5069-5072.
Felsenstein,J. 2005. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.
Joachimiak, M.P., Weisman, J.L., and May, B.C.H. (2006) JColorGrid: software for the visualization of biological measurement. BMC Bioinformatics 7.
Letunic, I., and Bork, P. (2007) Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23: 127-128.
Pommier, T., Canbäck, B., Riemann, L., Boström, H.K., Lundberg, P., Tunlid, A., and Hagström, Å. (2007) Global patterns of diversity and community structure in marine bacterioplankton. Mol Ecol 16: 867-880.
Stamatakis, A., Ludwig, T., and Meier, H. (2005) RAxML-II: a program for sequential, parallel and distributed inference of large phylogenetic. Concurrency and Computation-Practice & Experience 17: 1705-1723.

Figure S1


Phylogeny of proteobacteria

Click in the figure for higher resolution.

Fig. S1. Clustering of sequences using alternative methods. Three widely used software for sequence clustering, BLASTclust, Clusterer and DOTUR were run with parameters set to produce three sequence clusters. Due to variation in G+C contents neither produced the desired result of placing alpha- and gamma-proteo bacterial sequences into two distinct clusters. The software described in this report, RAMI, correctly places the two classes into two distinct clusters. It should be noted that the result of DOTUR is based on a Phylip-generated distance matrix which is the input format for this software, while the result of RAMI is based on the matrix used to construct the phylogenetic tree shown in the figure. The tree was built with the aid of the Phylo_win software using the Galtier and Gouy substitution model and neighbour-joining. Sequence clusters are represented by white, grey and black boxes. Branch lengths are found above the branches, while G+C contents are found below. Scale bar represents substitutions per site. Accession numbers starting from top: EF156507, AF069062, DQ130027, M59060, AF159575, AJ310648 and AY126632.

References

Galtier, N., Gouy, M. and Gautier, C. (1996) SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny. Comput. Appl. Biosci. 12; 543-548.

 

logo ©2008 Björn Canbäck Bioinformatics