As an alternative to MLSA data, we are also making available scripts and data for whole genome analysis of Agrobacterium and Rhodococcus isolates. Due to the large time and computational requirements, this analysis must be run on the command line of your own computer. The following instructions describe the pipeline and process for generating a core genome phylogenetic tree from whole genome sequencing data. The instructions are written for the Rhodococcus dataset, however analysis of Agrobacterium works the same way.
This pipeline relies on paired end Illumina sequencing reads, however pileup files can be generated from other sequencing data using SMALT, which is not described here.
Begin by installing the following software programs and make sure the executables are in your command line path. If you wish to remove sites resulting from recombination (which can skew phylogenetic analyses) before generating a phylogeny, also install the Gubbins program.
Next, download the core genome pipeline scripts and a SMALT index file and pre-computed pileup files for either Agrobacterium or Rhodococcus depending on your sequenced isolates.
Pipeline scripts for core genome analysis: Download
Agrobacterium tumefaciens strain C58 SMALT Index files: Download
Note: This index was generated for NCBI Refseq assembly GCF_000XXXXX.1 using SMALT v0.7.5, with a hash length of 13 and a word space of 2 ( -k 13 -s 2).
Rhodocccus fascians strain A44a SMALT Index files: Download
Note: This index was generated for NCBI Refseq assembly GCF_000760735.1 using SMALT v0.7.5, with a hash length of 13 and a word space of 2 ( -k 13 -s 2).
If you wish to build your own SMALT index for use with this pipeline, name your reference genome file in FASTA format as "reference.fna" and place it in the "index" folder of the pipeline. Then while in the "index" folder run the command:
smalt index -k 13 -s 2 reference reference.fnaThis will generate a SMALT hash index using a word length of 13 (-k 13) and a word size of 2 (-s 2). You can change these parameters depending on what is optimal for your dataset.
First, download the pipeline scripts from GitHub into its own folder using the following command:
git clone https://github.com/osuchanglab/WGS-Pipeline.git Rhodococcus_analysisUnzip the downloaded SMALT index and pre-compiled pileup files into the same folder:
tar xvzf Rhodococcus_A44a_smalt_index.tar.gz -C ./Rhodococcus_analysis/index tar xvzf Rhodococcus_A44a_smalt_pileups.tar.gz -C ./Rhodococcus_analysis/pileupCopy your Illumina paired-end read files in fastq format into the analysis folder:
cp my_isolate.1.fastq ./Rhodococcus_analysis/reads/ cp my_isolate.2.fastq ./Rhodococcus_analysis/reads/You can add sequenced isolates into the folder, just make sure that each dataset is consistently named (ie a dataset named datasetname has the files datasetname.1.fastq and datasetname.2.fastq for forward and reverse reads, respectively).
cd ./Rhodococcus_analysis ./generate_pileup.shThe script will ask for the location of the ssaha_pileup pipeline folder as well as the estimated insert size for each genome. Generated pileup files will be placed in the pileup folder and named with the format 'datasetname.pileup'.
./generate_core_alignment.shThis creates a file called './core_alignment.fasta' containing the 90% shared core alignment in CLUSTAL format.
If you want to remove sites putatively resulting from recombination events from your alignment, the program Gubbins can be used with the following script:
./remove_recombination.shThis will produce a SNP alignment file containing all nonrecombinant polymorphic sites called './core_alignment_filtered_SNPs.fasta'. It also produces a pdf file (core_alignment.recombinant_snps.pdf) detailing the identified recombinant sites.
To generate a maximum likelihood phylogeny using RAxML, the following script can be used:
./generate_phylogeny.shIf you removed recombinant sites in the previous step, the script will use the output of Gubbins (core_alignment_filtered_SNPs.fasta) instead of the full core alignment (core_alignment.fasta).
If you wish to publish output from this pipeline, please cite the following papers and programs:
Ning Z, Cox AJ and Mullikin JC. SSAHA: a fast search method for large DNA databases. 2001. Genome research (11) 10:1725-9. PUBMED: 11591649; PMC: 311141; DOI: 10.1101/gr.194201
Hannes Ponstingl. SMALT. Welcome Trust Sanger Institute. http://smalt.sourceforge.net
A. Stamatakis: "RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies". In Bioinformatics, 2014, open access.
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014.