Core Genome Analysis

As an alternative to MLSA data, we are also making available scripts and data for whole genome analysis of Agrobacterium and Rhodococcus isolates. Due to the large time and computational requirements, this analysis must be run on the command line of your own computer. The following instructions describe the pipeline and process for generating a core genome phylogenetic tree from whole genome sequencing data. The instructions are written for the Rhodococcus dataset, however analysis of Agrobacterium works the same way.


This pipeline relies on paired end Illumina sequencing reads, however pileup files can be generated from other sequencing data using SMALT, which is not described here.
Begin by installing the following software programs and make sure the executables are in your command line path. If you wish to remove sites resulting from recombination (which can skew phylogenetic analyses) before generating a phylogeny, also install the Gubbins program.

Required Software

SSAHA2 pileup
Perl -Installed by default on Mac and most Linux distributions
Gubbins -If compiling from source, first install the modified version of FastML here

Next, download the core genome pipeline scripts and a SMALT index file and pre-computed pileup files for either Agrobacterium or Rhodococcus depending on your sequenced isolates.

Pipeline scripts

Pipeline scripts for core genome analysis: Download


Agrobacterium tumefaciens strain C58 SMALT Index files: Download
Note: This index was generated for NCBI Refseq assembly GCF_000XXXXX.1[1] using SMALT v0.7.5, with a hash length of 13 and a word space of 2 ( -k 13 -s 2).


Rhodocccus fascians strain A44a SMALT Index files: Download
Note: This index was generated for NCBI Refseq assembly GCF_000760735.1[1] using SMALT v0.7.5, with a hash length of 13 and a word space of 2 ( -k 13 -s 2).

SMALT Pileup files for 19 Rhodococcus isolates ( 700 MB): Download

Custom SMALT index

If you wish to build your own SMALT index for use with this pipeline, name your reference genome file in FASTA format as "reference.fna" and place it in the "index" folder of the pipeline. Then while in the "index" folder run the command:

    		smalt index -k 13 -s 2 reference reference.fna
This will generate a SMALT hash index using a word length of 13 (-k 13) and a word size of 2 (-s 2). You can change these parameters depending on what is optimal for your dataset.

Generating a 90% core alignment

First, download the pipeline scripts from GitHub into its own folder using the following command:

	git clone https://github.com/osuchanglab/WGS-Pipeline.git Rhodococcus_analysis
Unzip the downloaded SMALT index and pre-compiled pileup files into the same folder:
	tar xvzf Rhodococcus_A44a_smalt_index.tar.gz -C ./Rhodococcus_analysis/index
	tar xvzf Rhodococcus_A44a_smalt_pileups.tar.gz -C ./Rhodococcus_analysis/pileup
Copy your Illumina paired-end read files in fastq format into the analysis folder:
	cp my_isolate.1.fastq ./Rhodococcus_analysis/reads/
	cp my_isolate.2.fastq ./Rhodococcus_analysis/reads/
You can add sequenced isolates into the folder, just make sure that each dataset is consistently named (ie a dataset named datasetname has the files datasetname.1.fastq and datasetname.2.fastq for forward and reverse reads, respectively).

Generate SMALT pileup files for each dataset using the downloaded index:
	cd ./Rhodococcus_analysis
The script will ask for the location of the ssaha_pileup pipeline folder as well as the estimated insert size for each genome. Generated pileup files will be placed in the pileup folder and named with the format 'datasetname.pileup'.

Compile all of the pileup files into a 90% shared core alignment using the following script:
This creates a file called './core_alignment.fasta' containing the 90% shared core alignment in CLUSTAL format.

Optional: Remove recombinant sites

If you want to remove sites putatively resulting from recombination events from your alignment, the program Gubbins can be used with the following script:

This will produce a SNP alignment file containing all nonrecombinant polymorphic sites called './core_alignment_filtered_SNPs.fasta'. It also produces a pdf file (core_alignment.recombinant_snps.pdf) detailing the identified recombinant sites.

Generate a phylogeny

To generate a maximum likelihood phylogeny using RAxML, the following script can be used:

If you removed recombinant sites in the previous step, the script will use the output of Gubbins (core_alignment_filtered_SNPs.fasta) instead of the full core alignment (core_alignment.fasta).


If you wish to publish output from this pipeline, please cite the following papers and programs:

Ning Z, Cox AJ and Mullikin JC. SSAHA: a fast search method for large DNA databases. 2001. Genome research (11) 10:1725-9. PUBMED: 11591649; PMC: 311141; DOI: 10.1101/gr.194201

Hannes Ponstingl. SMALT. Welcome Trust Sanger Institute. http://smalt.sourceforge.net

A. Stamatakis: "RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies". In Bioinformatics, 2014, open access.

Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014.