Beginners guide to comparative bacterial genome analysis using next- generation sequence data David J. Edwards, Kathryn E. Holt
BMC Microbial Informatics, 2013
Last updated March 2013
"# $%&'(% )**%(+,- )&. )&&'/)/0'& ########################################################################## 1 1.1 Downloading E. coli sequences for assembly .................................................... 2 1.2 Examining quality of reads (FastQC) ................................................................. 2 1.3 Velvet assembling reads into contigs .............................................................. 4 1.3.1 Using VelvetOptimiser to optimise de novo assembly with Velvet .............. 6 1.4 Ordering contigs against a reference using Mauve ........................................... 7 1.4.1 Viewing the ordered contigs (Mauve) ........................................................ 10 1.4.2 Viewing the ordered contigs (ACT) ........................................................... 13 1.5 Mauve Assembly Metrics Statistical View of the Contigs .............................. 15 1.6 Annotation with RAST ...................................................................................... 15 1.6.1 Alternatives to RAST ................................................................................. 19 1# 2'(3)4)/05% 6%&'(% )&),-*0* ############################################################################### 17 2.1 Downloading E. coli genome sequences for comparative analysis ................. 20 2.2 Mauve for multiple genome alignment .......................................................... 21 2.3 ACT for detailed pairwise genome comparisons .......................................... 24 2.3.1 Generating comparison files for ACT ........................................................ 24 2.3.2 Viewing genome comparisons in ACT ...................................................... 27 2.4 BRIG Visualizing reference-based comparisons of multiple sequences ....... 29 8# 9-30&6 )&. *3%:0),0*/ /'',* ##################################################################################### 8; 3.1 PHAST for identification of phage sequences ............................................... 34 3.2 ResFinder for identification of resistance gene sequences ........................... 34 3.3 Multilocus sequence typing .............................................................................. 34 3.4 PATRIC online genome comparison tool ...................................................... 34
Compaiative uenomics Tutoiial p2 1. Genome assemb|y and annotat|on
1.1 Down|oad|ng !" $%&' sequences for assemb|y In this pait of the tutoiial, we will cieate a uiaft quality !" $%&' 014:B4 genome assembly to use in the compaiative genome analysis. To stait with we neeu sequences to assemble. Foi the woikeu example we aie using Illumina BiSeq paiieu-enu ieaus fiom !" $%&' 01u4:B4 stiain TY-2482 (ENA accession SRR29277u) - available heie http:www.ebi.ac.ukenauataviewSRR29277u&uisplay=html. Locate the 'Fastq files (ftp)' column anu iight-click on each of the two file links, choosing 'Save Link as.' to save them to youi computei. These aie in fastq foimat (see http:en.wikipeuia.oigwikiFASTQ_foimat) anu compiesseu using gzip (you uo not neeu to uncompiess them).
Remembei to uownloau both foiwaiu anu ieveise ieaus (nameu 'SRR29277u_1.fastq.gz' anu 'SRR29277u_2.fastq.gz'). Save these to a new foluei (uiiectoiy) with a suitable name, e.g. 'compaiison_tut'. This will be oui woiking uiiectoiy foi the tutoiial.
1.2 Lxam|n|ng qua||ty of reads (IastC)
Piioi to attempting to assemble a set of ieaus, it is goou piactice to examine the ieaus to see if they aie of goou quality. A simple package to install anu iun to examine ieaus is FastQC.
<%+*0/%= Bownloau anu install FastQC fiom http:www.bioinfoimatics.babiaham.ac.ukpiojectsfastqc. The website also featuies examples of goou anu pooi quality ieau sets foi a numbei of sequencing platfoims.
2'(3)/0+0,0/-= }ava baseu, available foi Winuows, Linux anu Nac 0S X. This tutoiial was cieateu using FastQC u.1u.1 on Nac 0S X. Some veisions of }ava have been uisableu on the Nac 0S X, anu if you uo not have a veision of }ava highei than veision 7u11 installeu, you may neeu to follow the suggestions in the FAQ foi }ava (http:www.java.comenuownloaufaqjava_mac.xml).
>&3?/*= foiwaiu anu ieveise ieau sequence files (fastq foimat)
>&*/4?:/0'&*
0nce FastQC has been installeu, open the piogiam to begin. Then:
1. To select the file sequence to check, use 'File > 0pen' in the FastQC menu. Navigate to the foluei that you put the TY-2482 ieaus anu select the 'SRR29277u_1.fastq.gz' file. Nake suie 'File Foimat' is set to Compaiative uenomics Tutoiial pS 'Sequence Files', then hit the '0pen' button. FastQC will commence the analysis.
2. When the analysis has finisheu, you will be piesenteu with a seiies of iepoits on the sequences. Select 'Pei base sequence quality' to see a giaph of the same. It shoulu look like this:
You can also examine the othei iepoits. Compaiative uenomics Tutoiial p4 Note that this sequence set passes most of the tests, though the sequence uuplication level is a little high (aiounu 26%). The assembly coulu be impioveu by fiist iemoving uuplicates by making use of a fastq quality contiol package such as the commanu line tools FASTX-Toolkit (http:hannonlab.cshl.euufastx_toolkit) oi Tiimmomatic (http:www.usauellab.oigcmsinuex.php.page=tiimmomatic). Bowevei, as the ieaus foi the tutoiial aie of otheiwise goou quality, we shall leave the impoitant topic of quality contiol, anu its pit-falls, foi otheis to uesciibe. The websites foi the two packages aie a goou place to stait, along with the suppoiting infoimation foi FastQC.
You can now close FastQC anu continue with the iest of the tutoiial. If you wish to save the iepoit befoiehanu, use 'File > Save iepoit.' befoie closing.
1.3 Ve|vet - assemb||ng reads |nto cont|gs <%+*0/%= Bownloau anu install velvet anu its manual (~2S NB) fiom http:www.ebi.ac.uk~zeibinovelvet
2'(3)/0+0,0/-= Can be compileu foi Winuows, Nac 0S X anu Linux, though a 64- bit enviionment anu a minimum of 4uB of RAN aie iecommenueu. This tutoiial was cieateu using velvet 1.2.u8 on Nac 0S X.
@%A%4%&:%= Zeibino, B. R. anu Biiney, E., velvet: algoiithms foi ue novo shoit ieau assembly using ue Biuijn giaphs" ()*%+) ,)-, 2uu8. gi.u74492.1u7 |piij 1u.11u1gi.u74492.1u7.
>&*/4?:/0'&), @%A%4%&:%= Zeibino, B. R., 0sing the velvet ue novo assemblei foi shoit-ieau sequencing technologies" ./00)*1 20%1%$%&- '* 3'%'*4%0+51'$- 6 )7'1%05& 3%5078 9*70)5- :" ;5<)=5*'- """ >)1 5&"?, 2u1u. 1u.1uu2u4712Su9SS.bi11uSsS1.
>&3?/*= foiwaiu anu ieveise ieau sequence files (fastq foimat)
>&*/4?:/0'&* The 7) *%=% assembly piogiam velvet we useu was installeu with the 'NAXKNERLENuTB' set at 1u1 bp (make MAXKMERLENGTH=101) - see the velvet manual foi moie uetails. Note that a maximum @-mei of 41 will be sufficient foi this exeicise, but longei @-meis aie iequiieu when woiking with longei BiSeq anu NiSeq geneiateu ieaus (which aie now typically >1uu bp). Note you will also neeu to auu the velvet uiiectoiy to youi path, oi use the full path to the 'velvetg' anu 'velveth' executables in the commanus below. 1. 0pen a teiminal session anu change the uiiectoiy to that containing the SRR29277u ieaus files:
This will take ~1-2 minutes anu will piouuce a hash table of the ieaus using the specifieu @-mei length (k=SS), saving them to the foluei 'out_uata_SS'. The -shoitPaiieu anu -sepaiate tag tells velvet we aie supplying shoit, paiieu enu ieaus with sepaiate files foi foiwaiu anu ieveise ieaus. See manual foi othei input options.
S. The next velvet step to iun is velvetg to builu the giaph. Entei:
This will take ~S minutes. Running this commanu will output a numbei of files to the same foluei as velveth, incluuing the file containing oui newly assembleu contigs - this will be labelleu 'contigs.fa'. Ninimum contig length is set to 2uubp as this is the shoitest length alloweu foi uenBank submission of uiaft genomes. The coveiage cut-offs specifieu heie aie ones we have pie-ueteimineu to be optimal foi assembly of this ieau set. See below foi info on using velvet0ptimisei to set cut-offs foi uiffeient ieau sets.
4. Copy the contigs file fiom the velvet output foluei anu iename it:
You can then uelete the output foluei 'out-uata-SS', though you may want to eithei save oi look at the statistic file, 'stats.txt', befoie uoing so.
Whilst we pioviue 'optimal' values foi the thiee options of velvet (@-mei = SS, expecteu coveiage = 2u, coveiage cutoff of 2.81), these can be changeu to examine how each affects the contigs piouuceu. Note: you can ieiun just the velvetg commanu with new values if you aie vaiying only the lattei two anu keeping the @-mei constant by keeping the velvet output foluei between iuns of velvetg.
Compaiative uenomics Tutoiial p6 1.3.1 Us|ng Ve|vetCpt|m|ser to opt|m|se () *%+% assemb|y w|th Ve|vet To get the 'optimal' values useu heie, we maue use of the Peil sciipt velvet0ptimisei (we useu veision 2.2.S) available foi uownloau at http:bioinfoimatics.net.ausoftwaie.velvetoptimisei.shtml. Beie, we pioviue instiuctions foi iunning velvet0ptimisei to uemonstiate how these values weie obtaineu, anu foi those inteiesteu in uoing the same - we incluue it as a fuithei exeicise in making use of velvet. Those inteiesteu in exploiing both even fuithei shoulu begin with the instiuctional papei by Zeibino (2u1u). (Those not yet comfoitable with 0nix, Peil anu the commanu line may wish to skip the following.)
In oiuei to iun velvet0ptimisei, you will also neeu to uownloau anu install both Peil (veision S.8 oi latei, http:www.peil.oig) anu BioPeil (veision 1.4 oi latei, http:www.biopeil.oigwikiNain_Page). 0bviously, you also neeu velvet as above.
1. 0pen a teiminal session anu change to the uiiectoiy containing the ieaus files.
With these settings, velvet0ptimisei will set up a seiies of velveth iuns using ouu-numbei kmeis between SS anu 41. It then iuns velvetg foi each, taking the one with the best NSu as the seeu foi the final optimisation of the coveiage cutoff, wheie the numbei of bases in contigs longei than 1uubp is useu as the optimising statistic. The output is the same as foi a iegulai velvet iun, though the output foluei will have the piefix 'SRR29277u' to keep it sepaiate fiom the eailiei velvet iun uesciibeu above. The logfile foi the iun (SRR29277u_logfile.txt) contains uetails of the iun, incluuing the commanus useu to iun velveth anu velvetg.
Foi those inteiesteu in assembling Ion Toiient sequence ieaus, we iecommenu you tiy NIRA (veision S, http:www.chevieux.oigpiojects_miia.html). This assemblei is also useful foi those inteiesteu in assembling ieaus fiom uiffeient sequencing technologies into the one assembly - NIRA is piobably the best foi this kinu of assembly pioject. 0nce you have assembleu the ieaus into contigs using NIRA, the iest of the analysis can make use of the tools anu methous uesciibeu heie foi Illumina-baseu ieaus. Compaiative uenomics Tutoiial p7 1.4 Crder|ng cont|gs aga|nst a reference us|ng Mauve 0nce the sequence ieaus have been assembleu into contigs, it is useful to oiuei them against a suitable iefeience genome. 0ne simple way to accomplish this is to use the 'Nove Contigs' option available in A5/=) (which is also useu below foi genome compaiisons).
<%+*0/%= http:asap.ahabs.wisc.euumauve (Incluues uownloau links, installation instiuctions anu usei guiue)
2'(3)/0+0,0/-= }ava baseu, available foi Winuows, Nac 0S X, anu Linux This tutoiial was cieateu using Nauve 2.S.1 on Nac 0S X.
@%A%4%&:%= Bailing, A. E., Nau, B. anu Peina, N. T., "piogiessiveNauve: multiple genome alignment with gene gain, loss anu ieaiiangement". BC%D E*), 2u1u S(6): e11147.
>&3?/*= These will be youi newly assembleu contigs anu a iefeience genome - heie we have chosen to use EcSS989 (NCBI accession NC_u11748), a closely- ielateu stiain with a complete genome, available foi uownloau fiom NCBI. uo to this link: ftp:ftp.ncbi.nih.govgenomesBacteiiaEscheiichia_coli_SS989_uiuS9S8S Anu uownloau the sequence in fasta foimat, NC_u11748.fna (iight-click to save to youi computei).
If you uon't want to iun youi own velvet assembly, you can uo the iest of the exeicises using pie-assembleu !" $%&' 01u4:B4 contigs. uo to http:www.ncbi.nlm.nih.govTiaceswgs.val=AFvSu1 anu click the 'Bownloau' tab, then iight-click the fasta file to youi computei anu unzip it.
>&*/4?:/0'&* 0nce you have installeu Nauve anu locateu youi iefeience genome anu contigs, we can oiuei the contigs.
1. Launch the Nauve application.
2. Fiom the Tools menu, select 'Nove Contigs'.
S. A uialogue box shoulu appeai, with a box labelleu 'Choose location to keep output files anu folueis'. Navigate to the foluei with the sequences anu the copieu contigs, then click the 'Cieate New Foluei' iauio button. uive this foluei a suitable name, )"F" 'Nauve0utput' anu then hit '0K'.
4. A message shoulu appeai telling you about the iteiative piocess involveu in ieoiueiing the contigs. Take note of it, then hit '0K' to uismiss it.
S. A uialogue box shoulu appeai, with a box labelleu 'Align anu Reoiuei Contigs'. Click the button below the box 'Auu Sequence.' anu navigate to the iefeience genome to align against, in this case 'NC_u11748.fna'. Compaiative uenomics Tutoiial p8
6. Click the 'Auu Sequence.' button again anu navigate to the fasta file of the contigs you wish to align, 'SRR29277u_unoiueieu.fasta' fiom the assembly exeicise above. Check that you have put the iefeience genome fiist, anu the uiaft seconu, as expecteu by Nauve.
7. Click 'Stait' to iun the ieoiueiing. This might take half an houi oi so total. A new winuow shoulu appeai maikeu 'Nauve Console' wheie the piogiess of the iun will be uisplayeu, incluuing any eiioi messages (see below foi an example). The ieoiueiing will take fiom foui to seven iteiations (foi Nac 0S X; even up to 16 iteiations anu a bit moie time on a S2-bit Winuows 0S). A new winuow of the visualization tool shoulu launch foi each completeu iteiation, maikeu 'Nauve unknown - alignmentX', wheie X is the iteiation numbei.
If you encountei eiiois, check that you have specifieu the iight files foi input - they shoulu be fasta oi multi-fasta sequence files.
8. Finally, a message telling you the ieoiuei is completeu shoulu appeai. Bit '0K' anu quit Nauve - though you can inspect the final alignment (anu the otheis) befoiehanu.
9. The final set of oiueieu anu oiienteu contigs aie in the fasta file locateu in the last of the iteiateu alignments. To finu it, look in the 'Nauve0utput' foluei cieateu above. Foi each iteiation of the ieoiueiing theie will be an output foluei, so the final output is the contig file locateu in the subuiiectoiy 'alignmentX' with the highest X, wheie X is the iteiation numbei. Rename 'SRR29277u_unoiueieu.fasta' in this subuiiectoiy, to 'SRR29277u.fasta' anu copy it to youi main woiking uiiectoiy ('"). the one with the oiiginal sequence files, make suie you have changeu the name of Compaiative uenomics Tutoiial p9 the oiueieu contigs file fiist as we will use the unoiueieu contigs in a latei exeicise, )"F" 'SRR29277u_unoiueieu.fasta'. You can then uelete the 'alignmentX' folueis.
Those who aie useu to 0nix anu sequence analysis may piefei to use a commanu-line baseu solution foi oiueiing contigs. We iecommenu Abacas (http:abacas.souicefoige.netinuex.html), which iequiies installation of N0Nmei (http:mummei.souicefoige.net), Peil anu BioPeil.
The commanu foi oiueiing against a iefeience genome is (assuming you have attacheu Abacas |veision 1.S.1j to the $path enviionment anu ienameu the contigs file fiom the fiist exeicise to 'SRR29277u_unoiueieu.fasta' fiist):
abacas.1.3.1.pl r NC_011748.fasta -q SRR292770_unordered.fasta p nucmer c m b o SRR292770.fasta
0sing eithei methou, you shoulu enu up with a set of contigs oiueieu against the iefeience stiain in multi-fasta foimat in a file calleu 'SRR29277u.fasta'. This is the file to use foi the following steps. Compaiative uenomics Tutoiial p1u 1.4.1 V|ew|ng the ordered cont|gs (Mauve) To examine the newly oiueieu contigs, we pioviue two u0I-baseu appioaches. Foi the fiist, both the piogiam Nauve anu instiuctions foi the compaiison methou aie as uetaileu below, albeit with a few minoi (but impoitant) changes.
In this example, we will geneiate a multiple alignment of the oiueieu contigs fiom the 01u4:B4 outbieak genome, the EcSS989 genome useu as the iefeience foi oiueiing, anu anothei assembly cieateu using moie ieau sets than oui uiaft genome, anu a uiffeient assemblei. This alteinative assembly of stiain TY-2482 (NCBI accession AFvRu1) is available foi uownloau heie http:www.ncbi.nlm.nih.govTiaceswgs.val=AFvRu1 in fastq gzip foimat ='5 the uownloau tab. Aftei uownloauing, unzip the file befoie continuing. B4.%4 /C0* ),/%4&)/05% )**%(+,- /' /C% D:EEFGF 4%A%4%&:% 6%&'(% A04*/ H ?*% /C% 0&*/4?:/0'&* 34'50.%. )+'5%#
Instiuctions:
1. Launch the Nauve application
2. Fiom the File menu, select 'Align with piogiessiveNauve.'
S. A uialogue box shoulu appeai, with a box labelleu 'Sequences to align:'. Click the button below the box 'Auu Sequence.' anu navigate to youi oiueieu contigs file, 'SRR29277u.fasta'.
4. Click the 'Auu Sequence.' button again anu navigate to the fasta file of a genome you wish to align. In this case, we will stait with the alteinative assembly, 'AFvRu1.fasta'. If you pioviue a multi-fasta file containing contigs, Nauve will concatenate these togethei befoie iunning the alignment.
S. Repeat step 4 to auu any othei sequences of inteiest. In oui example we will just auu the EAEC genome EcSS989.
6. Now we neeu to specify the output file. Click the button maikeu '.' to select an output file. Navigate to the uiiectoiy in which you want the output to appeai. Now specify a name foi the output file (e.g. 'mauve_output'), anu click 'Save'. Compaiative uenomics Tutoiial p11
7. Click 'Align.' to iun the alignment. This might take half an houi oi so. A new winuow shoulu appeai maikeu 'Nauve Console' wheie the piogiess of the iun will be uisplayeu, incluuing any eiioi messages (example below).
If you encountei eiiois, check that you have specifieu the iight files foi input - they shoulu all be fasta oi multi-fasta sequence files, anu can incluue up to one genome in uenbank foimat (to pioviue an annotation).
8. When the alignment is finisheu, the visualization tool will appeai. To simplify the image a little, select view -> Style -> uncheck 'LCB connecting lines'. It shoulu look like this: Compaiative uenomics Tutoiial p12
Row 1 = 01u4 oiueieu contigs. Row 2 = alteinative assembly Row S = EcSS989 (EAEC) genome Colouieu blocks inuicate iegions of sequence with homology in the othei genomes. Reu lines inuicate contig bounuaiies.
Notice the similaiity in the oiueis of oui velvet assembly anu the alteinative assembly. Both assemblies contain contigs that uon't map to the iefeience genome.
You can save a static image of what you aie viewing by selecting Tools -> Expoit -> Expoit image. Compaiative uenomics Tutoiial p1S 1.4.2 V|ew|ng the ordered cont|gs (AC1) We will now use ACT to compaie the same thiee genomes, oui 01u4:B4 assembly, the alteinative assembly anu the iefeience genome EcSS989. Note both assemblies shoulu have been oiueieu against EcSS989 as outlineu above.
Betails of uownloauing anu using ACT aie given below (2.S.1).
>&3?/*= ACT can uisplay paiiwise compaiisons between genomes. To uo this it neeus the genome sequences themselves (in fasta foimat oi annotateu sequence foimat such as uenbank oi ENBL files) anu a compaiison file. Compaiison files can be cieateu on youi computei if you have BLAST installeu, oi using an online tool like WebACT (http:www.webact.oig) oi BoubleACT (http:www.hpa- bioinfotools.oig.ukpiseuouble_act.html), see steps 1-2 below.
>&*/4?:/0'&* 0se the instiuctions foi using ACT below to: 1. ueneiate fiist a single fasta file foi the two assemblies (step 1, 2.S.1). 2. ueneiate a compaiison file foi oui 01u4:B4 assembly against both EcSS989 anu the alteinative assembly sepaiately. (step 2, 2.S.1) S. view the compaiison(s) in ACT. a. Launch the ACT application b. Select File -> 0pen c. Initially, boxes foi 2 sequence files anu 1 compaiison file will be uisplayeu. Click 'moie files.' to cieate boxes foi a seconu compaiison file anu a thiiu sequence file. u. Click the 'Choose.' buttons to select each of youi two sequence files anu youi compaiison file. Note that you can loau in youi multi-fasta contigs files at this point foi the !" $%&' 01u4:B4 anu alteinative assembly. We want the !" $%&' 01u4:B4 assembly in the miuule, with the compaiisons to EcSS989 anu the alteinative assembly above anu below it, like this:
Compaiative uenomics Tutoiial p14
e. The compaiison between the thiee genomes will be uisplayeu. See the ACT manual foi uetails of how to navigate aiounu the viewei. Beie, we aie compaiing the new !" $%&' 01u4:B4 genome assembly (miuule) with EcSS989 (top) anu the alteinative assembly (bottom). We have zoomeu out by clicking the uown aiiow at the bottom iight of the winuow. Since oui contigs weie oiueieu against the EcSS989 genome, all the !" $%&' 01u4:B4 contigs with no homology to EcSS989 ('")" no colouieu bais linking them to EcSS989) appeai at the enu of the sequence. Some of these contigs uo map to the genome of the alteinative assembly. Also note that theie is much highei homology between oui 01u4 assembly anu the alteinative one.
Compaiative uenomics Tutoiial p1S 1.S Mauve Assemb|y Metr|cs - Stat|st|ca| V|ew of the Cont|gs
2'(3)/0+0,0/-= Available foi Winuows, Nac 0S X, anu Linux veisions of Nauve, but see the text foi moie uetails. It also iequiies the R statistical piogiam to be installeu. See above link foi moie uetails.
@%A%4%&:%= 1. Bailing, A. E., )1 5&., "Nauve assembly metiics"G ;'%'*4%0+51'$-, 2u11. 1u.1u9Sbioinfoimaticsbti4S1.
>&*/),,)/0'& &'/%= The authois inuicate that it is possible to use Nauve Assembly Netiics via the Nauve u0I tool when only a single paiiwise compaiison is iun, but as they uo not pioviue specific instiuctions, we can only uesciibe how to uo so foi the Nac 0S X. The following may also woik foi the Linux veision, but has not been testeu by us. 0nfoitunately, we uo not yet have a solution foi installing Nauve Assembly Netiics in the Winuows-baseu veision of the Nauve u0I.
The simplest way of installing Nauve Assembly Netiics into the u0I tool of Nauve foi the Nac 0S X is to use the instiuctions foi installing Nauve Assembly Netiics by sciipt (see the above website foi uetails) with one impoitant change - euit the taiget '.umg' file to the most cuiient upuate fiom the Nauve uownloau website. You may still have to install Nauve as an application by 'uiag-anu-uiop'.
Nauve Assembly Netiics aie only available as a u0I tool foi single paiiwise compaiisons, )"F" between the iefeience genome anu the assembly as oiueieu contigs. You will know if the tool is installeu successfully if a new button appeais aftei iunning such a compaiison, as highlighteu heie with the ieu ciicle:
>&*/4?:/0'&* In this example, we will geneiate Nauve Assembly Netiics foi the assembly we cieateu using a complete genome fiom the outbieak, !" $%&' 01u4:B4 stiain 2u11C-S49S (NCBI accession NC_u186S8.1; uownloau NC_u186S8.fna fiom ftp:ftp.ncbi.nih.govgenomesBacteiiaEscheiichia_coli_01u4_B4_2u11C_S49 S_uiu176127)
1. 0sing the instiuctions above, ieoiuei the SRR29277u unoiueieu contigs to the new iefeience, 2u11C-S49S.
Compaiative uenomics Tutoiial p16 2. When the alignment has finisheu, uon't close Nauve but insteau, on the final alignment winuow hit the Nauve Assembly Netiics button. This will launch the iepoit winuow anu shoulu look like this:
Notice that along with the summaiy shown heie, you can also geneiate a iepoit of SNPs anu gaps in alignment, anu save these iepoits. To unueistanu the iepoit fully, ieau the iefeience foi Nauve Assembly Netiics given above.
Some highlights fiom oui assembly incluue an NSu of 4S,8u2 bp, with a laigest contig of 141,S67 bp anu the smallest of 2uu bp (as expecteu, as we set that in velvet). Notably almost S% of bases of the iefeience genome have been misseu, though oui assembly has an extia 2% of bases, making up an extia 116 (small) contigs that uon't align. Theie aie also some 1,1Su SNPs between the two sequences, anu inteiestingly, oui assembly seemingly has fewei gaps (S71) than the iefeience 2u11C-S49S (411).
Nauve Assembly Netiics can also be iun as a commanu-line tool, with the instiuctions foi installing anu iunning the metiics tool pioviueu at the same link as above. The auvantage of the commanu-line veision is that moie than one assembly can be testeu foi inclusion in the same iepoit output. As we aie uealing with a single assembly in this tutoiial, we leave this as an exeicise foi those so inteiesteu. Compaiative uenomics Tutoiial p17 1.6 Annotat|on us|ng kAS1
<%+*0/%= http:iast.nmpui.oig (neeu to iegistei to use the seivice)
2'(3)/0+0,0/-= Nost web biowseis. This tutoiial was cieateu using RAST veision 4 on the Fiiefox web biowsei (veision 17.u).
@%A%4%&:%= Aziz, R. K., )1 5&", "The RAST Seivei: iapiu annotations using subsystems technology"G ;A. ()*%+'$-, 2uu8. 1u.11861471-2164-9-7S.
>&3?/*= 0iueieu contigs file (multi-fasta foimat)
>&*/4?:/0'&* In this example, we will geneiate a uenBank annotation foi the newly assembleu anu oiueieu contigs of the !" $%&' 01u4:B4 stiain in multifasta foimat (use the contigs oiueieu against EcSS989, as we have uone). I'? (?*/ A04*/ 4%60*/%4 ) A'4 ) @JK9 ?*%4 )::'?&/#
1. uo to http:iast.nmpui.oig in a web biowsei anu log into youi account.
2. 0nuei the 'Youi }obs' tab (top left coinei) select '0ploau New }ob'.
S. You shoulu be taken to a page titleu '0ploau youi genome'. At the bottom of the page theie is a box labelleu 'File 0ploau:' click the button anu navigate to youi oiueieu contigs file ('SRR29277u.fasta'). Then hit the '0se this uata anu go to step 2' button. This may take a little while as it is uploauing youi sequence file ovei youi inteinet connection.
4. Eventually the next page will open with the same heauing as the last, with the sub-heauing 'Review genome uata', anu some contig statistics. You will be askeu to entei fuithei uetails about the oiganism. In the fiist fielu, labelleu 'Taxonomy IB', entei the coue foi !" $%&' (S62), anu hit the 'Look up taxonomy IB at NCBI' button. This will populate the iest of the fielus foi you, except the last, the stain. Entei 'TY-2482' into the space foi the stiain, anu then hit the '0se this uata anu go to step S' button.
(Note if you weie uoing this with something othei than !" $%&', you can finu the iight taxonomy IB at http:www.ncbi.nlm.nih.govtaxonomy).
S. The next page shoulu have sub-heauing 'Complete 0ploau'. You can entei optional infoimation (Sequencing Nethou = 'othei', Coveiage = '>8x', Numbei of contigs = "1u1-Suu"), but this is not necessaiy to use RAST. The othei options foi the RAST annotation pipeline shoulu at least be consiueieu, though we will use the uefault options as shown when the page fiist loaus. Youi final page shoulu look like this: Compaiative uenomics Tutoiial p18
If it uoes, hit the 'Finish the uploau' to stait the job. Youi job will join the submission queue, anu you will be sent an email (to the auuiess you useu to iegistei) when the job is completeu. This coulu take a half a uay oi even much longei, uepenuing on the numbei of jobs in the queue befoie you.
6. 0nce you ieceive the completion email fiom the Annotation Seivei, click on the link in the email to ietuin to the RAST seivei (if you have loggeu out, you will have to log back in to continue). This time select '}obs 0veiview' unuei the 'Youi }obs' tab.
Compaiative uenomics Tutoiial p19 7. This will open the }obs 0veiview page, wheie you will see a list of youi jobs with a numbei of uetails anu the status of the job. Click on the '| view uetails j' link foi the job (in the 'Annotation Piogiess' column, unuei the gieen piogiess bais).
8. This opens the "}ob Betails' page anu will incluue the available uownloaus if the job has completeu. Select 'uenbank (EC numbeis stiippeu)' anu then hit 'Bownloau'. The file will be call 'S62.<job_no.>.ec-stiippeu.gbk' - change this to 'SRR29277u.gbk' anu move the file to youi woik foluei (wheie 'SRR29277u.fasta' is locateu).
1.6.1 A|ternat|ves to kAS1 A numbei of commanu-line tools aie available foi annotation on a local machine. Foi fast 7) *%=% annotation we iecommenu tiying Piokka (http:www.vicbioinfoimatics.comsoftwaie.piokka.shtml), though Piokka in tuin ielies on the installation of a long list of othei piogiams (see the link foi uetails). Foi those inteiesteu in compaiative annotation, you coulu tiy Bu7 (http:bg7.ohnosequences.com). 0theiwise, you now have an annotateu uiaft genome foi !" $%&' 01u4:B4 stiain TY-2482, anu can move on to the compaiative genome analysis that follows. Compaiative uenomics Tutoiial p2u 2. Comparat|ve genome ana|ys|s
2.1 Down|oad|ng !" $%&' genome sequences for comparat|ve ana|ys|s In this pait of the tutoiial, we will compaie oui !" $%&' 014:B4 genome assembly to othei !" $%&' using vaiious softwaie packages on oui computei. You will neeu to uownloau the piogiams fiom the web using the links given in each section. In auuition you will neeu to uownloau some !" $%&' uata foi compaiison.
Foi the Nauve anu ACT compaiisons, we will use these:
(This one we have alieauy useu above) EAEC sti. EcSS989 (NC_u11748) - uownloau NC_u11748.fna fiom ftp:ftp.ncbi.nih.govgenomesBacteiiaEscheiichia_coli_SS989_uiuS9S8S
We have alieauy intiouuceu Nauve above, foi oiueiing contigs anu inspecting assembly statistics.
In this example, we will geneiate a multiple alignment of the newly assembleu anu annotateu 01u4:B4 outbieak genome (uenBank foimat) with an EBEC chiomosome anu the chiomosome of EAEC stiain EcSS989 (fasta). We will then view the alignment anu use it to inspect genes that aie annotateu in the outbieak genome but missing fiom the othei pathogen chiomosomes.
<%+*0/%= http:asap.ahabs.wisc.euumauve (Incluues uownloau links, installation instiuctions anu usei guiue)
2'(3)/0+0,0/-= }ava baseu, available foi Winuows, Nac 0S X, anu Linux This tutoiial was cieateu using Nauve 2.S.1 on Nac 0S X.
@%A%4%&:%= Bailing, A. E., Nau, B. anu Peina, N. T., "piogiessiveNauve: multiple genome alignment with gene gain, loss anu ieaiiangement". BC%D E*), 2u1u. S(6): e11147
>&3?/*= uenome sequence files (fasta foimat) anu up to one annotateu genome sequence (uenbank foimat).
>&*/4?:/0'&*
1. Launch the Nauve application
2. Fiom the File menu, select 'Align with piogiessiveNauve.'
S. A uialogue box shoulu appeai, with a box labelleu 'Sequences to align:'. Click the button below the box 'Auu Sequence.' anu navigate to youi annotateu genome (uenBank file geneiateu by RAST).
4. Click the 'Auu Sequence.' button again anu navigate to the fasta file of a genome you wish to align. In this case, we will stait with the genome of the EBEC 01S7:B7 stiain EBL9SS (NC_uu26SS.fna). If you pioviue a multi-fasta file containing contigs, Nauve will concatenate these togethei befoie iunning the alignment.
S. Repeat step 4 to auu any othei sequences of inteiest. In oui example we will just auu the EAEC genome EcSS989.
6. Now we neeu to specify the output file. Click the button maikeu '.' to select an output file. Navigate to the uiiectoiy in which you want the output to appeai. Now specify a name foi the output file (e.g. 'mauve_output') anu click 'Save'. Compaiative uenomics Tutoiial p22
7. Click 'Align.' to iun the alignment. This might take half an houi oi so. A new winuow shoulu appeai maikeu 'Nauve Console' wheie the piogiess of the iun will be uisplayeu, incluuing any eiioi messages.
If you encountei eiiois, check that you have specifieu the iight files foi input - they shoulu all be fasta oi multi-fasta sequence files, anu can incluue up to one genome in uenbank foimat (to pioviue an annotation).
Compaiative uenomics Tutoiial p2S
8. When the alignment is finisheu, the visualization tool will appeai. To simplify the image a little, select view -> Style -> uncheck 'LCB connecting lines'. It shoulu look like this:
Row 1 = annotateu 01u4 genome. Row 2 = EBEC genome Row S = EAEC genome Colouieu blocks inuicate iegions of sequence with homology in the othei genomes. Reu lines inuicate contig bounuaiies.
You can save a static image of what you aie viewing by selecting Tools -> Expoit -> Expoit image.
9. Notice the EBEC genome has moie 'white space', '")" sequences not in homology blocks, meaning these sequences aie missing fiom the new 01u4:B4 genome anu EcSS989. The othei genomes have fewei white blocks, as they shaie a lot of theii genome sequence.
To see what the 'unique' sequences aie in the 01u4:B4 assembly, zoom in by clicking the '+' magnifying glass at the top of the winuow until you see boxes appeai unuei the 01u4 sequence; these aie annotateu genes.
Scioll aiounu to a iegion that is not within a colouieu block, anu mouse- ovei a gene to see its annotation. In oui example, we aie looking at a iegion of sequence in which IncI1 plasmiu genes have been annotateu. So, we know the 01u4:B4 genome assembly contains an IncI1 plasmiu.
<%+*0/%= http:www.sangei.ac.ukiesouicessoftwaieact (To uownloau, click the 'Bownloaus' tab) anu look foi the FTP uownloau link foi youi opeiating system. Note that the uownloau shoulu contain Aitemis as well as ACT.
2'(3)/0+0,0/-= }ava baseu, available foi Winuows, Nac 0S X, anu Linux This tutoiial was cieateu using ACT veision 11.u.u on Nac 0S X.
@%A%4%&:%= Caivei T. },, Rutheifoiu, K. N,, Beiiiman, N,, Rajanuieam, N. A., Baiiell, B. u. anu Paikhill }., "ACT: the Aitemis Compaiison ToolL# ;'%'*4%0+51'$-8 2uuS. 21:S422-S. 1u.1u9SbioinfoimaticsbtiSSS
>&3?/*= ACT can uisplay paiiwise compaiisons between genomes. To uo this it neeus the genome sequences themselves (in fasta foimat oi annotateu sequence foimat such as uenbank oi ENBL files) anu a compaiison file. Compaiison files can be cieateu on youi computei if you have BLAST installeu, oi using an online tool like WebACT (http:www.webact.oig) oi BoubleACT (http:www.hpa- bioinfotools.oig.ukpiseuouble_act.html), see steps 1-2 below.
>&*/4?:/0'&*
In this example we will visualize a compaiison of oui newly assembleu anu oiueieu !" $%&' 01u4:B4 TY-2482 contigs against enteioaggiegative !" $%&' EcSS989 (accession NC_u11748) anu the EBEC genome EBL9SS (accession NC_uu26SS).
2.3.1 Generat|ng compar|son f||es for AC1 1. To geneiate the compaiison file, you will neeu to have both of the genome sequences in single-fasta foimat. ueneiation of the compaiison will not woik with multi-fasta sequences such as those containing seveial contig sequences as output by velvet oi othei assemblies. So, we fiist neeu to change the multi-fasta contig sequences file into a fasta file with a single entiy, which incluues all of oui contig sequences concatenateu togethei into one big sequence. An easy way to uo this is to open the contig file in Aitemis fiist. a. Launch Aitemis b. Select File -> 0pen c. Navigate to the location of youi contig file in fasta foimat anu click '0pen'. The contig sequences shoulu be uisplayeu, with the bounuaiies of each contig maikeu up as a featuie anu colouieu in alteinative oiangebiown colouis.
Compaiative uenomics Tutoiial p2S
u. To wiite out the concatenateu contig sequences to a single-entiy fasta file, select File -> Wiite -> All Bases -> FASTA Foimat anu save name the new file something like 'genomeXX_single.fasta' so you can easily iuentify this as a single-entiy file.
Compaiative uenomics Tutoiial p26 2. ueneiate a compaiison between youi single-entiy fasta files by one of the following methous:
a. If you have BLAST installeu locally on youi computei, open up a teiminal anu type:
b. If you piefei to use a web-baseu tool, go to the WebACT site (http:www.webact.oig) anu click the 'ueneiate' tab at the top of the page. 0nuei 'Sequence 1' paste in the accession foi youi iefeience genome, e.g. NC_u11748. 0nuei 'Sequence 2' click the 'Biowse' button anu navigate to youi single-entiy genome sequence file to compaie. Click 'Submit'. It may take a while to uploau youi sequence (1-1u minutes), anu a while longei foi the iesults to be ietuineu (1-6u minutes).
When WebACT is finisheu, you will see a Results scieen. Click 'Bownloau files'. Entei a file name (a sensible choice is something that incluues the full iuentifiei of both genomes being compaieu, e.g. SRR29277u_NC_u11748.zip), uownloau the file anu unzip it. Insiue will be a set of files incluuing the input sequences; the compaiison file is the one nameu ''. Rename it to something moie infoimative (e.g. SRR29277u_NC_u11748.ciunch) anu copy it to the uiiectoiy with youi sequence files. (You can now uelete the iest of the WebACT output.)
c. You can also tiy the BoubleACT website (http:www.hpa- bioinfotools.oig.ukpiseuouble_act.html). Click the 'Biowse.' buttons to uploau youi single-entiy genome sequence file anu the iefeience file foi compaiison, then click the 'Blastn' iauio button, entei youi email auuiess anu click 'Run genome blast'.
When the compaiison file is cieateu, you will ieceive an email with a link to uownloau the iesults. The compaiison file is the one nameu 'genome_blast.iesult'. Right-click to save it to youi computei in the same uiiectoiy as youi sequence files, anu name it something moie infoimative (e.g. SRR29277u_NC_u11748.ciunch).
We geneiateu compaiison files foi the !" $%&' 01u4 assembly vs EcSS989 (accession NC_u11748), anu foi the !" $%&' 01u4 assembly vs EBEC genome EBL9SS (accession NC_uu26SS).
S. Initially, boxes foi 2 sequence files anu 1 compaiison file will be uisplayeu. Click 'moie files.' to cieate boxes foi a seconu compaiison file anu a thiiu sequence file.
4. Click the 'Choose.' buttons to select each of youi two sequence files anu youi compaiison file. Note that you can loau in youi multi-fasta contigs file at this point foi the !" $%&' 01u4:B4 genome. We want the !" $%&' 01u4:B4 assembly in the miuule, with the compaiisons to EcSS989 anu EBEC above anu below it, like this:
S. The compaiison between the two genomes will be uisplayeu. See the ACT manual foi uetails of how to navigate aiounu the viewei. Beie, we aie compaiing the new !" $%&' 01u4:B4 genome assembly (bottom) with EcSS989 (top). We have zoomeu out by clicking the uown aiiow at the bottom iight of the winuow. Since oui contigs weie oiueieu against the EcSS989 genome, all the !" $%&' 01u4:B4 contigs with no homology to EcSS989 (i.e. no colouieu bais linking them to EcSS989) appeai at the enu of the sequence.
Compaiative uenomics Tutoiial p28
Zoom into this iegion by clicking on one of the unmappeu contigs in this aiea anu then clicking the up aiiow to the siue of the 01u4:B4 sequence.
M'N&,'). A4'(= http:biig.souicefoige.net. The site contains uownloau links, installation instiuctions, a manual anu a tutoiial which you may finu useful.
2'(3)/0+0,0/-= }ava baseu, available foi Winuows, Nac 0S X, anu Linux This tutoiial was cieateu using BRIu veision u.9S on Nac 0S X.
OM%3%&.%&:0%*= BRIu also iequiies BLAST be installeu on youi computei. You can uownloau BLAST+ fiom ftp:ftp.ncbi.nlm.nih.govblastexecutablesblast+LATEST. Ensuie you select the file that matches youi opeiating system, e.g. 'ncbi-blast-x.x.x+-univeisal- macosx.tai.gz' foi Nac 0S X oi 'ncbi-blast-2.2.27+-winS2.exe' foi Winuows.
@%A%4%&:%= Alikhan, N. F., Petty, N. K., Ben Zakoui, N. L. anu Beatson, S. A., "BLAST Ring Image ueneiatoi (BRIu): simple piokaiyote genome compaiisons", ;A. ()*%+'$-, 2u11. 12:4u2. PNIB: 2182442S
>&*/4?:/0'&* N0TE: If you have not useu BRIu befoie, you will piobably finu it useful to woik thiough the BRIu tutoiial available at http:biig.souicefoige.netbiig-tutoiial- 1-whole-genome-compaiisons befoie woiking thiough the iest of oui example. 1. Select youi iefeience sequence anu the location of youi queiy sequences. In this analysis, we will use oui ue novo assembleu !" $%&' 01u4-B4 contig sequences as the iefeience, anu the EBEC anu EPEC genomes as queiies (see uownloau links in 2.1). We will also incluue the sequences foi the Stx2 phage anu the LEE pathogenicity islanu.
Compaiative uenomics Tutoiial pSu 2. Click 'Next' to be taken to the 'Customize iings' winuow. This is wheie you can specify which queiy sequences you want to be iepiesenteu by iings, anu the oiuei anu coloui they will be uisplayeu.
S. Finu the EcSS989 sequence in the 'uata pool' box anu click 'Auu uata'. Click the colouieu box anu change to coloui to ieu. In the box maikeu 'Legenu text:' type in a name foi this iing, e.g. '1: EcSS989'.
4. Click 'Auu new iing'. Now finu the EBEC genome in the uata pool anu click 'Auu uata'. Set the coloui to puiple anu change the legenu text to '2: EBEC sti 1'.
S. Repeat step S with the iemaining EBEC anu EPEC genomes. We useu the 4 EBEC genomes listeu in this tutoiial unuei 'Bownloaus' anu colouieu them all puiple, two EPEC genomes colouieu blue, anu one atypical EPEC genome colouieu gieen.
Compaiative uenomics Tutoiial pS1 6. uo to the Piefeiences menu anu select Image 0ptions. 0nuei the 'ulobal settings' tab change the 'Wiuth' fielu to 2Suu. This will make the image canvas wiue enough to uisplay the legenu text next to the iing image, without obscuiing the image itself. Click 'Save & close'.
7. Click 'Next' anu entei a title foi youi image (this will be piinteu in the miuule of the ciiculai uiagiam). Click 'Biowse' anu navigate to wheie you want the output to be saveu, then type in a name foi the output file (this will be a single image file) anu select the foimat foi the image (e.g. png). Click 'Submit'.
Compaiative uenomics Tutoiial pS2
8. While BRIu is iunning, it will piint uetails of its piogiess on the console within the same winuow wheie you just piesseu 'Submit'. When it tells you it has finisheu, go to wheie you askeu foi the output to be saveu anu open the image file to view the iesult.
It is easy to see that, in teims of gene content, the novel 01u4:B4 outbieak stiain is closest to EAEC stiain EcSS989 (ieu), then the atypical EPEC stiain E11uu19 (gieen). Theie aie seveial iegions of the outbieak stiain's sequence that aie missing fiom the EBEC anu EPEC stiains. Compaiative uenomics Tutoiial pSS
9. An alteinative way to make the compaiison is to use an EBEC genome as the iefeience sequence, to see how much of the chaiacteiistic EBEC sequences aie piesent in the outbieak genome. Click the 'Piev' button to get to the 'Customize iings' winuow, then click 'Piev' again to get to the input uata winuow. Change the iefeience sequence to EBEC stiain EBL9SS anu click 'Next'. Now change iing 2 to be the new 01u4:B4 genome... Click on 'Ring 2' in the list of iings in the fiist box; click the olu EBEC stiain in the 'uata' box anu click 'Remove uata'; finu the 01u4:B4 file in the uata pool anu click 'Auu uata'. Change the Legenu text to '2: 01u4:B4' anu change the iing coloui to black. We also auueu the Stx2 phage (oiange) anu the LEE pathogenicity islanu (puiple), to make it easy to see wheie these iegions aie. Click 'Next' anu change the image title anu output filename to inuicate that the iefeience sequence is an EBEC stiain, then click 'Submit' to geneiate the image.
1u. Tiy changing the iefeience stiain to the EcSS989, EPEC oi atypical EPEC genomes anu see how the figuies anu the inteipietations change. Compaiative uenomics Tutoiial pS4
3. 1yp|ng and spec|a||st too|s 3.1 nAS1 - for |dent|f|cat|on of phage sequences
9-3%= Web seivice P@Q= http:phast.wishaitlab.com @%A%4%&:%= Zhou, Y., )1 5&", "PBAST: a fast phage seaich tool"" H/$&)'$ 9$'7- ,)-)50$I, 2u11. 1u.1u9Snaigki48S.
>&3?/= Contigs in FASTA foimat (single oi multiple fasta)
B?/3?/*= - Summaiy Table (summaiising the location of piophage sequences) - Betaileu Table (giving the locations of inuiviuual genes within the piophage) - Ciiculai genome map (showing the locations of piophages within the genome) - Lineai maps of each piophage (showing the inuiviuual genes) See the PBAST website foi moie uetaileu uocumentation. 3.2 kesI|nder - for |dent|f|cat|on of res|stance gene sequences
9-3%= Web seivice P@Q= http:www.cbs.utu.ukseivicesResFinuei @%A%4%&:%= Zankaii, E., )1 5&", "Iuentification of acquiieu antimiciobial iesistance genes"" J 9*1'+'$0%3'5& .I)+%1I)0, 2u12. 1u.1u9Sjacuks261.
>&3?/= Contigs in FASTA foimat oi ieaus (will assemble fiist).
B?/3?/*= List of iesistance genes iuentifieu within the sequences 3.3 Mu|t||ocus sequence typ|ng
9-3%= Web seivice P@Q= http:cge.cbs.utu.ukseivicesNLST @%A%4%&:%= Laisen, N. v., )1 5&", Nultilocus Sequence Typing of Total uenome Sequenceu Bacteiia" J .&'* A'$0%3'%&, 2u12. 1u.1128}CN.u6u94-11.
>&3?/= Contigs in FASTA foimat oi ieaus (will assemble fiist). R)4)(%/%4*= Select the NLST uatabase to queiy. B?/3?/*= Top hitting alleles foi each locus useu in the NLST scheme, anu the sequence type (ST) assigneu to that combination of alleles. 3.4 A1kIC - on||ne genome compar|son too| Foi an intiouuction to what PATRIC can uo, tiy looking at theii analysis of the !" $%&' 01u4 genome, posteu at http:enews.patiicbic.oig1172e-coli-outbieak- new-compiehensive-compaiisons
A Protocol For Extraction and Purification of High-Quality and Quantity Bacterial DNA Applicable For Genome Sequencing: A Modified Version of The Marmur Procedure.