iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: http://bio-bwa.sourceforge.net/
Burrows-Wheeler Aligner

Introduction

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.


FAQ

How can I cite BWA?
The short read alignment component (bwa-short) has been published:
    Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
If you use BWA-SW, please cite:
    Li H. and Durbin R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler Transform. Bioinformatics, Epub. [PMID: 20080505]
(See also Errata below for a minor correction to the formulae in these papers.)
There are three algorithms, which one should I choose?
For 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences, BWA-MEM is usually the preferred algorithm. For short sequences, BWA-backtrack may be better. BWA-SW may have better sensitivity when alignment gaps are frequent.
With BWA-MEM/BWA-SW, my tools are complaining about multiple primary alignments. Is it a bug?
It is not. Multi-part alignments are possible in the presence of structural variations, gene fusion or reference misassembly. However, representing multi-part alignments in SAM has not been finalized. To make BWA work with your tools, please use option `-M' to flag extra hits as secondary.
What is the tolerance of sequencing errors?
Bwa-back is mainly designed for sequencing error rates below 2%. Although users can ask it to tolerate more errors by tuning command-line options, its performance is quickly degraded. Note that for Illumina reads, bwa-backtrack may optionally trim low-quality bases from the 3'-end before alignment and thus is able to align more reads with high error rate in the tail, which is typical to Illumina data.
BWA-SW and BWA-MEM both tolerate more errors given longer alignment. Simulation suggests that they may work well given 2% error for an 100bp alignment, 3% error for a 200bp, 5% for 500bp and 10% for 1000bp or longer alignment.
Does BWA find chimeric reads?
Yes, both BWA-SW and BWA-MEM are able to find chimera. BWA usually reports one alignment for each read but may output two or more alignments if the read/contig is a chimera.
Does BWA call SNPs like MAQ?
No, BWA only does alignment. Nonetheless, it outputs alignments in the SAM format which is supported by several generic SNP callers such as samtools and GATK.
I see one read in a pair has high mapping quality, but the other read has zero. Is this right?
This is correct. Mapping quality is assigned for individual read, not for a read pair. It is possible that one read can be mapped unambiguously, but its mate falls in a tandom repeat and thus its accurate position cannot be determined.
I see a read stands out the end of a chromosome and is flagged as unmapped (flag 0x4). What is happening here?
Internally BWA concatenates all reference sequences into one long sequence. A read may be mapped to the junction of two adjacent reference sequences. In this case, BWA will flag the read as unmapped, but you will see position, CIGAR and all the tags. A better solution would be to choose an alternative position or trim the alignment out of the end, but this is quite complicated in programming and is not implemented at the moment.
Does BWA work on reference sequences longer than 4GB in total?
Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over 4GB. However, invidual chromosome should not be longer than 2GB.

Errata

The suffix array interval of an empty string should [0,n-1] where n is the length of database string, not [1,n-1] as is stated in Li and Durbin (2009 and 2010). Correspondingly, we need to define O(a,-1)=0 and revise the pseudocode in Figure 3 from Li and Durbin (2009). BWA implementation is actually correct. The mistake only occurs to the paper. We apologize for the confusion and thank Nils Homer and Abel Antonio Carrion Collado for pointing this out.