Why is assembling paired end illumina without any input parameters an important problem?

Why is assembling paired end illumina without any input parameters an important problem?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In one of the comments in this question about multiple sequence alignment, it was stated

@5heikki: btw if you want a good bioinformatics problem, come up with an assembler that assembles any paired end illumina run optimally de novo without any input parameters.

What is a paired end illumina? How is optimally defined in this context? What are the usual input parameters?

The Next-Gen sequencers cannot sequence a very long stretch of DNA with good reliability (~150 for the recent model- HiSeq2000; even less for older models such as GA (40), GA-II (70), GA-IIx (90)). For increasing the confidence in a certain hit, it was sequenced from both the ends. For example, if you have selected 500bp DNA fragment, then after ligating adapters to both the ends, it is sequenced from both the direction up to 150bp. This would leave an unsequenced "insert" region of 200bp. (In the example image below, they have sequenced up to 40bp [case of old GA] )

During assembly you stitch together the fragments of DNA to find out the larger DNA from where the fragments arise. In case of RNAseq, these arise from a transcript, and your assembly should give you the complete transcript (mRNA or ncRNA etc). There are two basic types of assembly: reference guided assembly and de-novo assembly. In the former you use a sequence such as the genome as a reference to assemble the transcripts. If such a reference is not available then you have to go for de-novo assembly.

The assembly algorithms use several parameters and since these are computer algorithms and not some kind of magic, their output depends to an extent on the different parameters.

In case of paired end data there are some parameters that are important. Most important is the size of the insert. In case of a 500bp fragment, you'll end up with an unsequenced region of 200bp. This is not much of a problem with reference guided assembly because you can figure out the sequence of the insert based on where the sequenced region align to the reference. The average insert length is important to remove discordant reads (aligning too far apart in the reference). In case of de-novo assembly, the insert will remain unsequenced even if you know that the final transcript looks something like:


So, to get the sequence of the assembly, you need to sequence the insert regions. This is not a problem if you at least know the order of fragments in the assembly. However you should know the insert size to get the assembly size correct and as skyminge said, in scaffolding. Obtaining this insert length is not that difficult (You need not provide it as a parameter. Most algorithms can calculate it automatically).

Another parameter in de-novo assembly is k-mer length (the sequence reads are broken down into k-mers for better assembly). I cannot explain the algorithm of assembly here in detail. You can check the manuals/papers of common assembly algorithms like Velvet, SOAPdenovo, Euler [de novo]; cufflinks [reference based]

I have mentioned transcriptome sequencing here but the principles are same for genome sequencing too.

Back to your main question: Why is assembling paired end illumina without any input parameters an important problem?

Because it is less effort; but tweaking may be difficult. I won't consider it as an important problem. There are other important algorithmic optimizations that are required with de-novo assembly.

In Illumina sequencing, the DNA is (usually randomly) sheared into fragments. For paired end sequencing, fragments of a specific size range are selected and then sequenced from both sides. This results in two reads for each fragment. As read length is fixed, also the remaining "middle part" of the fragment is in a specific size range. In some cases there is no middle part, because the fragments have been chosen so small, that the reads overlap.

The information about the size of the fragment and/or the "middle part" as well as read length are some of the most important parameters you need for de novo assembly. Your could get away with not taking read size as a parameter, if you need it you can still run over all reads and check. But fragment size or insert size is important to place the reads, especially in scaffolding.

This blog entry also has some nice information about the often upcoming discussion what is meant by insert size (fragment size, the size of the middle part) and what can happen with overlapping reads and read-through.

There is lots more to say about this. Illumina also provides some nice videos available on youtube.

RNA-Seq data processing and gene expression analysis

This document outlines the essential steps in the process of analyzing gene expression data using RNA sequencing (mRNA, specifically), and recommends commonly used tools and techniques for this purpose. It is assumed in this document that the experimental design is simple and that differential expression is being assessed between 2 experimental conditions, i.e. a simple 1:1 comparison, with some information about analyzing data from complex experimental designs. The focus of the SOP is on single-end strand-specific reads, however special measures to be taken for analysis of paired-end data are also briefly discussed. The recommended coverage for RNA-Seq on human samples is 30-50 million reads (single-end), with a minimum of three replicates per condition, preferably more if one can budget accordingly. Preference is also generally given for a higher number of replicates with a lower per-sample sequence yield (15-20 million reads) if there is a tradeoff between the number of reads per sample and the total number of replicates.

Glossary of associated terms and jargon

Procedural steps

This protocol paper 2 was a very good resource for understanding the procedural steps involved in any RNA-Seq analysis. The datasets they use in that paper are freely available, but the source of RNA was the fruitfly Drosophila melanogaster, and not Human tissue. In addition, they exclusively use the “tuxedo” suite developed in their group.

Several papers are now available that describe the steps in greater detail for preparing and analyzing RNA-Seq data, including using more recent statistical tools:

In addition, newer alignment-free methods have also been published and are increasingly being used in analysis (we include a second protocol detailing the use of these):

The sections below detail those protocols and suggest tools.

Figure 1. Steps in RNA-Seq Workflow


As a cost-effective, high-throughput alternative to classical Sanger sequencing technology, emerging next-generation sequencing technologies have revolutionized biological research. When compared to Sanger sequencing technology, NGS platforms (e.g. 454, Illumina and ABI-SOLiD) [1] have their drawbacks, including shorter sequence read length, higher base-call error rate, non-uniform coverage and platform-specific artifacts [2–4] that can severely affect the downstream data analysis efforts.

One of the most important areas of NGS data analysis is de novo genome or transcriptome assembly. De novo assembly is essential for studying non-model organisms where a reference genome or transcriptome is not available. A common approach for de novo assembly of NGS sequences uses De Bruijn Graph (DBG) [5] data structure, which manages the large volume and short read length of NGS data better than classical Overlap-Layout-Consensus assemblers such as TIGR and Phrap [6, 7]. In the DBG-based approach, reads are decomposed into K-mers that in turn become the nodes of a DBG. Sequencing errors complicate the DBG because a single mis-called base can result in a new K-mer sequence that will subsequently introduce a new path in the DBG. These incorrect K-mers increase the complexity of the DBG, prolong assembler runtime, increase memory footprint, and ultimately lead to poor quality assembly [8]. Pre-processing NGS reads to remove mis-called bases would be beneficial to DBG assembler performance and the resulting assembly.

Another important area of NGS data analysis is reference-based assembly i.e. mapping or aligning reads to a reference genome or transcriptome. This step is crucial for many NGS applications including RNA-Seq [9], ChIP-Seq [10], and SNP and genomic structural variant detection [11]. The correct mapping of reads to a reference depends heavily on read quality [12, 13]. For example, some mapping tools use the base quality scores of a read to determine mismatch locations. Chimeric reads or other sequencing artifacts can introduce gaps in the alignment. Erroneous bases add additional complexity to the correct identification of actual mismatch positions during the mapping process. Therefore, cleaning up raw sequencing reads can improve the accuracy and performance of alignment tools.

We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package that implements many commonly used pre-processing algorithms gathered from the sequencing and assembly literature. In addition, we performed systematic assessments of the effects of using pre-processed short read sequences generated by different algorithms on the resulting de novo and reference-based assembly of three genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. We also compared the performance of ngsShoRT with other existing trimming tools: CutAdapt[14], NGS QC Toolkit[15] and Trimmomatic[16].


We applied the nine trimming algorithms on four different datasets (see Materials and Methods). The quality of these datasets was assessed with FastQC (see File S1 and Figure S1 for Q distribution plots) and measured by different metrics, such as the average PHRED error score, GC content biases and position-specific quality variations. The datasets vary conspicuously, possessing almost perfect quality parameters for the Yeast DNA-Seq dataset and somehow average-to-high for Lovell raw reads (Figure S1). The RNA-Seq datasets are characterized by the Arabidopsis thaliana reads as representative of high quality reads, while in Homo sapiens-derived data the error probability is both high and highly variable across read length.

Effects of Read Trimming on Gene Expression Analysis

We tested the performance of nine different trimming algorithms on two RNA-Seq datasets originating from human and Arabidopsis (see materials and methods). We assessed the number of reads and nucleotides aligning over the respective reference genomes, allowing for gap openings of the reads over spliced regions. It is evident how the trimming process in all cases reduces the number of reads, while increasing the percentage of the surviving dataset capable of correctly aligning over the reference genome. In the case of the low quality Homo sapiens dataset (Figure 1), while 72.2% of the untrimmed dataset reads are aligned, the trimmed ones reach values above 90%, with peaks in ConDeTri at 97.0% (HQ=15, LQ=10) and SolexaQA (Q=5) at 96.7% (Table 2). However, SolexaQA achieves the highest quality while keeping the highest number of reads, and therefore seems to be the optimal tool to maximize the tradeoff between loss of reads and increase in quality, at least in low quality RNASeq datasets such as the one analyzed here (Figure 2). For this dataset, we could observe a pseudo-optimal tradeoff between read loss and quality of the remaining reads, expressed as number of reads aligned over the total number of reads (Figure 1), which is between Q=20 and Q=30 for SolexaQA-BWA, Trimmomatic, Sickle, Cutadapt and ERNE-FILTER. Other trimmers, such as FASTX, being able to operate only from 3’end, do not achieve the same performance as the other tools (Figure 2). While retaining a similar ratio of correctly mapped reads (roughly assessed by the percentage of reads mapping within UCSC gene models), the loss of information is consistent when compared to untrimmed datasets (Figure S2).

For ConDeTri, two basic parameters are necessary, and combinations of both are reported (which explains the non-monotonic appearance of the barplots). Red bars indicate the percentage of reads aligning in the trimmed dataset. Blue bars indicate the number of reads surviving trimming.

RNASeqGenotypingGenome Assembly
Arabidopsis datasetHuman datasetYeast datasetPeach datasetYeast datasetPeach dataset
Max %Mapped Reads (threshold)Max %Mapped Reads (threshold)APOMAC at default thresholdAPOMAC at default thresholdN50 (bp)AccuracyRecallN50 (bp)AccuracyRecall
ConDeTri98.980% (HQ=40,LQ=35)96.973% (HQ=15,LQ=10)0.0485%0.0851%4,83099.600%91.834%14,52596.389%75.090%
Cutadapt99.422% (Q=40)91.751% (Q=26)0.0647%0.1589%6,25699.692%92.874%17,65395.349%74.466%
ERNE-FILTER98.687% (Q=38)95.475% (Q=30)0.0638%0.1564%6,21499.691%92.863%17,66595.374%74.482%
FASTX98.733% (Q=40)87.733% (Q=40)0.0655%0.1614%6,35799.692%92.892%17,69295.399%74.510%
PRINSEQ98.752% (Q=40)88.616% (Q=40)0.0652%0.1599%6,35799.692%92.890%17,69095.345%74.465%
Sickle99.422% (Q=40)95.971% (Q=20)0.0547%0.1308%5,38299.446%92.194%17,07495.495%74.504%
SolexaQA99.002% (Q=40)96.743% (Q=5)0.0644%0.1581%3,20999.642%89.770%13,57196.223%74.490%
SolexaQA-BWA98.705% (Q=38)91.947% (Q=26)0.0409%0.0645%6,25699.692%92.875%17,66295.328%74.449%
Trimmomatic99.422% (Q=40)95.875% (Q=22)0.0511%0.1119%4,78499.579%91.851%16,14195.766%74.629%

Table 2. Summary of comparisons between the trimming tools investigated in this study.

Each symbol corresponds to a quality threshold. Peak Q parameters for each tool are reported.

It is interesting to note that in general every tool shows different optimal Q thresholds (Figure 2 and Table 2) for maximizing the quality of the trimmed reads (expressed in this case by percentage of mapping reads over the reference). Also, every tool shows different trends between Q and mappability (percentage of post-trimming reads mapped on the reference genome): for some (such as SolexaQA and ConDeTri) loose thresholds are enough to achieve the most robust output. For other (such as FASTX and PRINSEQ), the highest possible threshold seems the optimal solution in terms of quality (with a concurrent loss of reads). Finally, some tools (like Cutadapt, Sickle, SolexaQA-BWA and Trimmomatic) possess an ideal intermediate Q threshold maximizing the relative amount of surviving reads alignable on the reference genome. In the case of the higher quality dataset originating from Arabidopsis thaliana, all tools have a comparable performance and no clear identifiable best Q for tradeoff between mappability and read loss. Starting from an untrimmed baseline of mappability of 82.8%, all tools reach a mappability of above 98.5% with stringent thresholds (Q>30, see Table 2 and Table S1). In both cases however, trimming affects and removes the most “unmappable” parts of the dataset, already at lower thresholds. Carrying a trimmed but reliable subset of the original RNA-Seq reads can reduce the need for disk space and the time needed for the overall alignment process, as high-error sequences would have already been eliminated.

Effects of read trimming on SNP identification

In order to assess the impact of trimming on SNP identification we used reads originating from dihaploid genome samples, specifically from the Prunus persica Lovell variety and from the Saccharomyces cerevisiae YDJ25 strain. In such genetic backgrounds, it is possible to evaluate any non-homozygous nucleotide call as a direct estimate of false positive SNP calling. In order to do so, we assessed the Average Percentage of Minor Allele Calls as an index termed APOMAC. At the same time, we measured the Average Percentage of Non-reference Allele Calls APONAC), although the latter is an underestimation of APOMAC, since it assumes that the sequenced individual has a genome identical to the reference sequence. The total non-homozygous nucleotide presence, related to false positive SNP calling and assessed by the APOMAC index, is -as expected- reduced by trimming (Figure 3). All trimmers drastically reduce the percentage of alternative allele nucleotides aligned over the reference genomes, both in Prunus persica (Figure 3) and in yeast (Table 2 and Table S1), bringing this false positive call indicator from 30% to 10% or less of the total aligned nucleotides. This rather spectacular loss of noise can be achieved with any trimmer with a Q threshold equal to or above 20 (Table S1). Best performing tools, in terms of APOMAC and APONAC, are ConDeTri and SolexaQA, which quickly reduce the number of minor allele calls. While increasing the quality of SNP calling, the coverage loss due to trimming is minor: FASTX, SolexaQA-BWA, PRINSEQ, Cutadapt and ERNE-FILTER at default Q values all process the reads without a noticeable loss of covered reference genome. This has been tested and reported by different minimum coverage thresholds (Figure 4).

Several read trimming method/threshold combinations are tested. The Average Percentage of Minor Allele Call (APOMAC) or of Non-reference Allele Call (APONAC) are reported, together with the total number of high-confidence SNPs.

The analysis was performed on untrimmed reads and after trimming with 9 tools at Q=20 (for ConDeTri, default parameters HQ=25 and LQ=10 were used).

Effects of read trimming on de novo genome assembly

Read trimming affects only partially genome assembly results and there is no big difference among results from the different datasets (see Figure 5 and Table 2). Negative effects are seen for high quality values (e.g. Q>30) on most datasets. Trimmed datasets from ConDeTri, Trimmomatic, Sickle and especially SolexaQA produce slightly more fragmented assemblies and this is probably due to a more stringent trimming that reflects also on lower computational needs (see Figure 6). The assembler used, ABySS, models and deals with sequencing errors therefore, assembly of the untrimmed dataset results best under certain metrics (average scaffold length, longest scaffold, N50 in bp) but at the cost of a slightly lower precision and a much higher computational demand. Conversely, stringent trimming tends to heavily remove data and decrease overall assembly quality.

Several read trimming method/threshold combinations are tested. Yellow bars report the N50 (relative to the untrimmed dataset N50). Blue bars report the accuracy of the assembly (% of the assembled nucleotides that could be aligned on the reference Prunus persica genome). Red bars report the recall of the assembly (% of the reference Prunus persica genome covered by the assembly).

Overall effects of read trimming

An overall analysis of the three computational biology analyses investigated here allows us to draw three conclusions. First, trimming is beneficial in RNA-Seq, SNP identification and genome assembly procedures, with the best effects evident for intermediate quality thresholds (Q between 20 and 30). Second, while all tools behave quite well (compared to untrimmed scenarios), some datasets with specific issues or low overall quality (Figure 2) benefit more from the most recent algorithms that operate on both 5’ and 3’ ends of the read, such as ERNE-FILTER, or those allowing low quality islands surrounded by high quality stretches, such as ConDeTri. Third, the choice of an optimal threshold is always a tradeoff between the amount of information retained (i.e,. the number of surviving reads/nucleotides) and its reliability, i.e., in RNA-Seq the alignable fraction, in SNP identification the amount of true positive aligned nucleotides and in genome assembly the percentage of the scaffolds correctly assembled and mappable on the reference genome. Overall, trimming gives also an advantage in terms of computational resources used and execution time, assessed for genome assembly in the present study (Figure 6) but evident also for the other analyses (data not shown). The performance of trimming seems to be dependent on the Q ditribution of the input dataset. For example, we observe a sudden drop in called SNPs above Q trimming thresholds of around 35 (Figure 3) in facts, Q=35 is roughly the flection point in the Q distribution of the Prunus persica dataset (Figure S1). On the other hand, for the higher quality Saccharomyces cerevisiae dataset, the drop in called SNPs is indeed present, but more gradual, and observed at Q values above 36, while the Q distribution for this datasets shows a flection point at Q=37 (Figure S1).


Accurate sequence reads and their reliable assembly is crucial for all downstream applications of NGS projects [15]. Without a reference genome, estimating the number of genes sequenced, their % coverage, and whether they have been assembled correctly is challenging [3, 23]. As of the use of NGS continues to increase for non-model organisms, the need for assembly algorithms that perform well in de novo assembly concomitantly increases, especially for the assembly of the short read sequence data for the Solexa/Illumina platform [3].

The performance of the three short read assemblers ( VELVET, NGEN and OASES ) investigated differed greatly. While VELVET resulted in the highest number of total contigs, only nine percent of these were larger than 200bp. In contrast, over 50% of the NGEN and OASES assembled contigs were larger than 200bp. As mapping accuracy increases with increasing contig size [14], we reason that the latter contig sets should resemble higher overall quality. This assumption was strengthened by the results of the BLAST searches. Meta-assembly of the four contig sets resulted in longer contigs, which also result in a higher number of BLAST hits in most searches.

Indirect contig quality assessment

Although the VELVET assembly had the largest number of contigs and the greatest number of hits against various databases, these are due to the poor assembly of contigs. Importantly, our ability to gain this insight is dependent upon the reference database used for BLAST searches and thus requires careful attention. In the BLAST comparisons against the UniProt database, the number of UniGen hits for the VELVET contigs is substantially smaller compared to the other assemblies with a cutoff value of < e -10 and also contigs > 200bp (Figure 2). The discrepancy between total and UniGen hits derives most probably from the incomplete assembly of contigs by the VELVET assembler, resulting in many independent contigs each hitting similar genes, while these are joined together by the other programs and thus constitute single hits to given genes for the NGEN and OASES assemblies. Additionally, the long contig assemblies by the other programs generate more high quality BLAST hits than those found for VELVET (Figure 2). However, BLAST results against RefSeq indicate a much greater number of UniGen hits by the VELVET assembly than the other two methods (Figure 3). This result arises due to the highly redundant nature of the RefSeq database, as it contains unique sets of genes for numerous species. The RefSeq database should therefore be used with caution since the number of unique types of genes should not differ significantly from those identified using the UniProt database.

In the BLAST comparison against a database consisting of a single closely related species, B. glabrata, the NGEN assembly resulted in the highest number of UniGens, and VELVET showing especially poor performance when considering assembled contigs > 200bp in length.

Combining all of these assemblies into the meta-assembly resulted in contigs that outperformed the other assemblies in the BLAST X searches against the UniProt and B. glabrata databases in all but one category.

An additional means of assessing contig assembly performance is to compare the actual hits identified by the different assemblies. Similar hits indicate similar contig sequence and accuracy. Comparisons were made between assemblies for the BLAST X search versus the UniProt database (cutoff value < e -10 , contig length > 200bp), which showed that the proportion of contigs leading to identical gene hits was highest between the NGEN and OASES contig sets. This again strengthened our interpretation that the quality of the NGEN and OASES contigs exceeded those of VELVET .

Direct contig quality assessment

The different contig assemblies were directly assessed by comparing their performance among the 13 mitochondrial genes of R. balthica[24]. In general, VELVET contigs had the highest number of hits against these genes due to these contigs being much shorter. The other assemblies had longer and fewer contigs, which had higher average alignment length, with the meta-assembly showing the best performance (i.e. fewest number of contigs with highest average aligned contig length (Table 2). Longer contigs had a lower identity match to the mtDNA genes, which likely arises due to genetic differences in the samples used for this and the published mtDNA genome for R. balthica and potentially sequencing errors (which have a higher probability to occur in long contigs compared to short ones). We identified some contigs whose middle region did not resemble the reference sequence and we identified these as assembly errors. In addition, most contigs of the NGEN and OASES assembly had a 20-30bp extension attached at the beginning of the contig that does not match the mtDNA genome. For the NGEN assembly, this extension was identified as the Illumina sequencing adaptor not removed during filtering due to low identity matching. For the OASES contigs we currently lack an explanation for the attachment origin. As the extensions seem to be an almost systematic error, cutting the first 30bp of each contig sequence is one means to solve this problem (although some good quality sequence may be lost).

Despite these differences, coverage of the mtDNA genes was quite similar among the contig assemblies, averaging around 50 - 55% (Table 2). Pooling all the contigs from the assemblies covered 79% of the mt genes. Thus, even though contigs of the three assemblers overlap to a large extent, each contig set covers some parts which are missed by the others, with at least 24% of the available bp information is not used by any of the three assemblers. We identified 27 clusters with 2 to 25 overlapping, and to a large extent identical VELVET contigs. In contrast, among the NGEN contigs not more than two contigs with more than 30bp overlap were found. As the two main reasons for insufficient assembly visual inspection of the mt genome alignments revealed insufficient read overlap and missed assemblies, even though identical and sufficient overlap was present. This might be traced back to the use of RNA from several pooled individuals, which leads to a larger number of SNP variants, and thus might hamper assembly [11]. In our study we identified 6.3 SNPs per thousand base pairs (n = 52), similar to the 6.7 identified SNPs in the Vera et al. [11] study. The estimated number of sequencing errors is almost identical (n = 51), and results in a sequencing error rate of 0.6%. Obviously SNP variation and sequencing errors affect the VELVET assembly, but do not appear to influence the other two assemblers. The meta-assembly combined the short VELVET SNP containing contigs into one, thus largely eliminating redundancy (Additional file 4). However, although the meta-assembly decreased the number of contigs from 560 to 82, this only resulted in a modest improvement in net coverage compared to VELVET (58% vs. 55% respectively).

Two other important observations merit discussion. First, contigs giving a hit against the mt genes can be split in two groups. One group of the contigs shows a clear relation between alignment length of the contig to total contig length. The other group consists of contigs that passed the cutoff value < e -5 , but only have a very short alignment length to the reference sequence and are therefore due effectively due to random, non-homologous matches (Figure 7). Second, while a clear relationship between cutoff value and alignment length is visible for the NGEN and OASES contigs, both the VELVET and meta-assembly contigs have clear outliers that may be assembly errors. These are contigs near the cutoff value with low alignment length, and with very high stringency cutoff values (e.g. < e -65 ).

Comparison to other studies

The number of UniGen matches against the UnipProt database found in other transcriptome studies of non-model organisms based on the 454/Roche platform is roughly similar to the 5380 meta-assembly matches detected in this study, at a cutoff value of < e -5 (e.g. [11]Melitaea cinxia: 6122 at < e -5 ). However, given our increased sequencing effort compared to previous studies (total quality data produced: 976 Mbp vs. 66 Mbp, i.e. 14 times higher compared to the M. cinxia study [11]), we expected to identify more genes. Previous observations of low blast results in mollusk species can be traced back to three main factors [25, 26]. First, the low amount of hits can be explained by the lack of EST datasets from mollusk species in Genbank [25, 26], and the general paucity of mollusk genetic data compared to insects and fish. Second, a large proportion of genes in mollusk species do not share orthologous relationships, but rather represent novel gene families [26]. Third, the high level of amino acid divergence to other, better studied invertebrate lineages and evolutionary distance to other organisms reduces the probability and quality of BLAST hits [26, 27]. These points highlight the need for more genomic data from molluscs to increase our knowledge and facilitate genomic studies in this phylum.

An optimized approach for local de novo assembly of overlapping paired-end RAD reads from multiple individuals

Restriction site-associated DNA (RAD) sequencing is revolutionizing studies in ecological, evolutionary and conservation genomics. However, the assembly of paired-end RAD reads with random-sheared ends is still challenging, especially for non-model species with high genetic variance. Here, we present an efficient optimized approach with a pipeline software, RADassembler, which makes full use of paired-end RAD reads with random-sheared ends from multiple individuals to assemble RAD contigs. RADassembler integrates the algorithms for choosing the optimal number of mismatches within and across individuals at the clustering stage, and then uses a two-step assembly approach at the assembly stage. RADassembler also uses data reduction and parallelization strategies to promote efficiency. Compared to other tools, both the assembly results based on simulation and real RAD datasets demonstrated that RADassembler could always assemble the appropriate number of contigs with high qualities, and more read pairs were properly mapped to the assembled contigs. This approach provides an optimal tool for dealing with the complexity in the assembly of paired-end RAD reads with random-sheared ends for non-model species in ecological, evolutionary and conservation studies. RADassembler is available at

1. Introduction

Recent developments of high-throughput sequencing techniques are revolutionizing studies of ecological, evolutionary and conservation genetics. Restriction site-associated DNA sequencing (RAD-seq) [1,2], which harnesses the massive throughput of next-generation sequencing, enables low-cost discovery and genotyping of thousands of genetic markers in both model and non-model species [3,4]. Illumina paired-end (PE) sequencing techniques make the original RAD (RPE) [5,6] more attractive for de novo studies. The first reads begin at the restriction enzyme cut site while the second reads are staggered over a local genomic region of usually several hundred base pairs. Furthermore, the overlapping RPE reads of each RAD locus could be individually assembled into one contig with the enzyme cut site at one end. The assembled contigs can provide more sequences information for blast annotations and the removal of paralogues [4,6,7]. In addition, RPE reads can also be used to remove polymerase chain reaction (PCR) duplicates, which will improve downstream genotyping accuracy, and the overlapping reads can further improve genotyping accuracy towards the ends of the reads [4].

To increase sequence coverage for RAD contigs assembly, it is a standard practice to pool multiple individuals' reads, which might introduce assembly complexity especially for non-model species with little knowledge of the genomic background [8,9]. Assembly software is challenged by repeats, sequencing errors, polymorphisms in the target and the computational complexity of large data volumes [10]. The polymorphisms among different individuals also complicate the assembly, and this could be more challenging particularly for species with high genetic variance. The assembly for RPE reads is more difficult compared to other RAD approaches that produce RAD loci of fixed length (flRAD), such as ddRAD [11]. PE ddRAD is much easier to assemble, because both the paired reads start at the restriction enzyme cut sites with fixed read length of uniform coverage of depth, and the reads could be easily stacked up. However, RPE is more difficult to assemble, as the second reads are staggered because of sonication and size selection, thus their coverage is non-uniform. In addition, there is huge difference of depth between the first reads and the second reads, which makes the assembly of RPE reads more challenging.

Previous studies have assembled RPE reads into contigs using different assembly tools [5,8,12], such as the de Bruijn Graph (DBG) based software Velvet [13] and the Overlap-Layout-Consensus (OLC) based software CAP3 [14] and LOCAS [15]. Davey et al. [9] demonstrated that VelvetOptimiser was the best assembly tool for RAD data by comparing nine assembly tools. However, Hohenlohe et al. [8] found CAP3 performed much better than Velvet. The results of Hohenlohe et al. showed that most reads of a locus could be each assembled into one contig by using CAP3, while Velvet failed to connect the overlapping PE reads at many loci. Possible causes of the conflicting results between the two studies might be attributed to the fact that Davey et al. did not use the overlapping RPE library preparation protocol and they only used the second reads for assembly, and therefore the information for the first reads was lost. There are several software for the assembly of RAD data supporting PE reads, such as Stacks [16,17], Rainbow [18], pyRAD [19] and dDocent [20]. However, many of these tools cannot directly and fully support RPE datasets with staggered PE reads. There are many studies which did not make full use of RPE reads either for assembly or single nucleotide polymorphism (SNP) discovery due to the lack of software or approaches that are specially optimized for RPE assembly. Therefore, an easy-to-use software as well as an approach specially optimized for the assembly of RPE reads is urgently needed. Here, we present an optimized assembly approach with a pipeline software, RADassembler, to deal with the complexity of RAD assembly, which could take full advantage of the overlapping RPE reads.

The goals of this study are to: (a) present an optimized approach with the pipeline software RADassembler for local de novo assembly of the overlapping RPE reads from multiple individuals and (b) compare the performances of RADassembler with the original Stacks, Rainbow, and dDocent on both simulation and real RPE datasets.

2. Material and methods

By making full use of the features of RPE reads, we can firstly cluster the first reads (the forward reads with enzyme cut sites) into RAD loci based on the sequence similarity, then group the read pairs of each locus accordingly and perform the local de novo assembly. The pipeline software RADassembler, written in Bash and Perl, mainly uses Stacks and CAP3 to perform the local de novo assembly of the RPE reads. Specifically, Stacks is used for clustering, and CAP3 is used for assembly. We chose Stacks (version 1.48) for clustering due to its popularity in analysing RAD-seq data in prior studies.

2.1. Choosing the optimal similarity thresholds for clustering

As the similarity thresholds (the number of mismatches) for clustering are critical for the downstream analysis, we adopted a protocol from Ilut et al. [21] for optimal similarity threshold selection within individuals. Two main components of Stacks were used for the selection of optimal similarity thresholds, ustacks and cstacks. Data from each individual were grouped into loci by ustacks, and loci were grouped together across individuals by cstacks. RADassembler would run ustacks of Stacks using a set of mismatches (e.g. from 1 to 10) using a single individual. The optimal number of mismatches within individual was chosen to maximize the number of clusters with two haplotypes (alleles) and simultaneously minimize the number of clusters with one haplotype (allele). A novel method for choosing the similarity threshold across individuals (cstacks) was also introduced: RADassembler would run cstacks of Stacks using a set of mismatches (e.g. from 1 to 10) on a subset of data (e.g. randomly select several individuals from each population). The optimal number of mismatches across individuals was chosen at the point of inflection, such that the number of incremental loci for each merging individual using different mismatches changed little. All the above parameters can be set by users.

2.2. De novo assembly of RAD contigs

After choosing the optimal number of mismatches within and across individuals, the first reads were sent to Stacks for clustering. A minimum depth of 5 was set to create a stack, and the number of mismatches allowed between stacks was set to the optimum to maintain the true alleles from paralogues. Deleveraging and removal algorithms of Stacks were turned on to resolve over merged loci and to filter out highly repetitive, likely paralogous loci. When building the catalogue, the number of mismatches allowed between loci across individuals was set to the optimum to attempt to merge loci together. Finally, only the second reads of each RAD locus from multiple individuals were collected into separate fasta files by using a modified version of ‘' of Stacks. RADassembler used data reduction techniques to select a certain number of reads (maximum of 400 and minimum of 10, set by users) for assembly.

To reduce the complexity for assembly, we present here a two-step assembly approach implemented in RADassembler (figure 1). Firstly, the second reads (the reverse reads) with random-sheared ends from multiple individuals corresponding to each RAD locus were sent to CAP3 to assemble separately, and the resulting contigs of each locus were then merged with the corresponding consensus sequence of the first reads from Stacks catalogue into one file. Secondly, each merged file was then locally assembled again into the final RAD contigs using CAP3. In the second step, if the contigs from the first step did not overlap with the consensus sequences, they would be concatenated by ten ‘N'. The assembly approach was parallelized to achieve the maximum efficiency. RADassembler used parameters specifically optimized for short reads assembly following the manual of CAP3 (see electronic supplementary material for the details of parameters).

Figure 1. Flow chart for the two-step assembly approach on RPE reads. (i) The first reads (the forward reads with enzyme cut sites) were clustered. (ii) The second reads (the reverse reads with random-sheared ends) were sorted into separated files accordingly (each locus represented by different colours contained reads from multiple individuals). Reads were assembled by a two-step assembly strategy: (iii) first step, the second reads were locally assembled into contigs and merged with the corresponded consensus sequences of the first reads (iv) second step, the merged files were locally assembled again into the final RAD contigs. If the contigs of the second reads do not overlap with the consensus sequences, ten ‘N’ will be padded (locus in blue).

2.3. RADassembler on simulation data

To evaluate the performance of RADassembler, we simulated 12 individuals with high levels of heterozygosity (0.02) on reference genome of the Genome Reference Consortium Zebrafish Build 11 (GRCz11, NCBI accession: GCF_000002035.6) digested with the enzyme SbfI. Only the primary assembly on 25 chromosomes of GRCz11 were retained for in silico digest. By using ‘ezmsim', a modified version of ‘wgsim' [22] from Rainbow, PE reads of length 125 bp were simulated from a range of insert size libraries initiated from 200 bp and elongation of 10 steps, with each step extends 50 bp. Mean depth of the PE reads was set to 10 for each step, and a sequencing error rate of 0.01 was randomly introduced according to a common error rate of approximately 0.1–1 × 10 −2 for Illumina sequencing machines [23]. So the expected coverage for each simulated RAD locus is 700 bp, and SNPs were random across all individuals. After checking the optimal number of similarity thresholds (see Results and figure 2), the number of mismatches within individual (ustacks) was set to 6, and the number of mismatches across individuals (cstacks) was set to 4. All the simulation and subsequent analysis were performed on a workstation with 20 CPUs (2.30 GHz) and 256 GB memory 30 threads were used when parallelization was available.

Figure 2. The selection of the optimal number of mismatches within (a) and across (b) individuals on simulation datasets. Reads from each individual were grouped into loci by ustacks, and loci were grouped together across individuals by cstacks to build the catalogue. The optimal number of mismatches within individual (ustacks) was chosen to maximize the number of loci (Y-axis on the left) with two alleles and simultaneously minimize the number of loci with one allele. In this case, six mismatches should be an appropriate value for ustacks. For cstacks, the optimal number of mismatches across individuals was chosen at the point of inflection, such that the number of incremental loci (Y-axis on the right) for each merging individual (X-axis on the right) using different mismatch thresholds (represents by different line types) changed little. In this case, four mismatches should be an appropriate value for cstacks.

2.4. RADassembler on real data

Overlapping RPE reads of 24 individuals for the small yellow croaker Larimichthys polyactis from Zhang et al. [24] were selected as a real dataset, with approximate insert sizes from 200 to 600 bp. Raw reads were firstly processed by cutadapt [25] to remove potential adaptors, then were passed to process_radtags of Stacks to drop low-quality read pairs with a window size of 0.1 and a score limit of 13. Only read pairs containing enzyme cut sites were retained. In addition, PCR duplicates were removed by clone_filter of Stacks. The final retained reads from the 24 individuals were sent to RADassembler for optimized RAD contigs assembly. The numbers of mismatches were set to 3 (ustacks) and 3 (cstacks) following the optimal similarity thresholds choosing method (see Results and figure 3). The assembled contigs were adaptor removed using cutadapt and a minimum contigs length of 125 bp was also required.

Figure 3. The selection of the optimal number of mismatches within (a) and across (b) individuals on real datasets (L. polyactis). The optimal number of mismatches within individual should be 3 (a), and optimal number of mismatches across individuals should be 3 (b), although liberal values might be more appropriate.

2.5. Comparisons of the performance to other tools

We compared the assembly performance of RADassembler with three other popular tools that supported RPE reads, including the original Stacks (version 1.48), Rainbow (version 2.04) and dDocent (version 2.2.20). The performances of assembly on both simulation and real datasets were compared. Parameters in the original Stacks were identical to that used in RADassembler, except that all the read pairs from multiple individuals for each locus were extracted by using a modified version of ‘’, and then were sent to the wrapper ‘’ provided by Stacks to assemble contigs. This wrapper will run Velvet on each locus and collect the sequences into the final contigs a minimum contig length of 125 bp was required. Rainbow is an ultra-fast and memory efficient solution for clustering and assembling short reads produced by RAD-seq. Rainbow includes three steps in assembling RAD contigs: clustering, dividing and merging (assembly). Parameters in Rainbow were set according to those used in RADassembler and dDocent, which were adjusted for multiple individuals. dDocent is an analysis pipeline that uses data reduction techniques and other stand-alone software packages to perform quality filtering, de novo assembly of RAD loci, read mapping, SNP calling and data filtering. Cut-off values for coverage of reads in the first assembly step of dDocent were set to 5 (within individual) and 2 (across individuals), respectively similarity threshold for the last reference clustering was set to the optimal value used in RADassembler clustering stage. All the detailed parameters used in the above programs are presented in the electronic supplementary material.

We evaluated the performances of the assembly of different tools using the commonly used statistics, including N50, mean contig length and total contig length (coverage). Besides, for simulation data the assembled contigs were also mapped to the original reference genome using the local BLAST+ [26] program, the mean identity and coverage were calculated. For real datasets, as no reference genome was currently available for L. polyactis, the reference genome (NCBI accession: GCF_000972845.1) of the congener (Larimichthys crocea) was selected for blast mappings. Furthermore, read pairs were also mapped back to the assembled contigs by using BWA 0.7.15 [27] to check the number of mapped reads and properly mapped reads. The properly mapped reads were those with the forward read and the reverse read mapped on the same contig (loci) and with right orientation as well as proper insert size, which was identified by SAM flags given by the aligner. For simplification and consistency, only the read pairs used for the assembly in Stacks were used for all the reads mappings, which would represent a comprehensive subset of the raw input reads. The ‘mem' algorithm [28] in BWA was used for mapping, and the parameters were set to default. Mapping statistics were calculated by Samtools 1.6 [22].

3. Results

3.1. Comparisons of RAD contigs assembly on simulation data

Using in silico digest, there were 29 242 cut sites of SbfI on the main assembly of the 25 chromosomes of GRCz11. Thus, an expected approximate coverage of 20 469 400 bp RAD library for each individual was generated, which covered approximately 1.52% of the genome. Using a set of mismatches (from 1 to 10 for ustacks, from 1 to 8 for cstacks) to cluster the first reads from preliminary runs, the optimal mismatches within individual (ustacks) was set to 6, and the optimal number of mismatches across individuals (cstacks) was set to 4 (figure 2). RADassembler exported a total of 29 533 loci for assembly, all of which were successfully assembled. The assembled contigs was with an N50 of 698 bp, mean contig length of 661 bp and a total coverage of 19 633 933 bp (table 1). Length distribution for the assembled contigs is presented in figure 4.

Figure 4. Length distribution of contigs assembled by the four tools on simulation datasets. Program versions: Stacks 1.48, Rainbow 2.04, dDocent 2.2.20.

Table 1. Assembly statistics of the four tools on simulation datasets. Comparison statistics including (from left to right): number of clusters (loci) assembled, number of clusters that mapped to the reference genome (Identical Clusters), N50 (bp), mean contig length (Mean, bp), total coverage (Total Cov, bp), identical bases to the reference genome (Identical Cov, bp), identical bases to the reference genome in proportion of the total coverage (Cov Ratio), mean identity of those mapped to the reference genome (Mean Identity), total mapping rate of the read pairs (Total Mapped), proper mapping rate of the read pairs (Proper Paired).

Compared to the other three tools, RADassembler identified the most appropriate number of clusters (loci), and the assembled contigs generally showed high qualities (table 1). By mapping the contigs to the reference genome, 99.96% of the clusters (assembled contigs) were mapped to the reference, with a mean identity of 98.78%. RADassembler showed the highest coverage ratio and proper mapping rate, with 98.60% of the reads properly mapped. Stacks and dDocent assembled many contigs of short length, which were not in accordance with the expectation (should be around the maximum insert size, 700 bp). Stacks (Velvet) failed to assemble most loci, though it recovered the appropriate number of loci in the clustering stage. The original Stacks assembled only 8717 loci, and most of the reads could not be properly mapped back (only 11.12% were properly mapped), which might suggest that Velvet was inappropriate for the assembly of RPE reads. Rainbow assembled much more loci (154 410) than the other tools, which was not in accordance with the expectation, suggesting the existence of many redundant loci. dDocent assembled 20 248 loci with an N50 of only 262 bp. When mapping back the read pairs, only 36.62% of the read pairs were properly mapped to the assembled contigs of dDocent. Although dDocent was the most time-efficient one among the four tools (see electronic supplementary material for benchmark details), RADassembler was still more efficient than the original Stacks and Rainbow. From a comprehensive perspective, RADassembler was the best performing tool among the four and comparison details are presented in table 1.

3.2. Comparisons of RAD contigs assembly on real data

After quality filtering, a total of 62 960 475 read pairs were retained for the 24 individuals of L. polyactis, with a mean read pairs of 2 623 353 per individual. Using preliminary runs to check the optimal similarity thresholds, the number of mismatches within individual was set to 3, and the number of mismatches across individuals was set to 3 (figure 3). RADassembler exported a total of 303 929 loci for assembly and all of these were successfully assembled. The assembled contigs, with an N50 of 539 bp, mean contig length of 511 bp and a total coverage of 157 941 578 bp, also demonstrated high qualities (table 2). Most of the read pairs (98.98%) were mapped to the contigs, and 95.99% of these were properly mapped. When mapping the assembled contigs to the reference genome of L. crocea, 98.33% of the assembled contigs were mapped to the reference with a mean identity of 95.85%.

Table 2. Assembly statistics of the four tools on real datasets of L. polyactis. The parameters of comparisons were the same as those used in the simulation datasets.

RADassembler was also more competent than the other three tools on the real datasets (table 2). It always showed the highest proper mapping rate, and the length of contigs conformed to the expected size (figure 5). Similar to their performances on simulation datasets, Stacks (Velvet) and dDocent performed poorly in recovering the appropriate contigs size on the real datasets (figure 5), with many of them were short ones. The original Stacks (Velvet) and Rainbow assembled more clusters (loci), and the total coverage was 181 151 234 bp and 182 080 648 bp, respectively. However, a large proportion of the read pairs could not be properly mapped. For the original Stacks, 87.89% of reads were mapped, but only 49.16% of these were properly mapped. However, Rainbow performed better than Stacks on the real datasets and itself on the simulation datasets. The size of the assembled contigs by Rainbow was also in accordance with the expected insert size. Moreover, the total and proper mapping rate is 92.34% and 85.47%, respectively, but is still not as good as RADassembler. dDocent assembled 183 763 clusters, and size of most of the assembled contigs was small, which was consistent with its performance on the simulation datasets. Most of the contigs assembled by dDocent were around 260 bp, which was the length of the forward read (125 bp) and the reverse read (125 bp) plus ten ‘N', suggesting its failure in the assembly of the second reads with randomly sheared ends (figure 5). The details of comparison of performance of the tools are presented in table 2.

Figure 5. Length distribution of contigs assembled by the four tools on real datasets (L. polyactis). Program versions: Stacks 1.48, Rainbow 2.04, dDocent 2.2.20.

4. Discussion

Several analysis tools have been released and widely applied to help researchers to deal with RAD-seq data. However, previous studies based on RPE have only used either the first reads [29,30] for SNP calling and downstream population genetic analysis, or only the second reads for contigs assembly [6,31,32], and the information for the other reads was wasted then. Although most of these tools support PE reads, many of them do not directly support RPE reads with random-sheared ends. Many studies have not taken full advantage of RPE read pairs for both assembly and SNP calling. The main constraints here may be the highly uneven coverage depth of the read pairs and the generally low depth of the second reads, as shown in Davey et al. [9]. However, RADassembler helped in reducing the complexity of RPE assembly, and the results presented in this study demonstrated its high promise and wide applicability.

RADassembler offered two advantages in its assembly for RPE reads: (i) it used methods to choose the optimal similarity thresholds within and across individuals and (ii) it used a two-step assembly approach to efficiently reduce the assembly complexity. Similarity threshold selection is critical for the downstream analysis. Stringent thresholds will cause over-splitting, which creates false homozygosity, and liberal thresholds will cause under-splitting, which creates false heterozygosity [21,33]. Incorrect similarity thresholds affect the inferences of the level of variation in the downstream population genetic and phylogeographic estimates [33]. RADassembler could efficiently identify the optimal threshold within and across individuals without the prior knowledge of heterozygosity. As a pipeline software, dDocent also includes a two-step strategy in the assembly of RAD reads, but the rationale of which is quite different from RADassembler. dDocent was originally designed and optimized for flRAD datasets [20], although its current version also supports RPE datasets. At the first step of assembly, dDocent uses the concatenated PE reads (only the first reads were used for RPE) to count the occurrences of unique reads, then users can choose a cut-off level of coverage for reads to be used for assembly. The choice for a cut-off of unique reads within individual is similar to that in ustacks (the parameter of minimum depth of coverage required to create a stack) of Stacks. The remaining concatenated reads are then divided back into read pairs, clustered and locally assembled by Rainbow (in the current version of dDocent, CD-HIT [34,35] is used for clustering). At last, the assembled contigs are clustered based on the overall sequence similarity using CD-HIT. By contrast, RADassembler only uses the second reads of each locus for local assembly in its first assembly step. The assembled contigs for the second reads are then merged (either assembled or padded by ten ‘N') with the corresponding consensus sequences of the first reads. The output reference contigs of dDocent represent only a subset of the total genomic information content of the raw input [20], which might be the cause of its lower proper mapping rate in the results. However, RADassembler will assemble more comprehensive information for a de novo assembly of RAD loci. The comprehensive RAD reference is useful for downstream annotations and will increase the chance of discovering individual level polymorphisms.

RADassembler also supports multi-threading, and it includes a data reduction step before assembly. Users can choose a cut-off level of coverage to restrict the minimum and maximum number of reads for each locus used in assembly. Thus, RADassembler achieved a better running efficiency compared to the original Stacks and Rainbow. Rainbow includes a dividing step after the first clustering to distinguish sequencing errors from heterozygote or variants between repetitive sequences [18]. While this step worked perfectly for data of a single individual, it performed not so well in pooled data from multiple individuals, especially in species with high polymorphisms, as shown in the simulation datasets. Rainbow might be inappropriate for the assembly of RPE datasets from multiple individuals with high heterozygotes, though the parameters need further optimizations. RADassembler uses Stacks for better clustering and it is more appropriate in dealing with polymorphisms among multiple individuals. Stacks mainly uses two steps for de novo assembly of loci, ustacks for clustering within individuals and cstacks for constructing catalogue across individuals. The original Stacks uses DBG-based assembler Velvet to assemble contigs only for the second reads of RPE reads. By modifying the program to include the first reads, however, Velvet did not perform well and failed to connect overlapping RPE reads at many loci. Similar results were also observed in Hohenlohe et al. [8]. Both the OLC-based assemblers CAP3 (used in RADassembler) and Rainbow assembled the appropriate size of contigs, suggesting their advantages over DBG-based assemblers in the assembly of RPE reads.

There are two categories of widely used NGS assemblers, which are based on either the OLC methods or the DBG methods [10]. The OLC methods rely on an overlap graph involving three phases: overlap, layout and consensus [36]. The OLC-based assemblers perform pairwise alignments (which is computationally expensive) to discover overlaps and the length of overlaps are not required to be uniform. The DBG methods rely on k-mer graph, which use fixed-length subsequence (k-mer) as its nodes and overlaps between consecutive k-mer as its edges. K-mer graph methods do not require all-against-all overlap discovery [10], thus might lose some true overlaps, but have advantages in efficiency of assembly for high-throughput short reads. The k-mer graph-based assembler has been applied on RAD data in many studies, such as those using Velvet [32,37] and VelvetOptimiser [6,9]. However, DBG-based assemblers did not perform well in the presented study, as well as in Hohenlohe et al. [8]. The general problem may be due to the highly uneven sequence coverage of depth expected in each locus for the RPE datasets [8], which makes it hard for Velvet to correctly assemble contigs. Indeed, Velvet is confounded by non-uniform coverage of the target sequences, as it uses coverage-based heuristics to distinguish putative unique regions from putative repetitive regions [38]. Nonetheless, compared to overlap graphs, k-mer graphs are more sensitive to repeats and sequencing errors [10], suggesting that k-mer graph based tools (such as Velvet) might be less powerful for assembly of pooled reads from multi-individuals. The polymorphisms among individuals will also complicate the assembly, particularly for k-mer graph methods. The OLC methods performed much better though a bit more computationally expensive, but still affordable after data reduction and parallelization. Additionally, RADassembler uses a two-step strategy to further reduce the complexity of RPE assembly. This strategy offers two advantages: firstly, it reduces the complexity of reads from multiple individuals as well as the calculation demands by using consensus sequences of the first reads and data reduction techniques (randomly select a subset of reads) secondly, it makes the depth in each assembly step uniform. At the same time, it is also crucial for researchers to vary the parameters to optimize the assembly. One solution is to estimate assembly parameters for each locus [9], and use a hybrid assembly strategy (use both OLC and DBG assemblers). However, this would cause severe computational demands. Our approach presented here provides a good tool for dealing with the complexity of RAD assembly, particularly for assembly of RPE reads from multiple individuals with high genetic variation.

RAD contigs are attractive for the detection and annotation of loci of interest (e.g. outliers). The assembled contigs hold higher probabilities to hit the database than that of the single-end consensus sequences. These annotations are important for population genomic and conservation genetic applications. In addition, RAD contigs provide more chances for outlier detection. The longer continuous sequences are expected to contain more SNPs that might be relevant to local adaptations. The assembled RAD contigs also provide sufficient flanking sequences for the design of primers or arrays that could be further used to perform functional verifications or studies of adaptive evolution based on more samples.

In the present study, we provided an optimized approach with the pipeline software RADassembler to deal with the assembly complexity for RPE reads from multiple individuals. The results on both simulation and real datasets suggested its high accuracy and efficiency. RADassembler included the protocols for choosing the optimal similarity thresholds, data reduction techniques as well as a two-step assembly approach to reduce the assembly complexity for RPE reads. RADassembler could provide an optimal tool for dealing with the complexity of RAD assembly for non-model species in ecological, evolutionary and conservation studies, especially for species with high polymorphisms.

Quality assessment with FastQC

More quality control! Yay! Note that we still have not done anything close to interpreting the biological results of our sequencing run.

FastQC is a great tool to get a first look at your data. It can give you an impression if there are certain biases in your data or if something went wrong in your library prep or sequencing run. It’s really easy to run, just type the following in the command line (after it’s installed):

You’ll receive a .html with different plots. More information, as always, in the documentation. Don’t rely on the pass/warning/fails on FastQC, it really depends on your library prep. In bisulphite sequencing for example, there will be little to no cytosines in the per base sequence content, but this is to be expected (as they are almost all converted to thymine). FastQC mislabels these as “fail”. Like Basespace, experience is important when interpreting FastQC plots. If you’re puzzled about something, ask a colleague.

Most often you will be analyzing more than one sample and will generate quite a few log files. MultiQC is a fantastic piece of software, with one command:

you can aggregate all your log files into one report. I use it all the time and can’t recommend it enough.

If you’re interested in writing best-practice pipelines to process your fastq files, you might be interested in bcbio. But you still need to know what goes on under the hood of the bcbio pipeline, or you might want to develop one yourself.


The bench-top sequencing revolution has led to a ‘democratization’ of sequencing, meaning most research laboratories can afford to sequence whole bacterial genomes when their work demands it. However analysing the data is now a major bottleneck for most laboratories. We have provided a starting point for biologists to quickly begin working with their own bacterial genome data, without investing money in expensive software or training courses. The figures show examples of what can be achieved with the tools presented, and the accompanying tutorial gives step-by-step instructions for each kind of analysis.


Konnector creates long pseudo-reads from paired-end sequencing reads (Figure 1) by searching for connecting paths between read pairs using a Bloom filter representation of a de Bruijn graph. In addition to connecting read pairs, Konnector v2.0 can also extend connected or unconnected sequences by following paths from the ends of sequences up to the next branching point or dead end in the de Bruijn graph. When the sequence extension feature of Konnector v2.0 is enabled, an additional Bloom filter is employed to avoid the production of an intractable quantity of duplicate sequences. Figure 2 provides a flowchart overview of the Konnector 2.0 algorithm.

A connecting path between two non-overlapping paired-end sequencing reads within a de Bruijn graph. Konnector joins the sequence provided by the input paired-end reads (green) by means a graph search for a connecting path (blue). Sequencing errors in the input sequencing data produce bubbles and branches in the de Bruijn graph of up to k nodes in length (red). Bloom filter false positives produce additional branches (yellow) with lengths that are typically much shorter than the error branches.

The Konnector2 algorithm. (1): The algorithm builds a Bloom filter representation of the de Bruijn graph by loading all k-mers from the input paired-end sequencing data. (2): For each read pair, a graph search for connecting paths within the de Bruijn graph is performed. (3): If one or more connecting paths are found, a consensus sequence for the paths is built. (4): If no connecting paths are found, error-correction is attempted on reads 1 and 2. (5) and (6): the algorithm queries for the existence of either the consensus connecting sequence or the error-corrected reads in the "duplicate filter". The duplicate filter is an additional Bloom filter, separate from the Bloom filter de Bruijn graph, which tracks the parts of the genome that have already been assembled. (7) and (8): If one or more of the k-mers in the query sequence are not found in the duplicate filter, the sequence is extended outwards in the de Bruijn graph, until either a dead end or a branching point is encountered in the graph. Finally, the extended sequences are written to the output pseudo-reads file.

Bloom filter de Bruijn graph

As the throughput of the Illumina platforms increased rapidly to generate up to 1Tb in a six-day run with the HiSeq SBS V4 Kits, one important concern for pseudo-read generating tools is their computational efficiency. In related problems, bioinformatics tools have used strategies such as parallel computing [11, 12], FM indexing [13, 14], and compressed data structures [15] for handling big data.

To fit large assembly problems in small memory, one recent approach has been the use of Bloom filters [16, 3] to represent de Bruijn graphs, as demonstrated by the Minia assembler [17]. Konnector adopts a similar approach. Briefly, a Bloom filter is a bit array that acts as a compact representation of a set, where the presence or absence of an element in the set is indicated by the state of one or more bits in the array. The particular position of the bits that correspond to each element is determined by a fixed set of hash functions. While Bloom filters are very memory-efficient, the principal challenge of developing Bloom filter algorithms is in dealing with the possibility of false positives. A false positive occurs when the bit positions of an element that is not in the set collide with the bit positions of an element that is in the set. In the context of Bloom filter de Bruijn graphs, false positives manifest themselves as false branches, as depicted by the yellow nodes in Figure 1.

In the first step of the algorithm (Figure 2, step (1)), the Bloom filter de Bruijn graph is constructed by shredding the input reads into k-mers, and loading the k-mers into a Bloom filter. To diminish the effect of sequencing errors at later stages of the algorithm, k-mers are initially propagated between two Bloom filters, where the first Bloom filter contains k-mers that have been seen at least once, and the second Bloom filter contains k-mers that have been seen at least twice. At the end of k-mer loading, the first Bloom filter is discarded, and the second Bloom filter is kept for use in the rest of the algorithm. We note here that only the k-mers of the input reads, corresponding to the nodes in the de Bruijn graph, are stored in the Bloom filter whereas there is no explicit storage of edges. Instead, the neighbours of a k-mer are determined during graph traversal by querying for the presence of all four possible neighbours (i.e. single base extensions) at each step.

Searching for connecting paths

In a second pass over the input sequencing data, Konnector searches for connecting paths within the de Bruijn graph between each read pair (Figure 2, step (2)). The graph search is initiated by choosing a start k-mer in the first read and a goal k-mer in the second read, and is carried out by means of a depth-limited, bidirectional, breadth-first search between these two k-mers.

The start and goal k-mers are selected to reduce the probability of dead-end searches due to sequencing errors or Bloom filter false positives. First, the putative non-error k-mers of each read are identified by querying for their existence in the Bloom filter de Bruijn graph. (Recall that after the loading stage, this Bloom filter only contains k-mers that occur twice or more.) Next, the algorithm attempts to find a consecutive run of three non-error k-mers within the read, and chooses the k-mer on the distal end (i.e. 5' end) of the run as the start/goal k-mer. This method ensures that if the chosen start/goal k-mer is a Bloom filter false positive, the path search will still proceed through at least two more k-mers instead of stopping at a dead end. In the likely case that there are multiple runs of "good" k-mers within a read, the run that is closest to the 3' (gap-facing) end of the read is chosen, in order reduce the depth of subsequent path search. In the case that there are no runs of three good k-mers, the algorithm falls back to using the longest run found (i.e. two k-mers or a single k-mer).

Once the start and goal k-mers have been selected, Konnector performs the search for connecting paths. In order to maximize the accuracy of the sequence connecting the reads, it is important for the algorithm to consider all possible paths between the reads, up to the depth limit dictated by the DNA fragment length. For this reason, a breadth-first search is employed rather than a shortest path algorithm such as Dijkstra or A*. Konnector implements a bidirectional version of breadth-first search, which improves performance by conducting two half-depth searches, and thus reducing the overall expansion of the search frontier. The bidirectional search is implemented by alternating between two standard breadth-first searches that can "see" each other's visited node lists. If one search encounters a node that has already been visited by the other search, the edge leading to that node is recorded as a "common edge", and the search proceeds no further through that particular node. As the two searches proceed, all visited nodes and edges are added to a temporary, in-memory "search graph". This facilitates the final step, where the full set of connecting paths are constructed by performing an exhaustive search both backwards and forwards from each common edge towards the start and goal k-mers, respectively.

If the search algorithm finds a unique path between the start and goal k-mers, then the path is converted to a DNA sequence, and is used to join the read sequences into a single pseudo-read. In the case of multiple paths, a multiple sequence alignment is performed, and the resulting consensus sequence is used to join the reads instead (Figure 2, step (3)). In order to fine-tune the quality of the results, the user may specify limits with respect to the maximum number of paths that can be collapsed to a consensus and/or the maximum number of mismatches that should be tolerated between alternate paths.

Extending connected and unconnected sequences

Konnector v2.0 introduces a new capability to extend both connected and unconnected sequences by traversing from the ends of sequences to the next branching point or dead-end in the de Bruijn graph (Figure 2, steps (7) and (8)). If a read pair is successfully connected, the algorithm will extend the pseudo-read outwards in both directions if the read pair is not successfully connected, each of the two reads will be extended independently, both inwards and outwards. The extensions are seeded in the same manner described above for the connecting path searches a putative non-error k-mer is selected near the end of the sequence, and following two consecutive non-error k-mers if possible.

The extension of connected reads or unconnected reads that are contained within the same linear path of the de Bruijn graph results in identical sequences. For this reason, the algorithm uses an additional Bloom filter to track the k-mers of sequences that have already been assembled. (Hereafter this Bloom filter will be referred to as the "duplicate filter" in order to reduce confusion with the Bloom filter de Bruijn graph.)

The logic for tracking duplicate sequences differs for the cases of connected and unconnected read pairs. In the case of connected reads, only the k-mers of the connecting sequence are used to query the duplicate filter (Figure 2, step (5)). By virtue of being present in the Bloom filter de Bruijn graph, the connecting k-mers are putative non-error k-mers that have occurred at least twice in the input sequencing data, and thus a 100% match is expected in the case that the genomic region in question has already been covered. If one or more k-mers from the connecting sequence are not found in the duplicate filter, the pseudo-read is kept and is extended outwards to its full length (Figure 2, step (7)). The k-mers of the extended sequence are then added to the duplicate filter, and the sequence is written to the output pseudo-reads file.

In the case of unconnected reads, the reads must first be corrected prior to querying the duplicate filter (Figure 2, step (4)). This is done by first extracting the longest contiguous sequence of non-error k-mers within the read, where k-mers that are present in the Bloom filter de Bruijn graph are considered to be putative non-error k-mers. An additional step is then performed to correct for recurrent read-errors that may have made it past the two-level Bloom filter. Starting from the rightmost k-mer of the selected subsequence, the algorithm steps left by k nodes, aborting the correction step if it encounters a branching point or dead-end before walking the full distance. As the longest branch that can be created by a single sequencing error is k nodes, this navigates out of any possible branch or bubble created by an error (red nodes of Figure 1). Finally, the algorithm steps right up to (k+1) nodes to generate a high confidence sequence for querying the duplicate filter. The second rightward step stops early upon encountering a branching point or dead-end, but any sequence generated up to that point is kept, and is still used to query the duplicate filter. Following error correction, the subsequent steps for handling unconnected reads are similar to the case for connected reads. If the high confidence sequence contains k-mers that are not found in the duplicate filter, the sequence is extended to its full length, added to the duplicate filter, and written to the output pseudo-reads file.

Finally, some additional look-ahead logic is employed in the extension algorithm to handle the common cases of false positive branches and simple bubbles created by heterozygous SNPs. All branches shorter than or equal to three nodes in length are assumed to be false positive branches and are ignored during extension. Upon reaching a fork with two (non-false-positive) branches, a look-ahead of (k+1) nodes is performed to see if the branches re-converge. If so, the bubble is collapsed and the extension continues.

© 2013 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License, which permits unrestricted use, provided the original author and source are credited.


. 1994 Nevirapine resistance mutations of human immunodeficiency virus type 1 selected during therapy . J. Virol. 68, 1660–1666. PubMed, Google Scholar

. 2011 Detection of inferred CCR5- and CXCR4-using HIV-1 variants and evolutionary intermediates using ultra-deep pyrosequencing . PLoS Pathog. 7, e1002106.doi:

Moya A, Holmes E& González-Candelas F

. 2004 The population genetics and evolutionary epidemiology of RNA viruses . Nat. Rev. Microbiol. 2, 279–288.doi:

Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M& Shafer RW

. 2007 Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance . Genome Res. 17, 1195–1201.doi:

Archer J, Braverman MS, Taillon BE, Desany B, James I, Harrigan PR, Lewis M& Robertson DL

. 2009 Detection of low-frequency pretherapy chemokine (CXC motif) receptor 4 (CXCR4)-using HIV-1 with ultra-deep pyrosequencing . AIDS 23, 1209–1218.doi:

Eriksson N, Pachter L, Mitsuya Y, Rhee S-Y, Wang C, Gharizadeh B, Ronaghi M, Shafer RW& Beerenwinkel N

. 2008 Viral population estimation using pyrosequencing . PLoS Comput. Biol. 4, 1–13.doi:

Archer J, Baillie G, Watson SJ, Kellam P, Rambaut A& Robertson DL

. 2012 Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using segminator II . BMC Bioinformatics 13, 47.doi:

. 2011 Sequence-specific error profile of Illumina sequencers . Nucleic Acids Res. 39, e90.doi:

Huse SM, Huber JA, Morrison HG, Sogin ML& Welch DM

. 2007 Accuracy and quality of massively parallel DNA pyrosequencing . Genome Biol. 8, RI43. Crossref, Google Scholar

Quinian AR, Stewart DA, Strömberg MP& Marth GT

. 2008 Pyrobayes: an improved base caller for SNP discovery in pyrosequences . Nat. Methods 5, 179–181.doi:

Pandey RV, Nolte V, Boenigk J& Schlötterer C

. 2011 CANGS DB: a stand-alone web-based database tool for processing, managing and analyzing 454 data in biodiversity studies . BMC Res. Notes 4, 227–237.doi:

. 2011 Quality control and preprocessing of metagenomic datasets . Bioinformatics 27, 863–864.doi:

. 2012 NGS QC toolkit: a toolkit for quality control of next generation sequencing data . PLoS ONE 7, e30619.doi:

Ning Z, Cox AJ& Mullikin JC

. 2001 SSAHA: a fast search method for large DNA databases . Genome Res. 11, 1725–1729.doi:

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G& Durbin R

, 1000 Genome Project Data Processing Subgroup. 2009 The sequence alignment/map (SAM) format and SAMtools . Bioinformatics 25, 2078–2079.doi:

. 2012 Evolutionary dynamics of local pandemic H1N1/2009 influenza virus lineages revealed by whole-genome analysis . J. Virol. 86, 11–18.doi:

. 2010 Sequencing technologies: the next generation . Nat. Rev. Genet. 11, 31–46.doi:

. 2010 The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data . Genome Res. 20, 1297–1303.doi:


  1. Jeric

    Fine!!! Instead of a book for the night.

  2. Nhat

    I'm sorry, but in my opinion, you are wrong. I propose to discuss it. Write to me in PM, it talks to you.

  3. Fitzsimon

    I am am excited too with this question. Prompt, where I can read about it?

  4. Dwain

    Don't take yourself to heart!

  5. Chevell

    You should tell him - the mistake.

Write a message