AbstractWe describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases.
Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by 280 bp or 3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.
Authors:;;;;;; Publication Date: 2011-08-18 Research Org.: Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States) Sponsoring Org.: Genomics Division OSTI Identifier: 1056549 Report Number(s): LBNL-5312E Journal ID: ISSN 1932-6203 DOE Contract Number: DE-AC02-05CH11231 Resource Type: Journal Article Journal Name: PLoS ONE Additional Journal Information: Journal Volume: 6; Journal Issue: 8; Journal ID: ISSN 1932-6203 Publisher: Public Library of Science Country of Publication: United States Language: English Subject: 59 BASIC BIOLOGICAL SCIENCES. Paired-end library sequencing has been proven useful in scaffold construction during de novo assembly of genomic sequences.
De novo Genome Assembly for Illumina Data Protocol. Written and maintained by Simon Gladman - Melbourne Bioinformatics (formerly VLSCI). Protocol Overview / Introduction. In this protocol we discuss and outline the process of de novo assembly for small to medium sized genomes.
The ability of generating mate pairs with 8 Kb or greater insert sizes is especially important for genomes containing long repeats. While the current 454 GS LT Paired-end library preparation protocol can successfully construct libraries with 3 Kb insert size, it fails to generate longer insert sizes because the protocol is optimized to purify shorter fragments.
We have made several changes in the protocol in order to increase the fragment length. These changes include the use of Promega column to increase the yield of large size DNA fragments, two gel purification steps to remove contaminated short fragments, and a large reaction volume in the circularization step to decrease the formation of chimeras. We have also made additional changes in the protocol to increase the overall quality of the libraries. The quality of the libraries are measured by a set of metrics, which include levels of redundant reads, linker positive, linker negative, half linker reads, and driver DNA contamination, and read length distribution, were used to measure the primary quality of these libraries. We have also assessed the quality of the resulted mate pairs including levels of chimera, distribution of insert sizes, and genome coverage after the assemblies are completed. Our data indicated that all these changes have improved the quality of the longer insert size libraries.
Fosmid or BAC end sequencing plays an important role in de novo assembly of large genomes like fungi and plants. However construction and Sanger sequencing of fosmid or BAC libraries are laborious and costly. The current 454 Paired-End (PE) Library and Illumina Jumping Library construction protocols are limited with the gap sizes of approximately 20 kb and 8 kb, respectively. In the attempt to understand the limitations of constructing PE libraries with greater than 30Kb gaps, we have purified 18, 28, 45, and 65Kb sheared DNA fragments from yeast and circularized the ends using the Cre-loxP approach described in the 454 PE Library protocol. With the increasing fragment sizes, we found a general trend of decreasing library quality in several areas.
First, redundant reads and reads containing multiple loxP linkers increase when the average fragment size increases. Second, the contamination of short distance pairs (. Paired-end library sequencing has been proven useful in scaffold construction during de novo whole genome shotgun assembly.
The ability of generating mate pairs with 8 Kb insert sizes is especially important for genomes containing long repeats. To make mate paired libraries for next generation sequencing, DNA fragments need to be circularized to bring the ends together. There are several methods that can be used for DNA circulation, namely ligation, hybridization and Cre-LoxP recombination. With higher circularization efficiency with large insert DNA fragments, Cre-LoxP recombination method generally has been used for constructing 8 kb insert size paired-end libraries. Second fragmentation step is also crucial for maintaining high library complexity and uniform genome coverage. Here we will describe the following three fragmentation methods: restriction enzyme digestion, random shearing and nick translation. We will present the comparison results for these three methods.
Our data showed that all three methods are able to generate paired-end libraries with greater than 20 kb insert. Advantages and disadvantages of these three methods will be discussed as well.