Back
Technical Documentation

Whole Genome Technical Documentation

Technical details

Plasmidsaurus Whole Genome Sequencing allows for de novo assembly and annotation of reference quality genomes for clonal populations of bacteria, yeast, fungi, or multicellular eukaryotic species. 

 

Sequencing

We sequence each sample with Oxford Nanopore (ONT) long reads to very high depth before generating an assembly using the latest super-accurate basecalling model and polishing software:

  • We construct an amplification-free long-read sequencing library using the newest v14 library prep chemistry, including minimal fragmentation of the input genomic DNA in a sequence-independent manner via tagmentation.
  • We sequence the library with a primer-free protocol using the most accurate R10.4.1 flow cells (raw data is delivered in .fastq format).

We require a minimum raw read Qscore of 10 (90% accuracy) during sequencing, although most raw reads are above Q20 (99% accuracy). 

The Hybrid ONT + Illumina bacterial genome sequencing service polishes your long-read ONT assembly with Illumina short reads (paired-end 2x150bp).

 

Bioinformatics analysis and genome assembly

The steps of our pipeline for bacterial WGS are:

  • Basecalling | Dorado Super-Accurate Basecalling with default Q10 quality filtering
  • Quality filtering | Remove worst 5% of reads with Filtlong v0.2.1
  • Downsampling | Downsample the reads to 250 Mb via Filtlong to create a rough sketch of the assembly with Miniasm v0.3
    • Using information acquired from the Miniasm assembly, re-downsample the reads to ~100x coverage (do nothing if there isn't at least 100x coverage) with heavy weight applied to removing low quality reads (helps small plasmids stick around)
  • Assembly | Run a Flye v2.9.1 assembly with parameters selected for high quality ONT reads
  • Polishing | Polish Flye assembly via Medaka v1.8.0 using the reads generated in step
  • Analysis | Run several analyses:
  • Hybrid Option | If you select the hybrid sequencing option, we target the same deliverables listed above for your ONT-only assembly, PLUS we run the following additional step: Polish the ONT .fna assembly with Illumina .fastq reads using Polypolish v0.6.0, which yields a new .fasta polished hybrid assembly file.

 

The steps of our pipeline for yeast & eukaryotic WGS are:

  • Basecalling | Dorado Super-Accurate Basecalling with default Q10 quality filtering
  • Quality filtering | Remove the bottom 5% worst fastq reads via Filtlong v0.2.1 (with heavy weight applied to removing low quality reads, -qual_weight 10)
  • Assembly | Run a Hifiasm v0.25.0 assembly with parameters selected for high quality ONT reads
  • Analysis | Run several analyses:
  • Hybrid Option | If you select the hybrid sequencing option for yeast genome sequencing, we target the same deliverables listed above for your ONT-only assembly, PLUS we run the following additional step:Polish the ONT .fna assembly with Illumina .fastq reads using Polypolish v0.6.0, which yields a new .fasta polished hybrid assembly file.

If sufficient coverage to meet our target is obtained, we typically see assembled contigs with ~Q40 (99.99%) accuracy.

 

 

Data targets and service levels

Bacterial Genome Sequencing

Service LevelGenome SizeSequencing PlatformRaw Data TargetSample Submission from gDNASample Submission from Cells
Standard<7 MbONT210 Mb of ONT raw data20 μL at 50 ng/μL normalized concentration4-6 x 10^9 cells in 500 µL Zymo 1X DNA/RNA Shield
Big7-12 Mb360 Mb of ONT raw data
Hybrid<7 MbONT + Illumina210 Mb of ONT raw data + 100 Mb of Illumina raw data30 μL at 50 ng/μL normalized concentration
Big Hybrid7-12 Mb360 Mb of ONT raw data + 170 Mb of Illumina raw data

 

Yeast Genome Sequencing

Service LevelGenome SizeSequencing PlatformRaw Data TargetSample Submission from gDNASample Submission from Cells
Standard<20 MbONT600 Mb of ONT raw data20 μL at 50 ng/μL normalized concentration50 - 100 mg cell pellet in 500 µL Zymo Shield
Hybrid<20 MbONT + Illumina600 Mb of ONT raw data + 290 Mb of Illumina raw data30 μL at 50 ng/μL normalized concentration

 

Eukaryotic Genome Sequencing

Service LevelGenome Size Range (30X Coverage)Sequencing PlatformSample Submission from gDNASample Submission from Cells
1 Gb data target20-60 MbONT20 μL at 50 ng/μL normalized concentration5 x 10^6 cells in 200 µL Zymo Shield
5 Gb data target75-250 Mb40 μL at 50 ng/μL normalized concentration
15 Gb data target300-750 Mb60 μL at 50 ng/μL normalized concentration
Full flow cell (50-100 Gb data target)1 - 3.3 Gb60 μL at 50 ng/μL normalized concentration

 

Data deliverables and file types

We provide a range of data deliverables for whole genome sequencing that can be accessed via your Dashboard

 

Data deliverables for bacterial genome sequencing

  • Raw reads:
    • (.fastq.gz file) A compressed file of all the raw ONT sequencing reads
  • Annotated sequence in (.gbk, .embl files):
    • (.gbff, .gbk, embl files) ONT-polished and annotated consensus sequence of the genome in various formats.
  • FASTA formatted sequences:
    • (.fna file) Full sequence of each contig.
      • (.faa file) Amino acid sequence of each protein annotation.
        • (.ffn file) DNA sequence of each annotated gene.
  • Annotations and species identifications:
    • (.txt files) Species identifications as found by Mash and Sourmash analyses.
      • (.gff file) Summary of annotations from Bakta analysis.
  • Summary report:
    • (.html, .png files) An analytical report of the figures and key metrics for the assembly
  • Genome comparison tool
    • Interactive tool for comparing two bacterial genomes. See demo video here

 

Data deliverables for yeast and eukaryotic genome sequencing

  • .fastq.gz = a compressed file of all the raw ONT sequencing reads
  • .fasta = polished consensus sequence of the genome (may contain multiple contigs)
  • .gff = gene annotations for the polished genome
  • .html = A summary report compiling the assembly metrics, including completeness of the assembly based on Busco v5.7.1 and general species identification of the contigs. The metrics summarized in the report are also delivered as discrete files:
    • reads.png = histogram of all raw reads (indicating read length vs. Phred score), including coloration to distinguish reads that are retained for assembly vs. reads that are rejected
    • stats.tsv = metrics assessing the quality and size of the polished genome assembly and the raw reads that were used for assembly
    • busco-short-summary.txt = metrics assessing the completeness of the polished genome assembly
    • contigs.png = graph of the contig topology and their connections in the assembly
    • contigs.txt = metrics assessing the quantity and lengths of the contigs

 

 

Troubleshooting

Common errors and cases of low confidence basecalls

The most common error modes for Oxford Nanopore are deletions in homopolymer stretches (especially if longer than 8 bp), errors at the Dam methylation site GATC, and errors at the middle position of the Dcm methylation site CCTGG or CCAGG. You can learn more about these errors and how our bioinformatics pipelines address them in this blog post

If you know that you need single-nucleotide accuracy in your assembly for these regions, please consider submitting to the Hybrid sequencing option for bacterial and yeast genomes to polish out those errors with Illumina data. 
 

Determining "complete" and "fail" sequencing status

“Complete” samples are defined by achieving either one of these two scenarios:

EITHER:

  • The target amount of raw data is produced (see section above)

OR:

  • The amount of raw data produced is below target, but
    • For the standard service, we are still able to assemble high coverage, high quality ONT-only contigs from that data
    • For the hybrid service, we are still able to assemble high coverage, high quality Illumina-polished contigs from that data

“Fail” samples are defined by achieving neither the target amount of raw data, NOR any high coverage, high quality contigs.
 

Troubleshooting failed sequencing

Although we do not provide definitive reasons on why each specific sample failed (or had low coverage or otherwise poor results), by far the most common reasons are:

  • Your samples did not have enough cells or the required DNA concentration of 50 ng/μL.
    • The most common cause of this is using a Nanodrop to quantify DNA concentration. We strongly recommend using a Qubit or equivalent fluorometric assay.
  • The gDNA in your samples is degraded or fragmented.
    • At least 50% of the DNA should be above 15kb in length, and samples should be handled with utmost care:
      • Pipetting with wide-bore tips
      • Minimal freeze/thaw cycles
      • No vortexing
      • No extreme temperature/pH
      • No intercalating dyes
      • No UV radiation
      • Not over-dried
  • Your samples contain inhibitors, such as:
    • RNA
    • Denaturants (guanidinium salts, phenol, etc.)
    • Detergents (SDS, Triton-X100, etc.)
    • Residual contaminants from the organism/tissue (heme, humic acid, polyphenols, polysaccharides, lipids, etc.)
    • Insoluble, colored, or cloudy material
    • Other inhibitors (EDTA, etc.)
  • The DNA you sent is not from a single isolate.
    • This service is intended for a clonal population (single species). If your sample contains a mixture of different species, it may fail to produce an assembly.
  • For the extraction option: You did not ship us the required number of cells
    • Please perform a cell count while preparing your preserved cells to confirm that you are sending the number of cells we require.

See sample prep for detailed instructions on how to submit either extracted gDNA or cells for Whole Genome Sequencing. 

 

Guarantees and rerun policy

We do not provide any guarantees for this service, as our ability to deliver both our internal target amount of raw sequencing data and a high quality, high coverage genome assembly is directly dependent on the quantity, quality, and purity of the samples that are sent to us. 

In cases of failure, we will evaluate the results of the initial sequencing attempt to determine whether additional sequencing may produce a more successful outcome, and if so we will repeat the sequencing (with possible protocol adjustments) at no additional charge. We will also combine the data from the two runs together to increase chances of success on the repeat attempt.

If we are not able to achieve the data target or a high quality assembly outcome after the free rerun, we will not perform further reruns on the sample. We do still charge for failed samples, since we spend more time and resources on them than we do on successes.

If you wish to sequence the sample again, please prepare new samples that meet all our QC requirements before submitting a new sequencing request.