Technical Documentation

Whole Genome Technical Documentation

Technical details Sequencing technology Bioinformatics analysis Data targets and service levels Data deliverables and file types Troubleshooting

Technical details

Plasmidsaurus Whole Genome Sequencing allows for de novo assembly and annotation of reference quality genomes for clonal populations of bacteria, yeast, fungi, or multicellular eukaryotic species.

Sequencing

We sequence each sample with Oxford Nanopore long reads to very high depth before generating an assembly using the latest super-accurate basecalling model and polishing software:

We construct an amplification-free long-read sequencing library using the newest v14 library prep chemistry, including minimal fragmentation of the input genomic DNA in a sequence-independent manner via tagmentation.
We sequence the library with a primer-free protocol using the most accurate R10.4.1 flow cells (raw data is delivered in .fastq format).

We use the latest flowcells and chemistry kits from Oxford Nanopore, along with the latest Super Accurate basecalling model. Genome assembly accuracy is typically around Q50-60, which corresponds to 99.999%, or one error per 100,000 bases, to 99.9999%, or one error per 1,000,000 bases.

The Hybrid Oxford Nanopore + Illumina bacterial genome sequencing service polishes your long-read ONT assembly with Illumina short reads (paired-end 2x150bp) which can improve assembly accuracy even further.

Bioinformatics analysis and genome assembly

If sufficient coverage to meet our target is obtained, we typically see assembled contigs with ~Q60 (99.9999%) accuracy.

The steps of our pipeline for Bacterial WGS are:

Basecalling | Dorado v4.3 Super-Accurate Basecalling with default Q10 quality filtering
Quality filtering | Remove worst 5% of reads with Filtlong v0.2.1 (default parameters)
Genome size estimation | Estimate genome size using Autocycler helper functions
Downsampling | Generate multiple subsampled read sets using Autocycler with optimal coverage for each subsample
Assembly | Perform multiple assemblies using Autocycler with three different assemblers:
- Flye v2.9.6+ with parameters optimized for high quality ONT reads
- Hifiasm for generating high-quality assemblies from long reads
- Plassembler v1.8.0+ for detecting and assembling plasmids
Remove low depth and small contigs and then compress, cluster, trim, resolve, and clean assemblies using Autocycler to identify the best consensus assembly from all assemblers. Rotate the assembly to start at the optimal position using dnaapler
Polishing | Polish Flye assembly via Medaka v1.8.0 using the filtered reads
Analysis | Run several analyses:
- Annotation with Bakta v1.11
- Contig analysis with Bandage v0.8.1
- Genome completeness and contamination with CheckM v1.2.2
- Species / plasmid identification with Sourmash v4.9.4 against a custom database (GTDB rs226, RefSeq plasmids, and phage sequences)
Hybrid Option | If you select the hybrid sequencing option, we target the same deliverables listed above for your ONT-only assembly, PLUS we run the following additional steps:
- Quality Control: Short reads are processed using fastp with quality filtering (Phred ≥15, length ≥50bp, adapter trimming, poly-G trimming)
- Alignment: Quality-controlled short reads are aligned to the long-read reference assembly using bwa-mem2
- Polishing: The long-read assembly (.fna) is polished using Polypolish v0.6.0 with the aligned short reads
  - Uses --careful mode when coverage is <25x
  - This produces a new .fasta polished hybrid assembly file

The steps of our pipeline for Yeast WGS are:

Basecalling | Dorado v4.3 Super-Accurate Basecalling with default Q10 quality filtering
Quality filtering | Remove the bottom 5% worst fastq reads via Filtlong v0.2.1 (with heavy weight applied to removing low quality reads, -qual_weight 10)
Assembly | Run a Hifiasm v0.25.0 assembly with parameters selected for high quality ONT reads
Analysis | Run several analyses:
- Annotation:
  - Augustus v3.5.0 (best gene model is automatically based on closest reference)
  - BLAST v2.15.0 (aligns ORFs against UniProt database v2024_04, use top hit if evalue <0.05)
- Contig analysis:
  - Bandage v0.8.1
- Genome completeness and contamination:
  - BUSCO v5.7.1
Hybrid Option | If you select the hybrid sequencing option, we target the same deliverables listed above for your ONT-only assembly, PLUS we run the following additional steps:
- Quality Control: Short reads are processed using fastp with quality filtering (Phred ≥15, length ≥50bp, adapter trimming, poly-G trimming)
- Alignment: Quality-controlled short reads are aligned to the long-read reference assembly using bwa-mem2
- Polishing: The long-read assembly (.fna) is polished using Polypolish v0.6.0 with the aligned short reads
  - Uses --careful mode when coverage is <25x
  - This produces a new .fasta polished hybrid assembly file

The steps of our pipeline for Eukaryotic WGS are:

Basecalling | Dorado v5.0 High-Accuracy Basecalling with default Q9 quality filtering
Quality filtering | Remove the bottom 5% worst fastq reads via Filtlong v0.2.1 (with heavy weight applied to removing low quality reads, -qual_weight 10)
Assembly | Run a Hifiasm v0.25.0 assembly with parameters selected for high quality ONT reads
Analysis | Run several analyses:
- Annotation with funannotate2, a eukaryotic genome annotation pipeline that uses:
  - Genome cleaning and preparation
  - Gene prediction using multiple methods (Augustus, GeneMark-ES, SNAP, GlimmerHMM, etc.)
  - Automated ab-initio training of gene prediction algorithms
  - Evidence-based gene prediction using transcript andd protein alignment evidence
  - Functional annotation using various databases (Pfam, dbCAN, MEROPS, SwissProt, etc.)
  - BUSCO-based quality assessment
- Contig analysis with Bandage v0.8.1

Note that the Eukaryotic WGS service does not offer a Hybrid option at this time. If you do need Hybrid data, please contact us at support@plasmidsaurus.com to discuss options.

Data targets and service levels

Bacterial Genome Sequencing

Service Level	Genome Size	Sequencing Platform	Raw Data Target	Sample Submission from gDNA	Sample Submission from Cells
Standard	<7 Mb	ONT	210 Mb of ONT raw data	20 μL at 50 ng/μL normalized concentration	4-6 x 10^9 cells in 500 µL Zymo 1X DNA/RNA Shield
Big	7-12 Mb	ONT	360 Mb of ONT raw data	20 μL at 50 ng/μL normalized concentration
Hybrid	<7 Mb	ONT + Illumina	210 Mb of ONT raw data + 100 Mb of Illumina raw data	30 μL at 50 ng/μL normalized concentration
Big Hybrid	7-12 Mb	ONT + Illumina	360 Mb of ONT raw data + 170 Mb of Illumina raw data	30 μL at 50 ng/μL normalized concentration

Yeast Genome Sequencing

Service Level	Genome Size	Sequencing Platform	Raw Data Target	Sample Submission from gDNA	Sample Submission from Cells
Standard	<20 Mb	ONT	600 Mb of ONT raw data	20 μL at 50 ng/μL normalized concentration	50 - 100 mg cell pellet in 500 µL Zymo 1X DNA/RNA Shield
Hybrid	<20 Mb	ONT + Illumina	600 Mb of ONT raw data + 290 Mb of Illumina raw data	30 μL at 50 ng/μL normalized concentration	50 - 100 mg cell pellet in 500 µL Zymo 1X DNA/RNA Shield

Eukaryotic Genome Sequencing

Service Level	Genome Size Range (30X Coverage)	Sequencing Platform	Sample Submission from gDNA	Sample Submission from Cells
1 Gb data target	20-60 Mb	ONT	20 μL at 50 ng/μL normalized concentration	5 x 10^6 cells in 200 µL Zymo 1X DNA/RNA Shield
5 Gb data target	75-250 Mb		40 μL at 50 ng/μL normalized concentration
15 Gb data target	300-750 Mb		60 μL at 50 ng/μL normalized concentration
Full flow cell (50-100 Gb data target)	1 - 3.3 Gb		60 μL at 50 ng/μL normalized concentration

Data deliverables and file types

We provide a range of data deliverables for whole genome sequencing that can be accessed via your Dashboard.

Data deliverables for bacterial genome sequencing

Raw reads:
- (.fastq.gz file) A compressed file of all the raw ONT sequencing reads
Annotated sequence in (.gbk, .embl files):
- (.gbff, .gbk, embl files) ONT-polished and annotated consensus sequence of the genome in various formats.
FASTA formatted sequences:
- (.fna file) Full sequence of each contig.
- (.faa file) Amino acid sequence of each protein annotation.
- (.ffn file) DNA sequence of each annotated gene.
Annotations and species identifications:
- (.txt files) Species identifications as found by Sourmash analysis.
- (.gff file) Summary of annotations from Bakta analysis.
Summary report:
- (.html, .png files) An analytical report of the figures and key metrics for the assembly
Genome comparison tool
- Interactive tool for comparing two bacterial genomes. See demo video here.

Data deliverables for yeast and eukaryotic genome sequencing

.fastq.gz = a compressed file of all the raw ONT sequencing reads
.fasta = polished consensus sequence of the genome (may contain multiple contigs)
.gff = gene annotations for the polished genome
.html = A summary report compiling the assembly metrics, including completeness of the assembly based on Busco v5.7.1 and general species identification of the contigs. The metrics summarized in the report are also delivered as discrete files:
- reads.png = histogram of all raw reads (indicating read length vs. Phred score), including coloration to distinguish reads that are retained for assembly vs. reads that are rejected
- stats.tsv = metrics assessing the quality and size of the polished genome assembly and the raw reads that were used for assembly
- busco-short-summary.txt = metrics assessing the completeness of the polished genome assembly
- contigs.png = graph of the contig topology and their connections in the assembly
- contigs.txt = metrics assessing the quantity and lengths of the contigs

Troubleshooting

Common errors and cases of low confidence basecalls

The most common error modes for Oxford Nanopore are deletions in homopolymer stretches (especially if longer than 8 bp), errors at the Dam methylation site GATC, and errors at the middle position of the Dcm methylation site CCTGG or CCAGG. You can learn more about these errors and how our bioinformatics pipelines address them in this blog post.

If you know that you need single-nucleotide accuracy in your assembly for these regions, please consider submitting to the Hybrid sequencing option for bacterial and yeast genomes to polish out those errors with Illumina data.

Determining "complete" and "fail" sequencing status

“Complete” samples are defined by achieving either one of these two scenarios:

EITHER:

The target amount of raw data is produced (see section above)

OR:

The amount of raw data produced is below target, but
- For the standard service, we are still able to assemble high coverage, high quality ONT-only contigs from that data
- For the hybrid service, we are still able to assemble high coverage, high quality Illumina-polished contigs from that data

“Fail” samples are defined by achieving neither the target amount of raw data, NOR any high coverage, high quality contigs.

Troubleshooting failed sequencing

Although we do not provide definitive reasons on why each specific sample failed (or had low coverage or otherwise poor results), by far the most common reasons are:

Your samples did not have enough cells or the required DNA concentration of 50 ng/μL.
- The most common cause of this is using a Nanodrop to quantify DNA concentration. We strongly recommend using a Qubit or equivalent fluorometric assay.
The gDNA in your samples is degraded or fragmented.
- At least 50% of the DNA should be above 15kb in length, and samples should be handled with utmost care:
  - Pipetting with wide-bore tips
  - Minimal freeze/thaw cycles
  - No vortexing
  - No extreme temperature/pH
  - No intercalating dyes
  - No UV radiation
  - Not over-dried
Your samples contain inhibitors, such as:
- RNA
- Denaturants (guanidinium salts, phenol, etc.)
- Detergents (SDS, Triton-X100, etc.)
- Residual contaminants from the organism/tissue (heme, humic acid, polyphenols, polysaccharides, lipids, etc.)
- Insoluble, colored, or cloudy material
- Other inhibitors (EDTA, etc.)
The DNA you sent is not from a single isolate.
- This service is intended for a clonal population (single species). If your sample contains a mixture of different species, it may fail to produce an assembly.
For the extraction option: You did not ship us the required number of cells
- Please perform a cell count while preparing your preserved cells to confirm that you are sending the number of cells we require.

See sample prep for detailed instructions on how to submit either extracted gDNA or cells for Whole Genome Sequencing.

Guarantees and rerun policy

We do not provide any guarantees for this service, as our ability to deliver both our internal target amount of raw sequencing data and a high quality, high coverage genome assembly is directly dependent on the quantity, quality, and purity of the samples that are sent to us.

In cases of failure, we will evaluate the results of the initial sequencing attempt to determine whether additional sequencing may produce a more successful outcome, and if so we will repeat the sequencing (with possible protocol adjustments) at no additional charge. We will also combine the data from the two runs together to increase chances of success on the repeat attempt.

If we are not able to achieve the data target or a high quality assembly outcome after the free rerun, we will not perform further reruns on the sample. We do still charge for failed samples, since we spend more time and resources on them than we do on successes.

If you wish to sequence the sample again, please prepare new samples that meet all our QC requirements before submitting a new sequencing request.