MedGenome NGS Data Quality Control Criteria and Metrics

user

January 5, 2023

By Parimala Nagaraja, NGS Assistant Manager, MedGenome Inc.

NGS technology is at the forefront of biological research. They generate vast amounts of data that run into gigabases in his single round of the sequence. However, several sequencing artifacts such as read errors (base call errors and small insertions/deletions), poor quality reads, and primer/adaptor contamination are very common in NGS data obtained after sequencing. . This can have significant implications for downstream analyzes such as sequence assembly, single nucleotide polymorphism (SNP) identification, and gene expression studies.

Quality control indicators play an important role in minimizing the number of errors and helping to obtain high quality data for successful experimental research. MedGenomeMedGenome strives to maintain strict guidelines regarding QC metrics in order to provide high quality data to our customers.

QC metrics are mainly applied at three levels.

• Sample QC (DNA/RNA)
• Library QC
• Sequence QC

Sample quality control

An ideal NGS assay requires high-quality DNA/RNA. This is typically determined using a Tapestation/Bioanalyzer, which provides DIN/RIN (DNA/RNA Integrity number) values ranging from 1-10. 10 is the highest quality sample and 1 is the highest quality sample. Samples of severe deterioration and poor quality.

Depending on the assay type and sample source, MedGenome has a set of quantity, quality and quantity guidelines for clients. At MedGenome, all samples are first QCed using a Qubit to determine quantity and a Tapestation/Bioanalyzer to determine quality.

Based on the determined QC, samples are classified as pass, marginal, or fail. Replacement samples are typically requested for samples that fail sample QC. For marginal samples, replacement is highly recommended. If not, proceed to library preparation after client approval.

Library QC

All in-house prepared libraries are quality checked using the Tapestation/Bioanalyzer and quantified using the Qubit. Tapestation and Bioanalyzer results are thoroughly reviewed for expected library size, adapter contamination, primer dimers, and PCR artifacts before pooling and loading onto the sequencer. MedGenome also provides sequencing support for Premade libraries prepared by various customers based on project requirements. All off-the-shelf libraries are also carefully reviewed and categorized as pass, marginal, or fail prior to sequencing, also following MedGenome QC methodology. The following images show examples of good and bad library QC.

Figure 1: An example of good library QC. The library is of the expected size, has a single bell curve, and is free of adapter dimer contamination.

library too big — **Figure 2:** Bad library QC example 1

Sequencing QC

Illumina allows users to monitor runs in real time without interfering with run performance using software called Sequencing Analysis Viewer (SAV). The software is compatible with all HiSeq, NextSeq, MiSeq, and NovaSeq platforms. The following table describes the features used for sequencing QC evaluation.

semester	meaning
Strength	90% percentile extracted intensity for a given image (lane/tile/cycle/channel combination). Platforms that use 4-channel sequencing will display 4 channels (A, C, G, and T).
FWHM	Average full width of the cluster at half height (approximate size in pixels).
% base	Percentage of clusters where the chosen base was called.
%Q >/= 20, %Q >/= 30	Percentage of bases with a Phred or Q quality score of 20 or 30 or higher, respectively
density	Density of clusters in each tile (thousands per mm)²).
DensityPF	Density of clusters passing filter for each tile (in thousands per mm)²).
cluster	The number of clusters in each tile (in millions).
Cluster PF	The number of clusters (in millions) that passed the filter for each tile. (Metrics are shown in the image below)
%pass filter	Percentage of clusters that passed the Chastity filter (metrics shown in the image below)
% Fading, % Pre-Fading	The average rate (per cycle) at which the molecules in the cluster lag (fading) or advance (pre-phasing) during the run.
% Alignment	Percentage of passing filter clusters aligned to the PhiX genome.
error rate	Calculated error rate determined by PhiX alignment.Subsequent columns show the error rate Cycles 1-35, 1-75, and 1-100.
total yield	The number of sequenced bases updated as the run progresses. (indicators shown in the image below)
Expected Total Yield	Number of bases expected to be sequenced at the end of the run.

Table 1: Various terms seen in SAV and their corresponding definitions.

Illumina provides standardized read output, read pass filters, and quality scores for each flow cell type for all sequencing platforms. The image below shows the metrics for various flow cells on the NovaSeq 6000.

**Figure 5:** NovaSeq 6000 Reads Read Output Specifications
**Image source:** https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html

**Figure 6:** Total reads passing filters on the NovaSeq platform
**Image source:** https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html

**Figure 7:** Illumina Standards for Read Quality on the NovaSeq 6000
**Image source:** https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html

Sequencing QC also depends on pooled library types in the same lane or flow cell. Pooled and sequenced libraries prepared using the same protocol (e.g., Illumina Stranded mRNA) show that NovaSeq exceeds Illumina specifications. However, this is typically not the ideal world for companies to provide NGS services with high throughput and fast-paced turnaround times. Therefore, pooling multiple libraries of different library types is expected to result in variability in run performance and data yield. The following images show examples of sequence statistics obtained by pooling similar and mixed libraries.

**Figure 8:** SAV statistics for runs with the same library type. Example of best stats displayed in SAV for S4 run performed in MedGenome. Cluster PF(%)<85%、レーンあたりの合計収量 (~34 億 PE) はイルミナの仕様を超え、%>=30 is over 95%.

**Figure 9:** SAV statistics for sequencing runs with mixed library types. Cluster PF(%) ~80%, total yield per lane (~3 billion PE) %>=30 ranged from 85% to 93%. However, this run met all Illumina standard metrics and can be classified as a “good run”.

Quality control of sequencing raw data

Raw data quality control should be the first step in data analysis for a successful study. There are several publicly available tools for quality control of raw FASTQ files. Developed by the Babraham Institute Bioinformatics Group, FastQC is one of the most popular tools that provides QC control parameters such as average base quality score per read, distribution of GC content, and identification of the most duplicated reads.

The key parameters to check the quality of the raw sequencing data are:

• Basic quality
• Nucleotide distribution
• %GC distribution
• PCR duplication

Base quality check:

A common way to visualize basic quality is to draw a basic Q-score versus cycle plot.

Sequencing data generated on Illumina platforms tend to observe median base quality scores between 35 and 40 on the Phred scale. Large variations in base quality scores (Figure 10a) usually indicate poor library QC. A sharp drop in the quality score (Figure 10b) usually indicates adapter dimer contamination or fluidic problems within the instrument. For paired-end reads, it is common to observe higher quality at the first end of the read than at the second end due to the length of time the template was on the instrument and the increasing laser exposure over time. target.

**Figure 10a:** Variation in quality scores due to poor library QC
**Image source:** Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. His three-step quality control strategy for DNA resequencing data. A brief bioinfo. 2014 Nov;15(6):879-89. Doi: 10.1093/bib/bbt069. Epub 2013 Sep 24. PMID: 24067931; PMCID: PMC4492405.

**Figure 10b:** Quality degradation due to contamination with adapter dimers.
**Image source:** Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. His three-step quality control strategy for DNA resequencing data. A brief bioinfo. 2014 Nov;15(6):879-89. Doi: 10.1093/bib/bbt069. Epub 2013 Sep 24. PMID: 24067931; PMCID: PMC4492405.

nucleotide distribution

This parameter is useful for whole-genome and whole-exome libraries (high diversity), but not amplicon or RNA libraries (medium-low diversity). For perfect sequencing, the distribution of the four nucleotides (ATCG) should be relatively stable across all reads (Figure 11).

%GC distribution

The proportion of GCs in the genome varies between species and between regions of each genome. For exome regions, the GC content is approximately 49-51%, whereas for whole-genome sequencing (human), the GC content is approximately 38-40%. Abnormal GC content (more than 10% deviation from normal range) may indicate contamination.

PCR duplication

PCR overlap occurs during library preparation when PCR uses adapters to amplify fragments. The presence of PCR duplication can introduce potential bias into variant calling algorithms. Therefore, they are removed by most bioinformatics analysis pipelines during data preprocessing. Common causes of high PCR overlap include low input, too much sequencing, too many PCR cycles, low pre-enrichment/final library yields, and short library fragments.

References

#NGSQC, #QCmetrics, #readquality, #sequencingQC, #LibraryQC, #DensityPF, #ClustersPF, # %PassFilter, #Errorrate, #YieldTotal

https://research.medgenome.com/medgenomes-quality-control-standards-and-metrics-for-ngs-data/ MedGenome NGS Data Quality Control Criteria and Metrics