MedGenome NGS Data Quality Control Criteria and Metrics

By Parimala Nagaraja, NGS Assistant Manager, MedGenome Inc.
NGS technology is at the forefront of biological research. They generate vast amounts of data that run into gigabases in his single round of the sequence. However, several sequencing artifacts such as read errors (base call errors and small insertions/deletions), poor quality reads, and primer/adaptor contamination are very common in NGS data obtained after sequencing. . This can have significant implications for downstream analyzes such as sequence assembly, single nucleotide polymorphism (SNP) identification, and gene expression studies.
Quality control indicators play an important role in minimizing the number of errors and helping to obtain high quality data for successful experimental research. MedGenomeMedGenome strives to maintain strict guidelines regarding QC metrics in order to provide high quality data to our customers.
QC metrics are mainly applied at three levels.
- • Sample QC (DNA/RNA)
- • Library QC
- • Sequence QC
Sample quality control
An ideal NGS assay requires high-quality DNA/RNA. This is typically determined using a Tapestation/Bioanalyzer, which provides DIN/RIN (DNA/RNA Integrity number) values ranging from 1-10. 10 is the highest quality sample and 1 is the highest quality sample. Samples of severe deterioration and poor quality.
Depending on the assay type and sample source, MedGenome has a set of quantity, quality and quantity guidelines for clients. At MedGenome, all samples are first QCed using a Qubit to determine quantity and a Tapestation/Bioanalyzer to determine quality.
Based on the determined QC, samples are classified as pass, marginal, or fail. Replacement samples are typically requested for samples that fail sample QC. For marginal samples, replacement is highly recommended. If not, proceed to library preparation after client approval.
Library QC
All in-house prepared libraries are quality checked using the Tapestation/Bioanalyzer and quantified using the Qubit. Tapestation and Bioanalyzer results are thoroughly reviewed for expected library size, adapter contamination, primer dimers, and PCR artifacts before pooling and loading onto the sequencer. MedGenome also provides sequencing support for Premade libraries prepared by various customers based on project requirements. All off-the-shelf libraries are also carefully reviewed and categorized as pass, marginal, or fail prior to sequencing, also following MedGenome QC methodology. The following images show examples of good and bad library QC.



Sequencing QC
Illumina allows users to monitor runs in real time without interfering with run performance using software called Sequencing Analysis Viewer (SAV). The software is compatible with all HiSeq, NextSeq, MiSeq, and NovaSeq platforms. The following table describes the features used for sequencing QC evaluation.
semester | meaning |
---|---|
Strength | 90% percentile extracted intensity for a given image (lane/tile/cycle/channel combination). Platforms that use 4-channel sequencing will display 4 channels (A, C, G, and T). |
FWHM | Average full width of the cluster at half height (approximate size in pixels). |
% base | Percentage of clusters where the chosen base was called. |
%Q >/= 20, %Q >/= 30 | Percentage of bases with a Phred or Q quality score of 20 or 30 or higher, respectively |
density | Density of clusters in each tile (thousands per mm)2). |
DensityPF | Density of clusters passing filter for each tile (in thousands per mm)2). |
cluster | The number of clusters in each tile (in millions). |
Cluster PF | The number of clusters (in millions) that passed the filter for each tile. (Metrics are shown in the image below) |
%pass filter | Percentage of clusters that passed the Chastity filter (metrics shown in the image below) |
% Fading, % Pre-Fading | The average rate (per cycle) at which the molecules in the cluster lag (fading) or advance (pre-phasing) during the run. |
% Alignment | Percentage of passing filter clusters aligned to the PhiX genome. |
error rate | Calculated error rate determined by PhiX alignment.Subsequent columns show the error rate Cycles 1-35, 1-75, and 1-100. |
total yield | The number of sequenced bases updated as the run progresses. (indicators shown in the image below) |
Expected Total Yield | Number of bases expected to be sequenced at the end of the run. |
Table 1: Various terms seen in SAV and their corresponding definitions.
Illumina provides standardized read output, read pass filters, and quality scores for each flow cell type for all sequencing platforms. The image below shows the metrics for various flow cells on the NovaSeq 6000.

Image source: https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html

Image source: https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html

Image source: https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html
Sequencing QC also depends on pooled library types in the same lane or flow cell. Pooled and sequenced libraries prepared using the same protocol (e.g., Illumina Stranded mRNA) show that NovaSeq exceeds Illumina specifications. However, this is typically not the ideal world for companies to provide NGS services with high throughput and fast-paced turnaround times. Therefore, pooling multiple libraries of different library types is expected to result in variability in run performance and data yield. The following images show examples of sequence statistics obtained by pooling similar and mixed libraries.


Quality control of sequencing raw data
Raw data quality control should be the first step in data analysis for a successful study. There are several publicly available tools for quality control of raw FASTQ files. Developed by the Babraham Institute Bioinformatics Group, FastQC is one of the most popular tools that provides QC control parameters such as average base quality score per read, distribution of GC content, and identification of the most duplicated reads.
The key parameters to check the quality of the raw sequencing data are:
- • Basic quality
- • Nucleotide distribution
- • %GC distribution
- • PCR duplication
Base quality check:
A common way to visualize basic quality is to draw a basic Q-score versus cycle plot.
Sequencing data generated on Illumina platforms tend to observe median base quality scores between 35 and 40 on the Phred scale. Large variations in base quality scores (Figure 10a) usually indicate poor library QC. A sharp drop in the quality score (Figure 10b) usually indicates adapter dimer contamination or fluidic problems within the instrument. For paired-end reads, it is common to observe higher quality at the first end of the read than at the second end due to the length of time the template was on the instrument and the increasing laser exposure over time. target.

Image source: Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. His three-step quality control strategy for DNA resequencing data. A brief bioinfo. 2014 Nov;15(6):879-89. Doi: 10.1093/bib/bbt069. Epub 2013 Sep 24. PMID: 24067931; PMCID: PMC4492405.

Image source: Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. His three-step quality control strategy for DNA resequencing data. A brief bioinfo. 2014 Nov;15(6):879-89. Doi: 10.1093/bib/bbt069. Epub 2013 Sep 24. PMID: 24067931; PMCID: PMC4492405.
nucleotide distribution
This parameter is useful for whole-genome and whole-exome libraries (high diversity), but not amplicon or RNA libraries (medium-low diversity). For perfect sequencing, the distribution of the four nucleotides (ATCG) should be relatively stable across all reads (Figure 11).

Image source: Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. His three-step quality control strategy for DNA resequencing data. A brief bioinfo. 2014 Nov;15(6):879-89. Doi: 10.1093/bib/bbt069. Epub 2013 Sep 24. PMID: 24067931; PMCID: PMC4492405.
%GC distribution
The proportion of GCs in the genome varies between species and between regions of each genome. For exome regions, the GC content is approximately 49-51%, whereas for whole-genome sequencing (human), the GC content is approximately 38-40%. Abnormal GC content (more than 10% deviation from normal range) may indicate contamination.
PCR duplication
PCR overlap occurs during library preparation when PCR uses adapters to amplify fragments. The presence of PCR duplication can introduce potential bias into variant calling algorithms. Therefore, they are removed by most bioinformatics analysis pipelines during data preprocessing. Common causes of high PCR overlap include low input, too much sequencing, too many PCR cycles, low pre-enrichment/final library yields, and short library fragments.
References
#NGSQC, #QCmetrics, #readquality, #sequencingQC, #LibraryQC, #DensityPF, #ClustersPF, # %PassFilter, #Errorrate, #YieldTotal
https://research.medgenome.com/medgenomes-quality-control-standards-and-metrics-for-ngs-data/ MedGenome NGS Data Quality Control Criteria and Metrics