{"id":7246,"date":"2020-03-23T12:40:48","date_gmt":"2020-03-23T12:40:48","guid":{"rendered":"https:\/\/www.kolabtree.com\/blog\/?p=7246"},"modified":"2023-04-18T11:12:54","modified_gmt":"2023-04-18T11:12:54","slug":"a-step-by-step-guide-to-dna-sequencing-data-analysis","status":"publish","type":"post","link":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/","title":{"rendered":"A Step-By-Step Guide to DNA Sequencing Data Analysis"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_45_1 counter-flat ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" area-label=\"ez-toc-toggle-icon-1\"><label for=\"item-69f9d1bcf0f1f\" aria-label=\"Table of Content\"><span style=\"display: flex;align-items: center;width: 35px;height: 30px;justify-content: center;direction:ltr;\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/label><input  type=\"checkbox\" id=\"item-69f9d1bcf0f1f\"><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#Introduction\" title=\"Introduction\">Introduction<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#Quality_control_QC_of_raw_reads\" title=\"Quality control (QC) of raw reads\">Quality control (QC) of raw reads<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#Read_trimming\" title=\"Read trimming\">Read trimming<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#Alignment\" title=\"Alignment\">Alignment<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#From_the_alignments\" title=\"From the alignments\">From the alignments<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#Before_you_start%E2%80%A6\" title=\"Before you start&#8230;\">Before you start&#8230;<\/a><\/li><\/ul><\/nav><\/div>\n<p><em><span style=\"font-weight: 300;\">Dr. Javier Quilez Oliete, an experienced <a href=\"https:\/\/www.kolabtree.com\/find-an-expert\/subject\/bioinformatics\" target=\"_blank\" rel=\"noopener\">freelance bioinformatics consultant<\/a> on Kolabtree, provides a comprehensive guide to DNA sequencing data analysis, including tools and software used to read data.\u00a0<\/span><\/em><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Introduction\"><\/span><b>Introduction<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 300;\">Deoxyribonucleic acid (DNA) is the molecule that carries most of the genetic information <\/span><span style=\"font-weight: 300;\">of an organism<\/span><span style=\"font-weight: 300;\">. (In some types of virus, genetic information is carried by ribonucleic acid (RNA).)\u00a0\u00a0Nucleotides (conventionally represented by the letters A, C, G or T) are the basic units of DNA molecules. Conceptually, <a href=\"https:\/\/www.kolabtree.com\/find-an-expert\/subject\/dna-sequencing?utm_source=Blog&amp;utm_medium=Post&amp;utm_campaign=DNASeqGuide\">DNA sequencing<\/a> is the process of reading the nucleotides that comprise a DNA molecule (e.g. \u201cGCAAACCAAT\u201d is a 10-nucleotide DNA string). Current sequencing technologies produce millions of such DNA reads <\/span><span style=\"font-weight: 300;\">in a reasonable time and at a relatively low cost. As a reference, the cost of sequencing a human genome &#8211; a genome is the complete set of DNA molecules in an organism &#8211; has dropped the <\/span><a href=\"https:\/\/www.technologyreview.com\/s\/615289\/china-bgi-100-dollar-genome\/\"><span style=\"font-weight: 300;\">$100 barrier<\/span><\/a><span style=\"font-weight: 300;\"> and it can be done in a matter of days. This contrasts with the first initiative to sequence the <\/span><a href=\"https:\/\/www.nature.com\/articles\/35057062\"><span style=\"font-weight: 300;\">human genome<\/span><\/a><span style=\"font-weight: 300;\">, which was completed in a decade and had a cost of about $2.7 billions.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">This capability to sequence DNA at high throughput and low cost has enabled the development of a growing number of sequencing-based methods and applications. For example, sequencing entire genomes or their protein-coding regions (two approaches known respectively as whole genome and exome sequencing) in disease and healthy individuals can hint to disease-causing DNA alterations. Also, the sequencing of the RNA that is transcribed from DNA\u2014a technique known as RNA-sequencing\u2014is used to quantify gene activity and how this changes in different conditions (e.g. untreated versus treatment). On the other side, chromosome conformation capture sequencing methods detect interactions between nearby DNA molecules and thus help to determine the spatial distribution of chromosomes within the cell.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Common to these and other applications of DNA sequencing is the generation of datasets in the order of the gigabytes and comprising millions of read sequences. Therefore, making sense of high-throughput sequencing (HTS) experiments requires substantial data analysis capabilities. Fortunately, dedicated computational and statistical tools and relatively standard analysis workflows exist for most HTS data types. While some of the (initial) analysis steps are common to most sequencing data types, more downstream analysis will depend on the kind of data and\/or the ultimate goal of the analysis. Below I provide a primer on the fundamental steps in the analysis of HTS data and I refer to popular tools.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Some of the sections below are focused on the analysis of data generated from short-read sequencing technologies (mostly <\/span><a href=\"https:\/\/www.illumina.com\/\"><span style=\"font-weight: 300;\">Illumina<\/span><\/a><span style=\"font-weight: 300;\">), as these have historically dominated the HTS market. However, newer technologies that generate longer reads (e.g. <\/span><a href=\"https:\/\/nanoporetech.com\/\"><span style=\"font-weight: 300;\">Oxford Nanopore Technologies<\/span><\/a><span style=\"font-weight: 300;\">, <\/span><a href=\"https:\/\/www.pacb.com\/\"><span style=\"font-weight: 300;\">PacBio<\/span><\/a><span style=\"font-weight: 300;\">) are gaining ground rapidly. As long-read sequencing has some particularities (e.g. higher error-rates), specific tools are being developed for the analysis of this sort of data.\u00a0<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Quality_control_QC_of_raw_reads\"><\/span><b>Quality control (QC) of raw reads<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 300;\">The eager analyst will start the analysis from FASTQ files; the <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/FASTQ_format\"><span style=\"font-weight: 300;\">FASTQ format<\/span><\/a><span style=\"font-weight: 300;\"> has for long been the standard to store short-read sequencing data. In essence, FASTQ files contain the nucleotide sequence and the per-base<\/span><span style=\"font-weight: 300;\"> calling quality for millions of reads. Although file size will depend on the actual number of reads, FASTQ files are typically large (in the order of megabytes and gigabytes) and compressed. Of note, most tools that use FASTQ files as input can handle them in compressed format so, in order to save disk space, it is recommended not to uncompress them. As a convention, here I will equate a FASTQ file to a sequencing sample.<\/span><\/p>\n<p><a href=\"https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/\"><span style=\"font-weight: 300;\">FastQC<\/span><\/a><span style=\"font-weight: 300;\"> is likely the most popular tool to carry out the QC of the raw reads. It can be run through a visual interface or programmatically. While the first option may be more convenient for users who do not feel comfortable with the command-line environment, the latter offers incomparable scalability and reproducibility (think of how tedious and error-prone it can be to manually run the tool for tens of files). Either way, the main output of FastQC is an <\/span><a href=\"https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/good_sequence_short_fastqc.html\"><span style=\"font-weight: 300;\">HTML file<\/span><\/a><span style=\"font-weight: 300;\"> reporting key summary statistics about the overall quality of the raw sequencing reads from a given sample. Inspecting tens of FastQC reports one by one is tedious and it complicates the comparison across samples. Therefore, you may want to use <\/span><a href=\"https:\/\/multiqc.info\/\"><span style=\"font-weight: 300;\">MultiQC<\/span><\/a><span style=\"font-weight: 300;\">, which aggregates the HTML reports from FastQC (as well as from other tools used downstream, e.g. adapter trimming, alignment) into a single report<\/span><span style=\"font-weight: 300;\">.<\/span><\/p>\n<div id=\"attachment_7265\" style=\"width: 712px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-7265\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-7265 size-large\" src=\"https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC-1024x576.png\" alt=\"\" width=\"702\" height=\"395\" srcset=\"https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC-1024x576.png 1024w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC-300x169.png 300w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC-768x432.png 768w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC-1536x864.png 1536w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC-1080x608.png 1080w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC.png 1600w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/MultiQC-300x169@2x.png 600w\" sizes=\"(max-width: 702px) 100vw, 702px\" \/><p id=\"caption-attachment-7265\" class=\"wp-caption-text\">MultiQC<\/p><\/div>\n<p><span style=\"font-weight: 300;\">QC information is intended to allow the user to judge whether samples have good quality and can be therefore used for the subsequent steps or they need to be discarded. Unfortunately, there is not a consensus threshold based on the FastQC metrics to classify samples as of good or bad quality. The approach that I use is the following. I expect all samples that have gone through the same procedure (e.g. DNA extraction, library preparation) to have similar quality statistics and a majority of \u201cpass\u201d flags. If some samples have lower-than-average quality, I will still use them in the downstream analysis bearing this in mind. On the other side, if all samples in the experiment systematically get \u201cwarning\u201d or \u201cfail\u201d flags in multiple metrics (see <\/span><a href=\"https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/bad_sequence_fastqc.html\"><span style=\"font-weight: 300;\">this example<\/span><\/a><span style=\"font-weight: 300;\">), I suspect that something went wrong in the experiment (e.g. bad DNA quality, library preparation, etc.) and I recommend repeating it.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Read_trimming\"><\/span><b>Read trimming<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 300;\">QC of raw reads helps to identify problematic samples but it does not improve the actual quality of the reads. To do so, we need to trim reads to remove technical sequences and low-quality ends.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Technical sequences are leftovers from the experimental procedure (e.g. sequencing adapters). If such sequences are adjacent to the true sequence of the read, alignment (see below) may map reads to the wrong position in the genome or decrease the confidence in a given alignment. Besides technical sequences, we may also want to remove sequences of biological origin if these are highly present among the reads. For instance, suboptimal DNA preparation procedures may leave a high proportion of DNA-converted ribosomal RNA (rRNA) in the sample. Unless this type nucleic acid is the target of the sequencing experiment, keeping reads derived from rRNA will just increase the computational burden of the downstream steps and may confound the results. Of note, if the levels of technical sequences, rRNA or other contaminant are very high, which will probably have been already highlighted by the QC, you may want to discard the whole sequencing sample.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">In short-read sequencing, the DNA sequence is determined one nucleotide at a time (technically, one nucleotide every sequencing cycle). In other words, the number of sequencing cycles determines read length. A known issue of HTS sequencing methods is the decay of the accuracy with which nucleotides are determined as sequencing cycles accumulate. This is reflected in an overall decrease of the per-base calling quality especially towards the end of the read. As happens with technical sequences, trying to align reads that contain low-quality ends can lead to misplacement or poor mapping quality.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">To remove technical\/contaminant sequences and low-quality ends, read trimming tools like <\/span><a href=\"http:\/\/www.usadellab.org\/cms\/?page=trimmomatic\"><span style=\"font-weight: 300;\">Trimmomatic<\/span><\/a><span style=\"font-weight: 300;\"> and <\/span><a href=\"https:\/\/cutadapt.readthedocs.io\/en\/stable\/\"><span style=\"font-weight: 300;\">Cutadapt<\/span><\/a><span style=\"font-weight: 300;\"> exist and are widely used. In essence, such tools will remove technical sequences (internally available and\/or provided by the user) and trim reads based on quality while maximizing read length. Reads that are left too short after the trimming are discarded (reads excessively short, e.g. &lt;36 nucleotides, complicate the alignment step as these will likely map to multiple sites in the genome). You may want to look at the percentage of reads that survive the trimming, as a high-rate of discarded reads is likely a sign of bad-quality data.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Finally, I typically re-run FastQC on the trimmed reads to check that this step was effective and systematically improved the QC metrics.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Alignment\"><\/span><b>Alignment<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 300;\">With exceptions (e.g <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/De_novo_sequence_assemblers\"><span style=\"font-weight: 300;\">de novo assembly<\/span><\/a><span style=\"font-weight: 300;\">), alignment (also referred to as mapping) is typically the next step for most HTS data types and applications. Read alignment consists in determining the position in the genome from which the sequence of the read derived (typically expressed as chromosome:start-end). Hence, at this step we require the use of a reference sequence to align\/map the reads on.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">The choice of the reference sequence will be determined by multiple factors. For one, the species from which the sequenced DNA is derived. While the number of species with a high-quality reference sequence available is increasing, this may be still not the case for some less studied organisms. In those cases, you may want to align reads to a evolutively close species for which a reference genome is available. For instance, as there is not a reference sequence for the genome of the coyote, we can use that of the closely related dog for the read alignment. Similarly, we may still want to align our reads to a closely related species for which a higher-quality reference sequence exists. For example, while the genome of the gibbon has been <\/span><a href=\"https:\/\/www.nature.com\/articles\/nature13679\"><span style=\"font-weight: 300;\">published<\/span><\/a><span style=\"font-weight: 300;\">, this is broken into thousands of fragments that do not fully recapitulate the organization of that genome into tens of chromosomes; in that case, carrying out the alignment using the human reference sequence may be beneficial.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Another factor to consider is the version of the reference sequence assembly, since new versions are released as the sequence is updated and improved. Importantly, the coordinates of a given alignment can vary between versions. For instance, multiple versions of the human genome can be found in the <\/span><a href=\"https:\/\/genome.ucsc.edu\/cgi-bin\/hgGateway?redirect=manual&amp;source=genome.ucsc.edu\"><span style=\"font-weight: 300;\">UCSC Genome Browser<\/span><\/a><span style=\"font-weight: 300;\">. In any species, I strongly favor migrating to the newest assembly version once that is fully released. This may cause some nuisance during the transition, as already existing results will be relative to older versions, but it pays off in the long run.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Besides, the type of sequencing data also matters. Reads generated from DNA-seq, ChIP-seq or Hi-C protocols will be aligned to the genome reference sequence. On the other side, as RNA transcribed from DNA is further processed into mRNA (i.e. introns removed), many RNA-seq reads will fail to align to a genome reference sequence. Instead, we need to either align them to transcriptome reference sequences or use split-aware aligners (see below) when using the genome sequence as a reference. Related to this is the choice of source for the annotation of the reference sequence, that is, the database with the coordinates of the genes, transcripts, centromeres, etc. I typically use the <\/span><a href=\"https:\/\/www.gencodegenes.org\/human\/\"><span style=\"font-weight: 300;\">GENCODE annotation<\/span><\/a><span style=\"font-weight: 300;\"> as it combines comprehensive gene annotation and transcript sequences.<\/span><\/p>\n<p><span style=\"font-weight: 300;\">A long list of short-read sequence alignment tools have been developed (see the Short-read sequence alignment section <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_sequence_alignment_software\"><span style=\"font-weight: 300;\">here<\/span><\/a><span style=\"font-weight: 300;\">). Reviewing them is beyond the scope of this article (details about the algorithms behind these tools can be found <\/span><a href=\"https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC5425171\/\"><span style=\"font-weight: 300;\">here<\/span><\/a><span style=\"font-weight: 300;\">). In my experience, among the most populars are <\/span><a href=\"http:\/\/bowtie-bio.sourceforge.net\/bowtie2\/index.shtml\"><span style=\"font-weight: 300;\">Bowtie2<\/span><\/a><span style=\"font-weight: 300;\">, <\/span><a href=\"http:\/\/bio-bwa.sourceforge.net\/\"><span style=\"font-weight: 300;\">BWA<\/span><\/a><span style=\"font-weight: 300;\">, <\/span><a href=\"http:\/\/daehwankimlab.github.io\/hisat2\/\"><span style=\"font-weight: 300;\">HISAT2<\/span><\/a><span style=\"font-weight: 300;\">, <\/span><a href=\"https:\/\/github.com\/lh3\/minimap2\"><span style=\"font-weight: 300;\">Minimap2<\/span><\/a><span style=\"font-weight: 300;\">, <\/span><a href=\"https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC3530905\/\"><span style=\"font-weight: 300;\">STAR<\/span><\/a><span style=\"font-weight: 300;\"> and <\/span><a href=\"http:\/\/ccb.jhu.edu\/software\/tophat\/index.shtml\"><span style=\"font-weight: 300;\">TopHat<\/span><\/a><span style=\"font-weight: 300;\">. My recommendation is that you choose your aligner based considering key factors like the type of HTS data<\/span><span style=\"font-weight: 300;\"> and application as well as acceptance by the community, quality of the documentation and number of users. E.g. one needs aligners like STAR or Bowtie2 that are aware of exon-exon junctions when mapping RNA-seq to the genome.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Common to most mappers is the need to index the sequence used as reference before the actual alignment takes place. This step may be time consuming but it only needs to be done once for each reference sequence. Most mappers will store alignments in SAM\/BAM files, which follow the <\/span><a href=\"https:\/\/samtools.github.io\/hts-specs\/SAMv1.pdf\"><span style=\"font-weight: 300;\">SAM\/BAM format<\/span><\/a><span style=\"font-weight: 300;\"> (BAM files are binary versions of SAM files). The alignment is among the most computation and time consuming steps in the analysis of sequencing data and SAM\/BAM files are heavy (in the order of gigabytes). Therefore, it is important to make sure that you have the required resources (see the final section below) to run the alignment in a reasonable time and store the results. Similarly, due to the size and binary format of BAM files, avoid opening them with text editors; instead use Unix commands or dedicated tools like <\/span><a href=\"http:\/\/www.htslib.org\/\"><span style=\"font-weight: 300;\">SAMtools<\/span><\/a><span style=\"font-weight: 300;\">.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"From_the_alignments\"><\/span><b>From the alignments<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 300;\">I would say that there is not a clear common step after the alignment, since at this point is where each HTS data type and application may differ.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 300;\">A common downstream analysis for DNA-seq data is variant calling, that is, the identification of positions in the genome that vary relative to the genome reference and between individuals. A popular analysis framework for this application is <\/span><a href=\"https:\/\/gatk.broadinstitute.org\/hc\/en-us\"><span style=\"font-weight: 300;\">GATK<\/span><\/a><span style=\"font-weight: 300;\"> for single nucleotide polymorphism (SNP) or small insertions\/deletions (indels) (<\/span><b>Figure 2<\/b><span style=\"font-weight: 300;\">). Variants comprising larger chunks of DNA (also referred to as structural variants) require dedicated calling methods (see <\/span><a href=\"https:\/\/genomebiology.biomedcentral.com\/articles\/10.1186\/s13059-019-1720-5\"><span style=\"font-weight: 300;\">this article<\/span><\/a><span style=\"font-weight: 300;\"> for a comprehensive comparison). As with the aligners, I advise selecting the right tool considering key factors like the sort of variants (SNP, indel or structural variants), acceptance by the community, quality of the documentation and number of users.<\/span><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-7262 size-large\" src=\"https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk-1024x576.png\" alt=\"\" width=\"702\" height=\"395\" srcset=\"https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk-1024x576.png 1024w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk-300x169.png 300w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk-768x432.png 768w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk-1536x864.png 1536w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk-1080x608.png 1080w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk.png 1600w, https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/gatk-300x169@2x.png 600w\" sizes=\"(max-width: 702px) 100vw, 702px\" \/><\/p>\n<p><span style=\"font-weight: 300;\">Probably the most frequent application of RNA-seq is gene expression quantification. Historically, reads needed to be aligned to the reference sequence and then the number of reads aligned to a given gene or transcript was used as a proxy to quantify its expression levels. This alignment+quantification approach is performed by tools like <\/span><a href=\"http:\/\/cole-trapnell-lab.github.io\/cufflinks\/manual\/\"><span style=\"font-weight: 300;\">Cufflinks<\/span><\/a><span style=\"font-weight: 300;\">, <\/span><a href=\"https:\/\/github.com\/deweylab\/RSEM\"><span style=\"font-weight: 300;\">RSEM<\/span><\/a><span style=\"font-weight: 300;\"> or <\/span><a href=\"http:\/\/subread.sourceforge.net\/\"><span style=\"font-weight: 300;\">featureCounts<\/span><\/a><span style=\"font-weight: 300;\">. However, scuh approach has been increasingly surpassed by newer methods implemented in software like <\/span><a href=\"https:\/\/pachterlab.github.io\/kallisto\/\"><span style=\"font-weight: 300;\">Kallisto<\/span><\/a><span style=\"font-weight: 300;\"> and <\/span><a href=\"https:\/\/combine-lab.github.io\/salmon\/\"><span style=\"font-weight: 300;\">Salmon<\/span><\/a><span style=\"font-weight: 300;\">. Conceptually, with such tools the full sequence of a read does not need to be aligned to the reference sequence. Instead, we only need to align enough nucleotides to be confident that a read originated from a given transcript. Put it simply, the alignment+quantification approach is reduced to a single step. This approach is known as pseudo-mapping and greatly increases the speed of the gene expression quantification. On the other side, keep in mind that pseudo-mapping will not be suitable for applications where the full alignment is needed (e.g. variant calling from RNA-seq data).<\/span><\/p>\n<p><span style=\"font-weight: 300;\">Another example of the differences in the downstream analysis steps and the required tools across sequencing-based application is ChIP-seq. Reads generated with such technique will be used for peak calling, which consists in detecting regions in the genome with a significant excess of reads that indicates where the target protein is bound. Several peak callers exist and <\/span><a href=\"https:\/\/academic.oup.com\/bib\/article\/18\/3\/441\/2453291\"><span style=\"font-weight: 300;\">this publication<\/span><\/a><span style=\"font-weight: 300;\"> surveys them. As a final example I will mention Hi-C data, in which alignments are used as input for tools that determine the interaction matrices and, from these, the 3D-features of the genome. Commenting on all the sequencing-based assays beyond the scope of this article (for a relatively complete list see <\/span><a href=\"https:\/\/liorpachter.wordpress.com\/seq\/\"><span style=\"font-weight: 300;\">this article<\/span><\/a><span style=\"font-weight: 300;\">).<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Before_you_start%E2%80%A6\"><\/span><b>Before you start&#8230;<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 300;\">The remaining part of this article touches on aspects that may be not strictly considered as steps in the analysis of HTS data and that are largely ignored. In contrast, I argue that it is capital that you think about the questions posed in <\/span><b>Table 1<\/b><span style=\"font-weight: 300;\"> before you start analyzing HTS data (or any kind of data indeed), and I have written on these topics <\/span><a href=\"https:\/\/www.slideshare.net\/slideshow\/embed_code\/key\/vwyxcqSsQTYBhl\"><span style=\"font-weight: 300;\">here<\/span><\/a><span style=\"font-weight: 300;\"> and <\/span><a href=\"https:\/\/academic.oup.com\/gigascience\/article\/6\/11\/gix100\/4557140\"><span style=\"font-weight: 300;\">here<\/span><\/a><span style=\"font-weight: 300;\">.<\/span><\/p>\n<p><b>Table 1<\/b><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Think about it<\/b><\/td>\n<td><b>Proposed action<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 300;\">Do you have all the information of your sample needed for the analysis?<\/span><\/td>\n<td><span style=\"font-weight: 300;\">Collect systematically the metadata of the experiments<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 300;\">Will you be able to identify unequivocally your sample?<\/span><\/td>\n<td><span style=\"font-weight: 300;\">Establish a system to assign each sample a unique identifier<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 300;\">Where will data and results be?<\/span><\/td>\n<td><span style=\"font-weight: 300;\">Structured and hierarchical organization of the data<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 300;\">Will you be able to process multiple samples seamlessly?<\/span><\/td>\n<td><span style=\"font-weight: 300;\">Scalability, parallelization, automatic configuration and modularity of the code<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 300;\">Will you or anybody else be able to reproduce the results?<\/span><\/td>\n<td><span style=\"font-weight: 300;\">Document your code and procedures!<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 300;\">As mentioned above, HTS raw data and some of the files generated during their analysis are in the order of gigabytes, so it is not exceptional that a project including tens of samples requires terabytes of storage. Besides, some steps in the analysis of HTS data are computationally intensive (e.g. alignment). However, the storage and computing infrastructure required for analyzing HTS data is an important consideration and it is often overlooked or not discussed. As an example, as part of a recent analysis, we reviewed tens of published papers performing phenome-wide association analysis (PheWAS). Modern PheWAS analyze 100-1,000s of both genetic variants and phenotypes, which results in important data storage and computing power. And yet, virtually none of the papers we reviewed commented on the infrastructure needed for the PheWAS analysis. Not surprisingly, my recommendation is that you plan upfront the storage and computing requirements that you will face and share them with the community.<\/span><\/p>\n<p><strong>Need help with analyzing DNA sequencing data? Get in touch with <a href=\"https:\/\/www.kolabtree.com\/find-an-expert\/subject\/bioinformatics?utm_source=Blog&amp;utm_medium=Post&amp;utm_campaign=DNASeqGuide\">freelance bioinformatics specialist<\/a> and <a href=\"https:\/\/www.kolabtree.com\/find-an-expert\/subject\/genomics\">genomics experts<\/a> on Kolabtree.\u00a0<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dr. Javier Quilez Oliete, an experienced freelance bioinformatics consultant on Kolabtree, provides a comprehensive guide to DNA sequencing data analysis, including tools and software used to read data.\u00a0 Introduction Deoxyribonucleic acid (DNA) is the molecule that carries most of the genetic information of an organism. (In some types of virus, genetic information is carried by<\/p>\n<div class=\"read-more\"><a href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/\" title=\"Read More\">Read More<\/a><\/div>\n","protected":false},"author":12,"featured_media":7266,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[442,398,435],"tags":[755,754],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v20.1 (Yoast SEO v20.1) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A Step-By-Step Guide to DNA Sequencing Data Analysis<\/title>\n<meta name=\"description\" content=\"An expert guide to DNA sequencing data analysis, including tools used for reading raw data, trimming reads and quality control.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Step-By-Step Guide to DNA Sequencing Data Analysis\" \/>\n<meta property=\"og:description\" content=\"An expert guide to DNA sequencing data analysis, including tools used for reading raw data, trimming reads and quality control.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/\" \/>\n<meta property=\"og:site_name\" content=\"The Kolabtree Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/kolabtree\" \/>\n<meta property=\"article:published_time\" content=\"2020-03-23T12:40:48+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-04-18T11:12:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/dna-sequencing-data-analysis-guide.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1350\" \/>\n\t<meta property=\"og:image:height\" content=\"900\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Ramya Sriram\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@kolabtree\" \/>\n<meta name=\"twitter:site\" content=\"@kolabtree\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ramya Sriram\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"A Step-By-Step Guide to DNA Sequencing Data Analysis","description":"An expert guide to DNA sequencing data analysis, including tools used for reading raw data, trimming reads and quality control.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/","og_locale":"en_US","og_type":"article","og_title":"A Step-By-Step Guide to DNA Sequencing Data Analysis","og_description":"An expert guide to DNA sequencing data analysis, including tools used for reading raw data, trimming reads and quality control.","og_url":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/","og_site_name":"The Kolabtree Blog","article_publisher":"https:\/\/www.facebook.com\/kolabtree","article_published_time":"2020-03-23T12:40:48+00:00","article_modified_time":"2023-04-18T11:12:54+00:00","og_image":[{"width":1350,"height":900,"url":"https:\/\/www.kolabtree.com\/blog\/wp-content\/uploads\/2020\/03\/dna-sequencing-data-analysis-guide.jpg","type":"image\/jpeg"}],"author":"Ramya Sriram","twitter_card":"summary_large_image","twitter_creator":"@kolabtree","twitter_site":"@kolabtree","twitter_misc":{"Written by":"Ramya Sriram","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#article","isPartOf":{"@id":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/"},"author":{"name":"Ramya Sriram","@id":"https:\/\/www.kolabtree.com\/blog\/#\/schema\/person\/81992f5863a1b06d132a47822e7b4400"},"headline":"A Step-By-Step Guide to DNA Sequencing Data Analysis","datePublished":"2020-03-23T12:40:48+00:00","dateModified":"2023-04-18T11:12:54+00:00","mainEntityOfPage":{"@id":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/"},"wordCount":2769,"commentCount":0,"publisher":{"@id":"https:\/\/www.kolabtree.com\/blog\/#organization"},"keywords":["DNA Sequencing Data Analysts","Freelance Bioinformatics Consultants"],"articleSection":["Biotechnology","Data Science","Research"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/","url":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/","name":"A Step-By-Step Guide to DNA Sequencing Data Analysis","isPartOf":{"@id":"https:\/\/www.kolabtree.com\/blog\/#website"},"datePublished":"2020-03-23T12:40:48+00:00","dateModified":"2023-04-18T11:12:54+00:00","description":"An expert guide to DNA sequencing data analysis, including tools used for reading raw data, trimming reads and quality control.","breadcrumb":{"@id":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.kolabtree.com\/blog\/a-step-by-step-guide-to-dna-sequencing-data-analysis\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.kolabtree.com\/blog\/"},{"@type":"ListItem","position":2,"name":"A Step-By-Step Guide to DNA Sequencing Data Analysis"}]},{"@type":"WebSite","@id":"https:\/\/www.kolabtree.com\/blog\/#website","url":"https:\/\/www.kolabtree.com\/blog\/","name":"The Kolabtree Blog","description":"Expert Views on Science, Innovation and Product Development","publisher":{"@id":"https:\/\/www.kolabtree.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.kolabtree.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.kolabtree.com\/blog\/#organization","name":"Kolabtree","url":"https:\/\/www.kolabtree.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.kolabtree.com\/blog\/#\/schema\/logo\/image\/","url":"","contentUrl":"","caption":"Kolabtree"},"image":{"@id":"https:\/\/www.kolabtree.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/kolabtree","https:\/\/twitter.com\/kolabtree","https:\/\/instagram.com\/kolabtree","https:\/\/www.linkedin.com\/company\/kolabtree","https:\/\/en.m.wikipedia.org\/wiki\/Kolabtree"]},{"@type":"Person","@id":"https:\/\/www.kolabtree.com\/blog\/#\/schema\/person\/81992f5863a1b06d132a47822e7b4400","name":"Ramya Sriram","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.kolabtree.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/8100b45c960ebbbbe420e8b3f250515f?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/8100b45c960ebbbbe420e8b3f250515f?s=96&d=mm&r=g","caption":"Ramya Sriram"},"description":"Ramya Sriram manages digital content and communications at Kolabtree (kolabtree.com), the world's largest freelancing platform for scientists. She has over a decade of experience in publishing, advertising and digital content creation.","url":"https:\/\/www.kolabtree.com\/blog\/author\/ramyas\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/posts\/7246"}],"collection":[{"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/comments?post=7246"}],"version-history":[{"count":8,"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/posts\/7246\/revisions"}],"predecessor-version":[{"id":10583,"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/posts\/7246\/revisions\/10583"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/media\/7266"}],"wp:attachment":[{"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/media?parent=7246"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/categories?post=7246"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kolabtree.com\/blog\/wp-json\/wp\/v2\/tags?post=7246"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}