NGS analysis

NGS data analysis workflow:

Here we provide a general workflow for NGS data analysis with RNA-SeqChIP-Seq and RRBS-Seq (Reduced Representation of bisulfite sequencing, a cost effecient alternative to whole genome bisulfite sequencing).

Getting started with your own analysis on NGS data:

If you want to do your own analysis, first see HPC to request an account for yourself if you are UCI affiliated. Also refer to this page for a brief introduction on how to manipulate data on Linux. This page also explains how to transfer your data to/from HPC. Once you have obtained your account and transferred data to HPC, you can start running software to do your analysis. (Use “module av” to see a full list of software available at HPC.) Note that HPC does not have a regular backup so you are responsible for archiving your own data.

Computational resources required for NGS reads assembly:

NSG sequencers contribute to orders of magnitude more data to sift through, analyze, and share, increasing the complexities of sequencing data analysis workflows. A tightly integrated scalable high-performance computing platform with intelligent data management is recommended. Here at UCI, we recommend using the high performance Linux clusters HPC, which is available to all campus to analyze your NGS data. If you prefer to use your own desktop/laptop, depending on your experiments and data volume, a minimum of 8GB+ RAM is recommended.

Strategies/software involved in assembly and alignment:

In most NGS data analysis workflows (exome sequencing, RNA seq, ChIP-seq etc), the first analysis step is to map (also called “align”) each of the short reads produced from the sequencer to a reference genome to infer the genomic location where the read is derived.  Depending upon the size of the reference genome and the total number of reads, this step could be computationally very challenging. Many open source software have been developed to solve this problem and their underlying algorithms largely come in two categories. One category is hash-based, hashing either reads (MAQ 2008, ELAND 2007) or the reference genome (Mosaik 2008). A second category is based on the Burrows-Wheeler transform (BWT) and associated data structures, which support fast retrieval of exact or approximate string matches (BWA 2009, Bowtie 2009, Bowtie2 2012).  The BWT based methods are designed to align short reads with exact or small number of mismatches. As such, the performance of these methods tend to deterioate as the read length becomes longer and the allowed number of mismatches increases.

All the above mentioned software is available on HPC. Depending on your data volume, size of your reference genome, time that you can spend on the computation, you can choose either category one software (slow, more accurate) or category two (faster, less sensitive).

At UCI, people have also developed our own read aligner, called Hobbes, which is both fast and accurate, but requiring relatively large memory.  Hobbes also provides a free web service to allow you to align your reads online, which is a good option if the total number of reads  from your study is less than 5 million.

Integrated Bioinforamtics Tools:

HPC also provides integrated tools for analysis and comprehension of biological data. R based Bioconductor is open source and available to all users. A commercial software specifically designed for biologists to easily analyze high throughput data, CLCbio Genomic Workbench, is also available on HPC for GHTF users. See CLCbio manual for a detailed instruction on how to use CLCbio to analyze your NGS data. Genome Analysis Tool Kit (GATK) develped by the Broad Institute is also available on HPC for next generation sequencing data analysis and human medical resequencing data analysis.