RNA-Seq Workshop for the Bioinformatician
by
28 April 2014
Seasoned Bioinformaticians from the University of Pennsylvania are giving a one day RNA-Seq analysis workshop.
RNA-Seq has not been living up to its potential because the unfortunate reality is that it is still far from routine and cannot be done effectively by off-the-shelf tools. Instead serious considerations must be made to process and normalize the data in order to get anything out of the effort that could not already be obtained via microarrays.
 
In short, in order to make RNA-Seq worth the extra expense, the bioinformatician must have well-developed and specialized skills. One must delve deeply to discover some of the best kept secrets of the field. For example which aligners work as advertized and which do not. And most importantly the many issues that arise with feature quantification and normalization. The standard pipelines will show you how to normalize for read depth, feature length, and possibly GC content; however they ignore other crucial factors that can introduce extreme variance into the data, which can result in an underpowered statistical analysis.
 
This workshop will focus on mRNA analysis, with particular attention on processing the data for differential expression and differential splicing analysis, which covers the majority of use cases.
We are not going to pretend that just anybody can do this. Participants must have a working knowledge of the UNIX environment and of basic concepts of statistics in biology (such as p-values and multiple testing). This workshop is aimed at bioinformaticians, bioengineers and statisticians who may be responsible for setting up and/or running the computational pipelines that take raw data all the way through to high level analysis. We will focus on basic to advanced analysis using a UNIX command line environment to run RNA-Seq software. All software used is freely available.
We will first discuss the nature of the data and then we will learn about alignment, normalization, quantification, statistical analysis and data management. We will not take the standard push-button out-of-the-box approach to RNA-Seq analysis, but will instead look deeper into the issues that make every study a special case, with the aim of getting more out of the data than the lowest hanging fruit. In other words, we are not just going to show you how to run Tophat and Cufflinks, instead we will evaluate these tools carefully and see if they really work. We will look across the spectrum at many available methods and look closely at issues that have been largely ignored, but which have considerable impact on the power of the downstream analysis.
 
This is a practical workshop which will provide down-to-earth material and hands on experience. You will learn how to perform the analyses on compute clusters (either local or cloud).

PRE-REQUISITES:
Familiarity with Unix.
Familiarity with some command line plain text editor. E.g. emacs or vi.

 
FORMAT:
The morning session will be in lecture format. The afternoon session will involve hands-on practice on a compute cluster taking some test data through the analysis pipeline.

 
TOPICS:
Note: This is a Unix based workshop on data analysis. We will not describe in detail the process of generating the data, except to the extent that it will help clarify any analysis issues.
RNA-Seq generalities and caveats
Essentials of compute clusters and cloud computing
Study design
File formats (fasta, fastq, SAM)
Alignment (What works best as determined by rigorous benchmarking.)
Data visualization (and more) with the UCSC Genome Browser and Table Browser
Normalization and quantification
Issues with splice form reconstruction and isoform quantification
Data and analysis sharing and management