(pronounced “seek academy”)
An easy-to-use, all-in-one tutorial for bioinformatics analyses.
Keywords: RNA-Seq, ChIP-Seq, alignment, differential gene expression, peak-calling, education, tutorial, pipeline
What is SeqAcademy?
SeqAcademy is a resource for anyone – regardless of skill level – to perform bioinformatics research. It is user-friendly notebook-based educational pipeline for RNA-Seq and ChIP-Seq data analysis. Bioinformaticians use RNA-Seq and ChIP-Seq analyses, respectively, to study RNA and protein interactions within genomes.
RNA-Seq and ChIP-Seq experiments generate large amounts of data and rely on pipelines for efficient analysis. However, existing tools perform specific portions of the pipeline or offer a complete pipeline solution for the advanced programmer.
SeqAcademy addresses these problems by providing an easy to use tutorial that outlines the complete RNA-Seq and ChIP-Seq analysis workflow and requires no prior programming experience.
Created by Hussain Ather (shussainather [at] gmail [dot] com)
Who is SeqAcademy for?
SeqAcademy is for students and researchers with little to no bioinformatics experience interested in hands-on bioinformatics tutorials. Anyone will feel comfortable analyzing epigenomic and RNA-Seq data using this simple educational tool.
What does SeqAcademy teach?
This tutorial works using HISAT2 aligner to align sample reads to a reference.
It uses quantification methods (such as salmon for RNA-Seq and peak-calling for ChIP-Seq) to quantify expression and determine protein-binding.
The output is analyzed (differential gene expression for RNA-Seq and peak analysis for ChIP-seq), and the results are visualized.
Then it performs MultiQC to extract quality control information from the aligned reads.
The model organism for this project is Yeast i.e. Saccharomyces cerevisiae. For RNA-Seq, yeast data between euploid and aneuoploid conditions will be compared. For ChIP-Seq, yeast data between 3AT-treated and untreated conditions will be compared.
How do I use SeqAcademy?
- Identify and open the terminal emulator program on your computer. Mac and Linux systems come with Terminal installed, and Windows systems come with Console. If there isn’t one installed, download one online.
pwdand press enter. This command shows what your current working directory is. Typing commands and pressing enter will be the primary way of running commands in this tutorial. Type
lsto display which directories and files are in this current directory.
- If you’d like to use the tutorial in this current working directory, skip to step 5. Otherwise, you may make a new directory or move to another one. To make a new directory, run
mkdir DIRECTORYin which DIRECTORY is the name of the directory you’d like to make. To move to another directory, run
cd DIRECTORYin which DIRECTORY is the name of the DIRECTORY you’d like to move to. To move up a directory, run
- Given the disk space and RAM requirements, it’s likely you’ll want to use a virtual machine for this tutorial. To connect to a virtual machine, make sure you use your own domain name or IP address.
If you know the hostname you’d like to connect to, run
ssh -L PORTNUMBER:localhost:PORTNUMBER USERNAME@HOSTNAME in which PORTNUMBER is a chosen unique identifable number, USERNAME is your username, and HOSTNAME is your hostname.
If you know the IP you’d like to connect to, run
ssh -L PORTNUMBER:localhost:PORTNUMBER USERNAME@IP in which IP is the IP address of the machine you wish to connect to.
- Download anaconda (https://www.anaconda.com/download/) and git (https://git-scm.com/downloads).
git clone https://github.com/NCBI-Hackathons/seqacademy.gitto clone the directory such that you can download the tutorial. This will download a folder called
- Before running any programs, we’ll make sure that each software is installed correctly. This tutorial uses Bioconda (https://bioconda.github.io/). Bioconda is a channel for the conda package manager specializing in bioinformatics software. The available packages are listed here: https://bioconda.github.io/recipes.html#recipes.
You will need to add the bioconda channel as well as the other channels bioconda depends on. It is important to add them in this order so that the priority is set correctly (that is, bioconda is highest priority).
The conda-forge channel contains many general-purpose packages not already found in the defaults channel. The r channel is only included due to backward compatibility. It is not mandatory, but without the r channel packages compiled against R 3.3.1 might not work.
This tutorial uses cells written in python, R, and unix to perform its analyses.
Run the following three lines in command line in the following order:
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
In this tutorial we will create an environment named “tutorial” and install the packages in there. Environments offer ways of installing packages in specific environments so they can be managed and run for different specifications. You can create, export, list, remove and update environments that have different versions of Python and/or packages installed in them. Switching or moving between environments is called activating the environment. You can also share an environment file.
This command will create an environment “tutorial” in which to install the packages used in this tutorial.
Run the following commands to create the environment. The
-n flag specifies the name of the environment to create (which is called “tutorial”) and the list of packages following the name are the packages that will be installed in the “tutorial” environment. This will most likely take 10-15 minutes.
conda create -n tutorial jupyter hisat2 multiqc macs2 bioconductor-deseq matplotlib ggplot samtools bioconductor-rsamtools bedtools htseq --yes
Then activate the environment with the following command:
For Mac and Linux
source activate tutorial
conda activate tutorial
Begin the tutorial
Follow the instructions in
chipseq.md for the corresponding tutorial.
The following data presents the RNA-Seq data used in this tutorial. This tutorial observes RNA-Seq data of aneuploidy in Yeast (source: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP106028).
Principal component analysis (PCA) suggests gene expression for euploid yeast samples (haploid) clusters distinctly from that of the aneuploid yeast samples (diploid chromosome X).The first two PCs account for ~70% of the variance in expressed genes). Data provided by Mulla et al. (https://elifesciences.org/articles/27991).
A volcano plot of differentially expressed genes between euploid yeast colonies versus aneuploid yeast colonies. The x-axis represents the difference in gene expression between the conditions. False discovery rate (FDR), a method for controlling for multiple testing, is along the y-axis. Each point represents a tested gene (N=3,926). Red points are those reaching genome-wide significance (at FDR0.05, N=3,263). Data provided by Mulla et al. (https://elifesciences.org/articles/27991).
The relative enrichment of chrX for differentially expressed genes suggests the downstream results of this processing pipeline are consistent with biological expectations. The RNA-seq experiment was performed on yeast colonies with an extra chromosome X. Data provided by Mulla et al. (https://elifesciences.org/articles/27991).
The following data presents the ChIP-Seq data used in this tutorial. This tutorial observes ChIP-Seq data of induction by 3-AT in Yeast (source: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP132584).
Distribution of intersected peaks across the yeast genome. This IGV screenshot shows in the bottom row the intersected peaks between the two treatment conditions of the yeast samples. The matching genes with each intersected peak can be analyzed.