(pronounced “seek academy”)
An easy-to-use, all-in-one tutorial for bioinformatics analyses.
Keywords: RNA-Seq, ChIP-Seq, alignment, differential gene expression, peak-calling, education, tutorial, pipeline
What is SeqAcademy?
SeqAcademy is a resource for anyone – regardless of skill level – to perform bioinformatics research. It is user-friendly notebook-based educational pipeline for RNA-Seq and ChIP-Seq data analysis. Bioinformaticians use RNA-Seq and ChIP-Seq analyses, respectively, to study RNA and protein interactions within genomes.
RNA-Seq and ChIP-Seq experiments generate large amounts of data and rely on pipelines for efficient analysis. However, existing tools perform specific portions of the pipeline or offer a complete pipeline solution for the advanced programmer. The tutorial uses jupyter notebook, an application for sharing documents and code, as instruction.
SeqAcademy addresses these problems by providing an easy to use tutorial that outlines the complete RNA-Seq and ChIP-Seq analysis workflow and requires no prior programming experience.
Created by Hussain Ather (shussainather [at] gmail [dot] com)
Who is SeqAcademy for?
SeqAcademy is for students and researchers with little to no bioinformatics experience interested in hands-on bioinformatics tutorials. Anyone will feel comfortable analyzing epigenomic and RNA-Seq data using this simple educational tool.
What does SeqAcademy teach?
This tutorial works using HISAT2 aligner to align sample reads to a reference.
It uses quantification methods (such as salmon for RNA-Seq and peak-calling for ChIP-Seq) to quantify expression and determine protein-binding.
The output is analyzed (differential gene expression for RNA-Seq and peak analysis for ChIP-seq), and the results are visualized.
Then it performs MultiQC to extract quality control information from the aligned reads.
The model organism for this project is Yeast i.e. Saccharomyces cerevisiae. For RNA-Seq, yeast data between euploid and aneuoploid conditions will be compared. For ChIP-Seq, yeast data between 3AT-treated and untreated conditions will be compared.
How do I use SeqAcademy?
- Identify and open the terminal emulator program on your computer. Mac and Linux systems come with Terminal installed, and Windows systems come with Console. If there isn’t one installed, download one online.
pwdand press enter. This command shows what your current working directory is. Typing commands and pressing enter will be the primary way of running commands in this tutorial. Type
lsto display which directories and files are in this current directory.
- If you’d like to use the tutorial in this current working directory, skip to step 5. Otherwise, you may make a new directory or move to another one. To make a new directory, run
mkdir DIRECTORYin which DIRECTORY is the name of the directory you’d like to make. To move to another directory, run
cd DIRECTORYin which DIRECTORY is the name of the DIRETORY you’d like to move to. To move up a directory, run
- Given the disk space and RAM requirements, it’s likely you’ll want to use a virtual machine for this tutorial. To connect to a virtual machine, make sure you use your own domain name or IP address.
If you know the hostname you’d like to connect to, run
ssh -L PORTNUMBER:localhost:PORTNUMBER USERNAME@HOSTNAME in which PORTNUMBER is a chosen unique identifable number, USERNAME is your username, and HOSTNAME is your hostname.
If you know the IP you’d like to connect to, run
ssh -L PORTNUMBER:localhost:PORTNUMBER USERNAME@IP in which IP is the IP address of the machine you wish to connect to.
git clone https://github.com/NCBI-Hackathons/seqacademy.gitto clone the directory such that you can download the tutorial. This will download a folder called
- Enter the seqacademy folder and follow the instructions in
The following data presents the RNA-Seq data used in this tutorial. This tutorial observes RNA-Seq data of aneuploidy in Yeast (source: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP106028).
Principal component analysis (PCA) suggests gene expression for euploid yeast samples (haploid) clusters distinctly from that of the aneuploid yeast samples (diploid chromosome X).The first two PCs account for ~70% of the variance in expressed genes). Data provided by Mulla et al. (https://elifesciences.org/articles/27991).
A volcano plot of differentially expressed genes between euploid yeast colonies versus aneuploid yeast colonies. The x-axis represents the difference in gene expression between the conditions. False discovery rate (FDR), a method for controlling for multiple testing, is along the y-axis. Each point represents a tested gene (N=3,926). Red points are those reaching genome-wide significance (at FDR0.05, N=3,263). Data provided by Mulla et al. (https://elifesciences.org/articles/27991).
The relative enrichment of chrX for differentially expressed genes suggests the downstream results of this processing pipeline are consistent with biological expectations. The RNA-seq experiment was performed on yeast colonies with an extra chromosome X. Data provided by Mulla et al. (https://elifesciences.org/articles/27991).
The following data presents the ChIP-Seq data used in this tutorial. This tutorial observes ChIP-Seq data of induction by 3-AT in Yeast (source: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP132584).
Distribution of intersected peaks across the yeast genome. This IGV screenshot shows in the bottom row the intersected peaks between the two treatment conditions of the yeast samples. The matching genes with each intersected peak can be analyzed.