Cohort RNA-seq data

HBA-DEALS is designed for the analysis of RNA-seq cohort data. The nature of the cohorts is arbitrary but will commonly be labeled as cases and controls or group 1 vs. group2.

In general, there are many ways to obtain such data including generating RNA-seq data or downloading it from an appropriate site. For this tutorial, we download a dataset from the NCBI Sequence Read Archive (SRA), which is the primary archive of NGS datasets.

Downloading from SRA

We will use the SRA Toolkit.

We will use 8 RNA-seq samples from the dataset SRP149366, which investigated estrogen responsive transcriptome of estrogen receptor positive normal human breast cells in 3D cultures.

control	estradiol
SRR7236472 SRR7236473 SRR7236474 SRR7236475	SRR7236480 SRR7236481 SRR7236482 SRR7236483

First install fasterq-dump on your system accorrding to the SRA toolkit instructions. The download page may be helpful.

Then execute the following command from the shell.

Downloading RNA-seq files with the SRA Toolkit

 for srr in SRR7236472 SRR7236473 SRR7236474 SRR7236475 SRR7236480 SRR7236481 SRR7236482 SRR7236483; do \
     prefetch $srr
     fasterq-dump -t tmp/ --split-files --threads 8 --outdir tutorial/ $srr
 done

If necessary, change the --threads argument according to the resources of your system. The downloaded *.fastq files will be written to the tutorial directory. Other directories that were created (e.g., SRR7236472, which contains the file SRR7236472.sra) can be deleted.

This step will typically take a few hours.

Cleaning the reads

fastp performs quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data.

fastp can be downloaded at its GitHub site, which also has installation instructions for various platforms. If you have a debian or Ubuntu system, then the easiest installation is just

sudo apt install fastp

After you have installed fastp, run the following command in the shell.

Cleaning FASTQ files with fastp

 for srr in SRR7236472 SRR7236473 SRR7236474 SRR7236475 SRR7236480 SRR7236481 SRR7236482 SRR7236483; do \
     fastp -i tutorial/${srr}_1.fastq -I tutorial/${srr}_2.fastq -o tutorial/${srr}_trimmed_1.fastq -O tutorial/${srr}_trimmed_2.fastq; \
 done

This will create for each .fastq file that was downloaded a file by the same name with an added “_trimmed_” in its name. At this point, you can delete the original *.fastq files if desired.