Using ProkSeq on Matilda

Overview

ProkSeq is an automated RNA-seq data analysis package for prokaryotes, where users can perform all the necessary steps of RNA-seq data analysis from quality control to pathway enrichment analysis.

ProkSeq has been adapted for use on Matilda from the original Docker file image and Conda. Limited changes have been made to some scripts in order to facilitate application paths being found in the user's working space. This document briefly describes how to use ProkSeq on Matilda.

Getting Started

Begin by loading the ProkSeq modulefile:

module load ProkSeq

The modulefile will load all of the necessary packages required by ProkSeq, and will set the appropriate environmental variables.

In order to run ProkSeq, you will need to setup a parameter file and a sample file to use as input parameters on the command line. For your convenience, loading of the modulefile will set the environmental variable of the installation directory to $PROKSEQ_HOME. For examples of parameter and sample files, please refer to the contents of:

$PROKSEQ_HOME/example

Similarly, an example of the annotation files which may be used can be found in:

$PROKSEQ_HOME/data

The Parameter File

You must change at least one line of the parameter file used as input to ProkSeq. Presented below is an excerpt of the file $PROKSEQ_HOME/example/param.bowtie.yaml:

## ENTER PATH TO WORKING DIRECTORY BELOW **
PATH ROOT : /path/to/working/directory
#       If the above environment (depend, scripts, data) is true, the following
#       line maye uncommented.
#PATH DEFAULT : "TRUE"
#       Specify the path to geneBody_coverage
PATH geneBody_coverage : /cm/shared/apps/ProkSeq/2.0-py368/depend/RSeQC-2.6.2/scripts/
#       Specify the path to FEATURECOUNTS
PATH FEATURECOUNTS : /cm/shared/apps/ProkSeq/2.0-py368/depend/subread-1.4.6-p5-Linux-i386/bin/
#       Specify the path to fastqc
PATH FASTQC : /cm/shared/apps/ProkSeq/2.0-py368/depend/FastQC
#       Specify the path to bowtie
PATH BOWTIE : /cm/shared/apps/ProkSeq/2.0-py368/depend/bowtie2/bowtie2-2.3.5.1-linux-x86_64
#       Specify the path to salmon if salmon is required
PATH SALMON : /cm/shared/apps/ProkSeq/2.0-py368/depend/salmon-latest_linux_x86_64/bin
#       Specify the path to pypy required for running afterqc
....

Please change the line "PATH ROOT :" to match the path of your current working directory!

You should NOT change the PATHs for any of the executables otherwise you will break ProkSeq. You may wish to alter certain input flag parameters for BOWTIE, SALMON, or AFTERQC. You may also wish to alter the names of certain files, such as the Featurecounts input GTF file. Please refer to the ProkSeq documentation for more information.

Running an Example

You may wish to run a couple of examples to get a feel for how ProkSeq works. Let's start by changing to your working directory, loading the modulefile, and copying over the example files:

cd /scratch/users/<username>
module load ProkSeq
rsync -avp $PROKSEQ_HOME/example/* .

Now as previously instructed, change the "PATH ROOT" line of both, the "param.bowtie.yaml" and "param.salmon.yaml" files. For example:

## ENTER PATH TO WORKING DIRECTORY BELOW **
PATH ROOT : /scratch/users/<username>

Now let's run the paired-end bowtie example:

prokseq.py -s samples.bowtie.PEsample -p param.bowtie.yaml -n 4

This should produce a directory called "Output" and shouldn't throw any exceptions or errors. Now rename the Output directory to save your results, and then run the salmon example:

mv Output Output.bowtie
prokseq.py -s samples.salmon.PEsample -p param.salmon.yaml -n 4

Again, you should have a new "Output" folder. Inspect the contents of "Output" and "Output.bowtie".

Interactive vs. Batch Jobs

When running the above examples interactively, you will note that you will be asked to answer a question to proceed with the analysis:

Done with package checks. Seems all the required packages are available.
Do you want to continue? (Y/N) :

Normally you would provide an answer (generally "Y") and the analysis will proceed. However, in a batch cluster job you need to provide the "answer" as part of the command line. For example:

echo "Y" | prokseq.py -s samples.salmon.PEsample -p param.salmon.yaml -n 4

This is important, since you cannot "answer" the question when a batch job runs, and it will fail without modifying the command line as shown above.

Additional Information

Please do NOT rely upon the documentation on the ProkSeq github page - we have found as of this writing that it is out-of-date and refers to deprecated commands in some cases.

Do use the dedicated ProkSeq documentation page for more information.


CategoryHPC