Using ProkSeq on Matilda
Contents
Overview
ProkSeq is an automated RNA-seq data analysis package for prokaryotes, where users can perform all the necessary steps of RNA-seq data analysis from quality control to pathway enrichment analysis.
ProkSeq has been adapted for use on Matilda from the original Docker file image and Conda. Limited changes have been made to some scripts in order to facilitate application paths being found in the user's working space. This document briefly describes how to use ProkSeq on Matilda.
Getting Started
Begin by loading the ProkSeq modulefile:
module load ProkSeq
The modulefile will load all of the necessary packages required by ProkSeq, and will set the appropriate environmental variables.
In order to run ProkSeq, you will need to setup a parameter file and a sample file to use as input parameters on the command line. For your convenience, loading of the modulefile will set the environmental variable of the installation directory to $PROKSEQ_HOME. For examples of parameter and sample files, please refer to the contents of:
$PROKSEQ_HOME/example
Similarly, an example of the annotation files which may be used can be found in:
$PROKSEQ_HOME/data
The Parameter File
You must change at least one line of the parameter file used as input to ProkSeq. Presented below is an excerpt of the file $PROKSEQ_HOME/example/param.bowtie.yaml:
## ENTER PATH TO WORKING DIRECTORY BELOW ** PATH ROOT : /path/to/working/directory # If the above environment (depend, scripts, data) is true, the following # line maye uncommented. #PATH DEFAULT : "TRUE" # Specify the path to geneBody_coverage PATH geneBody_coverage : /cm/shared/apps/ProkSeq/2.0-py368/depend/RSeQC-2.6.2/scripts/ # Specify the path to FEATURECOUNTS PATH FEATURECOUNTS : /cm/shared/apps/ProkSeq/2.0-py368/depend/subread-1.4.6-p5-Linux-i386/bin/ # Specify the path to fastqc PATH FASTQC : /cm/shared/apps/ProkSeq/2.0-py368/depend/FastQC # Specify the path to bowtie PATH BOWTIE : /cm/shared/apps/ProkSeq/2.0-py368/depend/bowtie2/bowtie2-2.3.5.1-linux-x86_64 # Specify the path to salmon if salmon is required PATH SALMON : /cm/shared/apps/ProkSeq/2.0-py368/depend/salmon-latest_linux_x86_64/bin # Specify the path to pypy required for running afterqc ....
Please change the line "PATH ROOT :" to match the path of your current working directory!
You should NOT change the PATHs for any of the executables otherwise you will break ProkSeq. You may wish to alter certain input flag parameters for BOWTIE, SALMON, or AFTERQC. You may also wish to alter the names of certain files, such as the Featurecounts input GTF file. Please refer to the ProkSeq documentation for more information.
Running an Example
You may wish to run a couple of examples to get a feel for how ProkSeq works. Let's start by changing to your working directory, loading the modulefile, and copying over the example files:
cd /scratch/users/<username> module load ProkSeq rsync -avp $PROKSEQ_HOME/example/* .
Now as previously instructed, change the "PATH ROOT" line of both, the "param.bowtie.yaml" and "param.salmon.yaml" files. For example:
## ENTER PATH TO WORKING DIRECTORY BELOW ** PATH ROOT : /scratch/users/<username>
Now let's run the paired-end bowtie example:
prokseq.py -s samples.bowtie.PEsample -p param.bowtie.yaml -n 4
This should produce a directory called "Output" and shouldn't throw any exceptions or errors. Now rename the Output directory to save your results, and then run the salmon example:
mv Output Output.bowtie prokseq.py -s samples.salmon.PEsample -p param.salmon.yaml -n 4
Again, you should have a new "Output" folder. Inspect the contents of "Output" and "Output.bowtie".
Interactive vs. Batch Jobs
When running the above examples interactively, you will note that you will be asked to answer a question to proceed with the analysis:
Done with package checks. Seems all the required packages are available. Do you want to continue? (Y/N) :
Normally you would provide an answer (generally "Y") and the analysis will proceed. However, in a batch cluster job you need to provide the "answer" as part of the command line. For example:
echo "Y" | prokseq.py -s samples.salmon.PEsample -p param.salmon.yaml -n 4
This is important, since you cannot "answer" the question when a batch job runs, and it will fail without modifying the command line as shown above.
Additional Information
Please do NOT rely upon the documentation on the ProkSeq github page - we have found as of this writing that it is out-of-date and refers to deprecated commands in some cases.
Do use the dedicated ProkSeq documentation page for more information.
CategoryHPC