--------------------------------------------------------------------------------------------------
	HuntMi: an efficient and taxon-specific approach in pre-miRNA identification

	Adam Gudys, Michal Wojciech Szczesniak, Marek Sikora and Izabela Makalowska
--------------------------------------------------------------------------------------------------

The basic functionality of HuntMi package to provide user with the possibility to identify real pre-miRNAs 
from a sequence set given in a FASTA file. This task is facilitated by two scripts:
- extractFeatures.py (extracts features from sequences in FASTA files and stores them in ARFF format),
- classify.py (generates predictions on the basis of ARFF files).
Additionally, HuntMi allows one to train new models on custom datasets with a use of Weka software.
Below one can find detailed descriptions of abovementioned actions.


--------------------------------------------------------------------------------------------------
			Feature extraction (extractFeatures.py script)
--------------------------------------------------------------------------------------------------

1. PURPOSE
extractFeatures.py takes one or two FASTA files provided by user and generates a set of features for each
sequence. The results are saved in ARFF format - .arff files - and can subsequently be used in HuntMi
classification engine or in Weka data mining tool for traning custom models or in classification experiments.

2. SYNTAX
There are two modes of action in HuntMi:
	a) Generating features that are to be used in classification experiments using provided models for human, A. thaliana,
	   animals, plants, or viruses. Here, only one input file - with sequences to be tested - should be provided, for instance:

	   python extractFeatures.py test_file.fasta

	b) Calculating features for positive and negative datasets that are further to be used to generate custom models.
	   In this situation two files need to be provided, first positive, then - negative one, as in the example:

	   python extractFeatures.py positive.fasta negative.fasta

3. INPUT
Input data should be in FASTA format, e.g.:

>sequence 1
CATAGACCTCTGCCAAAAGGAAAGTACACTGGATGAATGCCTGAGCTACCTCTGCAGGTGGATCCACTACAG
>sequence2
ATTTCTCAACTACATGGAAGCTGAACAACCTGCTCCTGAAAATGAAGGCATAAATAAAGATGTTCTTTGAAACTGA
>sequence3
ATCATATGCACATACGATCGATAGCTACACATCGTAGCATCTGTTTTTTTTTTGCTCCGGCCGTAGCAGCGCGCG

The input files are stored in /HuntMi/data directory.

4. OUTPUT
Result files are stored in /HuntMi/results

The results are in ARFF format.
For each input file two .arff files are generated: input_filename.arff and input_filename.filtered.arff, where filename
is user-provided name of the file. The first file contains the results for all sequences in the input_filename. However, 
it happens that some of the features cannot be calculated, e.g. when the sequence is too short. Then in input_filename.arff 
there are '?' signs instead of numbers. In input_filename.filtered.arff these cases are filtered out. Moreover, 
input_filename.removed file provides the user with FASTA sequences that failed the feature generation step.

There is also a directory created for each input file: /HuntMi/results/input_filename. The directory contains 7 files:
	a) selected_21_micropred_features: 21 microPred features
	b) input_filename.dustmasker: dm feature
	c) input_filename.fold.triplet: tri_A, tri_U, tri_G, and tri_C features
	d) input_filename.loops: loops feature
	e) input_filename.translate: orf feature
	f) input_filename.fold: secondary structures by RNAfold. 
	g) input_filename.all_features: all features a) - e). These features are equivalent to input_filename.arff file.

5. KNOWN ISSUES
HuntMi package uses ViennaRNA library which is by default compiled for Perl 5.10. If one runs HuntMi on a system 
with a different release of Perl, an error occurs. The error message looks like the following:

perl: symbol lookup error: ./miPred/ViennaRNA-1.6.4/Perl/blib/arch/auto/RNA/RNA.so: undefined symbol: Perl_Istack_sp_ptr

To solve this, please rebuild ViennaRNA with currently installed Perl release (root privileges are assumed):

cd HuntMi/progs/miPred/ViennaRNA-1.6.4
./configure
make
make install


--------------------------------------------------------------------------------------------------
			Classification (classify.py script)
--------------------------------------------------------------------------------------------------

1. PURPOSE
classify.py takes ARFF file with data representation (calculated by extractFeatures.py script) and
generates class labels for all sequences.

2. SYNTAX
python classify.py modelFile featuresFile    

3. INPUT
modelFile is a file containing classification model. It can be a model trained by the user in Weka software 
or one of the precomputed models:
	human.model,
	arabidopsis.model,
	animal.model,
	plant.model,
	virus.model.
Model files are placed in HuntMi/classifier directory.

featuresFile is an arff file with representation of sequences to be classified. This file is placed in 
HuntMi/results directory.

4. OUTPUT
Classifier predictions are stored in /HuntMi/results/featuresFile.predictions file. Each line corresponds to a single sequence and
contains 0, if sequence has been classified as non-miRNA, or 1, if sequence has been classified as miRNA.


--------------------------------------------------------------------------------------------------
				Model training 
--------------------------------------------------------------------------------------------------

HuntMi package allows user to train models on custom datasets. This can be done with a help of Weka
data mining software (version 3.6.x is required). Please download ROCSelect.jar file from the webpage 
and add it to to the 'cp' variable in Weka/RunWeka.ini file. E.g. if ROCSelect.jar has been downloaded 
to Weka/plugins/subdirectory one should alter 'cp' variable in the following way:

cp=%CLASSPATH%;./plugins/ROCSelect.jar

After running the Weka software user can load a training dataset in ARFF format generated by extractFeatures.py script
(mode (b) should be used as class labels need to be written in ARFF file as well). Now one can go to classification
tab in Weka, choose ROCSelect procedure under weka/classifiers/meta category, configure all the parameters and train a new model. 
The model must be saved to /Huntmi/classifier folder. If one would like to use default ROCSelect parameters (the same as in HuntMi classifier) 
they can be found on the webpage in ROCSelect.model-cfg  file.