Help & Documentation

Learn how to use PFGPred for fusion gene prediction and training

Background of the Project

Traditional gene fusion detection tools often show high false-positive rates in plants due to algorithms designed for human cancer datasets with different genomic structures. PFGPred is a robust deep learning–based model tailored for fusion gene discovery in plants, trained on high-confidence datasets validated with RNA-Seq and Whole-Genome Sequencing (WGS) data.

What are Fusion Genes?

Fusion genes are hybrid genes formed by combining sequences from two or more distinct genes. In plants, they play a critical role in stress response, gene evolution, and the development of novel traits, making their systematic detection essential for understanding gene regulation, adaptation, and crop evolution.

Figure S1: PFGPred Workflow or Overview

Overview of PFGPred pipeline and fusion gene detection workflow.

Frequently Asked Questions (FAQ)

Detailed answers and video guides for advanced usage.

1. How to use PFGPred?

PFGPred operates in two main steps:

  • Fusion Detection and Feature Extraction: Use the custom Python script provided on GitHub to detect fusion transcripts from RNA-Seq data and extract the corresponding features. This step generates a feature table in CSV format.
  • Prediction: Upload the generated feature table to the “Prediction” section of the PFGPred web server to obtain fusion gene predictions and confidence scores.

2. What type of input data is required for PFGPred?

PFGPred requires RNA-Seq data in FASTQ. Both single-end and paired-end reads are supported.

3. Can I use RNA-Seq data from any organism?

PFGPred is trained on Arabidopsis thaliana, Oryza sativa, Zea mays, and Triticum aestivum. Apart from these species, the model performance is also tested on Cicer arietinum, Glycine max, and Setaria italica. So the model showed decent performance for all the mentioned species.

4. Does PFGPred require WGS data for fusion validation?

No. Once trained, PFGPred uses only RNA-Seq data for fusion gene detection.

5. What should be the ideal threshold for high-confidence fusion gene prediction?

The default confidence threshold in PFGPred is 0.9, which generally provides a good balance between sensitivity and specificity. However, our evaluations across different species indicate that the optimal threshold may vary depending on the organism. Users are encouraged to adjust the threshold based on their desired stringency level.

6. How can I retrain the model for a new species?

Retraining PFGPred for a new species requires both RNA-Seq and whole-genome sequencing (WGS) data to ensure accurate model calibration. The retraining process involves the following steps:

  1. Fusion Detection and Feature Extraction: Use the custom Python script available on GitHub to detect fusion transcripts from RNA-Seq data and extract corresponding features. This step generates a feature table (CSV format) containing expression-, sequence-, and structure-based features for each candidate fusion.
  2. Validation of Fusion Events Using WGS: Validate the detected fusion events using the WGS validation pipeline (available on GitHub). This step helps distinguish true fusion events from false positives.
  3. Construction of Positive and Negative Datasets:
    • Positive dataset: WGS-validated fusions
    • Negative dataset: Non-WGS-validated fusions
    Use the dataset construction script provided on GitHub to generate the final training data.
  4. Model Training: Upload the positive and negative datasets to the “Prediction & Training” section of the PFGPred web server. The system will train a new model specifically optimized for the target species.
  5. Prediction for the Target Species: Once trained, the new model can be used to predict fusion genes using RNA-Seq data alone, without requiring additional WGS validation.

Data Format Requirements

* Your CSV file must include the following columns for processing.

Key Feature Columns:

  • LeftBreakpoint
  • RightBreakpoint
  • FFPM
  • Total_Count_(SC+RC)
  • Splice_Site
  • Total_Mapped_Reads
  • Left_Exon
  • Right_Exon
  • 5_gene_start
  • 5_gene_end
  • 5_gene_length
  • 3_gene_start
  • 3_gene_end
  • 3_gene_length
  • alternate_junction_count
  • exon_count5
  • exon_count3
  • LeftStrand_+
  • LeftStrand__
  • RightStrand__
  • RightStrand_+
  • Chromosome_Feature_Interchromosomal
  • Chromosome_Feature_Intrachromosomal
  • Same_Strand_No
  • Same_Strand_Yes
  • Reciprocal_Fusion_Yes
  • Reciprocal_Fusion_No
  • Splice_Pattern_InFrame
  • Splice_Pattern_FrameShift
  • Splice_Pattern_Unknown
  • Splice_Pattern_Class_CanonicalPattern
  • Splice_Pattern_Class_NonCanonicalPattern
  • 5_loc_M
  • 5_loc_S
  • 5_loc_E
  • 5_loc_O
  • 3_loc_M
  • 3_loc_S
  • 3_loc_E
  • 3_loc_O
  • alternative_junction_Yes
  • alternative_junction_No

* For training, include a target column (e.g., "label") with 1 for true fusions and 0 for false fusions.

Contact Support

For technical support or research inquiries, contact the SKLab at shailesh@nipgr.ac.in
support@PFGPred.example.com

Get in Touch