Plant Fusion Gene Predictor

PFGPred is an ML-based framework for accurately detecting fusion genes using RNA-Seq data. The development pipeline of PFGPred comprises several steps, including data collection and preprocessing, construction of positive and negative datasets, feature extraction, model development, and performance evaluation. Firstly, fusion transcripts were identified from RNA-Seq data and subsequently validated using WGS data to distinguish true fusion events from false positives. For both WGS-validated and non-validated fusions, a comprehensive set of features was derived from RNA-Seq data and available genome annotation. The prediction model was trained on this dataset using an ensemble-based approach that integrates Random Forest, XGBoost, and Long Short-Term Memory (LSTM) networks to enhance prediction robustness. The performance of PFGPred was evaluated on independent datasets using standard evaluation metrics, including sensitivity, specificity, accuracy, precision, Matthews correlation coefficient (MCC), area under the ROC curve (AUC), and F1-score.

PFGPred Graphical Abstract