GINESTRA: Graph-based Embeddings for Natural Product Classification

Abstract

Natural Products (NPs) represent a rich source of bioactive compounds with high structural diversity and therapeutic potential. Automatic classification of NPs is critical to ensure safety, support regulatory compliance, inform product usage, and enable the discovery of new pharmacologically relevant molecules. However, traditional rule-based approaches and hand-crafted molecular fingerprints often fall short in capturing the structural and biosynthetic complexity of NPs. GNNs are well-suited for this task, as they can model both the topology and local chemical environments of molecules. We evaluate multiple GNN architectures on curated NP dataset and assess their ability to generalize across hierarchical classification targets. These findings highlight the potential of GNNs as effective tools for NP classification. By leveraging graph-based representations, GNNs offer a scalable, data-driven approach that better reflects the structural and functional complexity of natural products. This work provides methodological guidance and encourages broader adoption of deep learning in natural product research and drug discovery.

Models Implemented

GCN (Graph Convolutional Network)
GAT (Graph Attention Network)
GIN (Graph Isomorphism Network)
GINE (Graph Isomorphism Network with Edge Features)
GATE (Graph Attention Network with Edge Features)
MLP (Multi-Layer Perceptron, as a baseline)

Installation

Clone the repository:

git clone https://github.com/YOUR_USERNAME/GINESTRA.git
cd GINESTRA

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows use venv\Scripts\activate

Install the required packages:
```
bash ./setup/packages_installer.sh
```

Usage

Configuration: All hyperparameters, dataset paths, and experiment settings can be configured in the config.py file. The PARAM_GRID dictionary defines the search space for manual grid search.
Running an Experiment: To run an experiment for a specific model, execute its script. For example, to run the GIN model:
```
python GIN_main.py
```
The script will:
- Load the dataset specified in config.py
- Iterate through all hyperparameter combinations in PARAM_GRID
- Train and evaluate the model for N_RUNS with different random seeds
- Apply early stopping
- Save the best model weights, logs, and statistics in experiments/

Repository Structure

GINESTRA/
├── experiments/      # Output directory for models, logs, and reports
├── data/             # Directory for datasets
├── models/           # GNN model definitions (.py files)
│   ├── GIN.py
│   └── ...
├── utils/            # Utility functions (early stopping, seeding, etc.)
│   ├── earlystop.py
│   └── ...
├── config.py         # Main configuration file
├── GIN_main.py       # Experiment script for GIN
├── GCN_main.py       # Experiment script for GCN
└── README.md         # This file

Citation

If you use this code or the ideas presented in our work, please cite our paper:

@inproceedings{moleculargnp2025,
  author    = {Prete, Alessia Lucia and Corradini, Barbara Toniella and Costanti, Filippo and Scarselli, Franco and Bianchini, Monica},
  title     = {Leveraging Molecular Graphs for Natural Product Classification},
  booktitle = {Computers in Biology and Medicine},
  year      = {2025},
  note      = {Under review}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions about the code or the paper, please contact alessia.prete@example.com or open an issue on GitHub.