From 9532c0303994fc5fba60d3c9e89208d54313a20f Mon Sep 17 00:00:00 2001 From: imilev Date: Fri, 4 Jul 2025 15:34:36 +0300 Subject: [PATCH] Added high-level diagrams --- .codeboarding/Configuration_System.md | 113 ++++++++++++++++++ .codeboarding/Data_Pipeline.md | 159 +++++++++++++++++++++++++ .codeboarding/Model_Core_GNNs_.md | 87 ++++++++++++++ .codeboarding/Utility_Services.md | 39 ++++++ .codeboarding/on_boarding.md | 165 ++++++++++++++++++++++++++ 5 files changed, 563 insertions(+) create mode 100644 .codeboarding/Configuration_System.md create mode 100644 .codeboarding/Data_Pipeline.md create mode 100644 .codeboarding/Model_Core_GNNs_.md create mode 100644 .codeboarding/Utility_Services.md create mode 100644 .codeboarding/on_boarding.md diff --git a/.codeboarding/Configuration_System.md b/.codeboarding/Configuration_System.md new file mode 100644 index 0000000..49008b0 --- /dev/null +++ b/.codeboarding/Configuration_System.md @@ -0,0 +1,113 @@ +```mermaid + +graph LR + + Configuration_Files_Repository["Configuration Files Repository"] + + Configuration_Loader_Parser["Configuration Loader/Parser"] + + Pretraining_Script["Pretraining Script"] + + Downstream_Task_Script["Downstream Task Script"] + + Configuration_Files_Repository -- "Provides Configuration Data To" --> Configuration_Loader_Parser + + Configuration_Loader_Parser -- "Provides Parsed Configuration To" --> Pretraining_Script + + Configuration_Loader_Parser -- "Provides Parsed Configuration To" --> Downstream_Task_Script + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +The Configuration System is a cornerstone of this Deep Learning Research Framework, providing a robust and flexible mechanism for managing all experimental parameters. It adheres to the "Configuration-Driven Development" architectural pattern, ensuring reproducibility, ease of experimentation, and clear separation of concerns. + + + +### Configuration Files Repository + +This component serves as the central, organized storage for all experimental parameters, model architectures, hyperparameters, dataset paths, and training settings. It is meticulously structured into `pretrain` and `downstream` subdirectories, with further categorization by specific tasks (e.g., EC, Fold3D, GO-BP) and model types. This hierarchical organization is crucial for managing the complexity of deep learning experiments, promoting reusability of configurations, and facilitating reproducible research by providing a single source of truth for all experimental setups. + + + + + +**Related Classes/Methods**: + + + +- `config/downstream/EC/BERT.yaml` (1:1) + +- `config/pretrain/mc_gearnet_edge.yaml` (1:1) + + + + + +### Configuration Loader/Parser + +This component is responsible for programmatically reading, parsing, and transforming the raw YAML data from the `Configuration Files Repository` into structured, accessible objects. It likely leverages libraries such as `pyyaml` for parsing the YAML syntax and `easydict` to provide convenient dictionary-like access to the loaded configurations. This transformation is critical as it converts static configuration files into dynamic objects that can be easily consumed and manipulated by other parts of the system, such as training and evaluation scripts. + + + + + +**Related Classes/Methods**: + + + +- `util.py` (1:1) + + + + + +### Pretraining Script + +This script orchestrates the pretraining phase of protein representation models. It is a primary consumer of the configurations, relying on the `Configuration Loader/Parser` to load specific settings related to model architecture, dataset paths, training schedules, optimization parameters, and other pretraining-specific configurations. This component embodies the "Training Loop Pattern" and uses the loaded configurations to initialize and execute the pretraining process. + + + + + +**Related Classes/Methods**: + + + +- `script/pretrain.py` (1:1) + + + + + +### Downstream Task Script + +This script manages the execution and evaluation of models on various downstream biological tasks (e.g., Enzyme Commission (EC) number prediction, 3D fold prediction, Gene Ontology (GO) term prediction). Similar to the pretraining script, it heavily relies on the `Configuration Loader/Parser` to load task-specific settings, including model configurations, dataset paths for the specific task, evaluation metrics, and fine-tuning parameters. This component also follows the "Training Loop Pattern" but is tailored for task-specific fine-tuning and evaluation. + + + + + +**Related Classes/Methods**: + + + +- `script/downstream.py` (1:1) + + + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/Data_Pipeline.md b/.codeboarding/Data_Pipeline.md new file mode 100644 index 0000000..cb5cc36 --- /dev/null +++ b/.codeboarding/Data_Pipeline.md @@ -0,0 +1,159 @@ +```mermaid + +graph LR + + Data_Pipeline["Data Pipeline"] + + Model_Training_Evaluation["Model Training & Evaluation"] + + Model_Architecture["Model Architecture"] + + Configuration_Manager["Configuration Manager"] + + Experiment_Tracking_Checkpointing["Experiment Tracking & Checkpointing"] + + Data_Pipeline -- "provides data to" --> Model_Training_Evaluation + + Configuration_Manager -- "receives configuration from" --> Data_Pipeline + + Model_Training_Evaluation -- "consumes data from" --> Data_Pipeline + + Model_Training_Evaluation -- "uses" --> Model_Architecture + + Model_Training_Evaluation -- "receives parameters from" --> Configuration_Manager + + Model_Training_Evaluation -- "outputs to" --> Experiment_Tracking_Checkpointing + + Model_Architecture -- "is used by" --> Model_Training_Evaluation + + Model_Architecture -- "expects input from" --> Data_Pipeline + + Configuration_Manager -- "provides config to" --> Data_Pipeline + + Configuration_Manager -- "provides config to" --> Model_Training_Evaluation + + Configuration_Manager -- "provides config to" --> Model_Architecture + + Experiment_Tracking_Checkpointing -- "receives metrics and checkpoints from" --> Model_Training_Evaluation + + Experiment_Tracking_Checkpointing -- "provides checkpoints to" --> Model_Training_Evaluation + + click Data_Pipeline href "https://github.com/DeepGraphLearning/GearNet/blob/main/.codeboarding//Data_Pipeline.md" "Details" + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +This project is a Deep Learning Research Framework/Library for Protein Representation Learning. The Data Pipeline component is central to its functionality, handling all aspects of data preparation for protein representation learning, from raw data acquisition to structured protein graphs and dataset splitting. + + + +### Data Pipeline [[Expand]](./Data_Pipeline.md) + +Manages the entire data lifecycle, including loading, preprocessing, featurization, and dataset splitting for protein data. + + + + + +**Related Classes/Methods**: + + + +- `gearnet.dataset` (1:1) + + + + + +### Model Training & Evaluation + +Orchestrates the training loops, model optimization, validation, and evaluation of protein representation models. + + + + + +**Related Classes/Methods**: + + + +- `scripts.train` (1:1) + +- `gearnet.model` (1:1) + + + + + +### Model Architecture + +Defines the neural network architectures used for protein representation learning (e.g., graph neural networks). + + + + + +**Related Classes/Methods**: + + + +- `gearnet.model` (1:1) + +- `gearnet.layer` (1:1) + + + + + +### Configuration Manager + +Handles loading, parsing, and managing project configurations (e.g., model hyperparameters, dataset paths, training settings) from YAML files. + + + + + +**Related Classes/Methods**: + + + +- `utils.config` (1:1) + + + + + +### Experiment Tracking & Checkpointing + +Manages the logging of training metrics, saving model checkpoints, and potentially resuming training. + + + + + +**Related Classes/Methods**: + + + +- `utils.checkpoint` (1:1) + +- `scripts.train` (1:1) + + + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/Model_Core_GNNs_.md b/.codeboarding/Model_Core_GNNs_.md new file mode 100644 index 0000000..4aa93be --- /dev/null +++ b/.codeboarding/Model_Core_GNNs_.md @@ -0,0 +1,87 @@ +```mermaid + +graph LR + + gearnet_layer_IEConvLayer["gearnet.layer.IEConvLayer"] + + gearnet_layer_GeometricRelationalGraphConv["gearnet.layer.GeometricRelationalGraphConv"] + + gearnet_model_GearNetIEConv["gearnet.model.GearNetIEConv"] + + gearnet_model_FusionNetwork["gearnet.model.FusionNetwork"] + + gearnet_model_GearNetIEConv -- "uses" --> gearnet_layer_IEConvLayer + + gearnet_model_GearNetIEConv -- "uses" --> gearnet_layer_GeometricRelationalGraphConv + + gearnet_model_FusionNetwork -- "composes" --> gearnet_model_GearNetIEConv + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +The `gearnet` subsystem is designed for molecular representation learning, primarily focusing on graph neural networks. It comprises distinct layers that perform specific graph operations and a model that orchestrates these layers to process molecular structures. + + + +### gearnet.layer.IEConvLayer + +This component represents an Interaction-Enhanced Convolutional Layer. It's a fundamental building block for processing graph-structured data, specifically designed to incorporate detailed interaction information between nodes (e.g., atoms in a molecule). + + + + + +**Related Classes/Methods**: _None_ + + + +### gearnet.layer.GeometricRelationalGraphConv + +This component implements a Geometric Relational Graph Convolutional layer. It's another core layer type within the GearNet architecture, focusing on incorporating both geometric and relational information during graph convolutions. + + + + + +**Related Classes/Methods**: _None_ + + + +### gearnet.model.GearNetIEConv + +This is the main GearNet model, specifically an Interaction-Enhanced Convolutional Graph Neural Network. It orchestrates multiple `GeometricRelationalGraphConv` layers and optionally `IEConvLayer` instances to build a deep graph neural network for molecular representation learning. + + + + + +**Related Classes/Methods**: _None_ + + + +### gearnet.model.FusionNetwork + +This component is a higher-level model designed to combine the outputs of two distinct models: a `sequence_model` and a `structure_model`. It's likely used for multi-modal learning, integrating information from different representations (e.g., sequence data and structural data of a molecule). + + + + + +**Related Classes/Methods**: _None_ + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/Utility_Services.md b/.codeboarding/Utility_Services.md new file mode 100644 index 0000000..d2ab4b8 --- /dev/null +++ b/.codeboarding/Utility_Services.md @@ -0,0 +1,39 @@ +```mermaid + +graph LR + + util["util"] + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +The 'util' package appears to be a module containing helper functions rather than a package with a complex class structure. Further examination of its file structure is needed to understand its contents and functionality + + + +### util + +A module containing helper functions. + + + + + +**Related Classes/Methods**: _None_ + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/on_boarding.md b/.codeboarding/on_boarding.md new file mode 100644 index 0000000..9a33bca --- /dev/null +++ b/.codeboarding/on_boarding.md @@ -0,0 +1,165 @@ +```mermaid + +graph LR + + Configuration_System["Configuration System"] + + Data_Pipeline["Data Pipeline"] + + Model_Core_GNNs_["Model Core (GNNs)"] + + Training_Evaluation_Engine["Training & Evaluation Engine"] + + Utility_Services["Utility Services"] + + Configuration_System -- "Defines Parameters For" --> Data_Pipeline + + Configuration_System -- "Defines Parameters For" --> Model_Core_GNNs_ + + Configuration_System -- "Defines Parameters For" --> Training_Evaluation_Engine + + Data_Pipeline -- "Provides Data To" --> Training_Evaluation_Engine + + Data_Pipeline -- "Configured By" --> Configuration_System + + Model_Core_GNNs_ -- "Used By" --> Training_Evaluation_Engine + + Model_Core_GNNs_ -- "Configured By" --> Configuration_System + + Training_Evaluation_Engine -- "Consumes Data From" --> Data_Pipeline + + Training_Evaluation_Engine -- "Trains/Evaluates" --> Model_Core_GNNs_ + + Training_Evaluation_Engine -- "Utilizes" --> Utility_Services + + Utility_Services -- "Loads From" --> Configuration_System + + Utility_Services -- "Supports" --> Training_Evaluation_Engine + + click Configuration_System href "https://github.com/DeepGraphLearning/GearNet/blob/main/.codeboarding//Configuration_System.md" "Details" + + click Data_Pipeline href "https://github.com/DeepGraphLearning/GearNet/blob/main/.codeboarding//Data_Pipeline.md" "Details" + + click Model_Core_GNNs_ href "https://github.com/DeepGraphLearning/GearNet/blob/main/.codeboarding//Model_Core_GNNs_.md" "Details" + + click Utility_Services href "https://github.com/DeepGraphLearning/GearNet/blob/main/.codeboarding//Utility_Services.md" "Details" + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +High-level data flow overview for GearNet, a Deep Learning Research Framework/Library for Protein Representation Learning. + + + +### Configuration System [[Expand]](./Configuration_System.md) + +This component serves as the central control for defining and managing all experimental parameters. It leverages YAML files to specify model hyperparameters, dataset paths, training schedules, and task-specific settings, ensuring flexible and reproducible experiment configurations. + + + + + +**Related Classes/Methods**: + + + +- `config` (1:1) + +- `util` (1:1) + + + + + +### Data Pipeline [[Expand]](./Data_Pipeline.md) + +Responsible for the entire data lifecycle, from acquiring and preprocessing raw protein data (e.g., HDF5 files) to parsing, featurizing atoms and residues, and transforming them into structured protein graphs. It also handles dataset splitting for training, validation, and testing. + + + + + +**Related Classes/Methods**: + + + +- `gearnet.dataset` (1:1) + + + + + +### Model Core (GNNs) [[Expand]](./Model_Core_GNNs_.md) + +This component implements the fundamental graph neural network layers and orchestrates their combination into complete deep learning models for protein representation. It defines how information propagates across protein graphs, incorporating geometric and relational inductive biases. + + + + + +**Related Classes/Methods**: + + + +- `gearnet.layer` (1:1) + +- `gearnet.model` (1:1) + + + + + +### Training & Evaluation Engine + +Manages the entire model lifecycle, encompassing pre-training, fine-tuning, and evaluation on various protein-related tasks. It sets up and executes training loops, handles validation, testing, metric evaluation, learning rate scheduling, and model checkpointing, often leveraging `torchdrug` solvers. + + + + + +**Related Classes/Methods**: + + + +- `script.pretrain` (1:1) + +- `script.downstream` (1:1) + +- `util` (1:1) + + + + + +### Utility Services [[Expand]](./Utility_Services.md) + +Provides foundational helper functions essential for the framework's operation. This includes setting up logging, managing working directories, parsing command-line arguments, and dynamically constructing core framework elements like `torchdrug` solvers and learning rate schedulers based on the loaded configuration. + + + + + +**Related Classes/Methods**: + + + +- `util` (1:1) + + + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file