This repository contains a complete workflow for species distribution modeling using Maximum Entropy (MaxEnt) with sample weighting. This project is designed to predict the potential distribution of invasive species using bioclimatic, topographic, and NDVI variables while accounting for spatial and temporal bias in occurrence data.
This project implements an SDM pipeline that includes:
- Data preparation from multiple sources (GBIF, CABI, research publications)
- Pseudo-absence generation with various strategies (random, biased, biased-land-cover)
- Predictive variable processing (bioclimatic, topographic, NDVI)
- Weighted MaxEnt model training to address sampling bias
- Model evaluation with ROC-AUC and PR-AUC metrics (weighted and unweighted)
- Distribution mapping for historical conditions and future projections
- Variable set comparison for model optimization
SDM-TOOLBOX/
│
├── 0_Iteration.ipynb # Main notebook to run the entire pipeline
├── 00_data-preparation.ipynb # Occurrence data preparation and aggregation from multiple sources
├── 01_specie-distribution.ipynb # Species distribution analysis and train/test splitting
├── 02_pseudo-absence.ipynb # Pseudo-absence point generation
├── 03_Bioclim.ipynb # Bioclimatic data processing
├── 03_predictive-variables.ipynb # Predictive variable preparation
├── 04_variable-statistics.ipynb # Variable descriptive statistics
├── 05_run-model_weight.ipynb # Weighted MaxEnt model training
├── 06_model-evaluation.ipynb # Model performance evaluation
├── 07_Bioclim_mapping.ipynb # Bioclim mapping
└── 08_variable_sets_comparison.ipynb # Variable set comparison
- Aggregates occurrence data from multiple sources:
- GBIF (Global Biodiversity Information Facility)
- CABI (Centre for Agriculture and Bioscience International)
- Research publications (Otieno et al. 2019, Peng et al. 2021, etc.)
- Field collection data
- Standardizes formats and columns
- Combines all sources into a single dataset
- Visualizes global species distribution
- Filters by geographic regions
- Splits data into training and testing sets (70:30)
- Generates multiple splits for validation (default: 10 iterations)
Supported strategies:
- Random: Random points across the entire region
- Biased: Random points with bias toward easily accessible areas
- Biased-land-cover: Points with bias based on land cover type (prioritizing forests)
- Bioclimatic Variables: 19 WorldClim variables (bio1-bio19)
- Topography: Elevation from SRTM
- NDVI: Normalized Difference Vegetation Index
- Support for historical data and future projections (RCP 8.5)
- Unit conversion (Kelvin to Celsius) and CRS adjustment
Weighted MaxEnt Features:
- Distance-based weighting: Reduces influence of overly clustered points
- Source-based weighting: Adjusts for data quality from different sources
- Temporal weighting: Gives higher weight to more recent data
- Model configuration:
- Transform: Logistic
- Beta multiplier: 1.5 (regularization)
- Feature types: Linear, Hinge, Product
Calculated metrics:
- ROC-AUC: Area Under ROC Curve (weighted and unweighted)
- PR-AUC: Precision-Recall AUC (weighted and unweighted)
- Permutation Importance: Importance of each predictive variable
- Relative Occurrence Probability (ROP) maps
- Historical vs. future prediction comparison
- Maps of potential distribution changes
- Evaluates performance of various bioclimatic variable combinations
- Identifies optimal variable sets
Edit the 0_Iteration.ipynb notebook to set parameters:
# Species selection
specie = 'leptocybe-invasa' # or 'thaumastocoris-peregrinus'
# Geographic regions
training = 'south-east-asia' # Region for training
interest = 'south-east-asia' # Region for prediction/testing
# Pseudo-absence parameters
pseudoabsence = 'biased-land-cover' # 'random', 'biased', or 'biased-land-cover'
count = 10000 # Number of background points
# Spatial resolution
ref_res = (0.01, 0.01) # degrees (~1000m)
# Bioclimatic variables
bioclim = [i for i in range(1, 20)] # All 19 variables
# Additional variables
topo1 = True # Include topography
ndvi1 = True # Include NDVI
# Number of iterations
n_iteration = 10-
Run the main notebook:
jupyter notebook 0_Iteration.ipynb
-
The main notebook will automatically execute other notebooks in sequence:
00_data-preparation.ipynb01_specie-distribution.ipynb02_pseudo-absence.ipynb03_Bioclim.ipynb03_predictive-variables.ipynb04_variable-statistics.ipynb05_run-model_weight.ipynb06_model-evaluation.ipynb07_Bioclim_mapping.ipynb08_variable_sets_comparison.ipynb
pandas- Data manipulationgeopandas- Geospatial datanumpy- Numerical computingxarray&rioxarray- Multi-dimensional raster dataelapid- Species Distribution Modelingscikit-learn- Model metrics and evaluationmatplotlib&cartopy- Visualization and mappingdask- Parallel computing (optional)
- Species occurrence data (CSV with lat/lon columns)
- Bioclimatic rasters (WorldClim or CORDEX)
- DEM/SRTM for topography
- NDVI data (optional)
- Land cover data for land-cover-based pseudo-absence
out/
└── {specie}/
├── input/
│ ├── train/ # Training data (presence & background)
│ ├── test/ # Testing data
│ └── worldclim/ # Predictive variable rasters
└── output/
└── exp_{model}_{pseudoabsence}_{region}_{topo}_{ndvi}/
├── *.ela # Trained models
├── *.nc # Predictions in NetCDF format
├── *.tif # Predictions in GeoTIFF format
└── *.csv # Input data and evaluation results
- Distribution maps: PNG files in
figs/anddocs/folders
- Leptocybe invasa (Gall wasp on Eucalyptus)
- Thaumastocoris peregrinus (Bronze bug on Eucalyptus)
The weighted MaxEnt model addresses several limitations of standard SDM:
- Spatial bias: Reduces influence of oversampled areas
- Temporal bias: Gives higher weight to more recent data
- Data quality: Adjusts weights based on data source
- Multi-source integration: Combines data from various sources with different quality levels
- Multiple train/test splits (default: 10 iterations)
- Evaluation on both training and testing data
- Weighted metrics for more realistic accuracy assessment
- Data Paths: Ensure paths to bioclimatic data and other rasters are correct in the
03_Bioclim.ipynbnotebook - Memory: This process requires sufficient memory for large raster data
- Dask Cluster: For parallel computing, a Dask cluster is initialized in the main notebook
- Reproducibility: Use
random_state=42for reproducible results
Managing risk in SE Asian forest biosecurity – supporting evidence-based standards for best practice (ACIAR FST/2018/179) Sub-project ‘Climate change modelling: Building biosecurity capacity for adaptation and changing risk'