Proposing a Method for Creating Test Data for Accuracy Assessment #431
Replies: 1 comment 1 reply
-
|
Greetings @Cenan-Alhassan , and happy new year! Thank you for your proposal, I really appreciate your spirit of contribution! The tutorial you watched has also a web page (https://fromgistors.blogspot.com/2019/09/Accuracy-Assessment-of-Land-Cover-Classification.html) which describes the design of sampling units. In general, validation requires data that is not used for training the classification algorithm. I see that you are separating the polygons in training and testing data, which is similar to what is implemented in machine learning algorithms (such as Multi-Layer Peceptron) to tune the algorithm.
I think that the problem with the method that you are proposing, is that even if you exclude the validation pixels from the training, you don't consider the real class area proportion (which can be different from the polygon area proportion), and the spatial distribution of samples is concentrated inside the input polygons (i.e. there could be spatial correlation between training and testing pixels, unless you have a very large number of input polygons randomly distributed across the image) which could increase uncertainty of accuracy metrics. If you haven't, I invite you to read Olofsson et al. (2014) which describes the sample size design and calculation of uncertainty (confidence intervals) of the metrics. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Greetings, Mr Congedo.
I would like to propose a method for creating testing data for accuracy assessment.
Proposal
I have been working with the SCP plugin for a while, and one of the things I have needed to do a lot is create testing data to use for accuracy assessment.
As far as I am aware, there is no native method to create testing data using SCP. I came across a video on your channel where you created testing data: https://www.youtube.com/watch?v=H1cL0yhIygg
From what I understood, you created a certain number of pixel-sized vectors within each class of the classified raster. You then manually labelled each vector with a label. This was fed into the accuracy assessment.
For my personal use, I wrote a script that creates testing data automatically. Let's say we have a polygon that represents two classes (blue and green):
What the script does is perform stratified random sampling on the polygons by separating a percentage of each polygon into testing data, while keeping the rest as training data.
For example, we can perform the split on the aforementioned polygons with two classes. A random area of 30% percent from each polygon is split into testing data.
Training polygons:

Testing polygons:

As you can see, the training polygon shares the same class as the polygon it was split from. Using the overlap function between the original polygon and the testing polygon, we find that the overlap for all three polygons is roughly 30% as intended (see bottom right):
The area of the testing polygon is composed of squares. Each square is the size of the pixels in the classification raster and sits on the same grid as the raster. This allows direct overlap of the testing vector over the classified raster for proper accuracy assessment.
Benefits Over Old Method
I believe this method has two very big benefits:
Inclusivity:
Since the testing polygon is derived from the training polygon, the testing polygons will include every variation that is captured in the training polygon. So then, assuming your training data is thorough and captures all variations within classes, your testing data will as well. You can be assured that the testing data does not overlap, but still includes all spectral variations.
Convenience:
Creating the testing data shares the exact same process as creating the training data, instead of having to create testing data separately. The labelling is automatic.
Integrating the Script into SCP
In order to implement this method, I need to first convert the .scpx training data to a standard vector, perform the processing script, and then convert the training polygon back into .scpx for use with ML models. The following are the parameters for the processing script:
It requires you to input the training data, the class IDs for reference, and the classification raster for pixel size, or input the pixel size manually. The percentage of split can be manually inputted or calculated using some other precise method.
If implemented into SCP, except for the split percentage, all these inputs can be pulled directly from the .scpx file and bandset that are loaded in the project, perhaps with a click of a button.
Conclusion
I would like your opinion on whether this method is good or not, and whether it can or should be implemented into the plugin. I believe it is a very useful idea if done correctly. I would love to discuss it further! And happy new year!
Beta Was this translation helpful? Give feedback.
All reactions