Skip to content

Conversation

DariaMityagina
Copy link
Contributor

@DariaMityagina DariaMityagina commented Aug 19, 2025

Details:

  • The task is to explore the possibility of moving the reshaping process to the plugin. This would allow the compiler to receive either a network with batch size 1 if reshaping is successful, or a network with a non-1 batch size if reshaping fails. This approach aims to simplify the batch handling process and reduce dependencies on the compiler.

The concept in this PR approach

  1. Verify if the model is compatible with batching on the plugin side.
  2. Attempt to reshape the model using set_batch(1).
  3. If this is successful, we continue with the PLUGIN batch. The reshaped model is sent to the compiler.
  4. The dynamic batch determination function has been updated to rely on the actual tensor shape instead of shapeFromIRModel.

Basically, we want to consider any N=1 network as a potential dynamic-batched-network.

Flow

  1. Plugin receives a batch model (no matter static or dynamic batch)
  2. Plugin tries to reshape it to batch=1
  3. If success, plugin invokes the compiler to compile the model with batch=1
  4. Plugin dumps the blob in the FS
    ===end of context===
  5. A user calls benchmark_app -m model.blob -data_shape [4, 3, 224, 224]
  6. Plugin loads the model onto the device
  7. Plugin sees the batch -> it creates N (4) infer requests

Tickets:

  • E-176749

@DariaMityagina DariaMityagina self-assigned this Aug 19, 2025
@DariaMityagina DariaMityagina added the category: NPU OpenVINO NPU plugin label Aug 19, 2025
@DariaMityagina DariaMityagina force-pushed the icv/dm/plugin_batch-ver-2 branch from 30529c1 to 29bbcdc Compare August 19, 2025 07:01
@DariaMityagina DariaMityagina force-pushed the icv/dm/plugin_batch-ver-2 branch 3 times, most recently from 6f2a537 to b9cbb0f Compare September 9, 2025 23:00
…and Compiler - no metadata changes - fix static tests
…and Compiler - fix BA issues - treat every model with batch 1 as a potentially dynamically batched one
@DariaMityagina DariaMityagina force-pushed the icv/dm/plugin_batch-ver-2 branch from 105c432 to 3a4baa2 Compare September 10, 2025 11:51
if (autoOrPluginBatch && pluginBatchingIsSupported && batchedModel) {
_logger.info("Attempting to handle batching on the plugin side.");
ov::set_batch(modelForCompilation, 1);
// TODO: add debatcher for more complicated cases as set_batch is pretty naive.
Copy link
Contributor Author

@DariaMityagina DariaMityagina Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe isn't needed. To be double-checked.

@DariaMityagina DariaMityagina added this to the 2025.4 milestone Sep 22, 2025
Copy link
Contributor

@sivanov-work sivanov-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we have too wide spread an unclear batch/dynamic-batch handling logic here, which actually were imposed by a legacy design.

The things I'd be very eager to discuss are:

  1. I think we should remove any opened presumption about N dimension location(even more by getting the batch from hardcoded BATCH_AXIS positions), because it will not work for 100% cases and we will need to return to this problem again and again making more and more refactoring.
    Instead I suggest we use layout information (or ask user to supply it if that doesn't exist) to get batch dimension, as this plugin.cpp part was designed to be generic rather than PLUGIN batch emphasized

  2. even though if we get off BATCH_AXIS and retain these "support-batch-case" various check: I see that these limitations are spread across these TWO places: the one part in "plugin.cpp" side where we check PLUGIN mode, and the actual limitation which are contained in ZeroInferRequest where we are checking data structures once again.
    IMHO, plugin.cpp should be clean from any assumptions whatever PLUGIN implementation can or cannot support and doesn't introduce such fine-grained conditions testing. Otherwise we would be destined always fix/add/invent symmetrical changes in both ZeroInferRequest and plugin.cpp as well.

  3. If we had agreed to consider each N=1 batched infer request, as a possible batched/dynamic-batched infer request, then it's also possible to release burden from this plugin.cpp part and always use set_batch(1) for any scenario where PLUGIN mode is involved so that later we could reallocate inner data when set_tensor() comes with a new N.
    So that it allows us to implement in ZeroInferRequest N=1 as a generic case and only confine set_tensor()/infer() implementation for event when new tensors are set by user which have a different batch value

As I can see:

  1. we are forced to set_batch(1) when we are dealing with batched network as ZeroInferRequests for PLUGIN mode use that. It seems to me that we MUST set_batch(1) every time we deal with a batched network. We cannot invoke set_bacth(1) for each network as it is not reliable and may fail for non-batched network, therefore we must determine that network is a batched network. Despite this entire legacy logic is, IMHO, redundant and weak- I think that we should stop here and do not proceed further for separating static batch/dynamic batch cases and bringing up any artificial limitations more than we already have. So that I hope that simply using set_batch(1) is enough here

  2. ZeroInferRequest deals with N=1 situation (or doesn't even care what it has got) and only add some checking into set_tensor() that new tensor is N times more of less than existing one. If ALL tensors got properly changed in the same proportion N - that we deal with bathced case whatever static/dynamic it is

  3. IMHO ideal situation, would be is to get off all heuristic/limitation from plugin.cpp and encapsulate them all in ZeroInferRequest

const bool batchedModel = ov::get_batch(modelForCompilation) != intel_npu::utils::DEFAULT_BATCH_SIZE;

if (autoOrPluginBatch && pluginBatchingIsSupported && batchedModel) {
_logger.info("Attempting to handle batching on the plugin side.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tips:
I recommend to extend the logger info by adding existing a current value of batch
.. from {ov::get_batch(modelForCompilation} to "1"

ov::set_batch(modelForCompilation, 1);
// TODO: add debatcher for more complicated cases as set_batch is pretty naive.
} else {
_logger.info("Unable to manage batching on the plugin side, so the compiler will take care of it.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tips:
a value of the batch in info printing will also help in troubleshooting

updateBatchMode(ov::intel_npu::BatchMode::COMPILER);
}

if (localConfig.isAvailable(ov::intel_npu::batch_mode.name())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tips:
I checked and it seems that all logic requires the ingredient condition localConfig.isAvailable met and it seems that we can wrap all these lines under the single if

if((localConfig.isAvailable()) {
 ... all new code
}

Perhaps it improves readability


updateBatchMode(ov::intel_npu::BatchMode::COMPILER);
} catch (const std::exception& ex) {
_logger.info("Couldn't validate and reshape the model. Batching will be handed by compiler.", ex.what());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AUTO & PLUGIN makes no difference then, as far I remember from the compiler code we roll back to the COMPILER mode only if we have got AUTO enabled so that we do not overwrite explicitly requested PLUGIN mode

ov::Layout layout = ov::layout::get_layout(input);

// Batching on plugin is working only when batching is found on 0th dimension
if ((shape.size() && shape[intel_npu::utils::BATCH_AXIS].get_max_length() != intel_npu::utils::DEFAULT_BATCH_SIZE) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don;t think that it's good idea to stick to the static BATCH_AXIS while determining a batch dimension, there are many cases when the 0th dimension is not a batch at all, as well as the cases where batch can be determined not at the 0th position

Since we have "layout" attribute affiliated to a model - we could adhere to it.
On the vpux-compiler side we have this logic in determining batches for the debatch method:

  1. check whether or not user specified "layouts" explicitly
  2. check ov::FindBatch outcome
  3. trying to set_batch(1) ...

what do you think if we reuse similar logic? Or do not introduce any logic at the plugin.cpp level at all


// Plugin batching is applied if:
// 1. Both inputs and outputs have batched dimensions
// 2. All batch sizes are consistent (should be only DEFAULT_BATCH_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that in case if dynamic batch we can perform this check here, we could only check that Ns are dynamic.
What do you think if we abolish this check at all (for the static case as well) and just set_batch(1) for the model unconditionally provided that the mode is PLUGIN? All substantial checks on batch size will be made in runtime on set_tensor in ZeroInferRequest?

In other words, if we intended to fail (due to different N sizes) - the we will definitely fail on "runtime/set_tensor" phase later. On the other hand, this disrupt the principle "fail fast" as errors can be determined on the compile phase, but we always can just execute "set_tensor" with a default/model execting shapes right after a model compilation.

Possible implementation:
We remove all limitation checks from plugin.cpp
we remove ALL tensors and pipeline allocation from ZeroInferRequest related ctors & initialization.
We move all this allocations into reallocate_pipeline method. We call reallocation_pipeline once all set_tensors are made i.e. right before infer().
So once a model has compiled, we will call set_tensor for each input which will trigger pipeline creation and device memory allocation.
Only after that, the compile_model() seems to be done
Later on inference phase, we reuse reallocate_pipeline logic when new tensors with new N are being requested.

Pros:
we keep plugin.cpp generic
we keep all limitation logic in ZeroInferRequest,
we keep ZeroInferRequest simple

Cons:
We get error due to that limitation later until ZeroInferRequest is created (?) and tests on compilation of batched model will require refactoring, perhaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these changes and checks from dummy_model. Not applicable anymore. Thanks!


// Limitation: Plugin batching is not supported when there are dynamic
// dimensions other than the batch dimension.
if (checkModelDynamicDims(model)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we were discussed this offline the problem of simultaneous existence of batch and dynamic dimensions is that we would have a "gap" in plain tensor data after a first batch end position and before a next batch start position.

That IMHO makes it harder to find where next N-line for a next batch portion is located and we cannot simply find a next batch I in an entire tensor by executing I * get_size() / N
Also there is an opened issue to overcome a such imitation.

What if we do not introduce that limitation at all, as it manifests a lot of brittleness in the code and introduces very specific checks like hasOtherDynamicDims?
it's a rare situation that we have N!=1 and dynamic shape together, which being observed only add a complexity for getting next line of batch in a result tensor, and which will be covered soon as additional issue?

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we do not introduce that limitation at all, as it manifests a lot of brittleness in the code and introduces very specific checks like hasOtherDynamicDims?
it's a rare situation that we have N!=1 and dynamic shape together, which being observed only add a complexity for getting next line of batch in a result tensor, and which will be covered soon as additional issue?

Sure, we can do it separately in E#179728. Thanks!

@DariaMityagina DariaMityagina force-pushed the icv/dm/plugin_batch-ver-2 branch from e932c8d to ef91465 Compare September 22, 2025 22:04
Copy link
Contributor

@pereanub pereanub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think that considering all the models that have batch size set to 1 as dynamic models is the correct approach here.


OPENVINO_ASSERT(is_dynamic || port.get_shape() == tensor->get_shape(),
OPENVINO_ASSERT(is_dynamic || port.get_shape() == tensor->get_shape() ||
tensor->get_shape()[utils::BATCH_AXIS] % port.get_shape()[utils::BATCH_AXIS] == 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this extra check here, but it doesn't make sense to me. Could you please explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me provide additional context.

The check_tensor function is invoked within set_tensor, particularly when assigning a tensor to a designated port.
With this implementation, the port's shape is adjusted to have a modified batch size (set to 1), while the tensor retains its original requested dimensions.

We won't pass is_dynamic || port.get_shape() == tensor->get_shape().

For instance, if the tensor has a batch size of 2:
2 % 1 = 0 -> deBatched case

"doesn't match with total blobs count: ",
tensors_size);
OPENVINO_ASSERT(
batch.is_dynamic() || batch.get_length() == tensors_size || tensors_size % batch.get_length() == 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

@DariaMityagina
Copy link
Contributor Author

DariaMityagina commented Sep 23, 2025

I still don't think that considering all the models that have batch size set to 1 as dynamic models is the correct approach here.

This approach was inspired by the requirements for this task:

1. Plugin detection: The plugin identifies a network with a dynamic batch size.
2. Batch conversion: It transforms the network into one with a batch size of 1, if possible.
3. Driver compilation: Only after this transformation does the network get passed to the driver for compilation.
4. Isolation from Compiler and Driver: Crucially, neither the compiler nor the driver is aware that the network originally had a dynamic/static batch size.

- To make dynamic batch compatible with older drivers
- Needed compatibility with old NPU Plugins
- [Stretch goal] No need for Metadata in compiled blobs, NPU Plugin to handle batches != 1 automatically using concurrently created infer requests

To minimize the risks, I will try to limit the application of current changes to the dynamic batch for now, after which we can work on a common solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: NPU OpenVINO NPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants