We currently schedule from the host and poll for completion, which is not scalable to small kernels (short execution) or large numbers of kernels. We need to implement a FPGA-side work scheduler and a work queueing mechanism. The scheduler should be designed to be as close as possible to HSA concepts (for future potential porting to HSA) and implement dependency trees without host intervention (chaining in XRT/ERT terminology)