You may be working with an implementation that divides several processing steps into separate kernels. These kernels may be algorithmically independent of each other. For this case you if you are developing for single device you may wish to place the enqueue kernel calls in some arbitrary order followed by a single call to 'clFinish()'. Then at some point in the future, the kernels can be spread over multiple devices in a consistent pattern without the need to recall algorithmic interdependencies.
E.g. The following examples attempts to better illustrate this concept:
(Upper case letters denote execution independent kernels and lower case letters denote that the prior results are a dependance for forward computation. Vertical spacing conveys 'EnqueueKernel' calls.)
Current implementation:
You may be working with an implementation that divides several processing steps into separate kernels. These kernels may be independent of each other. If this is the case and you are developing for single device execution you might place the enqueue kernel calls in some arbitrary order followed by a single call to 'clFinish()'.
At some point in the future - distributing execution of the kernels over multiple devices can be easier without excessive recall about the implementation as to what execution dependencies exist.
E.g. Current single execution pattern for single device development:
(Uppers are execution independent, lowers are not, vertical spacing denotes 'EnqueueKernel')
Current implementation:
Dev 1 (A)
Dev 1 (B)
Dev 1 (C) (clFinish)
Dev 1 (d) (clFinish)
Dev 1 (e) (clFinish)
Dev 1 (F)
Dev 1 (G)
Dev 1 (H) (clFinish)
Dev 1 (i) (clFinish)
Dev 1 (j) (clFinish)
Notice that a single device is used and that multiple kernels are sequentially processed by that single device. Now note the placement of the clFinish calls. These come before a processing step that requires that all prior steps be complete before continuing.
Lets say at some point in the future your application requires further performance tuning and thus you add two more devices to your system. In this case, the effort can be divided among existing resources by using the clFinish() calls...:
Dev 1 (A), Dev 2 (B), Dev 3 (C) (clFinish)
Dev 1 (d) (clFinish)
Dev 1 (e) (clFinish)
Dev 1 (F), Dev 2 (G), Dev 3 (H) (clFinish)
Dev 1 (i) (clFinish)
Dev 1 (j) (clFinish)
Note that the clFinish calls can be used as markers to dived the execution threads