Set the pixfmt conversion CUDA kernel as cmpto_j2k_enc preprocessor,
not run directly.
This also eliminates to need to have the conversion kernel if conversion
is needed - CPU conversion will be sufficient. Currently not effective,
only R12L is converted for which there is the kernel.
refer to GH-406
There must be an equal comparison, because the position_x indicates the
beginning of the block, so in the first case it means at most last
unfinished block and the 2nd comparison means the last unaligned block
(if any).
- separate the block computation
Drop every non-aligned end of the line, not just on the last line. The
point is, that whereas out of bound read is no more a problem, we may
also do the out-of-bound write - write the trash at the beginning of
the following line.
to measure the duration of the newly created kernel for ->R12L kernel
It is actually 1.6 ms for 1920x1080 picture on GeForce GTX TITAN, which
seems to be more or less OK for now (it could be perhaps optimzed but
doesn't seem to be a blocker for now).