Setting frame-based multithreading for FFmpeg internal hevc decoder
allows parallel decoding of HEVC stream generated by NVENC.
Performance evaluation shows that (interestingly) not only it doesn't
increase decoding latency but it actually reduces it. It holds for both
libx265 and NVENC (tested with `testcard:codec=R10k:pattern=gray`).
Note that the observation in previous paragraph doesn't hold for all
codecs, eg. setting frame+slice threading for FFmpeg JPEG
encoder+decoder increases latency from some 200 ms to 400 ms.
Refer to GitHub CESENT/UltraGrid discussion #241.
The motivation for this filter currently is to measure compression
latency in UltraGrid, eg.:
uv --verbose=+timestamps --capture-filter color -t testcard:pattern=gray \
-d dummy -p color [-c libavcodec]
Get every frame from video_pattern_generator, i. e. every frame, not
only first one.
This will allow more complex pattern than just sliding over one picture.
Behavior described in doxygen. Changes:
- do not override user selected nr. of threads/mode if OTHER (thr=0) or
SLICE (thr=<cpu>,type=slice) threading is supported
- do not set unsupported thread type
- do not set thread count if thread_mode=0
- allow user to select both slice and thread multithreading (slice
remains default)
Setting line buffering does not appear to work correctly on Powershell nor
cmd.exe and instead behaves like full buffering (lines do not appear
util flush).
The << operator for stringstream casts the result to basic_ostream
which does not have the .str() method on some compilers. (See C++ defect
report 1023).
The previous implementation using atomics was not entirely correct,
since the following situation could happen:
1. Thread 1 detects a msg repeat
2. Thread 2 prints a message before thread 1 could print repeat notice
3. Thread 1 outputs "last msg repeated" for the msg from step 1
The stdout stream uses locking internaly anyway, so this should not have
any significant overhead. On the other hand this simplifies the code,
eliminates an allocation and fixes the leak on exit.
- the required space for intermediate result was actually 2x larger than
dst buffer could provide
+ make arguments of vc_copylineRGBAtoUYVY restrict again -- no longer
used in situ, thus it can be restricted again