[theora-dev] Multithread support

Discussion:

M. Pabis

2015-02-03 23:37:56 UTC

Hi,

I recently had to encode few hours of desktop stream with Theora and I
noticed it used only one core for encoding. Was I missing something? I did
not find any "thread" options.

As I dig, I found there was a multithread patch back in 2007, and some
ffmpeg2theora-multithread commits, but it looks like all this was dropped.
Am I right?

If the multithreading encoding was dropped out, may I ask why?

I think I could dedicate some of my free time to bring multithreading to
the Theora encoder but I would like to ensure not to be redundant ;-)

--
Thanks in advance for answers.
Mateusz Pabis

Timothy B. Terriberry

2015-02-04 04:17:42 UTC

Permalink

Post by M. Pabis
If the multithreading encoding was dropped out, may I ask why?

IIRC, the commits from 2007 only threaded the motion search, and gave
gains of only 10 to 20%. However, as part of the work to improve the
Theora encoder quality for the HTML5 video tag, the way this search was
done was completely rewritten, and we got significantly more than 10 to
20% speed-ups just by having a smarter single-threaded algorithm.

Post by M. Pabis
I think I could dedicate some of my free time to bring multithreading to
the Theora encoder but I would like to ensure not to be redundant ;-)

I don't believe anyone has been working on this for some years. There
are two basic approaches.

One is threading within a single frame, which does not require any API
behavior changes. In theory you can scale to a fairly decent number of
threads everywhere except the final conversion from tokens to VLC codes
in oc_enc_frame_pack(). However, the units of work are sufficiently
small and the task dependencies sufficiently involved that this needs
some kind of lock-free work-stealing queues to have a hope of getting
more benefit from the parallelism than you pay in synchronization
overhead. I'd started designing one with the hope that all memory
allocations could be done up-front at encoder initialization (to avoid
locking contention there), but this turns out to be sufficiently
different from how most lock-free data structures worked at the time
that it was a fair amount of work. I've been meaning to look at what
Mozilla's Servo project is doing for this these days (since they have
similar challenges).

The other is traditional FFmpeg-style frame threading, which gives each
thread a separate frame to encode, and merely waits for enough rows of
the previous frame to be finished so that it can start its motion
search. This is generally much more effective than threading within a
frame, but a) requires additional delay (the API supports this in
theory, but software using that API might not expect it, so it would
have to be enabled manually through some sort of th_encode_ctl call) and
b) requires changes to the rate control to deal with the fact that
statistics from the previous frame are not immediately available. b) was
the real blocker here.

In every encoder I know of (for any format), the second approach is much
more effective than the first.

M. Pabis

2015-02-04 10:48:05 UTC

Permalink

Hi, thanks for some

On Wed, Feb 4, 2015 at 5:17 AM, Timothy B. Terriberry <***@vt.edu>
wrote:

I don't believe anyone has been working on this for some years. There are

Post by Timothy B. Terriberry
two basic approaches.
One is threading within a single frame, which does not require any API
behavior changes. In theory you can scale to a fairly decent number of
threads everywhere except the final conversion from tokens to VLC codes in
oc_enc_frame_pack(). However, the units of work are sufficiently small and
the task dependencies sufficiently involved that this needs some kind of
lock-free work-stealing queues to have a hope of getting more benefit from
the parallelism than you pay in synchronization overhead. I'd started
designing one with the hope that all memory allocations could be done
up-front at encoder initialization (to avoid locking contention there), but
this turns out to be sufficiently different from how most lock-free data
structures worked at the time that it was a fair amount of work. I've been
meaning to look at what Mozilla's Servo project is doing for this these
days (since they have similar challenges).
The other is traditional FFmpeg-style frame threading, which gives each
thread a separate frame to encode, and merely waits for enough rows of the
previous frame to be finished so that it can start its motion search. This
is generally much more effective than threading within a frame, but a)
requires additional delay (the API supports this in theory, but software
using that API might not expect it, so it would have to be enabled manually
through some sort of th_encode_ctl call) and b) requires changes to the
rate control to deal with the fact that statistics from the previous frame
are not immediately available. b) was the real blocker here.

I have read Theora Specification (from March 2011) and I have some more
ideas.

1. Each thread deals with frames from intra frame up to next intra frame -
1;
2. Each thread deals with 1/n-th of the duration, and all outputs are
finally concatenated.
3. Maybe not a multithreading, but parallel/vector computing - encoding one
frame, divided into small areas and processed on OpenCL or CUDA.

I'm aware these are rather naive approaches. Mostly because they need to
have enough data upfront. And for 1. - stream encoding would introduce some
latency. And with nowadays processor power encoding can be done in
realtime, so no speedup with streamed video. Maybe one could spend more
time finding better compression.

Well, 2. is totally naive. But, if the whole video is available, the speed
up should be almost linear.

About 3. Well it's a vendor lock ;-) But hey, better this than nothing,
right? As this is a variation of concept #1 you described, CUDA and OpenCL
have efficient mechanisms to deal with synchronization, memory sharing etc.
This approach probably would benefit with higher resolutions. CUDA and/or
OpenCL could be also performing concept #2, with the same limitations
unfortunately.

--
Best regards
Mateusz Pabis

Timothy B. Terriberry

2015-02-04 11:31:44 UTC

Permalink

Post by M. Pabis
1. Each thread deals with frames from intra frame up to next intra frame
- 1;

This works if you know where the intra frames are. Currently the frame
type decision is made by trying to encode as an inter frame, and keeping
statistics on expect rate and distortion from using all intra modes
during mode decision. Then if it looks like an all-intra frame is likely
to be more efficient, the frame is re-encoded as a keyframe. There is no
lookahead at all.

You could certainly do this in two-pass mode, but the first pass mode is
not very much faster than the second pass. In fact, I'm pretty sure you
could do this without any modification to libtheora at all.

Post by M. Pabis
2. Each thread deals with 1/n-th of the duration, and all outputs are
finally concatenated.

This is pretty similar to 1, except that you can be more relaxed about
picking your partition points (i.e., if you put a keyframe in the wrong
place 4 or 8 times in a whole sequence, the overhead will not be that
large). Again, I think you can do this with no modifications to
libtheora at all.

In both cases the real trick will be rate control, since unless you're
doing average bitrate, the number of bits you want to spend on each
segment can vary quite a lot. If you are doing average bitrate, then
this is easy.

This is what sites like YouTube already do to reduce the latency between
video upload and a video being available, and you can do this even with
an encoder that is itself multithreaded (i.e., splitting across multiple
machines instead of threads). Whether or not your encoder is
multithreaded just controls how many segments you need to split the
sequence into for a desired degree of parallelism.

Post by M. Pabis
3. Maybe not a multithreading, but parallel/vector computing - encoding
one frame, divided into small areas and processed on OpenCL or CUDA.

Lots of people have tried to do something like this for various codecs,
but I'm not aware of anyone ever getting any real improvements. A lot of
the processing does not work well on a GPU, and the data-marshalling to
get information back and forth between the CPU and GPU tend to wipe out
the gains from parallelism.

I would personally suggest not wasting time on this approach.

Post by M. Pabis
right? As this is a variation of concept #1 you described, CUDA and
OpenCL have efficient mechanisms to deal with synchronization, memory
sharing etc. This approach probably would benefit with higher

Well, they mostly deal with it by not synchronizing. I.e., to get good
performance you need something on the order of 1000-way parallelism
among tasks that do not have to synchronize with each other. They have
grown some synchronization mechanisms, but they have a huge performance
penalty on these architectures.

Maik Merten

2015-02-04 12:06:34 UTC

Permalink

Post by Timothy B. Terriberry

Post by M. Pabis
1. Each thread deals with frames from intra frame up to next intra frame
- 1;

This works if you know where the intra frames are.

Could this information be gathered by having one thread encode a
downsampled version of the input video sequence, or would this be a bad
predictor?

Who knows, perhaps one can also gather data on relative bitrate
distribution between segments.

Maik

Maik Merten

2015-02-04 12:06:40 UTC

Permalink

Post by Timothy B. Terriberry

Post by M. Pabis
1. Each thread deals with frames from intra frame up to next intra frame
- 1;

This works if you know where the intra frames are.

Timothy B. Terriberry

2015-02-04 12:16:36 UTC

Permalink

Post by Maik Merten
Could this information be gathered by having one thread encode a
downsampled version of the input video sequence, or would this be a bad
predictor?

Yes. x264 has a lookahead thread with exactly this purpose (among
others). But current libtheora has nothing like this.

Post by Maik Merten
Who knows, perhaps one can also gather data on relative bitrate
distribution between segments.

That requires significantly more lookahead, of course.

M. Pabis

2015-02-05 11:18:58 UTC

Permalink

Post by Timothy B. Terriberry

Post by Maik Merten
Could this information be gathered by having one thread encode a
downsampled version of the input video sequence, or would this be a bad
predictor?

Yes. x264 has a lookahead thread with exactly this purpose (among
others). But current libtheora has nothing like this.
Great, thanks for comments and ideas.

Is there anything obvious I'm missing why this improvement would be hard or
a waste of time?

Mateusz Pabis