[deb_x265.git] / doc / reST / threading.rst

*********
Threading
*********

Thread Pool
===========

x265 creates a pool of worker threads and shares this thread pool
with all encoders within the same process (it is process global, aka a
singleton).  The number of threads within the thread pool is determined
by the encoder which first allocates the pool, which by definition is
the first encoder created within each process.

:option:`--threads` specifies the number of threads the encoder will
try to allocate for its thread pool.  If the thread pool was already
allocated this parameter is ignored.  By default x265 allocates one
thread per (hyperthreaded) CPU core in your system.

Work distribution is job based.  Idle worker threads ask their parent
pool object for jobs to perform.  When no jobs are available, idle
worker threads block and consume no CPU cycles.

Objects which desire to distribute work to worker threads are known as
job providers (and they derive from the JobProvider class).  When job
providers have work they enqueue themselves into the pool's provider
list (and dequeue themselves when they no longer have work).  The thread
pool has a method to **poke** awake a blocked idle thread, and job
providers are recommended to call this method when they make new jobs
available.

Worker jobs are not allowed to block except when abosultely necessary
for data locking. If a job becomes blocked, the worker thread is
expected to drop that job and go back to the pool and find more work.

.. note::

	x265_cleanup() frees the process-global thread pool, allowing
	it to be reallocated if necessary, but only if no encoders are
	allocated at the time it is called.

Wavefront Parallel Processing
=============================

New with HEVC, Wavefront Parallel Processing allows each row of CTUs to
be encoded in parallel, so long as each row stays at least two CTUs
behind the row above it, to ensure the intra references and other data
of the blocks above and above-right are available. WPP has almost no
effect on the analysis and compression of each CTU and so it has a very
small impact on compression efficiency relative to slices or tiles. The
compression loss from WPP has been found to be less than 1% in most of
our tests.

WPP has three effects which can impact efficiency. The first is the row
starts must be signaled in the slice header, the second is each row must
be padded to an even byte in length, and the third is the state of the
entropy coder is transferred from the second CTU of each row to the
first CTU of the row below it.  In some conditions this transfer of
state actually improves compression since the above-right state may have
better locality than the end of the previous row.

Parabola Research have published an excellent HEVC
`animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_
which visualizes WPP very well.  It even correctly visualizes some of
WPPs key drawbacks, such as:

1. the low thread utilization at the start and end of each frame
2. a difficult block may stall the wave-front and it takes a while for
   the wave-front to recover.
3. 64x64 CTUs are big! there are much fewer rows than with H.264 and
   similar codecs

Because of these stall issues you rarely get the full parallelisation
benefit one would expect from row threading. 30% to 50% of the
theoretical perfect threading is typical.

In x265 WPP is enabled by default since it not only improves performance
at encode but it also makes it possible for the decoder to be threaded.

If WPP is disabled by :option:`--no-wpp` the frame will be encoded in
scan order and the entropy overheads will be avoided.  If frame
threading is not disabled, the encoder will change the default frame
thread count to be higher than if WPP was enabled.  The exact formulas
are described in the next section.

Parallel Mode Analysis
======================

When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
8x8) will distribute its analysis work to the thread pool. Each analysis
job will measure the cost of one prediction for the CU: merge, skip,
intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount
of increased parallelism is often enough to be able to reduce frame
parallelism while achieving the same overall CPU utilization. Reducing
frame threads is often beneficial to ABR and VBV rate control.

Parallel Motion Estimation
==========================

When :option:`--pme` is enabled all of the analysis functions which
perform motion searches to reference frames will distribute those motion
searches as jobs for worker threads (if more than two motion searches
are required).

Frame Threading
===============

Frame threading is the act of encoding multiple frames at the same time.
It is a challenge because each frame will generally use one or more of
the previously encoded frames as motion references and those frames may
still be in the process of being encoded themselves.

Previous encoders such as x264 worked around this problem by limiting
the motion search region within these reference frames to just one
macroblock row below the coincident row being encoded. Thus a frame
could be encoded at the same time as its reference frames so long as it
stayed one row behind the encode progress of its references (glossing
over a few details). 

x265 has the same frame threading mechanism, but we generally have much
less frame parallelism to exploit than x264 because of the size of our
CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock
rows available each frame while x265 only has 17 64x64 CTU rows.

The second extenuating circumstance is the loop filters. The pixels used
for motion reference must be processed by the loop filters and the loop
filters cannot run until a full row has been encoded, and it must run a
full row behind the encode process so that the pixels below the row
being filtered are available. When you add up all the row lags each
frame ends up being 3 CTU rows behind its reference frames (the
equivalent of 12 macroblock rows for x264)

The third extenuating circumstance is that when a frame being encoded
becomes blocked by a reference frame row being available, that frame's
wave-front becomes completely stalled and when the row becomes available
again it can take quite some time for the wave to be restarted, if it
ever does. This makes WPP many times less effective when frame
parallelism is in use.

:option:`--merange` can have a negative impact on frame parallelism. If
the range is too large, more rows of CTU lag must be added to ensure
those pixels are available in the reference frames.

.. note::

	Even though the merange is used to determine the amount of reference
	pixels that must be available in the reference frames, the actual
	motion search is not necessarily centered around the coincident
	block. The motion search is actually centered around the motion
	predictor, but the available pixel area (mvmin, mvmax) is determined
	by merange and the interpolation filter half-heights.

When frame threading is disabled, the entirety of all reference frames
are always fully available (by definition) and thus the available pixel
area is not restricted at all, and this can sometimes improve
compression efficiency. Because of this, the output of encodes with
frame parallelism disabled will not match the output of encodes with
frame parallelism enabled; but when enabled the number of frame threads
should have no effect on the output bitstream except when using ABR or
VBV rate control or noise reduction.

When :option:`--nr` is enabled, the outputs of each number of frame threads
will be deterministic but none of them will match becaue each frame
encoder maintains a cumulative noise reduction state.

VBV introduces non-determinism in the encoder, at this point in time,
regardless of the amount of frame parallelism.

By default frame parallelism and WPP are enabled together. The number of
frame threads used is auto-detected from the (hyperthreaded) CPU core
count, but may be manually specified via :option:`--frame-threads`

	+-------+--------+
	| Cores | Frames |
	+=======+========+
	|  > 32 |   6    |
	+-------+--------+
	| >= 16 |   5    |
	+-------+--------+
	| >= 8  |   3    |
	+-------+--------+
	| >= 4  |   2    |
	+-------+--------+

If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)**

Over-allocating frame threads can be very counter-productive. They
each allocate a large amount of memory and because of the limited number
of CTU rows and the reference lag, you generally get limited benefit
from adding frame encoders beyond the auto-detected count, and often
the extra frame encoders reduce performance.

Given these considerations, you can understand why the faster presets
lower the max CTU size to 32x32 (making twice as many CTU rows available
for WPP and for finer grained frame parallelism) and reduce
:option:`--merange`

Each frame encoder runs in its own thread (allocated separately from the
worker pool). This frame thread has some pre-processing responsibilities
and some post-processing responsibilities for each frame, but it spends
the bulk of its time managing the wave-front processing by making CTU
rows available to the worker threads when their dependencies are
resolved.  The frame encoder threads spend nearly all of their time
blocked in one of 4 possible locations:

1. blocked, waiting for a frame to process
2. blocked on a reference frame, waiting for a CTU row of reconstructed
   and loop-filtered reference pixels to become available
3. blocked waiting for wave-front completion
4. blocked waiting for the main thread to consume an encoded frame

Lookahead
=========

The lookahead module of x265 (the lowres pre-encode which determines
scene cuts and slice types) uses the thread pool to distribute the
lowres cost analysis to worker threads. It follows the same wave-front
pattern as the main encoder except it works in reverse-scan order.

The function slicetypeDecide() itself may also be performed by a worker
thread if your system has enough CPU cores to make this a beneficial
trade-off, else it runs within the context of the thread which calls the
x265_encoder_encode().

SAO
===

The Sample Adaptive Offset loopfilter has a large effect on encode
performance because of the peculiar way it must be analyzed and coded.

SAO flags and data are encoded at the CTU level before the CTU itself is
coded, but SAO analysis (deciding whether to enable SAO and with what
parameters) cannot be performed until that CTU is completely analyzed
(reconstructed pixels are available) as well as the CTUs to the right
and below.  So in effect the encoder must perform SAO analysis in a
wavefront at least a full row behind the CTU compression wavefront.

This extra latency forces the encoder to save the encode data of every
CTU until the entire frame has been analyzed, at which point a function
can code the final slice bitstream with the decided SAO flags and data
interleaved between each CTU.  This second pass over the CTUs can be
expensive, particularly at large resolutions and high bitrates.
Commit	Line	Data
	1	*********
	2	Threading
	3	*********
	4
	5	Thread Pool
	6	===========
	7
	8	x265 creates a pool of worker threads and shares this thread pool
	9	with all encoders within the same process (it is process global, aka a
	10	singleton). The number of threads within the thread pool is determined
	11	by the encoder which first allocates the pool, which by definition is
	12	the first encoder created within each process.
	13
	14	:option:`--threads` specifies the number of threads the encoder will
	15	try to allocate for its thread pool. If the thread pool was already
	16	allocated this parameter is ignored. By default x265 allocates one
	17	thread per (hyperthreaded) CPU core in your system.
	18
	19	Work distribution is job based. Idle worker threads ask their parent
	20	pool object for jobs to perform. When no jobs are available, idle
	21	worker threads block and consume no CPU cycles.
	22
	23	Objects which desire to distribute work to worker threads are known as
	24	job providers (and they derive from the JobProvider class). When job
	25	providers have work they enqueue themselves into the pool's provider
	26	list (and dequeue themselves when they no longer have work). The thread
	27	pool has a method to poke awake a blocked idle thread, and job
	28	providers are recommended to call this method when they make new jobs
	29	available.
	30
	31	Worker jobs are not allowed to block except when abosultely necessary
	32	for data locking. If a job becomes blocked, the worker thread is
	33	expected to drop that job and go back to the pool and find more work.
	34
	35	.. note::
	36
	37	x265_cleanup() frees the process-global thread pool, allowing
	38	it to be reallocated if necessary, but only if no encoders are
	39	allocated at the time it is called.
	40
	41	Wavefront Parallel Processing
	42	=============================
	43
	44	New with HEVC, Wavefront Parallel Processing allows each row of CTUs to
	45	be encoded in parallel, so long as each row stays at least two CTUs
	46	behind the row above it, to ensure the intra references and other data
	47	of the blocks above and above-right are available. WPP has almost no
	48	effect on the analysis and compression of each CTU and so it has a very
	49	small impact on compression efficiency relative to slices or tiles. The
	50	compression loss from WPP has been found to be less than 1% in most of
	51	our tests.
	52
	53	WPP has three effects which can impact efficiency. The first is the row
	54	starts must be signaled in the slice header, the second is each row must
	55	be padded to an even byte in length, and the third is the state of the
	56	entropy coder is transferred from the second CTU of each row to the
	57	first CTU of the row below it. In some conditions this transfer of
	58	state actually improves compression since the above-right state may have
	59	better locality than the end of the previous row.
	60
	61	Parabola Research have published an excellent HEVC
	62	`animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_
	63	which visualizes WPP very well. It even correctly visualizes some of
	64	WPPs key drawbacks, such as:
	65
	66	1. the low thread utilization at the start and end of each frame
	67	2. a difficult block may stall the wave-front and it takes a while for
	68	the wave-front to recover.
	69	3. 64x64 CTUs are big! there are much fewer rows than with H.264 and
	70	similar codecs
	71
	72	Because of these stall issues you rarely get the full parallelisation
	73	benefit one would expect from row threading. 30% to 50% of the
	74	theoretical perfect threading is typical.
	75
	76	In x265 WPP is enabled by default since it not only improves performance
	77	at encode but it also makes it possible for the decoder to be threaded.
	78
	79	If WPP is disabled by :option:`--no-wpp` the frame will be encoded in
	80	scan order and the entropy overheads will be avoided. If frame
	81	threading is not disabled, the encoder will change the default frame
	82	thread count to be higher than if WPP was enabled. The exact formulas
	83	are described in the next section.
	84
	85	Parallel Mode Analysis
	86	======================
	87
	88	When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
	89	8x8) will distribute its analysis work to the thread pool. Each analysis
	90	job will measure the cost of one prediction for the CU: merge, skip,
	91	intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount
	92	of increased parallelism is often enough to be able to reduce frame
	93	parallelism while achieving the same overall CPU utilization. Reducing
	94	frame threads is often beneficial to ABR and VBV rate control.
	95
	96	Parallel Motion Estimation
	97	==========================
	98
	99	When :option:`--pme` is enabled all of the analysis functions which
	100	perform motion searches to reference frames will distribute those motion
	101	searches as jobs for worker threads (if more than two motion searches
	102	are required).
	103
	104	Frame Threading
	105	===============
	106
	107	Frame threading is the act of encoding multiple frames at the same time.
	108	It is a challenge because each frame will generally use one or more of
	109	the previously encoded frames as motion references and those frames may
	110	still be in the process of being encoded themselves.
	111
	112	Previous encoders such as x264 worked around this problem by limiting
	113	the motion search region within these reference frames to just one
	114	macroblock row below the coincident row being encoded. Thus a frame
	115	could be encoded at the same time as its reference frames so long as it
	116	stayed one row behind the encode progress of its references (glossing
	117	over a few details).
	118
	119	x265 has the same frame threading mechanism, but we generally have much
	120	less frame parallelism to exploit than x264 because of the size of our
	121	CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock
	122	rows available each frame while x265 only has 17 64x64 CTU rows.
	123
	124	The second extenuating circumstance is the loop filters. The pixels used
	125	for motion reference must be processed by the loop filters and the loop
	126	filters cannot run until a full row has been encoded, and it must run a
	127	full row behind the encode process so that the pixels below the row
	128	being filtered are available. When you add up all the row lags each
	129	frame ends up being 3 CTU rows behind its reference frames (the
	130	equivalent of 12 macroblock rows for x264)
	131
	132	The third extenuating circumstance is that when a frame being encoded
	133	becomes blocked by a reference frame row being available, that frame's
	134	wave-front becomes completely stalled and when the row becomes available
	135	again it can take quite some time for the wave to be restarted, if it
	136	ever does. This makes WPP many times less effective when frame
	137	parallelism is in use.
	138
	139	:option:`--merange` can have a negative impact on frame parallelism. If
	140	the range is too large, more rows of CTU lag must be added to ensure
	141	those pixels are available in the reference frames.
	142
	143	.. note::
	144
	145	Even though the merange is used to determine the amount of reference
	146	pixels that must be available in the reference frames, the actual
	147	motion search is not necessarily centered around the coincident
	148	block. The motion search is actually centered around the motion
	149	predictor, but the available pixel area (mvmin, mvmax) is determined
	150	by merange and the interpolation filter half-heights.
	151
	152	When frame threading is disabled, the entirety of all reference frames
	153	are always fully available (by definition) and thus the available pixel
	154	area is not restricted at all, and this can sometimes improve
	155	compression efficiency. Because of this, the output of encodes with
	156	frame parallelism disabled will not match the output of encodes with
	157	frame parallelism enabled; but when enabled the number of frame threads
	158	should have no effect on the output bitstream except when using ABR or
	159	VBV rate control or noise reduction.
	160
	161	When :option:`--nr` is enabled, the outputs of each number of frame threads
	162	will be deterministic but none of them will match becaue each frame
	163	encoder maintains a cumulative noise reduction state.
	164
	165	VBV introduces non-determinism in the encoder, at this point in time,
	166	regardless of the amount of frame parallelism.
	167
	168	By default frame parallelism and WPP are enabled together. The number of
	169	frame threads used is auto-detected from the (hyperthreaded) CPU core
	170	count, but may be manually specified via :option:`--frame-threads`
	171
	172	+-------+--------+
	173	\| Cores \| Frames \|
	174	+=======+========+
	175	\| > 32 \| 6 \|
	176	+-------+--------+
	177	\| >= 16 \| 5 \|
	178	+-------+--------+
	179	\| >= 8 \| 3 \|
	180	+-------+--------+
	181	\| >= 4 \| 2 \|
	182	+-------+--------+
	183
	184	If WPP is disabled, then the frame thread count defaults to min(cpuCount, ctuRows / 2)
	185
	186	Over-allocating frame threads can be very counter-productive. They
	187	each allocate a large amount of memory and because of the limited number
	188	of CTU rows and the reference lag, you generally get limited benefit
	189	from adding frame encoders beyond the auto-detected count, and often
	190	the extra frame encoders reduce performance.
	191
	192	Given these considerations, you can understand why the faster presets
	193	lower the max CTU size to 32x32 (making twice as many CTU rows available
	194	for WPP and for finer grained frame parallelism) and reduce
	195	:option:`--merange`
	196
	197	Each frame encoder runs in its own thread (allocated separately from the
	198	worker pool). This frame thread has some pre-processing responsibilities
	199	and some post-processing responsibilities for each frame, but it spends
	200	the bulk of its time managing the wave-front processing by making CTU
	201	rows available to the worker threads when their dependencies are
	202	resolved. The frame encoder threads spend nearly all of their time
	203	blocked in one of 4 possible locations:
	204
	205	1. blocked, waiting for a frame to process
	206	2. blocked on a reference frame, waiting for a CTU row of reconstructed
	207	and loop-filtered reference pixels to become available
	208	3. blocked waiting for wave-front completion
	209	4. blocked waiting for the main thread to consume an encoded frame
	210
	211	Lookahead
	212	=========
	213
	214	The lookahead module of x265 (the lowres pre-encode which determines
	215	scene cuts and slice types) uses the thread pool to distribute the
	216	lowres cost analysis to worker threads. It follows the same wave-front
	217	pattern as the main encoder except it works in reverse-scan order.
	218
	219	The function slicetypeDecide() itself may also be performed by a worker
	220	thread if your system has enough CPU cores to make this a beneficial
	221	trade-off, else it runs within the context of the thread which calls the
	222	x265_encoder_encode().
	223
	224	SAO
	225	===
	226
	227	The Sample Adaptive Offset loopfilter has a large effect on encode
	228	performance because of the peculiar way it must be analyzed and coded.
	229
	230	SAO flags and data are encoded at the CTU level before the CTU itself is
	231	coded, but SAO analysis (deciding whether to enable SAO and with what
	232	parameters) cannot be performed until that CTU is completely analyzed
	233	(reconstructed pixels are available) as well as the CTUs to the right
	234	and below. So in effect the encoder must perform SAO analysis in a
	235	wavefront at least a full row behind the CTU compression wavefront.
	236
	237	This extra latency forces the encoder to save the encode data of every
	238	CTU until the entire frame has been analyzed, at which point a function
	239	can code the final slice bitstream with the decided SAO flags and data
	240	interleaved between each CTU. This second pass over the CTUs can be
	241	expensive, particularly at large resolutions and high bitrates.