Imported Upstream version 1.4
[deb_x265.git] / doc / reST / threading.rst
CommitLineData
72b9787e
JB
1*********
2Threading
3*********
4
5Thread Pool
6===========
7
8x265 creates a pool of worker threads and shares this thread pool
9with all encoders within the same process (it is process global, aka a
10singleton). The number of threads within the thread pool is determined
11by the encoder which first allocates the pool, which by definition is
12the first encoder created within each process.
13
14:option:`--threads` specifies the number of threads the encoder will
15try to allocate for its thread pool. If the thread pool was already
16allocated this parameter is ignored. By default x265 allocates one
17thread per (hyperthreaded) CPU core in your system.
18
19Work distribution is job based. Idle worker threads ask their parent
20pool object for jobs to perform. When no jobs are available, idle
21worker threads block and consume no CPU cycles.
22
23Objects which desire to distribute work to worker threads are known as
24job providers (and they derive from the JobProvider class). When job
25providers have work they enqueue themselves into the pool's provider
26list (and dequeue themselves when they no longer have work). The thread
27pool has a method to **poke** awake a blocked idle thread, and job
28providers are recommended to call this method when they make new jobs
29available.
30
31Worker jobs are not allowed to block except when abosultely necessary
32for data locking. If a job becomes blocked, the worker thread is
33expected to drop that job and go back to the pool and find more work.
34
35.. note::
36
37 x265_cleanup() frees the process-global thread pool, allowing
38 it to be reallocated if necessary, but only if no encoders are
39 allocated at the time it is called.
40
41Wavefront Parallel Processing
42=============================
43
44New with HEVC, Wavefront Parallel Processing allows each row of CTUs to
45be encoded in parallel, so long as each row stays at least two CTUs
46behind the row above it, to ensure the intra references and other data
47of the blocks above and above-right are available. WPP has almost no
48effect on the analysis and compression of each CTU and so it has a very
49small impact on compression efficiency relative to slices or tiles. The
50compression loss from WPP has been found to be less than 1% in most of
51our tests.
52
53WPP has three effects which can impact efficiency. The first is the row
54starts must be signaled in the slice header, the second is each row must
55be padded to an even byte in length, and the third is the state of the
56entropy coder is transferred from the second CTU of each row to the
57first CTU of the row below it. In some conditions this transfer of
58state actually improves compression since the above-right state may have
59better locality than the end of the previous row.
60
61Parabola Research have published an excellent HEVC
62`animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_
63which visualizes WPP very well. It even correctly visualizes some of
64WPPs key drawbacks, such as:
65
661. the low thread utilization at the start and end of each frame
672. a difficult block may stall the wave-front and it takes a while for
68 the wave-front to recover.
693. 64x64 CTUs are big! there are much fewer rows than with H.264 and
70 similar codecs
71
72Because of these stall issues you rarely get the full parallelisation
73benefit one would expect from row threading. 30% to 50% of the
74theoretical perfect threading is typical.
75
76In x265 WPP is enabled by default since it not only improves performance
77at encode but it also makes it possible for the decoder to be threaded.
78
79If WPP is disabled by :option:`--no-wpp` the frame will be encoded in
80scan order and the entropy overheads will be avoided. If frame
81threading is not disabled, the encoder will change the default frame
82thread count to be higher than if WPP was enabled. The exact formulas
83are described in the next section.
84
85Parallel Mode Analysis
86======================
87
88When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
898x8) will distribute its analysis work to the thread pool. Each analysis
90job will measure the cost of one prediction for the CU: merge, skip,
91intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount
92of increased parallelism is often enough to be able to reduce frame
93parallelism while achieving the same overall CPU utilization. Reducing
94frame threads is often beneficial to ABR and VBV rate control.
95
96Parallel Motion Estimation
97==========================
98
99When :option:`--pme` is enabled all of the analysis functions which
100perform motion searches to reference frames will distribute those motion
101searches as jobs for worker threads (if more than two motion searches
102are required).
103
104Frame Threading
105===============
106
107Frame threading is the act of encoding multiple frames at the same time.
108It is a challenge because each frame will generally use one or more of
109the previously encoded frames as motion references and those frames may
110still be in the process of being encoded themselves.
111
112Previous encoders such as x264 worked around this problem by limiting
113the motion search region within these reference frames to just one
114macroblock row below the coincident row being encoded. Thus a frame
115could be encoded at the same time as its reference frames so long as it
116stayed one row behind the encode progress of its references (glossing
117over a few details).
118
119x265 has the same frame threading mechanism, but we generally have much
120less frame parallelism to exploit than x264 because of the size of our
121CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock
122rows available each frame while x265 only has 17 64x64 CTU rows.
123
124The second extenuating circumstance is the loop filters. The pixels used
125for motion reference must be processed by the loop filters and the loop
126filters cannot run until a full row has been encoded, and it must run a
127full row behind the encode process so that the pixels below the row
128being filtered are available. When you add up all the row lags each
129frame ends up being 3 CTU rows behind its reference frames (the
130equivalent of 12 macroblock rows for x264)
131
132The third extenuating circumstance is that when a frame being encoded
133becomes blocked by a reference frame row being available, that frame's
134wave-front becomes completely stalled and when the row becomes available
135again it can take quite some time for the wave to be restarted, if it
136ever does. This makes WPP many times less effective when frame
137parallelism is in use.
138
139:option:`--merange` can have a negative impact on frame parallelism. If
140the range is too large, more rows of CTU lag must be added to ensure
141those pixels are available in the reference frames.
142
143.. note::
144
145 Even though the merange is used to determine the amount of reference
146 pixels that must be available in the reference frames, the actual
147 motion search is not necessarily centered around the coincident
148 block. The motion search is actually centered around the motion
149 predictor, but the available pixel area (mvmin, mvmax) is determined
150 by merange and the interpolation filter half-heights.
151
152When frame threading is disabled, the entirety of all reference frames
153are always fully available (by definition) and thus the available pixel
154area is not restricted at all, and this can sometimes improve
155compression efficiency. Because of this, the output of encodes with
156frame parallelism disabled will not match the output of encodes with
157frame parallelism enabled; but when enabled the number of frame threads
158should have no effect on the output bitstream except when using ABR or
159VBV rate control or noise reduction.
160
161When :option:`--nr` is enabled, the outputs of each number of frame threads
162will be deterministic but none of them will match becaue each frame
163encoder maintains a cumulative noise reduction state.
164
165VBV introduces non-determinism in the encoder, at this point in time,
166regardless of the amount of frame parallelism.
167
168By default frame parallelism and WPP are enabled together. The number of
169frame threads used is auto-detected from the (hyperthreaded) CPU core
170count, but may be manually specified via :option:`--frame-threads`
171
172 +-------+--------+
173 | Cores | Frames |
174 +=======+========+
175 | > 32 | 6 |
176 +-------+--------+
177 | >= 16 | 5 |
178 +-------+--------+
179 | >= 8 | 3 |
180 +-------+--------+
181 | >= 4 | 2 |
182 +-------+--------+
183
184If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)**
185
186Over-allocating frame threads can be very counter-productive. They
187each allocate a large amount of memory and because of the limited number
188of CTU rows and the reference lag, you generally get limited benefit
189from adding frame encoders beyond the auto-detected count, and often
190the extra frame encoders reduce performance.
191
192Given these considerations, you can understand why the faster presets
193lower the max CTU size to 32x32 (making twice as many CTU rows available
194for WPP and for finer grained frame parallelism) and reduce
195:option:`--merange`
196
197Each frame encoder runs in its own thread (allocated separately from the
198worker pool). This frame thread has some pre-processing responsibilities
199and some post-processing responsibilities for each frame, but it spends
200the bulk of its time managing the wave-front processing by making CTU
201rows available to the worker threads when their dependencies are
202resolved. The frame encoder threads spend nearly all of their time
203blocked in one of 4 possible locations:
204
2051. blocked, waiting for a frame to process
2062. blocked on a reference frame, waiting for a CTU row of reconstructed
207 and loop-filtered reference pixels to become available
2083. blocked waiting for wave-front completion
2094. blocked waiting for the main thread to consume an encoded frame
210
211Lookahead
212=========
213
214The lookahead module of x265 (the lowres pre-encode which determines
215scene cuts and slice types) uses the thread pool to distribute the
216lowres cost analysis to worker threads. It follows the same wave-front
217pattern as the main encoder except it works in reverse-scan order.
218
219The function slicetypeDecide() itself may also be performed by a worker
220thread if your system has enough CPU cores to make this a beneficial
221trade-off, else it runs within the context of the thread which calls the
222x265_encoder_encode().
223
224SAO
225===
226
227The Sample Adaptive Offset loopfilter has a large effect on encode
228performance because of the peculiar way it must be analyzed and coded.
229
230SAO flags and data are encoded at the CTU level before the CTU itself is
231coded, but SAO analysis (deciding whether to enable SAO and with what
232parameters) cannot be performed until that CTU is completely analyzed
233(reconstructed pixels are available) as well as the CTUs to the right
234and below. So in effect the encoder must perform SAO analysis in a
235wavefront at least a full row behind the CTU compression wavefront.
236
237This extra latency forces the encoder to save the encode data of every
238CTU until the entire frame has been analyzed, at which point a function
239can code the final slice bitstream with the decided SAO flags and data
240interleaved between each CTU. This second pass over the CTUs can be
241expensive, particularly at large resolutions and high bitrates.