Commit | Line | Data |
---|---|---|
72b9787e JB |
1 | ********* |
2 | Threading | |
3 | ********* | |
4 | ||
5 | Thread Pool | |
6 | =========== | |
7 | ||
8 | x265 creates a pool of worker threads and shares this thread pool | |
9 | with all encoders within the same process (it is process global, aka a | |
10 | singleton). The number of threads within the thread pool is determined | |
11 | by the encoder which first allocates the pool, which by definition is | |
12 | the first encoder created within each process. | |
13 | ||
14 | :option:`--threads` specifies the number of threads the encoder will | |
15 | try to allocate for its thread pool. If the thread pool was already | |
16 | allocated this parameter is ignored. By default x265 allocates one | |
17 | thread per (hyperthreaded) CPU core in your system. | |
18 | ||
19 | Work distribution is job based. Idle worker threads ask their parent | |
20 | pool object for jobs to perform. When no jobs are available, idle | |
21 | worker threads block and consume no CPU cycles. | |
22 | ||
23 | Objects which desire to distribute work to worker threads are known as | |
24 | job providers (and they derive from the JobProvider class). When job | |
25 | providers have work they enqueue themselves into the pool's provider | |
26 | list (and dequeue themselves when they no longer have work). The thread | |
27 | pool has a method to **poke** awake a blocked idle thread, and job | |
28 | providers are recommended to call this method when they make new jobs | |
29 | available. | |
30 | ||
31 | Worker jobs are not allowed to block except when abosultely necessary | |
32 | for data locking. If a job becomes blocked, the worker thread is | |
33 | expected to drop that job and go back to the pool and find more work. | |
34 | ||
35 | .. note:: | |
36 | ||
37 | x265_cleanup() frees the process-global thread pool, allowing | |
38 | it to be reallocated if necessary, but only if no encoders are | |
39 | allocated at the time it is called. | |
40 | ||
41 | Wavefront Parallel Processing | |
42 | ============================= | |
43 | ||
44 | New with HEVC, Wavefront Parallel Processing allows each row of CTUs to | |
45 | be encoded in parallel, so long as each row stays at least two CTUs | |
46 | behind the row above it, to ensure the intra references and other data | |
47 | of the blocks above and above-right are available. WPP has almost no | |
48 | effect on the analysis and compression of each CTU and so it has a very | |
49 | small impact on compression efficiency relative to slices or tiles. The | |
50 | compression loss from WPP has been found to be less than 1% in most of | |
51 | our tests. | |
52 | ||
53 | WPP has three effects which can impact efficiency. The first is the row | |
54 | starts must be signaled in the slice header, the second is each row must | |
55 | be padded to an even byte in length, and the third is the state of the | |
56 | entropy coder is transferred from the second CTU of each row to the | |
57 | first CTU of the row below it. In some conditions this transfer of | |
58 | state actually improves compression since the above-right state may have | |
59 | better locality than the end of the previous row. | |
60 | ||
61 | Parabola Research have published an excellent HEVC | |
62 | `animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_ | |
63 | which visualizes WPP very well. It even correctly visualizes some of | |
64 | WPPs key drawbacks, such as: | |
65 | ||
66 | 1. the low thread utilization at the start and end of each frame | |
67 | 2. a difficult block may stall the wave-front and it takes a while for | |
68 | the wave-front to recover. | |
69 | 3. 64x64 CTUs are big! there are much fewer rows than with H.264 and | |
70 | similar codecs | |
71 | ||
72 | Because of these stall issues you rarely get the full parallelisation | |
73 | benefit one would expect from row threading. 30% to 50% of the | |
74 | theoretical perfect threading is typical. | |
75 | ||
76 | In x265 WPP is enabled by default since it not only improves performance | |
77 | at encode but it also makes it possible for the decoder to be threaded. | |
78 | ||
79 | If WPP is disabled by :option:`--no-wpp` the frame will be encoded in | |
80 | scan order and the entropy overheads will be avoided. If frame | |
81 | threading is not disabled, the encoder will change the default frame | |
82 | thread count to be higher than if WPP was enabled. The exact formulas | |
83 | are described in the next section. | |
84 | ||
85 | Parallel Mode Analysis | |
86 | ====================== | |
87 | ||
88 | When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to | |
89 | 8x8) will distribute its analysis work to the thread pool. Each analysis | |
90 | job will measure the cost of one prediction for the CU: merge, skip, | |
91 | intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount | |
92 | of increased parallelism is often enough to be able to reduce frame | |
93 | parallelism while achieving the same overall CPU utilization. Reducing | |
94 | frame threads is often beneficial to ABR and VBV rate control. | |
95 | ||
96 | Parallel Motion Estimation | |
97 | ========================== | |
98 | ||
99 | When :option:`--pme` is enabled all of the analysis functions which | |
100 | perform motion searches to reference frames will distribute those motion | |
101 | searches as jobs for worker threads (if more than two motion searches | |
102 | are required). | |
103 | ||
104 | Frame Threading | |
105 | =============== | |
106 | ||
107 | Frame threading is the act of encoding multiple frames at the same time. | |
108 | It is a challenge because each frame will generally use one or more of | |
109 | the previously encoded frames as motion references and those frames may | |
110 | still be in the process of being encoded themselves. | |
111 | ||
112 | Previous encoders such as x264 worked around this problem by limiting | |
113 | the motion search region within these reference frames to just one | |
114 | macroblock row below the coincident row being encoded. Thus a frame | |
115 | could be encoded at the same time as its reference frames so long as it | |
116 | stayed one row behind the encode progress of its references (glossing | |
117 | over a few details). | |
118 | ||
119 | x265 has the same frame threading mechanism, but we generally have much | |
120 | less frame parallelism to exploit than x264 because of the size of our | |
121 | CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock | |
122 | rows available each frame while x265 only has 17 64x64 CTU rows. | |
123 | ||
124 | The second extenuating circumstance is the loop filters. The pixels used | |
125 | for motion reference must be processed by the loop filters and the loop | |
126 | filters cannot run until a full row has been encoded, and it must run a | |
127 | full row behind the encode process so that the pixels below the row | |
128 | being filtered are available. When you add up all the row lags each | |
129 | frame ends up being 3 CTU rows behind its reference frames (the | |
130 | equivalent of 12 macroblock rows for x264) | |
131 | ||
132 | The third extenuating circumstance is that when a frame being encoded | |
133 | becomes blocked by a reference frame row being available, that frame's | |
134 | wave-front becomes completely stalled and when the row becomes available | |
135 | again it can take quite some time for the wave to be restarted, if it | |
136 | ever does. This makes WPP many times less effective when frame | |
137 | parallelism is in use. | |
138 | ||
139 | :option:`--merange` can have a negative impact on frame parallelism. If | |
140 | the range is too large, more rows of CTU lag must be added to ensure | |
141 | those pixels are available in the reference frames. | |
142 | ||
143 | .. note:: | |
144 | ||
145 | Even though the merange is used to determine the amount of reference | |
146 | pixels that must be available in the reference frames, the actual | |
147 | motion search is not necessarily centered around the coincident | |
148 | block. The motion search is actually centered around the motion | |
149 | predictor, but the available pixel area (mvmin, mvmax) is determined | |
150 | by merange and the interpolation filter half-heights. | |
151 | ||
152 | When frame threading is disabled, the entirety of all reference frames | |
153 | are always fully available (by definition) and thus the available pixel | |
154 | area is not restricted at all, and this can sometimes improve | |
155 | compression efficiency. Because of this, the output of encodes with | |
156 | frame parallelism disabled will not match the output of encodes with | |
157 | frame parallelism enabled; but when enabled the number of frame threads | |
158 | should have no effect on the output bitstream except when using ABR or | |
159 | VBV rate control or noise reduction. | |
160 | ||
161 | When :option:`--nr` is enabled, the outputs of each number of frame threads | |
162 | will be deterministic but none of them will match becaue each frame | |
163 | encoder maintains a cumulative noise reduction state. | |
164 | ||
165 | VBV introduces non-determinism in the encoder, at this point in time, | |
166 | regardless of the amount of frame parallelism. | |
167 | ||
168 | By default frame parallelism and WPP are enabled together. The number of | |
169 | frame threads used is auto-detected from the (hyperthreaded) CPU core | |
170 | count, but may be manually specified via :option:`--frame-threads` | |
171 | ||
172 | +-------+--------+ | |
173 | | Cores | Frames | | |
174 | +=======+========+ | |
175 | | > 32 | 6 | | |
176 | +-------+--------+ | |
177 | | >= 16 | 5 | | |
178 | +-------+--------+ | |
179 | | >= 8 | 3 | | |
180 | +-------+--------+ | |
181 | | >= 4 | 2 | | |
182 | +-------+--------+ | |
183 | ||
184 | If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)** | |
185 | ||
186 | Over-allocating frame threads can be very counter-productive. They | |
187 | each allocate a large amount of memory and because of the limited number | |
188 | of CTU rows and the reference lag, you generally get limited benefit | |
189 | from adding frame encoders beyond the auto-detected count, and often | |
190 | the extra frame encoders reduce performance. | |
191 | ||
192 | Given these considerations, you can understand why the faster presets | |
193 | lower the max CTU size to 32x32 (making twice as many CTU rows available | |
194 | for WPP and for finer grained frame parallelism) and reduce | |
195 | :option:`--merange` | |
196 | ||
197 | Each frame encoder runs in its own thread (allocated separately from the | |
198 | worker pool). This frame thread has some pre-processing responsibilities | |
199 | and some post-processing responsibilities for each frame, but it spends | |
200 | the bulk of its time managing the wave-front processing by making CTU | |
201 | rows available to the worker threads when their dependencies are | |
202 | resolved. The frame encoder threads spend nearly all of their time | |
203 | blocked in one of 4 possible locations: | |
204 | ||
205 | 1. blocked, waiting for a frame to process | |
206 | 2. blocked on a reference frame, waiting for a CTU row of reconstructed | |
207 | and loop-filtered reference pixels to become available | |
208 | 3. blocked waiting for wave-front completion | |
209 | 4. blocked waiting for the main thread to consume an encoded frame | |
210 | ||
211 | Lookahead | |
212 | ========= | |
213 | ||
214 | The lookahead module of x265 (the lowres pre-encode which determines | |
215 | scene cuts and slice types) uses the thread pool to distribute the | |
216 | lowres cost analysis to worker threads. It follows the same wave-front | |
217 | pattern as the main encoder except it works in reverse-scan order. | |
218 | ||
219 | The function slicetypeDecide() itself may also be performed by a worker | |
220 | thread if your system has enough CPU cores to make this a beneficial | |
221 | trade-off, else it runs within the context of the thread which calls the | |
222 | x265_encoder_encode(). | |
223 | ||
224 | SAO | |
225 | === | |
226 | ||
227 | The Sample Adaptive Offset loopfilter has a large effect on encode | |
228 | performance because of the peculiar way it must be analyzed and coded. | |
229 | ||
230 | SAO flags and data are encoded at the CTU level before the CTU itself is | |
231 | coded, but SAO analysis (deciding whether to enable SAO and with what | |
232 | parameters) cannot be performed until that CTU is completely analyzed | |
233 | (reconstructed pixels are available) as well as the CTUs to the right | |
234 | and below. So in effect the encoder must perform SAO analysis in a | |
235 | wavefront at least a full row behind the CTU compression wavefront. | |
236 | ||
237 | This extra latency forces the encoder to save the encode data of every | |
238 | CTU until the entire frame has been analyzed, at which point a function | |
239 | can code the final slice bitstream with the decided SAO flags and data | |
240 | interleaved between each CTU. This second pass over the CTUs can be | |
241 | expensive, particularly at large resolutions and high bitrates. |