| 1 | ********* |
| 2 | Threading |
| 3 | ********* |
| 4 | |
| 5 | Thread Pool |
| 6 | =========== |
| 7 | |
| 8 | x265 creates a pool of worker threads and shares this thread pool |
| 9 | with all encoders within the same process (it is process global, aka a |
| 10 | singleton). The number of threads within the thread pool is determined |
| 11 | by the encoder which first allocates the pool, which by definition is |
| 12 | the first encoder created within each process. |
| 13 | |
| 14 | :option:`--threads` specifies the number of threads the encoder will |
| 15 | try to allocate for its thread pool. If the thread pool was already |
| 16 | allocated this parameter is ignored. By default x265 allocates one |
| 17 | thread per (hyperthreaded) CPU core in your system. |
| 18 | |
| 19 | Work distribution is job based. Idle worker threads ask their parent |
| 20 | pool object for jobs to perform. When no jobs are available, idle |
| 21 | worker threads block and consume no CPU cycles. |
| 22 | |
| 23 | Objects which desire to distribute work to worker threads are known as |
| 24 | job providers (and they derive from the JobProvider class). When job |
| 25 | providers have work they enqueue themselves into the pool's provider |
| 26 | list (and dequeue themselves when they no longer have work). The thread |
| 27 | pool has a method to **poke** awake a blocked idle thread, and job |
| 28 | providers are recommended to call this method when they make new jobs |
| 29 | available. |
| 30 | |
| 31 | Worker jobs are not allowed to block except when abosultely necessary |
| 32 | for data locking. If a job becomes blocked, the worker thread is |
| 33 | expected to drop that job and go back to the pool and find more work. |
| 34 | |
| 35 | .. note:: |
| 36 | |
| 37 | x265_cleanup() frees the process-global thread pool, allowing |
| 38 | it to be reallocated if necessary, but only if no encoders are |
| 39 | allocated at the time it is called. |
| 40 | |
| 41 | Wavefront Parallel Processing |
| 42 | ============================= |
| 43 | |
| 44 | New with HEVC, Wavefront Parallel Processing allows each row of CTUs to |
| 45 | be encoded in parallel, so long as each row stays at least two CTUs |
| 46 | behind the row above it, to ensure the intra references and other data |
| 47 | of the blocks above and above-right are available. WPP has almost no |
| 48 | effect on the analysis and compression of each CTU and so it has a very |
| 49 | small impact on compression efficiency relative to slices or tiles. The |
| 50 | compression loss from WPP has been found to be less than 1% in most of |
| 51 | our tests. |
| 52 | |
| 53 | WPP has three effects which can impact efficiency. The first is the row |
| 54 | starts must be signaled in the slice header, the second is each row must |
| 55 | be padded to an even byte in length, and the third is the state of the |
| 56 | entropy coder is transferred from the second CTU of each row to the |
| 57 | first CTU of the row below it. In some conditions this transfer of |
| 58 | state actually improves compression since the above-right state may have |
| 59 | better locality than the end of the previous row. |
| 60 | |
| 61 | Parabola Research have published an excellent HEVC |
| 62 | `animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_ |
| 63 | which visualizes WPP very well. It even correctly visualizes some of |
| 64 | WPPs key drawbacks, such as: |
| 65 | |
| 66 | 1. the low thread utilization at the start and end of each frame |
| 67 | 2. a difficult block may stall the wave-front and it takes a while for |
| 68 | the wave-front to recover. |
| 69 | 3. 64x64 CTUs are big! there are much fewer rows than with H.264 and |
| 70 | similar codecs |
| 71 | |
| 72 | Because of these stall issues you rarely get the full parallelisation |
| 73 | benefit one would expect from row threading. 30% to 50% of the |
| 74 | theoretical perfect threading is typical. |
| 75 | |
| 76 | In x265 WPP is enabled by default since it not only improves performance |
| 77 | at encode but it also makes it possible for the decoder to be threaded. |
| 78 | |
| 79 | If WPP is disabled by :option:`--no-wpp` the frame will be encoded in |
| 80 | scan order and the entropy overheads will be avoided. If frame |
| 81 | threading is not disabled, the encoder will change the default frame |
| 82 | thread count to be higher than if WPP was enabled. The exact formulas |
| 83 | are described in the next section. |
| 84 | |
| 85 | Parallel Mode Analysis |
| 86 | ====================== |
| 87 | |
| 88 | When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to |
| 89 | 8x8) will distribute its analysis work to the thread pool. Each analysis |
| 90 | job will measure the cost of one prediction for the CU: merge, skip, |
| 91 | intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount |
| 92 | of increased parallelism is often enough to be able to reduce frame |
| 93 | parallelism while achieving the same overall CPU utilization. Reducing |
| 94 | frame threads is often beneficial to ABR and VBV rate control. |
| 95 | |
| 96 | Parallel Motion Estimation |
| 97 | ========================== |
| 98 | |
| 99 | When :option:`--pme` is enabled all of the analysis functions which |
| 100 | perform motion searches to reference frames will distribute those motion |
| 101 | searches as jobs for worker threads (if more than two motion searches |
| 102 | are required). |
| 103 | |
| 104 | Frame Threading |
| 105 | =============== |
| 106 | |
| 107 | Frame threading is the act of encoding multiple frames at the same time. |
| 108 | It is a challenge because each frame will generally use one or more of |
| 109 | the previously encoded frames as motion references and those frames may |
| 110 | still be in the process of being encoded themselves. |
| 111 | |
| 112 | Previous encoders such as x264 worked around this problem by limiting |
| 113 | the motion search region within these reference frames to just one |
| 114 | macroblock row below the coincident row being encoded. Thus a frame |
| 115 | could be encoded at the same time as its reference frames so long as it |
| 116 | stayed one row behind the encode progress of its references (glossing |
| 117 | over a few details). |
| 118 | |
| 119 | x265 has the same frame threading mechanism, but we generally have much |
| 120 | less frame parallelism to exploit than x264 because of the size of our |
| 121 | CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock |
| 122 | rows available each frame while x265 only has 17 64x64 CTU rows. |
| 123 | |
| 124 | The second extenuating circumstance is the loop filters. The pixels used |
| 125 | for motion reference must be processed by the loop filters and the loop |
| 126 | filters cannot run until a full row has been encoded, and it must run a |
| 127 | full row behind the encode process so that the pixels below the row |
| 128 | being filtered are available. When you add up all the row lags each |
| 129 | frame ends up being 3 CTU rows behind its reference frames (the |
| 130 | equivalent of 12 macroblock rows for x264) |
| 131 | |
| 132 | The third extenuating circumstance is that when a frame being encoded |
| 133 | becomes blocked by a reference frame row being available, that frame's |
| 134 | wave-front becomes completely stalled and when the row becomes available |
| 135 | again it can take quite some time for the wave to be restarted, if it |
| 136 | ever does. This makes WPP many times less effective when frame |
| 137 | parallelism is in use. |
| 138 | |
| 139 | :option:`--merange` can have a negative impact on frame parallelism. If |
| 140 | the range is too large, more rows of CTU lag must be added to ensure |
| 141 | those pixels are available in the reference frames. |
| 142 | |
| 143 | .. note:: |
| 144 | |
| 145 | Even though the merange is used to determine the amount of reference |
| 146 | pixels that must be available in the reference frames, the actual |
| 147 | motion search is not necessarily centered around the coincident |
| 148 | block. The motion search is actually centered around the motion |
| 149 | predictor, but the available pixel area (mvmin, mvmax) is determined |
| 150 | by merange and the interpolation filter half-heights. |
| 151 | |
| 152 | When frame threading is disabled, the entirety of all reference frames |
| 153 | are always fully available (by definition) and thus the available pixel |
| 154 | area is not restricted at all, and this can sometimes improve |
| 155 | compression efficiency. Because of this, the output of encodes with |
| 156 | frame parallelism disabled will not match the output of encodes with |
| 157 | frame parallelism enabled; but when enabled the number of frame threads |
| 158 | should have no effect on the output bitstream except when using ABR or |
| 159 | VBV rate control or noise reduction. |
| 160 | |
| 161 | When :option:`--nr` is enabled, the outputs of each number of frame threads |
| 162 | will be deterministic but none of them will match becaue each frame |
| 163 | encoder maintains a cumulative noise reduction state. |
| 164 | |
| 165 | VBV introduces non-determinism in the encoder, at this point in time, |
| 166 | regardless of the amount of frame parallelism. |
| 167 | |
| 168 | By default frame parallelism and WPP are enabled together. The number of |
| 169 | frame threads used is auto-detected from the (hyperthreaded) CPU core |
| 170 | count, but may be manually specified via :option:`--frame-threads` |
| 171 | |
| 172 | +-------+--------+ |
| 173 | | Cores | Frames | |
| 174 | +=======+========+ |
| 175 | | > 32 | 6 | |
| 176 | +-------+--------+ |
| 177 | | >= 16 | 5 | |
| 178 | +-------+--------+ |
| 179 | | >= 8 | 3 | |
| 180 | +-------+--------+ |
| 181 | | >= 4 | 2 | |
| 182 | +-------+--------+ |
| 183 | |
| 184 | If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)** |
| 185 | |
| 186 | Over-allocating frame threads can be very counter-productive. They |
| 187 | each allocate a large amount of memory and because of the limited number |
| 188 | of CTU rows and the reference lag, you generally get limited benefit |
| 189 | from adding frame encoders beyond the auto-detected count, and often |
| 190 | the extra frame encoders reduce performance. |
| 191 | |
| 192 | Given these considerations, you can understand why the faster presets |
| 193 | lower the max CTU size to 32x32 (making twice as many CTU rows available |
| 194 | for WPP and for finer grained frame parallelism) and reduce |
| 195 | :option:`--merange` |
| 196 | |
| 197 | Each frame encoder runs in its own thread (allocated separately from the |
| 198 | worker pool). This frame thread has some pre-processing responsibilities |
| 199 | and some post-processing responsibilities for each frame, but it spends |
| 200 | the bulk of its time managing the wave-front processing by making CTU |
| 201 | rows available to the worker threads when their dependencies are |
| 202 | resolved. The frame encoder threads spend nearly all of their time |
| 203 | blocked in one of 4 possible locations: |
| 204 | |
| 205 | 1. blocked, waiting for a frame to process |
| 206 | 2. blocked on a reference frame, waiting for a CTU row of reconstructed |
| 207 | and loop-filtered reference pixels to become available |
| 208 | 3. blocked waiting for wave-front completion |
| 209 | 4. blocked waiting for the main thread to consume an encoded frame |
| 210 | |
| 211 | Lookahead |
| 212 | ========= |
| 213 | |
| 214 | The lookahead module of x265 (the lowres pre-encode which determines |
| 215 | scene cuts and slice types) uses the thread pool to distribute the |
| 216 | lowres cost analysis to worker threads. It follows the same wave-front |
| 217 | pattern as the main encoder except it works in reverse-scan order. |
| 218 | |
| 219 | The function slicetypeDecide() itself may also be performed by a worker |
| 220 | thread if your system has enough CPU cores to make this a beneficial |
| 221 | trade-off, else it runs within the context of the thread which calls the |
| 222 | x265_encoder_encode(). |
| 223 | |
| 224 | SAO |
| 225 | === |
| 226 | |
| 227 | The Sample Adaptive Offset loopfilter has a large effect on encode |
| 228 | performance because of the peculiar way it must be analyzed and coded. |
| 229 | |
| 230 | SAO flags and data are encoded at the CTU level before the CTU itself is |
| 231 | coded, but SAO analysis (deciding whether to enable SAO and with what |
| 232 | parameters) cannot be performed until that CTU is completely analyzed |
| 233 | (reconstructed pixels are available) as well as the CTUs to the right |
| 234 | and below. So in effect the encoder must perform SAO analysis in a |
| 235 | wavefront at least a full row behind the CTU compression wavefront. |
| 236 | |
| 237 | This extra latency forces the encoder to save the encode data of every |
| 238 | CTU until the entire frame has been analyzed, at which point a function |
| 239 | can code the final slice bitstream with the decided SAO flags and data |
| 240 | interleaved between each CTU. This second pass over the CTUs can be |
| 241 | expensive, particularly at large resolutions and high bitrates. |