doc/reST/threading.rst

   1 *********
   2 Threading
   3 *********
   4
   5 Thread Pool
   6 ===========
   7
   8 x265 creates a pool of worker threads and shares this thread pool
   9 with all encoders within the same process (it is process global, aka a
  10 singleton).  The number of threads within the thread pool is determined
  11 by the encoder which first allocates the pool, which by definition is
  12 the first encoder created within each process.
  13
  14 :option:`--threads` specifies the number of threads the encoder will
  15 try to allocate for its thread pool.  If the thread pool was already
  16 allocated this parameter is ignored.  By default x265 allocates one
  17 thread per (hyperthreaded) CPU core in your system.
  18
  19 Work distribution is job based.  Idle worker threads ask their parent
  20 pool object for jobs to perform.  When no jobs are available, idle
  21 worker threads block and consume no CPU cycles.
  22
  23 Objects which desire to distribute work to worker threads are known as
  24 job providers (and they derive from the JobProvider class).  When job
  25 providers have work they enqueue themselves into the pool's provider
  26 list (and dequeue themselves when they no longer have work).  The thread
  27 pool has a method to **poke** awake a blocked idle thread, and job
  28 providers are recommended to call this method when they make new jobs
  29 available.
  30
  31 Worker jobs are not allowed to block except when abosultely necessary
  32 for data locking. If a job becomes blocked, the worker thread is
  33 expected to drop that job and go back to the pool and find more work.
  34
  35 .. note::
  36
  37         x265_cleanup() frees the process-global thread pool, allowing
  38         it to be reallocated if necessary, but only if no encoders are
  39         allocated at the time it is called.
  40
  41 Wavefront Parallel Processing
  42 =============================
  43
  44 New with HEVC, Wavefront Parallel Processing allows each row of CTUs to
  45 be encoded in parallel, so long as each row stays at least two CTUs
  46 behind the row above it, to ensure the intra references and other data
  47 of the blocks above and above-right are available. WPP has almost no
  48 effect on the analysis and compression of each CTU and so it has a very
  49 small impact on compression efficiency relative to slices or tiles. The
  50 compression loss from WPP has been found to be less than 1% in most of
  51 our tests.
  52
  53 WPP has three effects which can impact efficiency. The first is the row
  54 starts must be signaled in the slice header, the second is each row must
  55 be padded to an even byte in length, and the third is the state of the
  56 entropy coder is transferred from the second CTU of each row to the
  57 first CTU of the row below it.  In some conditions this transfer of
  58 state actually improves compression since the above-right state may have
  59 better locality than the end of the previous row.
  60
  61 Parabola Research have published an excellent HEVC
  62 `animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_
  63 which visualizes WPP very well.  It even correctly visualizes some of
  64 WPPs key drawbacks, such as:
  65
  66 1. the low thread utilization at the start and end of each frame
  67 2. a difficult block may stall the wave-front and it takes a while for
  68    the wave-front to recover.
  69 3. 64x64 CTUs are big! there are much fewer rows than with H.264 and
  70    similar codecs
  71
  72 Because of these stall issues you rarely get the full parallelisation
  73 benefit one would expect from row threading. 30% to 50% of the
  74 theoretical perfect threading is typical.
  75
  76 In x265 WPP is enabled by default since it not only improves performance
  77 at encode but it also makes it possible for the decoder to be threaded.
  78
  79 If WPP is disabled by :option:`--no-wpp` the frame will be encoded in
  80 scan order and the entropy overheads will be avoided.  If frame
  81 threading is not disabled, the encoder will change the default frame
  82 thread count to be higher than if WPP was enabled.  The exact formulas
  83 are described in the next section.
  84
  85 Parallel Mode Analysis
  86 ======================
  87
  88 When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
  89 8x8) will distribute its analysis work to the thread pool. Each analysis
  90 job will measure the cost of one prediction for the CU: merge, skip,
  91 intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount
  92 of increased parallelism is often enough to be able to reduce frame
  93 parallelism while achieving the same overall CPU utilization. Reducing
  94 frame threads is often beneficial to ABR and VBV rate control.
  95
  96 Parallel Motion Estimation
  97 ==========================
  98
  99 When :option:`--pme` is enabled all of the analysis functions which
 100 perform motion searches to reference frames will distribute those motion
 101 searches as jobs for worker threads (if more than two motion searches
 102 are required).
 103
 104 Frame Threading
 105 ===============
 106
 107 Frame threading is the act of encoding multiple frames at the same time.
 108 It is a challenge because each frame will generally use one or more of
 109 the previously encoded frames as motion references and those frames may
 110 still be in the process of being encoded themselves.
 111
 112 Previous encoders such as x264 worked around this problem by limiting
 113 the motion search region within these reference frames to just one
 114 macroblock row below the coincident row being encoded. Thus a frame
 115 could be encoded at the same time as its reference frames so long as it
 116 stayed one row behind the encode progress of its references (glossing
 117 over a few details).
 118
 119 x265 has the same frame threading mechanism, but we generally have much
 120 less frame parallelism to exploit than x264 because of the size of our
 121 CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock
 122 rows available each frame while x265 only has 17 64x64 CTU rows.
 123
 124 The second extenuating circumstance is the loop filters. The pixels used
 125 for motion reference must be processed by the loop filters and the loop
 126 filters cannot run until a full row has been encoded, and it must run a
 127 full row behind the encode process so that the pixels below the row
 128 being filtered are available. When you add up all the row lags each
 129 frame ends up being 3 CTU rows behind its reference frames (the
 130 equivalent of 12 macroblock rows for x264)
 131
 132 The third extenuating circumstance is that when a frame being encoded
 133 becomes blocked by a reference frame row being available, that frame's
 134 wave-front becomes completely stalled and when the row becomes available
 135 again it can take quite some time for the wave to be restarted, if it
 136 ever does. This makes WPP many times less effective when frame
 137 parallelism is in use.
 138
 139 :option:`--merange` can have a negative impact on frame parallelism. If
 140 the range is too large, more rows of CTU lag must be added to ensure
 141 those pixels are available in the reference frames.
 142
 143 .. note::
 144
 145         Even though the merange is used to determine the amount of reference
 146         pixels that must be available in the reference frames, the actual
 147         motion search is not necessarily centered around the coincident
 148         block. The motion search is actually centered around the motion
 149         predictor, but the available pixel area (mvmin, mvmax) is determined
 150         by merange and the interpolation filter half-heights.
 151
 152 When frame threading is disabled, the entirety of all reference frames
 153 are always fully available (by definition) and thus the available pixel
 154 area is not restricted at all, and this can sometimes improve
 155 compression efficiency. Because of this, the output of encodes with
 156 frame parallelism disabled will not match the output of encodes with
 157 frame parallelism enabled; but when enabled the number of frame threads
 158 should have no effect on the output bitstream except when using ABR or
 159 VBV rate control or noise reduction.
 160
 161 When :option:`--nr` is enabled, the outputs of each number of frame threads
 162 will be deterministic but none of them will match becaue each frame
 163 encoder maintains a cumulative noise reduction state.
 164
 165 VBV introduces non-determinism in the encoder, at this point in time,
 166 regardless of the amount of frame parallelism.
 167
 168 By default frame parallelism and WPP are enabled together. The number of
 169 frame threads used is auto-detected from the (hyperthreaded) CPU core
 170 count, but may be manually specified via :option:`--frame-threads`
 171
 172         +-------+--------+
 173         | Cores | Frames |
 174         +=======+========+
 175         |  > 32 |   6    |
 176         +-------+--------+
 177         | >= 16 |   5    |
 178         +-------+--------+
 179         | >= 8  |   3    |
 180         +-------+--------+
 181         | >= 4  |   2    |
 182         +-------+--------+
 183
 184 If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)**
 185
 186 Over-allocating frame threads can be very counter-productive. They
 187 each allocate a large amount of memory and because of the limited number
 188 of CTU rows and the reference lag, you generally get limited benefit
 189 from adding frame encoders beyond the auto-detected count, and often
 190 the extra frame encoders reduce performance.
 191
 192 Given these considerations, you can understand why the faster presets
 193 lower the max CTU size to 32x32 (making twice as many CTU rows available
 194 for WPP and for finer grained frame parallelism) and reduce
 195 :option:`--merange`
 196
 197 Each frame encoder runs in its own thread (allocated separately from the
 198 worker pool). This frame thread has some pre-processing responsibilities
 199 and some post-processing responsibilities for each frame, but it spends
 200 the bulk of its time managing the wave-front processing by making CTU
 201 rows available to the worker threads when their dependencies are
 202 resolved.  The frame encoder threads spend nearly all of their time
 203 blocked in one of 4 possible locations:
 204
 205 1. blocked, waiting for a frame to process
 206 2. blocked on a reference frame, waiting for a CTU row of reconstructed
 207    and loop-filtered reference pixels to become available
 208 3. blocked waiting for wave-front completion
 209 4. blocked waiting for the main thread to consume an encoded frame
 210
 211 Lookahead
 212 =========
 213
 214 The lookahead module of x265 (the lowres pre-encode which determines
 215 scene cuts and slice types) uses the thread pool to distribute the
 216 lowres cost analysis to worker threads. It follows the same wave-front
 217 pattern as the main encoder except it works in reverse-scan order.
 218
 219 The function slicetypeDecide() itself may also be performed by a worker
 220 thread if your system has enough CPU cores to make this a beneficial
 221 trade-off, else it runs within the context of the thread which calls the
 222 x265_encoder_encode().
 223
 224 SAO
 225 ===
 226
 227 The Sample Adaptive Offset loopfilter has a large effect on encode
 228 performance because of the peculiar way it must be analyzed and coded.
 229
 230 SAO flags and data are encoded at the CTU level before the CTU itself is
 231 coded, but SAO analysis (deciding whether to enable SAO and with what
 232 parameters) cannot be performed until that CTU is completely analyzed
 233 (reconstructed pixels are available) as well as the CTUs to the right
 234 and below.  So in effect the encoder must perform SAO analysis in a
 235 wavefront at least a full row behind the CTU compression wavefront.
 236
 237 This extra latency forces the encoder to save the encode data of every
 238 CTU until the entire frame has been analyzed, at which point a function
 239 can code the final slice bitstream with the decided SAO flags and data
 240 interleaved between each CTU.  This second pass over the CTUs can be
 241 expensive, particularly at large resolutions and high bitrates.