Working implementation with a very good benchmark based on num of threads