Start > GNU Radio > SSE2 Intrinsic optimized USRP short to float conversion

SSE2 Intrinsic optimized USRP short to float conversion

Overview

Author Dominik Auras
Type Case study of optimization of data conversion in gr-usrp module
State closed / negative result
Outline
  • USRP delivers 16/8bit short integers
  • gr-usrp module converts short integers to single precision floating point numbers
  • SIMD-Implementation of data conversion with SSE2 intrinsics
  • Comparison with existing solution, compiled at highest optimization level
  • No speedup achieved

Introduction

The USRP is the standard hardware used to receive and transmit signals with the GNU Radio project. It provides its own driver to access the hardware via libusb. There is a wrapper in GNU Radio, the submodule gr-ursp, that integrates the USRP with GNU Radio. It is using the USRP driver and converts the read data to the used GNU Radio format.

The data delivered by the USRP are of type short integers, using 16 bit, and in groups of 2 shorts per complex sample. The gr-usrp module converts, for example, these shorts to complex values that consists of 2 floats. There are also conversion of 8 bit data from USRP to floats, and the possibility to do no conversion in gr-usrp, i.e. reusing the shorts. Now my main focus has been on the conversion of the complex short integer samples to complex float samples.

Implementation

The existing code is a straight forward implementation in C++ (but using mostly the C subset). I doubt if it is possible to implement a faster conversion. My choice was to examine the possible speedup using SSE instruction available in contemporary processors. A vectorized conversion implementing SIMD might be faster in order of magnitudes, since it both uses highly optimized cpu instructions and reduces loop count due to processing more than one sample in parallel.

Starting from the disassembly of the existing conversion routine, compiled with GCC and flag -O3 turned on, I worked towards an implementation that uses builtin intrinsics.

Having a look at the Intel optimization manual, I've got the idea to make use of four specific instructions available with SSE2. That are

The idea is to first unpack the packed 16 bit integers to 32 bit integers into the upper 16 bits, then perform a right shift by 16 to shift the unpacked 16 bit to the lower bytes. This is to preserve the sign bit. Then the conversion to floats is carried out.

The inner conversion loop produces 4 complex valued samples per iteration. Now all buffers need to be aligned to a 16 byte boundary. All access additionally are multiples of 16 bytes.

Results

After verifying that the conversion is properly done, I created benchmark code and compared the timing of the existing C++ implementation and my new one. The results are presented here. They have been measured with a Core 2 Duo E6750 and 4 Gb RAM (Dual Channel).

               Generic: input: 6.4e+07  cpu:  0.232  items/sec:  2.759e+08  
      SSE2 intrinsic: input: 6.4e+07  cpu:  0.230  items/sec:  2.783e+08 
It can be observed that the speedup is neglectible. Indeed, it even varies, so that for some runs, there is no gain at all. Besides, I can note that fortunately it never performs worse than the standard implementation. But nevertheless, the drawbacks are heavy. The optimized code needs its buffer aligned to 16 byte (change to gr-usrp module needed) and can only handle data sizes as multiple of 16 byte. While this seems to be no constraint in current usage of usrp, it may be in the future. However, the standard implementation does not have these constraints.

Conclusion

The existing conversion solution is efficient. There is no significant speedup with the current SSE-instructions. Also, benchmark results show that a throughput of approximately 280 Megasamples per second is achieved. That is far beyond any data rate GNU Radio applications can process and USRP can delivered, even for future revisions. Though this rate may be decreased if the conversion is run on smaller blocks than in my benchmark, due to function call overhead etc.

License

This work is licensed under the terms of the GPL. A copy is included in the module archive. The patch file does not contain a copy, but of course the same applies for that file.

I am not the author of the gr-usrp module. Relevant changes have been marked in the header of each file and there is a note added to the changelog.

Sourcecode

Modified gr-usrp module TAR-Archive 2008-03-08 70kb
Patch TAR-Archive 2008-03-08 4.5kb