The encoder uses the code polynomials G0 = 1 + x^1 + x^2 + x^3 + x^6 and G1 = 1 + x^0 + x^2 + x^3 + x^5 + x^6.
These are the polynomials defined for the DVB-T mother code, as in
ETSI Standard: EN 300 744 V1.5.1.

The x86 SIMD instruction set SSSE3
offers the wonderful palignr instruction. It takes three operands,
two 128bit register and one immediate, and performs a right shift. The result is the lower 128bit vector
of the right shifted concatenated two input vectors. Using this instruction, we assemble 128bit
vectors from the input that are partially overlapping (8 bits).

We then right shift the assembled vectors (by 1,2,3,5 and 6 bits) and xor them. Since there is no
128bit right shift instruction, we are to do a trick: use the 64bit SIMD right shift (right shifting
2 64bit operands in one 128bit vector), the SIMD left shift and again palignr. With palignr, we
swap the two 64bit operands. Then a right shift of 1 bit looks like this: a1 = (a >> 1) | (swap64(a) << 63).

Finally after XOR'ing, undo the partial overlapping using palignr.

The encoder processes blocks of 1920 bit, producing 2x1920 code bit. It requires the last 16 byte
of the previous block (=0 for the first block), thus the input is an array of 256 byte. The
output is stored in two 240 byte arrays.

Performance figures

Measured on a Core 2 Duo 2.66GHz sudo chrt -f 99 taskset 1 time ./conv_test_inlined

Inlined:

~408 cycles / 1920 bit

i.e. ~12.5 Gbit/s

Not inlined:

~1144 cycles / 1920 bit

i.e. ~4.4 Gbit/s

Note: the benchmark runs the encoder several times on the same input and output
arrays, without writing or reading to these memory locations. Thus they are definitely
completely cached. This is unlikely for a real application.