Commit Graph

2 Commits

Author SHA1 Message Date
Adhemerval Zanella Netto 84cfc6479b x86: Add AVX2 optimized chacha20
It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-amd64-avx2.S.  It is used only if AVX2 is supported
and enabled by the architecture.

As for generic implementation, the last step that XOR with the
input is omited.  The final state register clearing is also
omitted.

On a Ryzen 9 5900X it shows the following improvements (using
formatted bench-arc4random data):

SSE                                        MB/s
-----------------------------------------------
arc4random [single-thread]               704.25
arc4random_buf(16) [single-thread]      1018.17
arc4random_buf(32) [single-thread]      1315.27
arc4random_buf(48) [single-thread]      1449.36
arc4random_buf(64) [single-thread]      1511.16
arc4random_buf(80) [single-thread]      1539.48
arc4random_buf(96) [single-thread]      1571.06
arc4random_buf(112) [single-thread]     1596.16
arc4random_buf(128) [single-thread]     1613.48
-----------------------------------------------

AVX2                                       MB/s
-----------------------------------------------
arc4random [single-thread]               922.61
arc4random_buf(16) [single-thread]      1478.70
arc4random_buf(32) [single-thread]      2241.80
arc4random_buf(48) [single-thread]      2681.28
arc4random_buf(64) [single-thread]      2913.43
arc4random_buf(80) [single-thread]      3009.73
arc4random_buf(96) [single-thread]      3141.16
arc4random_buf(112) [single-thread]     3254.46
arc4random_buf(128) [single-thread]     3305.02
-----------------------------------------------

Checked on x86_64-linux-gnu.
2022-07-22 11:58:27 -03:00
Adhemerval Zanella Netto e169aff0e9 x86: Add SSE2 optimized chacha20
It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-amd64-ssse3.S.  It replaces the ROTATE_SHUF_2 (which
uses pshufb) by ROTATE2 and thus making the original implementation
SSE2.

As for generic implementation, the last step that XOR with the
input is omited. The final state register clearing is also
omitted.

On a Ryzen 9 5900X it shows the following improvements (using
formatted bench-arc4random data):

GENERIC                                    MB/s
-----------------------------------------------
arc4random [single-thread]               443.11
arc4random_buf(16) [single-thread]       552.27
arc4random_buf(32) [single-thread]       626.86
arc4random_buf(48) [single-thread]       649.81
arc4random_buf(64) [single-thread]       663.95
arc4random_buf(80) [single-thread]       674.78
arc4random_buf(96) [single-thread]       675.17
arc4random_buf(112) [single-thread]      680.69
arc4random_buf(128) [single-thread]      683.20
-----------------------------------------------

SSE                                        MB/s
-----------------------------------------------
arc4random [single-thread]               704.25
arc4random_buf(16) [single-thread]      1018.17
arc4random_buf(32) [single-thread]      1315.27
arc4random_buf(48) [single-thread]      1449.36
arc4random_buf(64) [single-thread]      1511.16
arc4random_buf(80) [single-thread]      1539.48
arc4random_buf(96) [single-thread]      1571.06
arc4random_buf(112) [single-thread]     1596.16
arc4random_buf(128) [single-thread]     1613.48
-----------------------------------------------

Checked on x86_64-linux-gnu.
2022-07-22 11:58:27 -03:00