Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve siphash performance for longer data #27280

Merged
merged 4 commits into from Jul 28, 2015
Merged

Conversation

bluss
Copy link
Member

@bluss bluss commented Jul 25, 2015

Improve siphash performance for longer data

Use ptr::copy_nonoverlapping (aka memcpy) to load an u64 from the
byte stream. This is correct for any alignment, and the compiler will
use the appropriate instruction to load the data.

Also contains small tweaks that should benefit hashing short data too,
both the commit that removes a variable and the autovectorization of
the hash state initialization (in SipHash::reset).

Benchmarks show that hashing longer data benefits for the improved word loading.

Before (using benchmarks from the first commit in the PR):

The before benchmark is a bit noisy.

test hash::sip::bench_bytes_4                              ... bench:          41 ns/iter (+/- 0) = 97 MB/s
test hash::sip::bench_bytes_7                              ... bench:          49 ns/iter (+/- 2) = 142 MB/s
test hash::sip::bench_bytes_8                              ... bench:          42 ns/iter (+/- 4) = 190 MB/s
test hash::sip::bench_bytes_a_16                           ... bench:          57 ns/iter (+/- 14) = 280 MB/s
test hash::sip::bench_bytes_b_32                           ... bench:          85 ns/iter (+/- 74) = 376 MB/s
test hash::sip::bench_bytes_c_128                          ... bench:         278 ns/iter (+/- 33) = 460 MB/s
test hash::sip::bench_long_str                             ... bench:         825 ns/iter (+/- 103)
test hash::sip::bench_str_of_8_bytes                       ... bench:         151 ns/iter (+/- 66)
test hash::sip::bench_str_over_8_bytes                     ... bench:          59 ns/iter (+/- 3)
test hash::sip::bench_str_under_8_bytes                    ... bench:          47 ns/iter (+/- 56)
test hash::sip::bench_u32                                  ... bench:          39 ns/iter (+/- 93) = 205 MB/s
test hash::sip::bench_u32_keyed                            ... bench:          40 ns/iter (+/- 88) = 200 MB/s
test hash::sip::bench_u64                                  ... bench:          54 ns/iter (+/- 96) = 148 MB/s

After:

test hash::sip::bench_bytes_4                              ... bench:          41 ns/iter (+/- 3) = 97 MB/s
test hash::sip::bench_bytes_7                              ... bench:          48 ns/iter (+/- 0) = 145 MB/s
test hash::sip::bench_bytes_8                              ... bench:          35 ns/iter (+/- 1) = 228 MB/s
test hash::sip::bench_bytes_a_16                           ... bench:          45 ns/iter (+/- 1) = 355 MB/s
test hash::sip::bench_bytes_b_32                           ... bench:          60 ns/iter (+/- 0) = 533 MB/s
test hash::sip::bench_bytes_c_128                          ... bench:         161 ns/iter (+/- 5) = 795 MB/s
test hash::sip::bench_long_str                             ... bench:         514 ns/iter (+/- 5)
test hash::sip::bench_str_of_8_bytes                       ... bench:          44 ns/iter (+/- 0)
test hash::sip::bench_str_over_8_bytes                     ... bench:          51 ns/iter (+/- 0)
test hash::sip::bench_str_under_8_bytes                    ... bench:          52 ns/iter (+/- 6)
test hash::sip::bench_u32                                  ... bench:          40 ns/iter (+/- 2) = 200 MB/s
test hash::sip::bench_u32_keyed                            ... bench:          39 ns/iter (+/- 1) = 205 MB/s
test hash::sip::bench_u64                                  ... bench:          36 ns/iter (+/- 1) = 222 MB/s

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

Use `ptr::copy_nonoverlapping` (aka memcpy) to load an u64 from the
byte stream. This is correct for any alignment, and the compiler will
use the appropriate instruction to load the data.

Use unchecked indexing.

This results in a large improvement of throughput (hashed bytes
/ second) for long data. Maximum improvement benches at a 70% increase
in throughput for large values (> 256 bytes) but already values of 16
bytes or larger improve.

Introducing unchecked indexing is motivated to reach as good throughput
as possible. Using ptr::copy_nonoverlapping without unchecked indexing
would land the improvement some 20-30 pct units lower.

We use a debug assertion so that the test suite checks our use of
unchecked indexing.
Without this temporary variable, codegen improves slightly and less
registers are spilled to the stack in SipHash::write.
If they are ordered v0, v2, v1, v3, the compiler can find just a few
simd optimizations itself.

The new optimization I could observe on x86-64 was using 128 bit
registers for the v = key ^ constant operations in new / reset.
@alexcrichton
Copy link
Member

@bors: r+ 27c44ce

@bluss
Copy link
Member Author

bluss commented Jul 27, 2015

Thank you!

@bors
Copy link
Contributor

bors commented Jul 28, 2015

⌛ Testing commit 27c44ce with merge ff6c6ce...

bors added a commit that referenced this pull request Jul 28, 2015
Improve siphash performance for longer data

Use `ptr::copy_nonoverlapping` (aka memcpy) to load an u64 from the
byte stream. This is correct for any alignment, and the compiler will
use the appropriate instruction to load the data.

Also contains small tweaks that should benefit hashing short data too,
both the commit that removes a variable and the autovectorization of
the hash state initialization (in SipHash::reset).

Benchmarks show that hashing longer data benefits for the improved word loading.

Before (using benchmarks from the first commit in the PR):

The before benchmark is a bit noisy.

```
test hash::sip::bench_bytes_4                              ... bench:          41 ns/iter (+/- 0) = 97 MB/s
test hash::sip::bench_bytes_7                              ... bench:          49 ns/iter (+/- 2) = 142 MB/s
test hash::sip::bench_bytes_8                              ... bench:          42 ns/iter (+/- 4) = 190 MB/s
test hash::sip::bench_bytes_a_16                           ... bench:          57 ns/iter (+/- 14) = 280 MB/s
test hash::sip::bench_bytes_b_32                           ... bench:          85 ns/iter (+/- 74) = 376 MB/s
test hash::sip::bench_bytes_c_128                          ... bench:         278 ns/iter (+/- 33) = 460 MB/s
test hash::sip::bench_long_str                             ... bench:         825 ns/iter (+/- 103)
test hash::sip::bench_str_of_8_bytes                       ... bench:         151 ns/iter (+/- 66)
test hash::sip::bench_str_over_8_bytes                     ... bench:          59 ns/iter (+/- 3)
test hash::sip::bench_str_under_8_bytes                    ... bench:          47 ns/iter (+/- 56)
test hash::sip::bench_u32                                  ... bench:          39 ns/iter (+/- 93) = 205 MB/s
test hash::sip::bench_u32_keyed                            ... bench:          40 ns/iter (+/- 88) = 200 MB/s
test hash::sip::bench_u64                                  ... bench:          54 ns/iter (+/- 96) = 148 MB/s
```

After:

```
test hash::sip::bench_bytes_4                              ... bench:          41 ns/iter (+/- 3) = 97 MB/s
test hash::sip::bench_bytes_7                              ... bench:          48 ns/iter (+/- 0) = 145 MB/s
test hash::sip::bench_bytes_8                              ... bench:          35 ns/iter (+/- 1) = 228 MB/s
test hash::sip::bench_bytes_a_16                           ... bench:          45 ns/iter (+/- 1) = 355 MB/s
test hash::sip::bench_bytes_b_32                           ... bench:          60 ns/iter (+/- 0) = 533 MB/s
test hash::sip::bench_bytes_c_128                          ... bench:         161 ns/iter (+/- 5) = 795 MB/s
test hash::sip::bench_long_str                             ... bench:         514 ns/iter (+/- 5)
test hash::sip::bench_str_of_8_bytes                       ... bench:          44 ns/iter (+/- 0)
test hash::sip::bench_str_over_8_bytes                     ... bench:          51 ns/iter (+/- 0)
test hash::sip::bench_str_under_8_bytes                    ... bench:          52 ns/iter (+/- 6)
test hash::sip::bench_u32                                  ... bench:          40 ns/iter (+/- 2) = 200 MB/s
test hash::sip::bench_u32_keyed                            ... bench:          39 ns/iter (+/- 1) = 205 MB/s
test hash::sip::bench_u64                                  ... bench:          36 ns/iter (+/- 1) = 222 MB/s
```
@bors bors merged commit 27c44ce into rust-lang:master Jul 28, 2015
@bluss bluss deleted the siphash-perf branch July 28, 2015 15:06
@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Aug 3, 2015
@brson
Copy link
Contributor

brson commented Aug 3, 2015

Nice wins.

@arthurprs
Copy link
Contributor

Awesome! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants