Read [Peter]

June 9th, 2025

After I found some issues with my benchmark which invalidated my previous results, I have substantially revised my previous blog entry. There are two main differences:

A proper baseline revealed that my amd64 numbers were nonsense because I wasn’t fencing enough, and after tearing my hair out and eventually fixing that I found that the bithack conversion is one or two cycles faster.
A newer compiler can radically improve the multiply conversion on arm64 so it’s the same speed as the bithack conversion; I've added some source and assembly snippets to the blog post to highlight how nice arm64 is compared to amd64 for this task.

There are no comments on this entry. (Reply.)

We still haven't met with Senior Management: it's now due tomorrow, in person. I'm gently trying not to panic.

There's still been no message of support to all members of staff and students from the University, and nothing at all from the department. Though I understand they're still in discussions in the background. This is frustrating.

The subject was raised at a recent All Staff meeting (in which people submit questions as text, and senior management attempt to answer them). We were given broad assurances that the university values and supports trans people, but nothing actually useful or genuinely supportive was said.

In the meantime a new EHRC chair is due to be appointed, and they're considering a person with a known anti-trans background. There's an Open Letter available to sign in protest, written by a very good friend and colleague: https://docs.google.com/forms/d/e/1FAIpQLSe_Y77t7CQqKjdGifNa0lE3HKjDAb1UoJdjuLAbInhIQsRMhw/viewform

I've also seen a good template if you want to write directly: https://docs.google.com/document/d/1865KMfu24JgmwnWmYXaVc3jlzj5uQFEq69hXMxKP6BU/edit?tab=t.0

And I wrote my own version:

9th June 2025
Dear Women’s and Equalities Select Committee and Joint Committee on Human Rights,
Cc: Pippa Heylings, as my MP

I am writing to express my grave concern about the proposed appointment of Dr Mary-Ann Stephenson as the Chair of the Equality and Human Rights Commission.

I won't include a string of references here, because I think you will have seen them all already, but I think it is imperative that the next person appointed as Head of the EHRC should not be seen to have a strong anti-trans background. Trans people are currently scared. Scared for their jobs, if they cannot access their workplace in safety and dignity. Scared of being assaulted if they go to the "wrong" toilet. Scared of being outed as trans in public if they try to follow the new guidelines.

And I am scared as a cis woman, a woman who is not trans, at what is happening in our country, and what this means for my friends and colleagues and for trans people in general. For intersex people, non-binary people, and any woman who might be mistaken for being trans. Other women need to feel safe too, but excluding trans people is not the way to do this.

The EHRC needs to stand up for the rights of everyone, and to be seen to do so. I sincerely hope you will take this into account.

Kind Regards,

Eleanor Blair
Great Shelford, Cambridge, CB22

I'm not even going to attempt to get into the member of the EHRC who was quoted as effectively saying that trans people have been misled about their rights under the Equality Act for the last 15 years, and there will now be a period of adjustment, but they should just get used to having fewer rights than they thought they did. The Guardian changed their headline and reporting three times as a result of her protesting about being misquoted, but that seems to have been the gist of it. Not mentioning that the "misleading" guidance came from the EHRC themselves, and was based on the previous understanding of the Equalities Act and entirely consistent with it. FFS

There are 2 comments on this entry. (Reply.)

June 8th, 2025

https://dotat.at/@/2025-06-08-floats.html

A couple of years ago I wrote about random floating point numbers. In that article I was mainly concerned about how neat the code is, and I didn't pay attention to its performance.

Recently, a comment from Oliver Hunt and a blog post from Alisa Sireneva prompted me to wonder if I made an unwarranted assumption. So I wrote a little benchmark, which you can find in pcg-dxsm.git.

(Note 2025-06-09: I've edited this post substantially after discovering some problems with the results.)

recap

Briefly, there are two basic ways to convert a random integer to a floating point number between 0.0 and 1.0:

Use bit fiddling to construct an integer whose format matches a float between 1.0 and 2.0; this is the same span as the result but with a simpler exponent. Bitcast the integer to a float and subtract 1.0 to get the result.
Shift the integer down to the same range as the mantissa, convert to float, then multiply by a scaling factor that reduces it to the desired range. This produces one more bit of randomness than the bithacking conversion.

(There are other less basic ways.)

code

The double precision code for the two kinds of conversion is below. (Single precision is very similar so I'll leave it out.)

It's mostly as I expect, but there are a couple of ARM instructions that surprised me.

bithack

The bithack function looks like:

double bithack52(uint64_t u) {
    u = ((uint64_t)(1023) << 52) | (u >> 12);
    return(bitcast(double, u) - 1.0);
}

It translates fairly directly to amd64 like this:

bithack52:
    shr     rdi, 12
    movabs  rax, 0x3ff0000000000000
    or      rax, rdi
    movq    xmm0, rax
    addsd   xmm0, qword ptr [rip + .number]
    ret
.number:
    .quad   0xbff0000000000000

On arm64 the shift-and-or becomes one bfxil instruction (which is a kind of bitfield move), and the constant -1.0 is encoded more briefly. Very neat!

bithack52:
    mov     x8, #0x3ff0000000000000
    fmov    d0, #-1.00000000
    bfxil   x8, x0, #12, #52
    fmov    d1, x8
    fadd    d0, d1, d0
    ret

multiply

The shift-convert-multiply function looks like this:

double multiply53(uint64_t u) {
    return ((double)(u >> 11) * 0x1.0p-53);
}

It translates directly to amd64 like this:

multiply53:
    shr       rdi, 11
    cvtsi2sd  xmm0, rdi
    mulsd     xmm0, qword ptr [rip + .number]
    ret
.number:
    .quad     0x3ca0000000000000

GCC and earlier versions of Clang produce the following arm64 code, which is similar though it requires more faff to get the constant into the right register.

multiply53:
    lsr     x8, x0, #11
    mov     x9, #0x3ca0000000000000
    ucvtf   d0, x8
    fmov    d1, x9
    fmul    d0, d0, d1
    ret

Recent versions of Clang produce this astonishingly brief two instruction translation: apparently you can convert fixed-point to floating point in one instruction, which gives us the power of two scale factor for free!

multiply53:
    lsr     x8, x0, #11
    ucvtf   d0, x8, #53
    ret

benchmark

My benchmark has 2 x 2 x 2 tests:

bithacking vs multiplying
32 bit vs 64 bit
sequential integers vs random integers

I ran the benchmark on my Apple M1 Pro and my AMD Ryzen 7950X.

These functions are very small and work entirely in registers so it has been tricky to measure them properly.

To prevent the compiler from inlining and optimizing the benchmark loop to nothing, the functions are compiled in a separate translation unit from the test harness. This is not enough to get plausible measurements because the CPU overlaps successive iterations of the loop, so we also use fence instructions.

On arm64, a single ISB (instruction stream barrier) in the loop is enough to get reasonable measurements.

I have not found an equivalent of ISB on amd64, so I'm using MFENCE. It isn't effective unless I pass the argument and return values via pointers (because it's a memory fence) and place MFENCE instructions just before reading the argument and just after writing the result.

results

In the table below, the leftmost column is the number of random bits; "old" is arm64 with older clang, "arm" is newer clang, "amd" is gcc.

The first line is a baseline do-nothing function, showing the overheads of the benchmark loop, function call, load argument, store return, and fences.

The upper half measures sequential numbers, the bottom half is random numbers. The times are nanoseconds per operation.

         old    arm    amd

    00  21.44  21.41  21.42

    23  24.28  24.31  22.19
    24  25.24  24.31  22.94
    52  24.31  24.28  21.98
    53  25.32  24.35  22.25

    23  25.59  25.56  22.86
    24  26.55  25.55  23.03
    52  27.83  27.81  23.93
    53  28.57  27.84  25.01

The times vary a little from run to run but the difference in speed of the various loops is reasonably consistent.

The numbers on arm64 are reasonably plausible. The most notable thing is that the "old" multiply conversion is about 3 or 4 clock cycles slower, but with a newer compiler that can eliminate the multiply, it's the same speed as the bithacking conversion.

On amd64 the multiply conversion is about 1 or 2 clock cycles slower than the bithacking conversion.

conclusion

The folklore says that bithacking floats is faster than normal integer to float conversion, and my results generally agree with that, apart from on arm64 with a good compiler. It would be interesting to compare other CPUs to get a better idea of when the folklore is right or wrong -- or if any CPUs perform the other way round!

There are no comments on this entry. (Reply.)

June 5th, 2025

Hm. I'm still not sure about writing `2.` but if you have to use floats then I think I came around to preferring to write `2.0` over writing `2`.

There are 3 comments on this entry. (Reply.)

Peter

performance of random floats (amended)

EHRC nonsense

performance of random floats