Instant::now() what?

until Rust 1.60, Instant::now() included a heavyweight hammer to enforce the standard library’s guarantee that Instant::now() monotonically increases. that is, it does not go backwards. some hardware has clocks that go backwards, and Rust sees fit to guard against such ridiculousness.

tl;dr at the bottom

so, which hardware/software pairs are actually_monotonic()? according to this comment:

and Firefox has a similar hammer to force apparent monotonicity of “now”.

i’ve seen people talk about this before, with shock and awe and horror. i’ve tweeted about this before. there have been sharp words about x86 TSCs in linux discussions.

what’s interesting to me today is that Rust concludes that on the same inconsistent hardware, windows and openbsd get clocks wrong in a way linux does not.

so: on windows, Rust uses QueryPerformanceCounter. for macos, mach_absolute_time, and linux, clock_gettime(CLOCK_MONOTONIC). these all seem like the reasonable hardware-abstracting ways to get a monotonic clock, letting the OS paper over broken hardware when possible.

and they do (linux, windows). i didn’t care to figure out what openbsd does, but it certainly also tries to fast-path time checks on reasonable hardware.

so how does linux get steady monotonically increasing times on hardware that windows can’t make consistent? a random comment on stackoverflow believes that windows aggressively moves processes between cores, where linux tightly couples processes and cores, which might mean that windows happens to expose inconsistent more often. it also proceeds right into claims about product claims for no good reason. linux just does what Rust also now does - “if the clock looks wrong just saturate and say it didn’t change” (as of 1.60 anyway), generally. on x86 specifically, linux falls back to the kernel if it decided a clock is no longer trustworthy, as determined by the last vdso data update.

clock_gettime(CLOCK_MONOTONIC) also makes stronger claims than QueryPerformanceCounter, asserting that the returned time is with respect to the system’s startup, where QPC says that it’s independent of any external time source (so, not comparable to wallclock times). according to microsoft’s documentation for QPC, windows XP may be tripped up by hardware incorrectly reporting TSC variance, Vista chose to use HPET instead of a TSC, windows 7 was back to using a TSC if available (modulo incorrect hardware reporting), and windows 8+ use TSCs. i didn’t bother looking to see what windows 10 does in the kernel, but in ntdll.dll!RtlQueryPerformanceCounter, on x86, it certainly does rely on rdtsc with appropriate barriers for serialization.

but then linux developers report that some hardware will change TSCs and lie about the current time, which may lead to incorrect time reports from clock_gettime in the fallback kernel code anyway.

so why did Rust decide that windows is untrustworthy due to the presence of broken hardware, while linux is trusted to not give totally bogus times? idk, probably because there were reports of broken windows times on x86, and not reports of broken linux times on x86. maybe linux’ attempt at monotonization is sufficient for the worst cases of whacky hardware. maybe windows has a particularly bad time migrating between VMs, might be hinted by a section from this high-resolution time stamps document: ; and on Hyper-V, the performance counter frequency is always 10 MHz when the guest virtual machine runs under a hypervisor that implements the hypervisor version 1.0 interface. the windows issues all have some evidence of being related to times gathered in a VM (maybe even AWS specifically). the Firefox issue seems to relate to older hardware, but some comments suggest they actually saw instability on linux as well. i can’t see the old crash reports, so i don’t have any hope of seeing implicated hardware.

even if windows was penalized for what might be a primarily-in-VMs time issue, the hammer fixes what was an uncontrolled, unpredictable crash due to hardware-level behavior into just a performance issue. that’s a good improvement.

tl;dr? is rust bad?

given that this was a fix for crashes with murky circumstances where the only clear information - especially easily available - is that the circumstance should be impossible and that buggy hardware is prevalent, the technical decisions made here were reasonable given what the parties knew at the time and the constraints they were subject to. it’s fine.

index

ps: some windows stuff

windows is closed source. so to know how it handles hardware differences in tsc consistency we get to read compiled code.

so here’s ntdll.dll!RtlQueryPerformanceCounter.

            ;-- RtlQueryPerformanceCounter:
            0x180040150      48895c2408     mov qword [rsp + 8], rbx
            0x180040155      57             push rdi
            0x180040156      4883ec20       sub rsp, 0x20
            0x18004015a      448a0425c603.  mov r8b, byte [0x7ffe03c6] ; [0x7ffe03c6:1]=255
            0x180040162      488bd9         mov rbx, rcx
            0x180040165      41f6c001       test r8b, 1                ; 1
        < 0x180040169      0f84fb680700   je 0x1800b6a6a
           0x18004016f      4c8b1c25b803.  mov r11, qword [0x7ffe03b8] ; [0x7ffe03b8:8]=-1
           0x180040177      41f6c002       test r8b, 2                ; 2
       < 0x18004017b      0f84bd680700   je 0x1800b6a3e
          0x180040181      4c8b0d408e12.  mov r9, qword [0x180168fc8] ; [0x180168fc8:8]=0
          0x180040188      4d85c9         test r9, r9
      < 0x18004018b      0f84d9680700   je 0x1800b6a6a
     > 0x180040191      458b11         mov r10d, dword [r9]
        0x180040194      4585d2         test r10d, r10d
    < 0x180040197      0f84cd680700   je 0x1800b6a6a
       0x18004019d      4584c0         test r8b, r8b
   < 0x1800401a0      7941           jns 0x1800401e3
      0x1800401a2      0f01f9         rdtscp
  > 0x1800401a5      48c1e220       shl rdx, 0x20
     0x1800401a9      480bd0         or rdx, rax
     0x1800401ac      498b4108       mov rax, qword [r9 + 8]
     0x1800401b0      498b4910       mov rcx, qword [r9 + 0x10]
     0x1800401b4      48f7e2         mul rdx
     0x1800401b7      418b01         mov eax, dword [r9]
     0x1800401ba      4803d1         add rdx, rcx
     0x1800401bd      413bc2         cmp eax, r10d
  < 0x1800401c0      75cf           jne 0x180040191
      0x1800401c2      8a0c25c703fe.  mov cl, byte [0x7ffe03c7]  ; [0x7ffe03c7:1]=255
      0x1800401c9      4903d3         add rdx, r11
      0x1800401cc      48d3ea         shr rdx, cl
      0x1800401cf      488913         mov qword [rbx], rdx
      0x1800401d2      488b5c2430     mov rbx, qword [rsp + 0x30]
      0x1800401d7      b801000000     mov eax, 1
      0x1800401dc      4883c420       add rsp, 0x20
      0x1800401e0      5f             pop rdi
      0x1800401e1      c3             ret
      0x1800401e2      cc             int3
  > 0x1800401e3      41f6c020       test r8b, 0x20             ; 32
   < 0x1800401e7      0f843f680700   je 0x1800b6a2c
      0x1800401ed      0faee8         lfence
      0x1800401f0      0f31           rdtsc
  < 0x1800401f2      ebb1           jmp 0x1800401a5

first, mov r8b, byte [0x7ffe03c6] loads a byte that will be used to check which way we should read time counters. r8b will be reused several times in this function.

all early je checks are to branch off to some cold code far away from this function. the happy path is to fall through to 0x18004019d where either we believe rdtscp is sufficient to read timers, or we should lfence; rdtsc and come back. either way this loads the TSC into edx:eax, which is reassembled into a 64-bit number before being offset and scaled (?) by some core-local (?) information in r9. and if this compares less than something (?), branch back and see if we should use the cold path anyway. the cold path code returns here, where we eventually write to the out-pointer parameter in the mov at 0x1800401cf.

the cold path is interesting and worth looking at too:

          0x1800b6a3e      4584c0         test r8b, r8b
      < 0x1800b6a41      7905           jns 0x1800b6a48
         0x1800b6a43      0f01f9         rdtscp
     < 0x1800b6a46      eb16           jmp 0x1800b6a5e
     > 0x1800b6a48      41f6c020       test r8b, 0x20             ; 32
     < 0x1800b6a4c      7405           je 0x1800b6a53
        0x1800b6a4e      0faee8         lfence
    < 0x1800b6a51      eb09           jmp 0x1800b6a5c
    > 0x1800b6a53      41f6c010       test r8b, 0x10             ; 16
    < 0x1800b6a57      7403           je 0x1800b6a5c
       0x1800b6a59      0faef0         mfence
    > 0x1800b6a5c      0f31           rdtsc
     > 0x1800b6a5e      48c1e220       shl rdx, 0x20
          0x1800b6a62      480bd0         or rdx, rax
       < 0x1800b6a65      e95897f8ff     jmp 0x1800401c2
           0x1800b6a6a      33d2           xor edx, edx
           0x1800b6a6c      488d4c2440     lea rcx, qword [rsp + 0x40]
           0x1800b6a71      e80a69feff     call sym.ntdll.dll_NtQueryPerformanceCounter
           0x1800b6a76      488b442440     mov rax, qword [rsp + 0x40]
           0x1800b6a7b      488903         mov qword [rbx], rax
       < 0x1800b6a7e      e94f97f8ff     jmp 0x1800401d2

again we’re consulting r8b for which mechanism we can safely use. down at 0x1800b6a6a is the worst case, calling into NtQueryPerformanceCounter - a wrapper to make the syscall into the kernel for whatever fallback mechanism it has available. this is how windows eventually falls back to HPET if something is seriously wrong.

all in all, not dissimilar from linux’s implementation of tsc-based timers.