Instant::now()
what?
until Rust 1.60, Instant::now()
included a heavyweight hammer to enforce the standard library’s guarantee that Instant::now()
monotonically increases. that is, it does not go backwards. some hardware has clocks that go backwards, and Rust sees fit to guard against such ridiculousness.
so, which hardware/software pairs are actually_monotonic()
? according to this comment:
- (
OpenBSD
,x86_64
) is not monotonic - (
linux
,arm64
) is not monotonic (and again) - (
linux
,s390x
) is not monotonic - (
windows
,x86
) is not monotonic- hardware here might be a haswell chip, but under xen (details at the bottom of OP:
Intel64 Family 6 Model 63 Stepping 2 GenuineIntel ~2400 Mhz
, lookup)
- hardware here might be a haswell chip, but under xen (details at the bottom of OP:
- (
windows
,x86_64
) is not monotonic- unknown hardware, also aws
- (
windows
,x86
) is not monotonic
and Firefox has a similar hammer to force apparent monotonicity of “now”.
i’ve seen people talk about this before, with shock and awe and horror. i’ve tweeted about this before. there have been sharp words about x86 TSCs in linux discussions.
what’s interesting to me today is that Rust concludes that on the same inconsistent hardware, windows and openbsd get clocks wrong in a way linux does not.
so: on windows, Rust uses QueryPerformanceCounter
. for macos, mach_absolute_time
, and linux, clock_gettime(CLOCK_MONOTONIC)
. these all seem like the reasonable hardware-abstracting ways to get a monotonic clock, letting the OS paper over broken hardware when possible.
and they do (linux, windows). i didn’t care to figure out what openbsd does, but it certainly also tries to fast-path time checks on reasonable hardware.
so how does linux get steady monotonically increasing times on hardware that windows can’t make consistent? a random comment on stackoverflow believes that windows aggressively moves processes between cores, where linux tightly couples processes and cores, which might mean that windows happens to expose inconsistent more often. it also proceeds right into claims about product claims for no good reason. linux just does what Rust also now does - “if the clock looks wrong just saturate and say it didn’t change” (as of 1.60 anyway), generally. on x86 specifically, linux falls back to the kernel if it decided a clock is no longer trustworthy, as determined by the last vdso data update.
clock_gettime(CLOCK_MONOTONIC)
also makes stronger claims than QueryPerformanceCounter
, asserting that the returned time is with respect to the system’s startup, where QPC says that it’s independent of any external time source (so, not comparable to wallclock times). according to microsoft’s documentation for QPC, windows XP may be tripped up by hardware incorrectly reporting TSC variance, Vista chose to use HPET instead of a TSC, windows 7 was back to using a TSC if available (modulo incorrect hardware reporting), and windows 8+ use TSCs. i didn’t bother looking to see what windows 10 does in the kernel, but in ntdll.dll!RtlQueryPerformanceCounter
, on x86, it certainly does rely on rdtsc
with appropriate barriers for serialization.
but then linux developers report that some hardware will change TSCs and lie about the current time, which may lead to incorrect time reports from clock_gettime
in the fallback kernel code anyway.
so why did Rust decide that windows is untrustworthy due to the presence of broken hardware, while linux is trusted to not give totally bogus times? idk, probably because there were reports of broken windows times on x86, and not reports of broken linux times on x86. maybe linux’ attempt at monotonization is sufficient for the worst cases of whacky hardware. maybe windows has a particularly bad time migrating between VMs, might be hinted by a section from this high-resolution time stamps document: ; and on Hyper-V, the performance counter frequency is always 10 MHz when the guest virtual machine runs under a hypervisor that implements the hypervisor version 1.0 interface
. the windows issues all have some evidence of being related to times gathered in a VM (maybe even AWS specifically). the Firefox issue seems to relate to older hardware, but some comments suggest they actually saw instability on linux as well. i can’t see the old crash reports, so i don’t have any hope of seeing implicated hardware.
even if windows was penalized for what might be a primarily-in-VMs time issue, the hammer fixes what was an uncontrolled, unpredictable crash due to hardware-level behavior into just a performance issue. that’s a good improvement.
tl;dr? is rust bad?
given that this was a fix for crashes with murky circumstances where the only clear information - especially easily available - is that the circumstance should be impossible and that buggy hardware is prevalent, the technical decisions made here were reasonable given what the parties knew at the time and the constraints they were subject to. it’s fine.
ps: some windows stuff
windows is closed source. so to know how it handles hardware differences in tsc consistency we get to read compiled code.
so here’s ntdll.dll!RtlQueryPerformanceCounter
.
;-- RtlQueryPerformanceCounter: 0x180040150 48895c2408 mov qword [rsp + 8], rbx 0x180040155 57 push rdi 0x180040156 4883ec20 sub rsp, 0x20 0x18004015a 448a0425c603. mov r8b, byte [0x7ffe03c6] ; [0x7ffe03c6:1]=255 0x180040162 488bd9 mov rbx, rcx 0x180040165 41f6c001 test r8b, 1 ; 1 ┌─< 0x180040169 0f84fb680700 je 0x1800b6a6a │ 0x18004016f 4c8b1c25b803. mov r11, qword [0x7ffe03b8] ; [0x7ffe03b8:8]=-1 │ 0x180040177 41f6c002 test r8b, 2 ; 2 ┌──< 0x18004017b 0f84bd680700 je 0x1800b6a3e ││ 0x180040181 4c8b0d408e12. mov r9, qword [0x180168fc8] ; [0x180168fc8:8]=0 ││ 0x180040188 4d85c9 test r9, r9 ┌───< 0x18004018b 0f84d9680700 je 0x1800b6a6a ┌────> 0x180040191 458b11 mov r10d, dword [r9] ╎│││ 0x180040194 4585d2 test r10d, r10d ┌─────< 0x180040197 0f84cd680700 je 0x1800b6a6a │╎│││ 0x18004019d 4584c0 test r8b, r8b ┌──────< 0x1800401a0 7941 jns 0x1800401e3 ││╎│││ 0x1800401a2 0f01f9 rdtscp ┌───────> 0x1800401a5 48c1e220 shl rdx, 0x20 ╎││╎│││ 0x1800401a9 480bd0 or rdx, rax ╎││╎│││ 0x1800401ac 498b4108 mov rax, qword [r9 + 8] ╎││╎│││ 0x1800401b0 498b4910 mov rcx, qword [r9 + 0x10] ╎││╎│││ 0x1800401b4 48f7e2 mul rdx ╎││╎│││ 0x1800401b7 418b01 mov eax, dword [r9] ╎││╎│││ 0x1800401ba 4803d1 add rdx, rcx ╎││╎│││ 0x1800401bd 413bc2 cmp eax, r10d ╎││└────< 0x1800401c0 75cf jne 0x180040191 ╎││ │││ 0x1800401c2 8a0c25c703fe. mov cl, byte [0x7ffe03c7] ; [0x7ffe03c7:1]=255 ╎││ │││ 0x1800401c9 4903d3 add rdx, r11 ╎││ │││ 0x1800401cc 48d3ea shr rdx, cl ╎││ │││ 0x1800401cf 488913 mov qword [rbx], rdx ╎││ │││ 0x1800401d2 488b5c2430 mov rbx, qword [rsp + 0x30] ╎││ │││ 0x1800401d7 b801000000 mov eax, 1 ╎││ │││ 0x1800401dc 4883c420 add rsp, 0x20 ╎││ │││ 0x1800401e0 5f pop rdi ╎││ │││ 0x1800401e1 c3 ret ╎││ │││ 0x1800401e2 cc int3 ╎└──────> 0x1800401e3 41f6c020 test r8b, 0x20 ; 32 ╎ │┌────< 0x1800401e7 0f843f680700 je 0x1800b6a2c ╎ │││││ 0x1800401ed 0faee8 lfence ╎ │││││ 0x1800401f0 0f31 rdtsc └───────< 0x1800401f2 ebb1 jmp 0x1800401a5
first, mov r8b, byte [0x7ffe03c6]
loads a byte that will be used to check which way we should read time counters. r8b
will be reused several times in this function.
all early je
checks are to branch off to some cold code far away from this function. the happy path is to fall through to 0x18004019d
where either we believe rdtscp
is sufficient to read timers, or we should lfence; rdtsc
and come back. either way this loads the TSC into edx:eax
, which is reassembled into a 64-bit number before being offset and scaled (?) by some core-local (?) information in r9
. and if this compares less than something (?), branch back and see if we should use the cold path anyway. the cold path code returns here, where we eventually write to the out-pointer parameter in the mov
at 0x1800401cf
.
the cold path is interesting and worth looking at too:
╎╎ 0x1800b6a3e 4584c0 test r8b, r8b ┌───< 0x1800b6a41 7905 jns 0x1800b6a48 │╎╎ 0x1800b6a43 0f01f9 rdtscp ┌────< 0x1800b6a46 eb16 jmp 0x1800b6a5e │└───> 0x1800b6a48 41f6c020 test r8b, 0x20 ; 32 │┌───< 0x1800b6a4c 7405 je 0x1800b6a53 ││╎╎ 0x1800b6a4e 0faee8 lfence ┌─────< 0x1800b6a51 eb09 jmp 0x1800b6a5c ││└───> 0x1800b6a53 41f6c010 test r8b, 0x10 ; 16 ││┌───< 0x1800b6a57 7403 je 0x1800b6a5c │││╎╎ 0x1800b6a59 0faef0 mfence └─└───> 0x1800b6a5c 0f31 rdtsc └────> 0x1800b6a5e 48c1e220 shl rdx, 0x20 ╎╎ 0x1800b6a62 480bd0 or rdx, rax ╎└─< 0x1800b6a65 e95897f8ff jmp 0x1800401c2 ╎ 0x1800b6a6a 33d2 xor edx, edx ╎ 0x1800b6a6c 488d4c2440 lea rcx, qword [rsp + 0x40] ╎ 0x1800b6a71 e80a69feff call sym.ntdll.dll_NtQueryPerformanceCounter ╎ 0x1800b6a76 488b442440 mov rax, qword [rsp + 0x40] ╎ 0x1800b6a7b 488903 mov qword [rbx], rax └──< 0x1800b6a7e e94f97f8ff jmp 0x1800401d2
again we’re consulting r8b
for which mechanism we can safely use. down at 0x1800b6a6a
is the worst case, calling into NtQueryPerformanceCounter
- a wrapper to make the syscall into the kernel for whatever fallback mechanism it has available. this is how windows eventually falls back to HPET if something is seriously wrong.
all in all, not dissimilar from linux’s implementation of tsc
-based timers.