Replyblogging: “Async Rust can be a pleasure to work with”

i initially was going to just write (bad sign) one of my rare reddit replies (worse sign) and be done with it, but then i pasted what i wrote* into the text box and it reflowed the page ’cause it was too much words (ouch).

*: if you need to write a reddit reply in vim because you don’t want to lose your work in the reddit text box if you hit ctrl+w, MAYBE YOUCH, TOUCH GRASS INSTEAD. THIS IS THE WORST SIGN.

anyway, sorry Evan, this is a lot of words to say that generally i think you’ve cited a lot of folks who’ve thought about this a lot, and i agree there’s a lot that could be better about async Rust, but that these really aren’t the arguments i think should persuade people very much on work-stealing vs thread-per-core, or how async rust is designed, or necessarily how Send/Sync work.

this is the post that was floating around and prompted the aforementioned writing: https://emschwartz.me/async-rust-can-be-a-pleasure-to-work-with-without-send-sync-static/

OK, from here on out pretend you’re reading this inside the reddit reader of your choice, and pretend i have the appropriate shame of being a redditor.


there are several points i’d encourage you to reexamine in this post - it’s difficult to read this as a comparison between async approaches rather than that… different programs are different.

Send, Sync

first, Sync does not indicate anything about mutability, or shared mutability. Sync is a marker that only means that multiple threads can reference a type. for example, Rc is not Sync because with a reference to an Rc, e.g. &Rc, you can call Rc::clone and increment the reference count. but Rc’s reference count is not safe for concurrent access, and so it would be unsafe to call Rc::clone from multiple threads. so it is not safe to share references to an Rc<T> between threads, and Rc is not Sync. note that Rc<T> is never mutable and the referenced item is never mutable, unless there’s some form of interior mutability involved. this is misunderstood, and leads to a few incorrect claims:

Most types automatically implement Send except references, Rcs, and any value that contains one of these.

references are explicitly Send. pointers are not Send, the reasoning is not well-documented in Send’s documentation, but is discussed in the nomicon chapter on Send and Sync.

but as an example, the following would compile:

struct HasRef<'a> {                                   
    a: &'a str                                        
}                                                     
                                                      
fn requires_send<T: Send>(t: T) {}                    
                                                      
fn example() {                                        
    let local_str = "hello".to_string();              
    requires_send(HasRef { a: local_str.as_str() });  
}                                                     

Sync is a stricter bound that means a type cannot only be sent between threads but is also safe to modify from multiple threads.

Sync only means that you can have references to the type from multiple threads, nothing about mutability. *const T is not Sync, for example. this is for basically the same reason as it is not Send: not because these indicate anything about the referenced type, but that it is very easy to use referenced types incorrectly if pointers were automatically Send or Sync.

Rc is not Sync for the same reason it is not Send, unrelated to the interior mutability of its referent.

The subtle distinction is that any resources that are shared between the threads definitely need to be Send (and Sync if they are mutable), but our futures themselves don’t necessarily need to be Send.

references that are shared between threads need to be Sync, not Send. this is specifically what Sync indicates: that the type is one whose references can be shared between between threads. futures tend to need to be Send for the same reason you indicate: that the future (and so its locals) may be moved between threads, so Send-ness comes from all internal types being Send. generally you do not need to care if a future is Sync. it’s pretty rare to reference a future from multiple threads at the same time.

again, a Future having mutable state has no relationship to it being Sync.

these might seem like small points, but getting this wrong means you (or a reader of your blog) might confidently write an incorrect unsafe impl Sync. next thing you know you’re accidentally sharing non-atomic counters across threads and everything goes sideways. i’ve been here! it’s bad!

benchmarking

sorry! there’s a bunch here too. you disclaimed that you’re not a benchmarking expert, so i want to emphasize that i mean this all as helpful feedback.

You can see that the Glommio throughput slightly surpasses Tokio (13% higher), the 50th percentile latency is slightly lower (14% lower), and the 99th percentile latency is higher (40% higher).

as you mentioned earlier in the post, tokio does not use io_uring, while glommio does. this is not benchmarking “work-stealing” vs “thread-per-core”, it is benchmarking “work-stealing epoll” vs “thread-per-core io_uring”. this is important. in fact i’m somewhat surprised that glommio does not have better results here! for emphasis these are not comparing the things you and your readers think they are.

one might imagine, for example, if anyone ever figures out a reasonable way to glue tokio and io_uring together, i’d expect it to look pretty similar to glommio for the benchmarks you’ve written.

Throughput (Requests / Sec)

something with these values is wrong. i think your 50th Percentile Latency units should be microseconds, not milliseconds. i’m not as sure about the 99th percentile or maximums. if p50 latency was 1.29ms, where you have 8000 cpu-milliseconds per second available, you would expect the total throughput to be something like 8 * 1000 / 1.29 == 6201rps.

throughput in the hundreds of thousands of requests/second seems reasonable, but the units don’t work out. assuming 50th percentile as microseconds, it slightly overestimates the real figure, but only by 10-15% which is consistent with a long tail dragging down an average. p50 is probably fairly close to p25 and p5 here, so the average is probably actually above p50.

edit: of course, after posting this i realized what your numbers actually mean - latencies are as measured by wrk on the client side, so the concurrency here is 800, not 8. with each client seeing 1.29ms of latency, and 800,000 ms-per-second of time to fill at 1.29ms/req, that comes out to around 620krps, just a bit above your measured throughput and off for probably the same “50th percentile is a bit below the average” reason.

and where each future synchronously sleeps for 10 microseconds (so each request takes between 0 and 100 microseconds total). The idea here is to simulate the request handler awaiting additional futures, each of which requires some CPU time.

this is also very important: with times in the range of 0-100 microseconds you’re calling nanosleep() and yielding threads at intervals approaching the actual time to schedule a thread on a core. there are a few outcomes here!

i’m having a hard time estimating which way this would bias the numbers, but i’d expect something here.

more importantly: if a slow HTTP handler was 100 microseconds, i feel like a lot of SREs would be substantially less stressed :) i’d strongly recommend something like 50 microseconds minimum, 100 microseconds p50, 2ms p99 for the CPU time of a request to something more front-end than a database. this is… still pretty optimistic, but i know my own experience here would suggest much much more pessimistic numbers.

this is very important for the work-stealing case because the extra orders of magnitude are where i’d expect you to see a more severe distinction between round-robin and work-stealing at even p99.

your glommio benchmark, in comparison, is functionally also work-stealing. because all your glommio threads are binding to an SO_REUSEADDR socket. the work-stealing is just done by the kernel at accept() time, picking whichever thread is actually in accept() on the socket rather than having an acceptor thread distributing work as in the tokio benchmarks.

i would expect that if you changed the tokio benchmarks to TcpListener::bind() all num_cpus() times you would see somewhat better throughput in all cases - the tokio “work-stealing” benchmark would look more like eight tasks running at high utilization, rather than nine busy tasks (accept() + 8 workers + the glue conn.send(stream) temporaries) that are contending with each other. the tokio “round-robin” benchmark would not be round-robin, but would much more directly compare to the glommio benchmark. in all cases, you’ll probably see the tokio max latencies drop off significantly.

(minor detail: this spawn is super unnecessary - you could make the channel unbounded for the same effect without constantly spawning new tasks from the accept() task, instead of the extant behavior of treating the task pool itself as an unbounded queue)

i’d entirely misread SO_REUSEADDR as SO_REUSEPORT, and how glommio handles the acceptor task. this part was just totally wrong. sorry!


ergonomics

finally, the examples are.. difficult to compare. i was initially going to just talk about this, but then wrote all of the above :) but here at least i have a few play.rust-lang.org links to step through.

basically, the tokio example is doing a lot more, and most of that i would consider a bug. the clearest comparison is that the return types of the future in while let Some(...) are different, which hints at the bodies themselves being different.

i started by writing up something that makes your tokio example typecheck: here.

tokio::try_join, just like futures::try_join, runs the futures to completion concurrently. so tokio::spawn() for result_b and result_c should be unnecessary, right? i tried removing them, and then realized a subtle but important difference between these. removing the tokio::spawns look like this.

the error took me a moment to realize: the error being ignored in try_join here is that any of the futures panicked, not any error in relationship to the futures themselves. so by removing the spawn, that removes the implicit catch_panic in running the tasks, and so removes the possibility of handling those panicks. i’m really not sure how the moro-local version would fare if one of those futures panicked. either way, bringing these in line makes them look more similar: like so

the if let Err() in the tokio example is totally missing from the moro-local example. lets clip that out too. now it looks pretty similar and i’ll finally get into advice for Send-y async Rust - you probably just don’t want async fn to take &T. your code will be strewn with Arc<T>, and Arc<T> is a reference. just take Arc<T> in your arguments and a lot of glue becomes less grating.

here’s an example. this works out nicely when you find you need to spawn new futures from whatever’s happening in do_something, or do_something_else, or … - you don’t need to go change your &T to Arc<T> in preparation for spawning a future. and your callers probably already have Arc<T>, so taking &T means you’re just asking rustc to forget that the reference counting machinery is there, it’s all the same refcount changes happen in the context of the async move future rather than the service function.

speaking of which, once you’re not relying on the function to own Arc<T> whose references you pass to service functions, you no longer need the inner async move { ... }, like this.

it’s still very clone()ful but i think this is a much clearer comparison of the actual code complexity you get for that. it’s annoying, i agree! but it’s much closer to your moro-local example than the post suggests.


again, sorry for the long post! i hope you find this useful! i’m going to go log off!!