Explaining Ruby Fibers

20·10·2021

Fibers have long been a neglected corner of the Ruby core API. Introduced in Ruby version 1.9 as a coroutine abstraction, fibers have never really seemed to realize their promise of lightweight concurrency, and remain relatively little explored. In spite of attempts to employ them for achieving concurrency, most notably em-synchrony, fibers have not caught on. Hopefully, with the advent of the FiberScheduler interface introduced in Ruby 3.0, and libraries such as Async and Polyphony, this situation will change and fibers will become better known and understood, and Ruby developers will be able to use them to their full potential.

My aim in this article is to explain how fibers work from the point of view of a concurrent Ruby application written using Polyphony. I’ll give an overview of fibers as concurrency constructs, and discuss how Polyphony harnesses Ruby fibers in order to provide an idiomatic and performant solution for writing highly-concurrent Ruby apps. For the sake of simplicity, I’ll omit some details, which will be mentioned in footnotes towards the end of this article.

What’s a Fiber?

Most developers (hopefully) know about threads, but just what are fibers? Linguistically we can already tell they have some relation to threads, but what is the nature of that relation? Is it one of aggregation (as in “a thread is made of one or more fibers”,) or is it one of similitude (as in “a fiber is sort of like a thread but lighter/smaller/whatever”?) As we shall see, it’s a little bit of both.

A fiber is simply an independent execution context that can be paused and resumed programmatically. We can think of fibers as story lines in a book or a movie: there are multiple happenings involving different persons at different places all occurring at the same time, but we can only follow a single story line at a time: the one we’re currently reading or watching.

This is one of the most important insights I had personally when I started working with Ruby fibers: there’s always a currently active fiber. In fact, even if you don’t use fibers and don’t do any fiber-related work, your code is still running in the context of the main fiber for the current thread, which is created automatically by the Ruby runtime for each thread.

So a fiber is an execution context, and in that regard it’s kind of like a thread: it keeps track of a sequence of operations, and as such it has its own stack and instruction pointer. Where fibers differ from threads is that they are not managed by the operating system, but instead are paused, resumed, and switched between by the userspace program itself. This programmatic way to switch between different execution contexts is also called “cooperative multitasking”, in contrast to the way threads are being switched by the operating system, called “preemptive multitasking”.

Switching between Fibers

Here’s a short example of how switching between fibers is done¹. In this example we’re doing the fiber switching manually:

require 'fiber'

@f1 = Fiber.new do
  puts 'Hi from f1'
  @f2.transfer
  puts 'Hi again from f1'
end

@f2 = Fiber.new do
  puts 'Hi from f2'
  @f1.transfer
end

puts 'Hi from main fiber'
@f1.transfer
puts 'Bye from main fiber'

In the above example, we create two fibers. The main fiber then transfers control (“switches”) to fiber @f1, which then transfers control to @f2, which then returns control to @f1. When @f1 has finished running, control is returned automatically to the main fiber, and the program terminates. The program’s output will be:

Hi from main fiber
Hi from f1
Hi from f2
Hi again from f1
Bye from main fiber

One peculiarity of the way switching between fibers is done, is that the context switch is always initiated from the currently active fiber. In other words, the currently active fiber must voluntarily yield control to another fiber. This is also the reason why this model of concurrency is called “cooperative concurrency.”

Another important detail to remember is that a fiber created using Fiber.new always starts in a suspended state. It will not run unless you first switch to it using Fiber#transfer.

Controlling fiber state via the context switch

What I find really interesting about the design of Ruby fibers is that the act of pausing a fiber, then resuming it later, is seen as a normal method call from the point of view of the fiber being resumed: we make a call to Fiber#transfer, and that call will return only when the fiber who made that call is itself resumed using a reciprocal call to Fiber#transfer. The return value will be the value given as an argument to the reciprocal call, as is demonstrated in the following example:

ping = Fiber.new do |peer|
  loop do
    msg = peer.transfer('ping')
    puts msg
  end
end

pong = Fiber.new do |msg|
  loop do
    puts msg
    msg = ping.transfer('pong')
  end
end

ping.transfer(pong)

The block provided to Fiber.new can take a block argument which will be set to the value given to Fiber#transfer the first time the fiber is being resumed. We use this to set ping’s peer, and then use Fiber#transfer to pass messages between the ping and pong fibers.

This characteristic of Fiber#transfer has profound implications: it means that we can control a fiber’s state using the value we pass to it as we resume it. Here’s an example for how this could be done:

main = Fiber.current
f = Fiber.new do |msg|
  count = 0
  loop do
    case msg
    when :reset
      count = 0
    when :increment
      count += 1
    end
    puts "count = #{count}"
    msg = main.transfer
  end
end

3.times { |i| f.transfer(:increment) } # count = 3
f.transfer(:reset)                     # count = 0
3.times { |i| f.transfer(:increment) } # count = 3

In the above example, the fiber f gets its state updated each time it is resumed. The main fiber can control f’s state by passing a special value when calling Fiber#transfer. As long as we have in place a well-defined convention for how the transferred value is interpreted by fibers in a given program, we can implement arbitrarily complex semantics for controlling our fibers’ state and life-cycle. Later on in this article we’ll see how Polyphony couples this mechanism with return values from blocking operations, as well as exceptions that permit us to cancel any long-running operation at any time.

Switching fibers on long-running operations

Now that we have a feel for how fibers can be paused and resumed, we can discuss how this can be used in conjunction with blocking operations. For the sake of our discussion, a blocking operation is any operation that potentially needs to wait for some external event to occur, such as a socket becoming readable, or waiting for a timer to elapse. We want to be able to have multiple concurrent tasks, each proceeding at its own pace, and each yielding control whenever it needs to wait for some external event to occur.

In order to demonstrate how this might work, let’s imagine a program where we have two fibers: one waits for data to arrive on STDIN, then echoes it; and a second fiber that prints the time once a second:

@echo_printer = Fiber.new do
  loop do
    if STDIN.readable?
      msg = STDIN.readpartial(1024)
      puts msg
    end
    @time_printer.transfer
  end
end

@time_printer = Fiber.new do
  timer = Timer.new(1)
  loop do
    if timer.elapsed?
      puts "Time: #{Time.now}"
      timer.reset
    end
    @echo_printer.transfer
  end
end

In each of the above fibers, we have have a condition that tells us if the fiber can proceed: for @echo_printer the condition is STDIN.readable?; for @time_printer it’s timer.elapsed?. What’s notable though about this example that the switching between fibers is done explicitly, and that each fiber needs to check a condition continually. Of course, this is not ideal, since the two fibers will just pass control between them until one of the conditions is met and actual work is done. If you run such a program, you’ll see one of your CPU cores saturated. But the main insight to draw here is that each fiber can yield control to another fiber if it cannot go on doing actual work.

Automatic fiber switching using an event reactor

Let’s see if we can avoid endlessly checking for readiness conditions, by using an event reactor - a piece of software that lets you subscribe to specific events, most importantly I/O readiness, and timers. In this case we’ll be using ever, a tiny Ruby gem that I wrote a few months ago, implementing an event reactor for Ruby apps based on libev. Here’s our rewritten example:

require 'ever'
require 'fiber'

evloop = Ever::Loop.new
@reactor = Fiber.current

@echo_printer = Fiber.new do
  loop do
    msg = STDIN.readpartial(1024)
    puts msg
    @reactor.transfer
  end
end

@time_printer = Fiber.new do
  loop do
    puts "Time: #{Time.now}"
    @reactor.transfer
  end
end

# register interest in events
evloop.watch_io(@echo_printer, STDIN, false, false)
evloop.watch_timer(@time_printer, 1, 1)

# run loop
evloop.each { |fiber| fiber.transfer }

As you can see, our fibers no longer need to check for readiness. The reactor, after registering interest in the respective event for each fiber, runs the event loop, and when an event becomes available, the corresponding fiber is resumed. Once resumed, each fiber does its work, then yields control back to the reactor.

This is already a great improvement, since our program does not need to endlessly check for readiness, and the code for each of the fibers doing actual work looks almost normal. The only sign we have of anything “weird” going on is that each fiber needs to yield control back to the reactor by calling @reactor.transfer.

If we look more closly at the above program, we can describe what’s actually happening as follows:

@echo_printer = Fiber.new do
  loop do
    msg = STDIN.readpartial(1024)
    puts msg

    wait_for_events if queued_events.empty?
    queued_events.each { |e| e.fiber.transfer }
  end
end

@time_printer = Fiber.new do
  loop do
    puts "Time: #{Time.now}"

    wait_for_events if queued_events.empty?
    queued_events.each { |e| e.fiber.transfer }
  end
end

...

For each of our fiber, at any point where the fiber needs to wait, we first look at our list of queued events. If there are none, we wait for events to occur. Finally we proceed to handle those events by transferring control to each respective fiber. This needs to be done for each blocking or long-running operation: reading from a file, reading from a socket, writing to a socket, waiting for a time period to elapsed, waiting for a process to terminate, etc. What if we had a tool that could automate this for us? This is where Polyphony enters the stage.

Using Polyphony for fiber-based concurrency

Polyphony automates all the different aspects of fiber switching, which boils down to knowing which fiber should be running at any given moment. Let’s see how Polyphony solves the problem of fiber switching by using it to rewrite the example above:

require 'polyphony'

echo_printer = spin do
  loop do
    msg = STDIN.read
    puts msg
  end
end

time_printer = spin do
  loop do
    sleep 1
    puts "Time: #{Time.now}"
  end
end

Fiber.await(echo_printer, time_printer)

As our rewritten example shows, we got completely rid of calls to Fiber#transfer. The fiber switching is handled automatically and implicitly by Polyphony². Each fiber is an autonomous infinite loop made of a sequence of operations that’s simple to write and simple to read. All the details of knowing when STDIN is ready or when a second has elapsed, and which fiber to run at any moment, are conveniently taken care of by Polyphony.

Even more importantly, Polyphony offers a fluent, idiomatic API that mostly gets out of your way and feels like an integral part of the Ruby core API. Polyphony introduces the #spin global method, which creates a new fiber and schedules it for running as soon as possible (remember: fibers start their life in a suspended state.) The call to Fiber.await means that the main fiber, from which the two other fibers were spun, will wait for those two fibers to terminate. Because both echo_printer and time_printer are infinite loops, the main fiber will wait forever.

Now that we saw how Polyphony handles behind the scenes all the details of switching between fibers, let’s examine how Polyphony does that.

The fiber switching dance

A fiber will spend its life in one of three states: :running, :waiting or :runnable³. As discussed above, only a single fiber can be :running at a given moment (for each thread). When a fiber needs to perform a blocking operation, such as reading from a file descriptor, it makes a call to the Polyphony backend associated with its thread, which performs the actual I/O operation. The backend will submit the I/O operation to the OS (using the io_uring interface⁴,) and will switch to the first fiber pulled from the runqueue, which will run until it too needs to perform a blocking operation, at which point another fiber switch will occur to the next fiber pulled from the runqueue.

Meanwhile, our original fiber has been put in the :waiting state. Eventually, the runqueue will be exhausted, which means that all fibers are waiting for some event to occur. At this point, the Polyphony backend will check whether any of the currently ongoing I/O operation have completed. For each completed operation, the corresponding fiber is “scheduled” by putting it on the runqueue, and the fiber transitions to the :runnable state. The runnable state means that the operation the fiber was waiting for has been completed (or cancelled), and the fiber can be resumed.

Once all completed I/O operations have been processed, the backend performs a fiber switch to the first fiber available on the runqueue, the fiber transitions back to the :running state, and the whole fiber-switching dance recommences.

What’s notable about this way of scheduling concurrent tasks is that Polyphony does not really have an event loop that wraps around your code. Your code does not run inside of a loop. Instead, whenever your code needs to perform some blocking operation, the Polyphony backend starts the operation, then switches to the next fiber that is :runnable, or ready to run.

The runqueue

The runqueue is simply a FIFO queue that contains all the currently runnable fibers, that is fibers that can be resumed. Let’s take our last example and examine how the contents of the runqueue changes as the program executes:

runqueue #=> []

echo_printer = spin { ... }
runqueue #=> [echo_printer]

time_printer = spin { ... }
runqueue #=> [echo_printer, time_printer]

# at this point the two fibers have been put on the runqueue and will be resumed
# once the current (main) fiber yields control:
Fiber.await(echo_printer, time_printer)

# while the main fiber awaits, the two fibers are resumed. echo_printer will
# wait for STDIN to become readable. time_printer will wait for 1 second to
# elapse.

# The runqueue is empty.
runqueue #=> []

# Since there's no runnable fiber left, the Polyphony backend will wait for
# io_uring to generate at least one completion entry. A second has elapsed, and
# time_printer's completion has arrived. The fiber becomes runnable and is put
# back on the runqueue.
runqueue #=> [time_printer]

# Polyphony pulls the fiber from the runqueue and switches to it. The time is
# printed, and time_printer goes back to sleeping for 1 second. The runqueue is
# empty again:
runqueue #=> []

# The Polyphony backend waits again for completions to occur. The user types a
# line and hits RETURN. The completion for echo_printer is received, and
# echo_printer is put back on the runqueue:
runqueue #=> [echo_printer]

# Polyphony pulls the fiber from the runqueue and switches to it.
runqueue #=> []

...

While the runqueue was represented above as a simple array of runnable fibers, its design is actually much more sophisticated than that. First of all, the runqueue is implemented as a ring buffer in order to achieve optimal performance. The use of a ring buffer algorithm results in predictable performance characteristics for both adding and removing of entries from the runqueue. When adding an entry to a runqueue that’s already full, the underlying ring buffer is reallocated to twice its previous size. For long running apps, the runqueue size will eventually stabilize around a value that reflects the maximum number of currently runnable fibers in the process.

In addition, each entry in the runqueue contains not only the fiber, but also the value with which it will be resumed. If you recall, earlier we talked about the fact that we can control a fiber’s state whenever we resume it by passing it a value using Fiber#transfer. The fiber will receive this value as the return value of its own previous call to Fiber#transfer.

In Polyphony, each time a fiber is resumed, the return value is checked to see if it’s an exception. In case of an exception, it will be raised in the context of the resumed fiber. Here’s an excerpt from Polyphony’s io_uring backend that implements the global #sleep method:

VALUE Backend_sleep(VALUE self, VALUE duration) {
  Backend_t *backend;
  GetBackend(self, backend);

  VALUE resume_value = Qnil;
  io_uring_backend_submit_timeout_and_await(backend, NUM2DBL(duration), &resume_value);
  RAISE_IF_EXCEPTION(resume_value);
  RB_GC_GUARD(resume_value);
  return resume_value;
}

In the code above, we first submit an iouring timeout entry, then yield control (by calling Fiber#transfer on the next runnable fiber) and await its completion. When the call to io_uring_backend_submit_timeout_and_await returns, our fiber has alreadey been resumed, with resume_value holding the value returned from our call to Fiber#transfer. We use RAISE_IF_EXCEPTION to check if resume_value is an exception, and raise it in case it is.

It’s also important to note the resume values stored alongside fibers in the runqueue can be used in effect to control a fiber’s state, just like in bare bones calls to Fiber#transfer. This can be done using the Polyphony-provided Fiber#schedule method, which puts the fiber on the runqueue, along with the provided resume value:

lazy = spin do
  loop do
    msg = suspend
    puts "Got #{msg}"
  end
end

every(1) { lazy.schedule('O hi!') }

In the example above, our lazy fiber suspends itself (using #suspend, which puts it in a :waiting state), and the main fiber schedules it once every second along with a message. The lazy fiber receives the message as the return value of the call to #suspend. One important difference between Fiber#schedule and Fiber#transfer is that Fiber#schedule does not perform a context switch. It simply puts the fiber on the runqueue along with its resume value. The fiber will be resumed as soon as all previous runnable fibers have been resumed and have yielded control.

Yielding control

As explained above, blocking operations involve submitting the operation to the io_uring interface, and then yielding (or #transferring) control to the next runnable fiber. A useful metaphor here is the relay race: only a single person runs at any given time (she holds the baton,) and eventually the runner will pass the baton to the next person who’s ready to run. Let’s examine what happens during this “passing of the baton” in a little more detail.

In effect, once the I/O operation has been submitted, the Polyphony backend calls the backend_base_switch_fiber function, which is responsible for this little ceremony, which consists of the following steps:

Shift the first runqueue entry from the runqueue.
If the entry is not nil:
- Check if it’s time to do a non-blocking poll (in order to prevent starvation of I/O completions. See discussion below.)
- proceed to step 4.
Otherwise:
- Perform a blocking poll.
- Go back to step 1.
Transfer control to the entry’s fiber with the entry’s resume value using Fiber#transfer.

All this happens in the context of the fiber that yields control, until a context switch is performed in step 4. To reuse our “relay race” metaphor, each time the current runner wishes to pass the baton to the next one, it’s as if she has a little gadget in her hand that holds the baton, performs all kinds of checks, finds out who the next runner is, and finally hands it over to the next runner.

Polling for I/O completions

When there are no runnable fibers left, Polyphony polls for at least one io_uring completion to arrive. For each received completion, the corresponding fiber is scheduled by putting it on the runqueue. Once the polling is done, the first fiber is pulled off the runqueue and is then resumed.

As we saw, polling for completions is only perofrmed done when the runqueue is empty. But what about situations where the runqueue is never empty? Consider the following example:

@f1 = spin { loop { @f2.schedule; puts 'f1' } }
@f2 = spin { loop { @f1.schedule; puts 'f2' } }

Fiber.await(@f1, @f2)

Each of the fibers above will run in an infinite loop, scheduling its peer and then printing a message. As shown above, the call to #puts, being an I/O operation, causes the Polyphony backend to submit the write operation to the io_uring interface, and then perform a context switch to the next runnable fiber. In order for the call to #puts to return, the Polyphony backend needs to poll for completions from the io_uring interface. But, since the runqueue is never empty (both fibers are scheduling each other, effectively adding each other to the runqueue,) the runqueue will never be empty!

In order to be able to deal with such circumstances, and prevent the “starvation” of completion processing, the Polyphony backend periodically performs a non-blocking check for any received completions. This mechanism assures that even in situations where our application becomes CPU-bound (since there’s always some fiber running!) we’ll continue to process io_uring completions, and our entire process will continue to behave normally.

Implications for performance

The asynchronous nature of the io_uring interface has some interesting implications for the performance of Polyphony’s io_uring backend. As mentioned above, the submission of SQE’s is deferred, and performed either when a certain number of submissions have acumulated, or before polling for completions.

In Polyphony, runnable fibers are always prioritized, and polling for events is done mostly when there are no runnable fibers. Theoretically, this might have a negative impact on latency, but I have seen the io_uring backend achieve more than twice the throughput achieved by the libev backend. Tipi, the Polyphony-based web server I’m currently developing, boasts a request rate of more than 138K requests/second at the time of this writing.

In fact, Polyphony provides multiple advantages over other concurrency solutions: The number of I/O-related syscalls is minimized (by using the io_uring interface.) In addition, the use Fiber#transfer to transfer control directly to the next runnable fiber halves the number of context-switches when compared to using Fiber#resume/Fiber.yield.

Most importantly, by prioritizing runnable fibers over processing of I/O completions (with an anti-starvation mechanism as described above,) Polyphony lets a Ruby app switch easily between I/O-bound work and CPU-bound work. For example, when a Polyphony-based web server receives 1000 connections in the space of 1ms, it needs to perform a lot of work, setting up a fiber, a parser and associated state for each connection. The design of Polyphony’s scheduling system allows the web server to do this burst of hard work while deferring any I/O work for later submission. When this CPU-bound burst has completed, the web server fires all of its pending I/O submissions at once, and can proceed to check for completions.

Cancellation

Now that we have a general idea of how Polyphony performs fiber switching, let’s examine how cancellation works in Polyphony. We want to be able to cancel any ongoing operation at any time. This can be done for any reason: a timeout has elapsed, an exception has occurred in a related fiber, or business logic that requires cancelling some specific operation in specific circumstances. The mechanism for doing this is simple, as mentioned above: we use an exception as the resume value that will be transferred to the fiber when the context switch occurs.

Here’s a simple example that shows how cancellation looks from the app’s point of view:

def gets_with_timeout(io, timeout)
  move_on_after(timeout) { io.gets }
end

The #move_on_after global API sets up a timer, then runs the given block. If the timer has elapsed before the block has finished executing, the fiber is scheduled with a Polyphony::MoveOn exception, any ongoing I/O operation is cancelled by the backend, the exception is caught by #move_on_after and an optional cancellation value is returned.

If we want to generate an exception on timeout, we can instead use the #cancel_after API, which will schedule the fiber with a Polyphony::Cancel exception, and the exception will have to be caught by the app code:

def gets_with_timeout(io, timeout)
  cancel_after(timeout) { io.gets }
rescue Polyphony::Cancel
  "gets was cancelled!"
end

More ways to control fiber execution

In addition to the various APIs discussed above, here’s a list of some of the various APIs used for controlling fiber execution:

#suspend - transition from :running to :waiting state. A fiber that was suspended will not be resumed until it is manually scheduled.
#snooze - transition from :running to :runnable state. The fiber is put on the runqueue, and will eventually get its turn again. This method is useful when a fiber performs a long-running CPU-bound operation and needs to periodically let other fibers have their turn.
Fiber#schedule - transition from :waiting to :runnable state, with an optional resume value.
Fiber#terminate - terminate a fiber.

Conclusion

Polyphony has been built to harness the full power of Ruby fibers, and provide a solid and joyful experience for those wishing to write highly-concurrent Ruby apps. There are many subtleties to designing a robust, well-behaving fiber-based concurrent environment. For example, there’s the problem of forking from arbitrary fibers, or the problem of correctly handling signals. Polyphony aims to take care of all these details, in order to be able to handle a broad range of applications and requirements.

I hope this article has helped clear up some of the mystery and misconceptions about Ruby fibers. Please let me know if you have specific questions about fibers in general or Polyphony in particular, and feel free to browse Polyphony’s source code. Contributions will be gladly accepted!

Footnotes

1. In this article we’ll confine ourselves to using the Fiber#transfer API for fiber switching, which is better suited for usage with symmetric coroutines. Using the Fiber.yield/Fiber#resume API implies an asymmetric usage which is better suited for fibers used as generators or iterators. Sadly, most articles dealing with Ruby fibers only discuss the latter, and make no mention of the former. Please note that in order to use Fiber#transfer we first need to require 'fiber'.

2. I’ve mentioned above the Fiber Scheduler interface introduced in Ruby 3.0, based on the work of Samuel Williams. This new feature, baked into the Ruby core, has roughly the same capabilities as Polyphony when it comes to automatically switching between fibers based on I/O readiness. As its name suggests, this is just a well-defined interface. In order to be able to employ it in your Ruby app, you’ll need to use an actual fiber scheduler. At the moment, the following fiber schedulers are available, in different states of production-readiness: evt, event, and my own libev-scheduler.

3. A fourth state, :dead is used for fibers that have terminated. A fiber’s state can be interrogated using Fiber#state.

4. Polyphony also includes an alternative backend used on non-Linux OSes or older Linux kernels. Both backends have the same capabilities. In this article we’ll discuss only the io_uring backend.