A Compositional Approach to Optimizing the Performance of Ruby Apps

05·10·2021

Ruby has long been derided as a slow programming language. While this accusation has some element of truth to it, successive Ruby versions, released yearly, have made great strides in improving Ruby’s performance characteristics.

In addition to all the iterative performance improvements - Ruby 2.6 and 2.7 were especially impressive in terms of performance gains - recent versions have introduced bigger features aimed at improving performance, namely: a JIT compiler, the Ractor API for achieving parallelism, and the Fiber scheduler interface aimed at improving concurrency for I/O bound applications.

While those three big developments have yet to prove themselves in real-life Ruby apps, they represent great opportunities for improving the performance of Ruby-based apps. The next few years will tell if any of those new technologies will deliver on its promise for Ruby developers.

While Ruby developers (especially those working in and around Ruby on Rails) are still looking for the holy grail of Ruby fastness, I’d like to explore a different way of thinking about developing Ruby apps (and gems) that can be employed to achieve optimal performance.

Ruby Makes Developers Happy, Until it Comes to Performance…

Why do I love programming in Ruby? First of all, because it’s optimized for developer happiness. Ruby is famous for allowing you to express your ideas in any number of ways. You can write functional programs, or go all in on Java-like enterprise OOP patterns. You can build DSLs, and you can replace or extend whole chunks of the Ruby core functionality by redefining core class methods. And metaprogramming in Ruby lets you do nuclear stuff!

All this comes, of course, with a price tag in the form of reduced performance when compared to other, less magical, programming languages, such as Go, C or C++. But experienced Ruby developers will have already learned how to get the most “bang for the buck” out of Ruby, by carefully designing their code so as to minimize object allocations (and subsequent GC cycles), and picking core Ruby and Stdlib APIs that provide better performance.

Another significant way Ruby developers have been dealing with problematic performance is by using Ruby C-extensions, which implement specific functionalities in native compiled code that can be invoked from plain Ruby code.

Compositional Programming in Ruby

It has occurred to me during my work on Polyphony, a concurrency library for Ruby, that a C-extension API can be designed in such a way as to provide a small, task-specific execution layer for small programs composed of multiple steps that can be exspressed as data structures using plain Ruby objects. Let me explain using an example.

Let’s say we are implementing an HTTP server, and we would like to implement sending a large response using chunked encoding. Here’s how we can do this in Ruby:

def send_chunked_encoding(data, chunk_size)
  idx = 0
  len = data.bytesize
  while idx < len
    chunk = data[idx...(idx += chunk_size)]
    @socket << "#{chunk.bytesize.to_s(16)}\r\n#{chunk}\r\n"
  end
  # send empty chunk
  @socket << "0\r\n\r\n"
end

This is pretty short and sweet, but look how we’re allocating a string for each chunk and doing index arythmetic in Ruby. This kind of code surely could be made more efficient by reimplementing it as a C-extension. But if we already go to the trouble of writing a C-extension, we might want to generalize this approach, so we might be able to implement sending chunked data over other protocols as well.

What if we could come up with a method implemented in C, that takes a description of what we’re trying to do? Suppose we have a method with the following interface:

def send_data_in_chunks(
    data,
    chunk_size,
    chunk_head,
    chunk_tail
  )
end

We could then implement HTTP/1 chunked encoding by doing the following:

def send_chunked_encoding(data, chunk_size)
  @socket.send_data_in_chunks(
    data,
    chunk_size,
    ->(len) { "#{len.to_s(16)}\r\n" }, # chunk size + CRLF
    "\r\n"                             # trailing CRLF
  )
end

If the #send_data_in_chunks method is implemented in C, this means that Ruby code is not involved at all in the actual sending of the data. The C-extension code is responsible for looping and writing the data to the socket, and the Ruby code just provides instructions for what to send before and after each chunk.

Polyphony’s chunked splicing API

The above approach is actually how static file responses are generated in Tipi, the web server for Ruby I’m currently developing. One of Tipi’s distinguishing features is that it can send large files without ever loading them into memory, by using Polyphony’s Backend#splice_chunks API (Polyphony emulates splicing on non-Linux OSes). Here’s an excerpt from Tipi’s HTTP/1 adapter code:

def respond_from_io(request, io, headers, chunk_size = 2**14)
  formatted_headers = format_headers(headers, true, true)
  request.tx_incr(formatted_headers.bytesize)

  Thread.current.backend.splice_chunks(
    io,
    @conn,
    formatted_headers,
    "0\r\n\r\n",
    ->(len) { "#{len.to_s(16)}\r\n" },
    "\r\n",
    chunk_size
  )
end

The Backend#splice_chunks method is slightly more sophisticated than the previous example, as it also takes a string to send before all chunks (here it’s the HTTP headers), and a string to send after all chunks (the empty chunk string "0\r\n\r\n"). My non-scientific benchmarks have shown speed gains of up to 64% for multi-megabyte HTTP responses!

The main idea behind the #splice_chunks API is that the application provides a plan, or a program for what to do, and the underlying system “runs” that program.

Chaining multiple I/O operations in a single Ruby method call

A similar approach was also used to implement chaining of multiple I/O operations, a feature particularly useful when running on recent Linux kernels with io_uring (Polyphony automatically uses io_uring starting from Linux version 5.6.) Here again, the same idea is employed - the application provides a “program” expressed using plain Ruby objects. Here’s how chunked transfer encoding can be implemented using Backend#chain (when splicing a single chunk from an IO instance):

def send_chunk_from_io(io, chunk_size)
  r, w = IO.pipe
  len = w.splice(io, chunk_size)
  if len > 0
    Thread.current.backend.chain(
      [:write, @conn, "#{len.to_s(16)}\r\n"],
      [:splice, r, @conn, len],
      [:write, @conn, "\r\n"]
    )
  else
    @conn.write("0\r\n\r\n")
  end
  len
end

Let’s take a closer look at the call to #chain:

Thread.current.backend.chain(
  [:write, @conn, "#{len.to_s(16)}\r\n"],
  [:splice, r, @conn, len],
  [:write, @conn, "\r\n"]
)

The Backend#chain API takes one or more Ruby arrays each an I/O operation. The currently supported operations are :write, :send and :splice. For each operation we provide the operation type followed by its arguments. The most interesting aspect of this API is that it allows us to reap the full benefits of using io_uring, as the given operations are linked so that they will be performed by the kernel one after the other without the Ruby code ever being involved! The #chain method will return control to the Ruby layer once all operations have been performed by the kernel.

Designing Compositional APIs

This approach to API design might be called compositional APIs - the idea here is that the API provides a way to compose multiple tasks or operations by describing them using native data structures.

Interestingly enough, io_uring itself takes this approach: you describe I/O operations using SQEs (submission queue entries), which are nothing more than C data structures conforming to a standard interface. In addition, as mentioned above, with io_uring you can chain multiple operations to be performed one after another.

Future plans for io_uring include making it possible to submit eBPF programs for running arbitrary eBPF code kernel side. That way, we might be able to implement chunked encoding in eBPF code, and submit it to the kernel using io_uring.

A More General Approach to Chaining I/O operations

It has recently occurred to me that the compositional approach to designing APIs can be further enhanced and generalized, for example by providing the ability to express flow control. Here’s how the chunk splicing functionality might be expressed using such an API:

def respond_from_io(request, io, headers, chunk_size = 2**14)
  formatted_headers = format_headers(headers, true, true)
  r, w = IO.pipe

  Thread.backend.submit(
    [:write, @conn, formatted_headers],
    [:loop,
      [:splice, io, w, chunk_size],
      [:break_if_ret_eq, 0],
      [:store_ret, :len], # store the return code in @len
      [:write, @conn, ->(ret) { "#{ret.to_s(16)}\r\n" }],
      [:splice, r, @conn, :len], # use stored @len value
      [:write, @conn, "\r\n"]
    ],
    [:write, @conn, "0\r\n\r\n"]
  )
end

Now there are clearly a few problems here: this kind of API can quickly run into the problem of Turing-completeness - will developers be able to express any kind of program using this API? Where are the boundaries and how do we define them?

Also, how can we avoid having to allocate all those arrays every time we call the #respond_from_io method? All those allocations can put more pressure on the Ruby GC, and themselves can be costly in terms of performance. And that proc we provide - it’s still Ruby code that needs to be called for every iteration of the loop. That too can be costly to performance.

The answers to all those questions are still not clear to me, but one solution I thought about was to provide a “library” of operation types that is a bit higher-level than a simple write or splice. For example, we can come up with an operation to write the chunk header, which can look something like this:

Thread.backend.submit(
  [:write, @conn, formatted_headers],
  [:loop,
    [:splice, io, w, chunk_size],
    [:break_if_ret_eq, 0],
    [:store_ret, :len],
    [:write_cte_chunk_size, @conn, :len],
    [:splice, r, @conn, :len],
    [:write, @conn, "\r\n"]
  ],
  [:write, @conn, "0\r\n\r\n"]
)

Adding IO References

Another improvement we can make is to provide a way to reference io instances and dynamic strings from our respond_from_io “program” using indexes. This will allow us to avoid allocating all those arrays on each invocation:

# program references:
# 0 - headers
# 1 - io
# 2 - @conn
# 3 - pipe r
# 4 - pipe w
# 5 - chunk_size
RESPOND_FROM_IO_PROGRAM = [
  [:write, 2, 0],
  [:loop,
    [:splice, 1, 4, 5],
    [:break_if_ret_eq, 0],
    [:store_ret, :len],
    [:write_cte_chunk_size, 2, :len],
    [:splice, 3, 2, :len],
    [:write, 2, "\r\n"]
  ],
  [:write, 2, "0\r\n\r\n"]
]

def respond_from_io(request, io, headers, chunk_size = 2**14)
  formatted_headers = format_headers(headers, true, true)
  r, w = IO.pipe
  Thread.backend.submit(RESPOND_FROM_IO_PROGRAM, formatted_headers, io, @conn, r, w)
end

Creating IO Programs Using a DSL

Eventually, we could provide a way for developers to express IO programs with a DSL, instead of with arrays. We could then also use symbols for representing IO indexes:

RESPOND_FROM_IO_PROGRAM = Polyphony.io_program(
  :headers, :io, :conn, :pipe_r, :pipe_w, :chunk_size
) do
  write :conn, :headers
  io_loop do
    splice :io, :pipe_w, :chunk_size
    break_if_ret_eq 0
    store_ret :len
    write_cte_chunk_size :conn, :len
    splice :pipe_r, :conn, :len
    write :conn, "\r\n"
  end
  write :conn, "0\r\n\r\n"
end

Does this look better? I’m not sure. Anyways, there are some rough edges here that will need to be smoothed out for this approach to work.

Implementing a Protocol Parser Using the Compositional Approach

It as occurred to me that this kind of approach, expressing a “program” using plain Ruby objects, to be executed by a C-extension, could also be applied to protocol parsing. I’ve recently released a blocking HTTP/1 parser for Ruby, called h1p, implemented as a Ruby C extension, and I had some ideas about how this could be done.

We introduce a IO#parse method that accepts a program for parsing characters. The program expressed includes a set of steps, each one reading consecutive characters from the IO instance:

# for each part of the line we can express the valid range of lengths, 
REQUEST_LINE_RULES = [
  [:read, { delimiter: ' ', length: 1..40, invalid: ["\r", "\n"], consume_delimiter: true }],
  [:consume_whitespace],
  [:read, { delimiter: ' ', length: 1..2048, invalid: ["\r", "\n"], consume_delimiter: true }],
  [:consume_whitespace],
  [:read_to_eol, { consume_eol: true, length: 6..8 } ]
]

HEADER_RULES = [
  [:read_or_eol, { delimiter: ':', length: 1..128, consume_delimiter: true }],
  [:return_if_nil],
  [:consume_whitespace],
  [:read_to_eol, { consume_eol: true, length: 1..2048, consume_delimiter: true }]
]

def parse_http1_headers
  (method, request_path, protocol) = @conn.parse(REQUEST_LINE_RULES)
  headers = {
    ':method' => method,
    ':path' => request_path,
    ':protocol' => protocol
  }
  
  while true
    (key, value) = @conn.parse(HEADER_RULES)
    return headers if !key

    headers[key.downcase] = value
  end
end

Here too, we can imagine being able to express these parsing rules using a DSL:

REQUEST_LINE_RULES = Polyphony.parse_program do
  read delimiter: ' ', length: 1..40, invalid: ["\r", "\n"], consume_delimiter: true
  consume_whitespace
  read delimiter: ' ', length: 1..2048, invalid: ["\r", "\n"], consume_delimiter: true
  consume_whitespace
  read_to_eol consume_eol: true, length: 6..8
end

It remains to be seen where are the limits to what we can achieve with this approach: can we really express everything that we need in order to parse any conceivable protocol. In addition, it is not clear whether this kind of solution provides performance benefits.

Summary

In this article I have presented an approach to optimizing the performance of Ruby apps by separating the program into two layers: a top layer that written in Ruby, expressing low-level operations using Ruby data structures; and an implementation layer written in C for executing those operations in an optimized manner. This approach is particularly interesting when dealing with long running or complex operations: sending an HTTP response with chunked encoding, parsing incoming data, running I/O operations in loops etc.

As I have mentioned above, this this is similar to that employed by io_uring on Linux. The idea is the same: we express (I/O) operations using data structures, then offload the execution to an lower-level optimized layer - in io_uring’s case it’s the kernel, in Ruby’s case it’s a C-extension.

It seems to me that

This is definitely an avenue I intend on further exploring, and I invite other Ruby developers to join me in this exploration. While we wait for all those exciting Ruby developments I mentioned at the beginning of this article to materialize (the new YJIT effort from Shopify looks especially promising), we can investigate other approaches that take advantage of Ruby’s expressivity while relying on native C code to execute lower level code.