Crashing your production app with iteration
bug ruby
At work, several bugs were filed due to HTTP requests failing with internal server errors. They were all caused by the same exception being raised:
#<FiberError: can't set a guard page: Cannot allocate memory>
This looks like we’re running out of memory on the box. However, when I checked, they were not using much memory. It was never more than 50% at the time the exception was raised. Then, I came across this bug report: https://bugs.ruby-lang.org/issues/17263. The reported bug is unrelated but one of the replies gave the cause of the error:
Regarding “can’t set a guard page” it’s because of your system is limiting the number of memory mapped segments. Each fiber stack requires a guard page and this is considered a separate memory map entry.
Why can’t it allocate memory?
The specific code block in Ruby that raises the FiberError
error is:
if (mprotect(page, RB_PAGE_SIZE, PROT_NONE) < 0) {
munmap(allocation->base, count*stride);
rb_raise(rb_eFiberError, "can't set a guard page: %s", ERRNOMSG);
}
https://github.com/ruby/ruby/blob/75ed086348da66e4cfe9488ae9ece5462dd2aef9/cont.c#L549-L552
The call to mprotect(2)
1 is creating a guard page, and a return
value less than zero means it failed to change the memory protection on the
page. A guard page is a memory page used to separate the Fiber
’s stack within
the Ruby process’s memory. It prevents Fiber
s from accidentally interacting
due to a memory overrun as trying to access memory in the guard page would
result in a segmentation fault.
The ERRNOMSG
global that Ruby uses in the exception message is set to “Cannot
allocate memory” when mprotect(2)
returns the error value ENOMEM
. There are
three situations that cause this error value to be returned:
- The kernel didn’t have enough memory available to allocate its internal structures. This didn’t happen because we saw no memory pressure on the box.
- The address range [
page
,page
+RB_PAGE_SIZE
) isn’t valid for the process, or includes pages that have not been mapped (viammap(2)
). This would indicate a bug in Ruby, which, while possible, seemed exceptionally unlikely. - Changing the memory protection on a region would result in too many memory
mappings. This seemed like the most likely: we were allocating too many
Fiber
s. It is also what the reply on the Ruby bug report stated.
This call to mprotect(2)
tries to create new memory mapped segments because it
is removing read and write access to a memory range to create the guard page. By
default, Linux limits the number of memory mapped segments to 65536 but you can
change it with:
sysctl -w vm.max_map_count=x
If we were creating lots of Fiber
s then it’s possible that we are running into
this limit because each guard page creates two new memory mapped segments. A
memory mapped segment is a contiguous block of memory with the same
protections. By changing the protection of a page in a segment from read/write
to none, we end up with three segments: read/write, none, read/write. Thus,
where there was once one memory mapped segment, there are now three segments.
FiberError
is being raised because there are too many memory mapped segments in the Ruby process due to too manyFiber
s being created.
Why are we creating too many Fiber
s?
The first clue was that all the exception stack traces came from the same
method: Enumerable.first!
. This is a method that we monkey patch onto the
Enumerable
module. It is like first
except, rather than returning nil
when
the object is empty, it raises an ArgumentError
. This is useful because we use
Sorbet2 extensively. first
returns a nilable value, whilst first!
a
non-nilable value. This allows us to express the invariant that the enumerable
should never be empty. Around the time this error started occurring in the logs,
we had updated the method with the following patch:
def self.first!
- T.must(self.first)
+ self.to_enum.next
+ rescue StopIteration
+ raise ArgumentError.new("Enumerable must not be empty: #{self.inspect}")
end
We’d gone from using #first
on the receiver object, to converting the receiver
to an Enumerator
, then calling Enumerator#next
3. This meant we
are now performing external iteration. External iteration is where the caller
manually steps the iterator from one item to the next. Iteration is driven by
the caller. In Ruby, external iteration uses a Fiber
. As we use
Enumerable#first!
all over the code base, we were suddenly using lots of
Fiber
s, leading to running into the memory mapped segments limit.
We’re creating too many
Fiber
s because the implementation ofEnumerable#first!
was changed to use external iteration.
Why is a Fiber
used for external iteration?
Fibers4 are type of userspace thread. The important difference between
fibers and threads is that fibers use cooperative multitasking. This means
that different fibers have to work together, yielding to each other and being
manually resumed. Threads are scheduled by the operating system and don’t
require manual pausing and resumption to allow tasks to continue to make
progress. Ruby provides Fiber
5 as a fiber implementation. The
reason it is used for external iteration of an Enumerator
is that
Enumerator
s are implemented using internal iteration. Internal iteration is
where you pass a function to the iterator and the function is called with each
item in the iterator. Iteration is driven by the iterator. Enumerator#each
is
the method on which all iteration of an Enumerator
is built. It is a method to
which you pass a block and it calls the block with each item in the enumerator
— internal iteration. By using the cooperative multitasking provided by
fibers, Ruby is able to switch contexts between the caller iterating the
Enumerator
and the block given to #each
. In Iteration Inside and Out, Part
26 Bob Nystron gives a great simplified implementation
exemplifying how Ruby converts internal iteration to external iteration:
class MyEnumerator
include Enumerable
def initialize(obj)
@fiber = Fiber.new do
obj.each do |value|
Fiber.yield(value)
end
raise StopIteration
end
end
def next
@fiber.resume
end
end
The key insight here is that we’re using the Fiber
to switch contexts between
the caller of #next
and the block given to #each
.
The actual Ruby implementation is written in C and a bit more difficult to
follow but the concept is the same. Interestingly with the introduction of JITs
to Ruby (mjit, yjit, rjit, etc.) we could see more implementations moving out of
C and into Ruby of which Enumerator
could be a candidate. So, in the future
the implementation may look similar to the Ruby one above.
External iteration creates a
Fiber
so thatEnumerable
, which is internally iterable, can provide an external iteration interface.
Can you not? External iteration without a fiber
What can we do to fix it? By using internal iteration we don’t need a
fiber. Enumerable#each
performs internal iteration. Knowing this we can
re-write #first!
as:
def self.first!
self.each { |i| return i }
raise ArgumentError.new("Enumerable must not be empty: #{self.inspect}")
end
The reason this works is that #each
is being passed a block. When you
return
from a block, it will return from the surrounding function, in this
case #first!
. Therefore, if the enumerable is not empty we’ll return the first
item. If it is empty then the block given to #each
won’t be called. Then we
will execute the next line raising ArgumentError
. This is the first time I’ve
been grateful for Ruby’s differentiation between blocks and functions!
Interestingly, my solution to prevent #first!
using a fiber has a similar
shape to the usage of Fiber.yield
to pass control back to the caller. The
difference is that my solution only works for a single iteration, while using a
Fiber
is generalisable to iterating over the entire enumerator.
Sorbet is a gradual type system for Ruby. Amongst other things, it lets you define type signatures for your functions and methods then enforce them statically and at runtime. You can read more about it here: https://sorbet.org.
The reason for this is that T.must
will raise TypeError
if
it is called on nil
. It’s totally valid for nil
to be the first element
of an enumerator so calling T.must
on the result will incorrectly
raise. With #first
we can’t tell the different between for example []
and [nil]
, and T.must
will raise on the result of #first
for both.