Skip to content

Conversation

@dan42
Copy link

@dan42 dan42 commented Aug 24, 2023

What problem is this PR intended to solve?

Parsing html documents in threads has no performance benefit; there is no concurrency.

I have just learned today about the "nogvl" ruby API, described in thread.c, as a way to permit concurrent/parallel execution of non-ruby code by releasing the GVL (Global VM Lock).

So I tried experimenting a bit with wrapping gumbo_parse_with_options and build_tree inside rb_thread_call_without_gvl, and the results are promising. As seen in the table below, originally we have the same number of operations per second no matter the concurrency. By releasing the GVL during gumbo_parse_with_options we increase the number of operations per sec. Here the increase maxes out around 4 threads (the number of cores). 2x operations per sec is not bad at all I think. By also releasing the GVL during build_tree we get a bit of extra concurrency.

parsing N times a 216k html file
0: original with GVL, and therefore no concurrency
1: release GVL for gumbo_parse_with_options
2: release GVL for gumbo_parse_with_options and build_tree

    0 0 1 1 2 2
threads parse calls time parse/s time parse/s time parse/s
1 100 2.34 42.81 2.31 43.25 2.40 41.74
2 200 4.97 40.20 3.00 66.60 2.89 69.24
3 300 7.26 41.34 4.01 74.78 3.71 80.82
4 400 9.61 41.63 5.02 79.64 4.62 86.58
5 500 12.10 41.34 6.17 81.04 5.73 87.20
6 600 14.60 41.08 7.25 82.79 7.08 84.74
7 700 17.20 40.69 8.51 82.25 7.86 89.02
8 800 19.41 41.21 9.76 82.00 8.92 89.71
9 900 21.85 41.18 10.83 83.13 9.73 92.51
10 1000 24.21 41.30 11.92 83.91 11.09 90.15

What do you think?

Some caveats: I don't know very much about this nogvl API, so I'm not sure if I should have used rb_thread_call_without_gvl2, if the RUBY_UBF_IO unblock function is appropriate to use here, if the code would crash on some other input, etc. Also I'm assuming (at least it looked that way to me) that gumbo_parse_with_options and build_tree are pure gumbo/libxml functions that don't invoke any of the Ruby API that depends on the GVL.

Ideally there would only be one call to rb_thread_call_without_gvl instead of two, but that would require pulling build_tree inside perform_parse somehow and I have no idea if that's feasible, why build_tree is executed inside rb_ensure, etc.

@stevecheckoway
Copy link
Contributor

I forget why we needed to use rb_ensure. It might not be needed any longer. But there was some reason at some point. :)

I'll try to take a look at that code again soon.

@flavorjones flavorjones added this to the v1.18.0 milestone Jul 8, 2024
@flavorjones flavorjones modified the milestones: v1.18.0, v1.19.0 Dec 16, 2024
@mohamedhafez
Copy link

just a little vote of interest from those of us using nokogiri to parse HTML documents in different threads! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants