One of the other cool things that came out of the Firefox
Desktop work week that I mentioned in my previous post was
increased exposure to a handy tool written by
taras and
bsmedberg:
jydoop.
Benjamin has
a post explaining what it is and how it works, but I wanted to describe one specific use: measuring
main-thread responsiveness issues in the wild.
First, a bit of backstory!
One of my team’s overarching goals for a long time has
been eliminating all sources of main-thread file I/O.[1] One of the worst main-thread I/O offenders in Firefox code
is our
SQLite Storage API. A few years ago we went a bit overboard using a version of
this API that is synchronous, and does its file I/O work on
the main thread. So recently we’ve been focused a lot on
eliminating all users of that API.
One consumer of this API that we
fixed in Firefox 23
is the
Form History code
(responsible for remembering information entered into web
forms and offering to re-fill it).
Neil Deakin,
markh and
mak worked together to
fix the core service (based on earlier work from a previous
Mozilla intern, Felix),
but after landing that improvement, taras noticed that there
were
still some remaining issues. We eventually determined that the likely primary culprit
for the remaining Form History main-thread I/O was Firefox
Sync’s use of the older Form History API.
markh and
rnewman worked
together to
fix Sync, but we weren’t entirely sure whether that was
sufficient to eliminate all of the main thread I/O we were
seeing reported by telemetry.
Now, to the jydoop part
Taras had identified that our initial Form History fix
wasn’t complete using the
SlowSQL telemetry dashboard. This gives a high-level overview of slow SQL statements
occurring in the wild, and he noticed that after we landed the
original fix for form history, some formhistory.sqlite
statements were still showing up. Unfortunately, the SlowSQL
dashboard only offers two time ranges: previous week, and the
previous month, and it aggregates data across all builds from
a specific channel (e.g. “nightly”), which means
that I couldn’t easily use it to confirm whether our
Sync fix was effective.
That’s where jydoop came into the picture. It allows
much more sophisticated analysis of the SlowSQL data. I wrote
a couple of jydoop scripts:
one
to summarize the slowSQL reports for formhistory.sqlite from
builds before the
Sync fix
(i.e. Nightly builds with a build ID < 20130623) and
another
to do the same for builds containing the fix. I then ran those
scripts against Telemetry reports from the period from
2013-06-21 to 2013-06-24 (inclusive, i.e. roughly two days on
either side of the landing of the fix).
The results!
The
results
are pretty conclusive: the reports of slow SQL statements for
formhistory.sqlite from builds including the Sync fix were
dramatically reduced. The most frequently slow statement from
builds before the fix (“COMMIT TRANSACTION”) was
responsible for a total of 417 seconds of main-thread
unresponsiveness (174993 occurences of slowness at an average
of 143ms of “jank” at a time), over that period.
The Sync fix prevented that statement (and many other slow
statements) from occurring at all, thus removing a significant
cause of main-thread responsiveness issues experienced by
Firefox users in the wild.
Next steps
If you look at the results in more detail, they reveal that
there are still some slow SQL main-thread operations happening
on formhistory.sqlite – they’re less frequent and
severe, and are tracked by
other bugs. I’m going to be using jydoop to gather more data on
the improvements my team’s already made, as well as
gauge the potential for improvement in areas we’re still
working on. I also want others on the team to learn to use
jydoop, so that people can make their own measurements to
validate the improvements they’re working on.
[1] Main-thread file I/O is a
term used to describe trying to read from or write to the
disk on the same thread (the “main thread”) as
you’re trying to respond to user input (e.g. mouse/key
events) and do other things like drawing the UI. Reading
from or writing to the disk can take a relatively long time,
particularly if you have a slow hard drive, so we want to
always do it on a background thread so that those other
important responsiveness tasks (responding to input and
updating the UI) aren’t affected.