One of the other cool things that came out of the Firefox Desktop work week that I mentioned in my previous post was increased exposure to a handy tool written by taras and bsmedberg: jydoop. Benjamin has a post explaining what it is and how it works, but I wanted to describe one specific use: measuring main-thread responsiveness issues in the wild.
First, a bit of backstory!
One of my team’s overarching goals for a long time has been eliminating all sources of main-thread file I/O.[1] One of the worst main-thread I/O offenders in Firefox code is our SQLite Storage API. A few years ago we went a bit overboard using a version of this API that is synchronous, and does its file I/O work on the main thread. So recently we’ve been focused a lot on eliminating all users of that API.
One consumer of this API that we
fixed in Firefox 23
is the
Form History code
(responsible for remembering information entered into web
forms and offering to re-fill it).
Neil Deakin,
markh and
mak worked together to
fix the core service (based on earlier work from a previous
Mozilla intern, Felix),
but after landing that improvement, taras noticed that there
were
still some remaining issues. We eventually determined that the likely primary culprit
for the remaining Form History main-thread I/O was Firefox
Sync’s use of the older Form History API.
markh and
rnewman worked
together to
fix Sync, but we weren’t entirely sure whether that was
sufficient to eliminate all of the main thread I/O we were
seeing reported by telemetry.
Now, to the jydoop part
Taras had identified that our initial Form History fix wasn’t complete using the SlowSQL telemetry dashboard. This gives a high-level overview of slow SQL statements occurring in the wild, and he noticed that after we landed the original fix for form history, some formhistory.sqlite statements were still showing up. Unfortunately, the SlowSQL dashboard only offers two time ranges: previous week, and the previous month, and it aggregates data across all builds from a specific channel (e.g. “nightly”), which means that I couldn’t easily use it to confirm whether our Sync fix was effective.
That’s where jydoop came into the picture. It allows much more sophisticated analysis of the SlowSQL data. I wrote a couple of jydoop scripts: one to summarize the slowSQL reports for formhistory.sqlite from builds before the Sync fix (i.e. Nightly builds with a build ID < 20130623) and another to do the same for builds containing the fix. I then ran those scripts against Telemetry reports from the period from 2013-06-21 to 2013-06-24 (inclusive, i.e. roughly two days on either side of the landing of the fix).
The results!
The results are pretty conclusive: the reports of slow SQL statements for formhistory.sqlite from builds including the Sync fix were dramatically reduced. The most frequently slow statement from builds before the fix (“COMMIT TRANSACTION”) was responsible for a total of 417 seconds of main-thread unresponsiveness (174993 occurences of slowness at an average of 143ms of “jank” at a time), over that period. The Sync fix prevented that statement (and many other slow statements) from occurring at all, thus removing a significant cause of main-thread responsiveness issues experienced by Firefox users in the wild.
Next steps
If you look at the results in more detail, they reveal that there are still some slow SQL main-thread operations happening on formhistory.sqlite – they’re less frequent and severe, and are tracked by other bugs. I’m going to be using jydoop to gather more data on the improvements my team’s already made, as well as gauge the potential for improvement in areas we’re still working on. I also want others on the team to learn to use jydoop, so that people can make their own measurements to validate the improvements they’re working on.
[1] Main-thread file I/O is a term used to describe trying to read from or write to the disk on the same thread (the “main thread”) as you’re trying to respond to user input (e.g. mouse/key events) and do other things like drawing the UI. Reading from or writing to the disk can take a relatively long time, particularly if you have a slow hard drive, so we want to always do it on a background thread so that those other important responsiveness tasks (responding to input and updating the UI) aren’t affected.
This is excellent. I hope we get many more responsiveness fixes.
Keep up the great work guys
Think there’s a typo in the “before” script – it’s filtering on ‘buildData’ instead of ‘buildDate’.
Indeed there is! I was tweaking the scripts manually, I must have typoed it after running it at some point. I will be more rigorous and version control them next time 🙂