Telemetry from Super-slow Startups, First Impressions
I have been investigating very slow startups by looking at Telemetry reports of startups taking longer than 30 seconds. There were 49 reports of super-slow startups during a one month period starting Dec 15th that met the following criteria:
- First paint* > 30 seconds
- Uninterrupted startup
- Windows OS
- Recent Nightly or Aurora build reporting on slow SQL Telemetry
- Uptime < 5minutes
* There are 3 metrics used for representing startup speed: “main”, “firstPaint” and “sessionRestore”. The names are mostly self-explanatory, but you can find more detail here.
Two thirds of the slow startup reports had more than 5 addons. While addons may be contributing to slow startups in some cases, there doesn’t seem to be much difference between Telemetry reports for browsers with few addons and many addons. There were also two reports with negative values for uptime (filed bug 722651, similar to bug 670008), but this is a minor bug and not a sign of corruption in the reports.
UPDATE: Some of these reports may come from developer builds with Telemetry manually enabled (Bug 722240) which might explain some of the long startup times. It doesn’t look like there is any way to distinguish local builds from official builds in this dataset (fixed in Bug 720875).
Startup times
The root causes behind the long startups are not immediately obvious from these Telemetry reports. This is partly because Telemetry doesn’t measure everything of interest yet and partly because the Telemetry data was not divided into startup and post-startup measurements (now fixed in bug 720456). Currently, the Telemetry histograms contain measurements both from startup and the initial 5 minutes of browser use.
Most of the startups (as measured by firstPaint) were between 30-60s long. In 32 of the 49 reports, there was a delay of at least 10 seconds before XRE_main was called, perhaps suggesting that these systems were very slow or the startup took place during a time of very high contention for the computer’s resources, perhaps during boot up or a resume from hibernation. In 7 reports, the time to reach XRE_main was even greater than the additional time for producing the first paint. The additional time required for session restore was either negligible (nothing to restore) or surprisingly relatively small compared to total startup time (less than 15%).
One observation of note is that sessionRestore sometimes occurs before firstPaint (bug 715402). This likely further degrades the user experience during slow startups.
Histogram data
The total times recorded by Telemetry histograms in the slow startup reports are not large enough to fully explain the lengthy start up times. The only histograms reporting times on par with the total start up times (or much larger!) are those for:
- asynchronous SQL execution (MOZ_STORAGE_ASYNC_REQUESTS_MS)
- URL classifier work (URLCLASSIFIER_PS_CONSTRUCT_TIME, MOZ_SQLITE_URLCLASSIFIER_READ_MS, …)
- and HTTP page loading (HTTP_SUBITEM_OPEN_LATENCY_TIME, HTTP_SUB_COMPLETE_LOAD_CACHED, …)
None of these operations should be affecting startup times: page-loading should not be happening before first paint (barring bug 715402 above) and URL classifier operations & asynchronous SQL are executed on worker threads and should not delay startup.
There are, however, a few operations lasting multiple seconds that should not be, such as DWRITEFONT_DELAYEDINITFONTLIST_TOTAL and DWRITEFONT_DELAYEDINITFONTLIST_COLLECT taking 17 seconds each in one report. Similar font performance issues are already being addressed in bug 705594.
The table below shows a few histograms of interest with total times in excess of 1 second during the initial 2-4 minutes:
# of Telemetry reports |
Histogram |
32 |
FIND_PLUGINS |
28 |
MOZ_SQLITE_OTHER_SYNC_MAIN_THREAD_MS |
28 |
CACHE_DEVICE_SEARCH |
25 |
MOZ_SQLITE_OPEN_MS |
22 |
MOZ_SQLITE_OPEN_MAIN_THREAD_MS |
21 |
NETWORK_DISK_CACHE_OPEN |
18 |
MOZ_SQLITE_PLACES_READ_MAIN_THREAD_MS |
16 |
TELEMETRY_PING |
12 |
GC_MS |
10 |
MOZ_SQLITE_OTHER_READ_MAIN_THREAD_MS |
9 |
GC_MARK_MS |
9 |
XUL_REFLOW_MS |
8 |
CYCLE_COLLECTOR |
8 |
MOZ_SQLITE_PLACES_SYNC_MAIN_THREAD_MS |
6 |
MOZ_SQLITE_WEBAPPS_SYNC_MAIN_THREAD_MS |
6 |
CACHE_DISK_SEARCH |
5 |
GC_SWEEP_MS |
5 |
SYSTEM_FONT_FALLBACK_FIRST |
3 |
MOZ_SQLITE_OTHER_SYNC_MS |
3 |
GDI_INITFONTLIST_TOTAL |
2 |
MOZ_SQLITE_OTHER_READ_MS |
2 |
NETWORK_DISK_CACHE_DELETEDIR |
2 |
IMAGE_DECODE_ON_DRAW_LATENCY |
2 |
FONTLIST_INITFACENAMELISTS |
2 |
FX_TAB_ANIM_OPEN_MS |
1 |
CHECK_JAVA_ENABLED |
1 |
FONTLIST_INITOTHERFAMILYNAMES |
1 |
MOZ_SQLITE_COOKIES_READ_MAIN_THREAD_MS |
1 |
MOZ_SQLITE_PLACES_WRITE_MAIN_THREAD_MS |
1 |
SYSTEM_FONT_FALLBACK |
Although the operations measured by the histograms above are likely not solely responsible for the long startup times, they merit further performance investigation.
Slow SQL data
Adding to the slow system I/O hypothesis, all of the slow startup reports contain multiple SQL statements with execution times over 100ms. Nevertheless, the time taken by these slow SQL statements is still not enough to adequately explain the slow startups.
These are the top 10 most common slow SQL statements across the slow startup reports:
Main Thread
# of Telemetry reports |
SQL string |
23 |
RELEASE SAVEPOINT 'default' |
13 |
SELECT h.id, h.url, IFNULL(b.title, h.title), h.rev_host, h.visit_count, h.last_visit_date, f.url, null, b.id, b.dateAdded, b.lastModified, b.parent, null, h.frecency, b.position, b.type, b.fk, b.guid FROM moz_bookmarks b LEFT JOIN moz_places h ON b.fk = h.id LEFT JOIN moz_favicons f ON h.favicon_id = f.id WHERE b.parent = :parent ORDER BY b.position ASC |
13 |
INSERT OR IGNORE INTO moz_places (url, title, rev_host, hidden, typed, frecency, guid) VALUES (:page_url, :page_title, :rev_host, :hidden, :typed, :frecency, GENERATE_GUID()) |
12 |
SELECT session FROM moz_historyvisits ORDER BY visit_date DESC |
12 |
SELECT COUNT(1) AS numEntries FROM moz_formhistory |
10 |
DELETE FROM addon WHERE internal_id=:internal_id |
10 |
DELETE FROM moz_formhistory WHERE lastUsed <= :expireTime |
10 |
SELECT a.item_id FROM moz_anno_attributes n JOIN moz_items_annos a ON n.id = a.anno_attribute_id WHERE n.name = :anno_name |
8 |
UPDATE moz_bookmarks SET lastModified = :date WHERE id = :item_id |
7 |
PRAGMA page_size |
Other Threads
# of Telemetry reports |
SQL string |
45 |
SELECT domain, partial_data, complete_data FROM moz_classifier |
36 |
DELETE FROM moz_classifier WHERE table_id=?1 AND chunk_id=?2 |
19 |
UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0 |
19 |
SELECT 1 FROM moz_places h WHERE url = ?1 AND last_visit_date NOTNULL |
18 |
INSERT INTO expiration_notify (v_id, url, guid, visit_date, expected_results) SELECT v.id, h.url, h.guid, v.visit_date, :limit_visits FROM moz_historyvisits v JOIN moz_places h ON h.id = v.place_id WHERE (SELECT COUNT(*) FROM moz_places) > :max_uris AND visit_date < strftime('%s','now','localtime','start of day','-7 days','utc') * 1000000 ORDER BY v.visit_date ASC LIMIT :limit_visits |
18 |
UPDATE moz_places SET frecency = CALCULATE_FRECENCY(:page_id) WHERE id = :page_id |
18 |
SELECT id, title, hidden, typed, guid FROM moz_places WHERE url = :page_url |
18 |
ANALYZE moz_places |
17 |
INSERT INTO moz_historyvisits (from_visit, place_id, visit_date, visit_type, session) VALUES (:from_visit, :page_id, :visit_date, :visit_type, :session) |
16 |
UPDATE moz_places SET frecency = CALCULATE_FRECENCY(id) WHERE frecency < 0 |
Most of the statements above were executed during regular browser use and not during startup time. Perhaps the most interesting query in the list above is the main thread “PRAGMA page_size” query. It should be a quick read-only query, but in one report it took 4.888s on the main thread while another, more complex query took the exact same amount of time on a different thread. As it turns out, several main thread queries suffer from lock contention with queries on other threads (e.g. bug 722242).
Another interesting point is the factor of 10 or even 100 difference between MOZ_STORAGE_ASYNC_REQUESTS_MS and the slow SQL telemetry times. This is caused by slow SQL telemetry limiting itself to reporting only on prepared statements executed against known Firefox databases. We should also track slow SQL executed from dynamic strings, perhaps by reporting the state of the JavaScript stack instead of the SQL string itself which may contain privacy-sensitive argument values (bug 722368).
Next steps
The obvious next steps are to 1) eliminate the SQL blind spots by tracking times for all SQL executed in the browser and 2) analyze startup data from the most recent Nightlies containing separate startup metrics. I will also be posting more startup analysis from the individual slow startup reports in this batch of data.