Symbolizing the intermittent Electron-startup SIGSEGV crash dumps (Ubuntu debug
symbols + frame-pointer walk) shows the fault is an upstream Pango concurrency
bug, not a VS Code bug:
pango: fc_thread_func -> init_in_thread -> FcInit() (pangofc-fontmap.c)
fontconfig: FcInit -> FcConfigParseAndLoadFromMemory -> _FcConfigParse
libexpat: XML_ParseBuffer -> libc (NULL deref, SIGSEGV)
Pango >= 1.52's pango_fc_font_map_init() unconditionally spawns a
"[pango] fontconfig" thread that runs FcInit(); that races with the
Electron/Chromium main thread's own fontconfig use during startup and corrupts
fontconfig's global config while it is being parsed. The threaded design is a
known-bad area upstream (pango#784 "single fontconfig thread introduces a hang
... seems to be due to a race condition", pango#872), and there is no env var to
disable it (still present in Pango 1.56).
It only manifests in our CI because the race window is microscopic: it needs a
cold process, two threads hitting first-time FcInit() simultaneously, and a slow
machine. Our smoke job is a near-perfect trigger — fresh contended runners, a
wiped fontconfig cache + custom FONTCONFIG_FILE (so FcInit re-parses cold), and
~25 cold Electron starts per run. (This also explains why the expat version was
irrelevant and why dropping the config DOCTYPE made it worse: it is pure timing,
not parser/content.)
Fix: initialize fontconfig once, single-threaded, from an ELF constructor that
runs before main() (and thus before any thread exists), via a tiny LD_PRELOAD
shim. Pango's later threaded FcInit() then finds fontconfig already initialized
and returns immediately, so the concurrent parse never happens and the race is
eliminated.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Linux smoke-test job works around the expat 2.6.1 fontconfig NULL-deref
CVEs by pointing FONTCONFIG_FILE at a minimal config with <include> removed.
However that config still declared an external DTD:
<!DOCTYPE fontconfig SYSTEM "urn:fontconfig:fonts.dtd">
fontconfig feeds that DTD to expat as an external *parameter* entity, which
still hits the not-yet-backported CVE-2026-32776 / CVE-2026-32778 crash paths
on expat 2.6.1 even with <include> gone. This was observed in CI as a SIGSEGV
inside libexpat (called from libfontconfig) during Chromium browser-process
font initialization, which crashed Electron at startup. Because the smoke-test
launch used no timeout, that crash surfaced only as an opaque 120s Mocha
"before all" hook timeout.
fontconfig does not require the DOCTYPE, so drop it to remove the last
external-entity codepath. The full workaround can be removed once the runner
ships libexpat >= 2.7.5 (the step already auto-disables itself in that case).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: split Electron test jobs into unit/integration and smoke
The Linux, Windows and macOS Electron PR test jobs are the slowest in CI,
dominated by the smoke test run. Split each into two parallel jobs - one
running unit + integration tests, the other running smoke tests - to cut
wall-clock time.
Done via two new parameters on the reusable workflows
(unit_and_integration_tests and smoke_tests, both defaulting to true) so
Browser and Remote jobs are unchanged. Artifact names get a -smoke suffix
on the smoke-only job to avoid upload collisions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: gate build and diagnostics to correct Electron test phase
Follow-up to the Electron job split. Ensure each half only does the work
it needs:
- Gate "Build integration tests" on unit_and_integration_tests so the
smoke-only job skips it.
- Scope the before/after diagnostics steps to their phase (combined with
always()) so they don't run in the wrong job.
- Move the Copilot extension build into the smoke phase (gated on
smoke_tests) instead of compiling it unconditionally; align Linux,
Windows and macOS on the same ordering.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: drop space and parens from Electron-Smoke job name
The Windows 1ES runner builds its JobId label from job_name, producing
"windows-test-Electron (Smoke)-...". The space and parentheses prevented
the runner from picking up the job. Rename the smoke job to Electron-Smoke
on all three platforms so the JobId is a plain slug.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fixes
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
refactor: update restore-node-modules action to support lookup-only functionality
- Replaced 'extract' input with 'lookup-only' to allow cache entry checks without downloading or extracting.
- Updated action logic to conditionally extract node_modules based on the new 'lookup-only' input.
- Adjusted workflow files to utilize 'lookup-only' for cache-warming jobs on Linux, macOS, and Windows.
* CI: speed up node_modules cache with zstd + shared scripts
Switch the Linux/macOS node_modules cache from single-threaded gzip
(tar -czf) to multi-threaded zstd. The "Create node_modules archive"
step was spending ~5min of single-core gzip on a multi-GB tree on every
cache miss; zstd -T0 uses all cores and decompresses much faster, so
cache-hit jobs benefit too. Windows stays on 7-Zip (already threaded).
Extract the archive/extract commands into shared per-platform scripts
under .github/workflows/node_modules_cache/ (cache.sh / cache.ps1, each
dispatching on an archive|extract argument) so the format and flags live
in one place instead of being duplicated across ~8 workflows. Bump
build/.cachesalt to invalidate existing gzip caches.
Also remove the obsolete extensions/copilot CI workflows
(copilot-setup-steps.yml, ensure-node-modules-cache.yml, pr.yml) and the
unused build/listBuildCacheFiles.js, and drop their now-stale entries
(plus lit-html and signals-core) from .eslint-allowed-javascript-files.
* ci: seed copilot node_modules cache on main and rename cache keys
Add copilot-linux and copilot-windows jobs to pr-node-modules.yml so the
copilot node_modules cache is populated on main. Rename the copilot cache
keys to copilot-node_modules-linux / copilot-node_modules-windows in pr.yml.
* ci: extract node_modules cache into composite actions
Factor the repeated node_modules cache plumbing into two local composite
actions, restore-node-modules and save-node-modules, and migrate all
workflows that used the cache.sh/cache.ps1 archive flow (pr, pr-node-modules,
pr-{linux,darwin,win32}-test, copilot-setup-steps, component-fixtures,
css-order-scan).
- restore-node-modules computes the key, restores the cache, optionally
extracts on a hit, and exports the resolved key via $GITHUB_ENV.
- save-node-modules archives node_modules and saves it to the cache, reusing
the key exported by restore so callers don't repeat the prefix.
- Bespoke install steps stay in the workflows, so per-job env/secrets never
cross the action boundary.
- Only seed the cache on branch pushes (component-fixtures skips PRs, whose
caches aren't shared).
* save the node_modules cache for now to test it
* ci: fix node_modules cache save dropping the archive
cache.sh wrote its archive as cache.tzst, but actions/cache reserves that
name for its own tarball and passes --exclude cache.tzst, so our archive was
excluded and an empty (~200 B) cache was saved on Linux/macOS. Rename the
archive to node-modules.tzst and bump build/.cachesalt to invalidate the
broken cache entries.
* empty commit
* Remove again saving to the node modules cache from PR steps
* ci: restore chat pipeline to windows-latest
* chore: remove node-gyp override
* chore: restore node-gyp override with comment
* refactor: rm dependency on key:sqlite
The module locks the node-gyp dependency to 8.x due to
its transitive sqlite3 native module dependency this in turn
blocks using newer windows CI, refs https://github.com/microsoft/vscode/issues/321267
The module can be replaced with built-in sqlite support
from Node.js without losing the on-disk cache format has
already been committed.
* chore: restore minimist
* chore: set sqlite busy timeout
* fix: decode json-buffer values for keyv cache compat
The "chat-lib tests (windows-latest)" job started failing at the
"Extract chat-lib" step (npm ci in extensions/copilot). npm ci builds
the native sqlite3@5.1.7 module — a transitive dependency of the
@keyv/sqlite devDependency — via `prebuild-install -r napi || node-gyp
rebuild`. prebuild-install finds no matching prebuilt, so it falls back
to node-gyp, which fails on the runner because the GitHub-hosted
windows-latest label now resolves to the Windows Server 2025 + Visual
Studio 2026 image, whose VS 18 toolchain the bundled node-gyp cannot
detect ("unknown version undefined ... could not find a version of
Visual Studio 2017 or newer").
This was the only npm ci job exposed to the new image: every other
Windows npm ci job runs on self-hosted pools pinned to windows-2022
(still VS 2022), and all other copilot npm ci jobs run on Linux/macOS.
Pin this matrix entry to windows-2022 to match, as recommended by the
runner-images migration notice (actions/runner-images#14017).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Lots of logging for chat smoke tests
* PR test workflows: build extensions/copilot before smoke tests
* PR test workflows: drop duplicate copilot compile from linux/win32 (was already built before integration tests)
* smoke tests: remove musl Claude binary on Linux glibc runner
The musl variant is probed first by @anthropic-ai/claude-agent-sdk and
fails to exec on glibc (ENOENT from missing ELF interpreter), which
caused the Test Claude session tests to time out.
Follow-up to #313128. The VSCODE_OSS fallback isn't needed for the
api.github.com calls in core-ci — secrets.GITHUB_TOKEN already
authenticates those reads with permissions: contents: read (added in
#304929), so we don't hit the anonymous rate limit on 1ES.
* ci: switch PR workflows back to 1ES self-hosted runners with JobId
Re-applies #311975 (reverted in #312033). Adds per-run+attempt JobId
labels to scope 1ES agents to specific GitHub Actions runs and prevent
intermittent runner cancellations.
Also switches the pr.yml compile job's GITHUB_TOKEN from the
ephemeral repo-scoped runner token to secrets.VSCODE_OSS so cross-repo
GitHub API release fetches (vscode-js-debug, vscode-js-debug-companion,
vscode-js-profile-visualizer, etc.) authenticate properly. On 1ES pools
the shared egress IPs hit the anonymous 60/hr api.github.com rate limit
and produced 403 fan-out across PRs last time.
* ci: fall back to GITHUB_TOKEN for fork PRs
Match the historical pattern from before #255987 — fork PRs can't
access secrets.VSCODE_OSS, so use the conditional to pick GITHUB_TOKEN
for forks.
* Allow cherry-pick bot PRs in engineering system changes check
Add an exception for PRs created by vs-code-engineering[bot] whose title
starts with [cherry-pick] and that carry the cherry-pick-artifact label.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fetch cherry-pick-artifact label via API at runtime
The label is applied ~2s after PR creation, so the webhook payload may
not include it. Fetch current labels from the API instead, gated behind
cheap event-payload checks to avoid extra API calls on unrelated PRs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add label retry loop and consolidate guard expressions
Retry the cherry-pick-artifact label check up to 3 times (2s apart) to
handle the ~2s delay between PR creation and label application.
Consolidate the repeated exception guards into a single 'allowed' step
with a 'blocked' output, simplifying downstream conditions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>