Files
vscode/build/agent-sdk
Tyler James Leonhardt 56414283d5 AH: per-platform agent SDK build + CDN upload (#321012)
* AH: replace agentSdks {url, sha256} with {urlTemplate}

product.agentSdks.<sdk> now ships {version, urlTemplate} instead of
{version, url, sha256}. The runtime substitutes {sdkTarget} into the
template per-launch via a new IAgentSdkPackage.currentSdkTarget()
hook — Claude appends -musl on musl Linux hosts (detected from Node's
process.report.header.glibcVersionRuntime, no subprocess), Codex
never does (statically musl-linked, single Linux SKU).

Why the shape change:

  1. macOS Universal bundles ship arm64 + x64 binaries sharing one
     product.json — a fixed per-platform {url, sha256} could only be
     correct for one of the two halves. The template lets the same
     bundle serve both.
  2. The sha256 was belt-and-suspenders: product.json is covered by
     product.checksums inside the signed app bundle, URLs are HTTPS
     to a Microsoft-controlled CDN. The sha only guarded "trusted URL
     string, tampered edge bytes" — a much harder attack than
     tampering with product.json itself.

Downloader changes: sdkTarget joins the cache key path
(<userDataPath>/agent-host/sdk-cache/<pkg>/<sdkVersion>/<sdkTarget>/)
so Universal launches with different resolved targets get independent
caches. .complete sentinel content is now the source URL (debug-only;
the file's existence is the integrity signal).

isAvailable() now gates on both product config AND currentSdkTarget()
resolving, so the provider doesn't register on hosts with no SKU
(armhf) even if some future product.json carries an agentSdks block.

Tests: drop sha-mismatch + stale-cache-by-sha tests; add coverage for
{sdkTarget} substitution, separate-cache-dirs-for-different-targets
(the Universal motivating case), currentSdkTarget-undefined gating
isAvailable, and CodexSdkPackage.currentSdkTarget agreeing with the
existing codexPackageSuffix table.

Pairs with the build PR (#321012), which will be rebased to emit the
new shape once this lands. See build/agent-sdk/TODO.md on that branch.

* AH: simplify per-package SDK target resolution

Replaces per-package `currentSdkTarget()` (one method per SDK,
re-implementing the same platform/arch table modulo a musl branch)
with a single boolean `hasSeparateMuslLinuxPackage` on the package
descriptor and a shared `resolveSdkTarget(pkg, host)` in the
downloader. Claude sets it true; Codex sets it false. The supported-
platforms whitelist collapses from three copies (claudeSdkTarget,
codexPackageSuffix, build's getSdkTargetForBuild) to one runtime
resolver paired with the build helper.

Host injection: AgentSdkDownloader gains an optional leading
`ISdkTargetHost` constructor param (per project convention, non-DI
before DI). Production call sites pass `undefined` to derive from
`process`; tests pass synthetic hosts to exercise Universal launches
and musl Linux without touching `process`.

Other simplifications applied from review:
- `_cacheHit` was a one-line passthrough → inlined to
  `_fileService.exists(sentinel)` at both call sites.
- `_pendingDownloads` key now uses `cacheDir` directly (already
  unique per pkg/version/target) instead of allocating a parallel
  `<pkg>/<version>/<target>` key string.
- `.complete` sentinel content is now empty — the file's existence
  is the integrity signal, the cache dir path already encodes
  `<pkg>/<version>/<sdkTarget>` for debugging.
- `detectLibc()` returns `LibcFamily | undefined` on non-Linux
  instead of `'glibc'`-by-convention (drops consumer-specific
  phrasing from the primitive).
- Test's `listLeftovers` recursive walker replaced with a direct
  `readdir` of the known version dir (the only level where scratch
  dirs can land).
- Tests collapse 4 direct `new AgentSdkDownloader(...)` blocks
  through `makeDownloader(null, host)`.
- `IAgentSdkProductConfig` JSDoc trimmed to interface contract;
  rationale lives in roadmap.md Phase 15.
- `_failureLatch` doc explains why per-id (not per-target) granularity
  is intentional.

Tests: 25 passing (3 libc + 3 resolveSdkTarget + 13 downloader +
6 codex paths). New `resolveSdkTarget` suite covers the cross-product
of {claude, codex} × {linux glibc, linux musl, darwin, win32} that
previously lived as scattered table tests.

* AH: drop test-only host injection on AgentSdkDownloader

The previous commit added a constructor param to inject a synthetic
`(platform, arch, libc)` into the downloader so tests could exercise
Universal launches and musl Linux from any CI host. Production passed
`undefined` and the body fell back to a derived host — a test-only
ceremony in production code.

Restructured so the runtime stays clean:

  - `resolveSdkTarget(pkg, host?)` keeps its optional `host` param,
    defaulting to the real process. Cross-host coverage lives in
    dedicated unit tests that call it directly.
  - `AgentSdkDownloader` no longer takes a host. Both call sites
    revert to `createInstance(AgentSdkDownloader)` with no extras.
  - Integration suite `suiteSetup` skips on hosts the downloader
    can't target (e.g. linux-armhf), and pins `hostSdkTarget` for
    path assertions. The "two-host cache key" assertion becomes a
    direct path check on the host's resolved target instead of an
    artificial second-host download.

Tests: 23 passing (3 libc + 3 resolveSdkTarget unit + 11 downloader
integration + 6 codex paths).

* AH: address PR review — validate urlTemplate placeholders + honor backpressure

Two findings from #321078 review:

1. `format2()` silently leaves unknown placeholders untouched, so a
   vscode-distro typo like `{sdkTaret}` would produce a 404 from the
   CDN with no hint at the real cause. Add a `{...}` scan after
   substitution that throws an actionable error pointing at the
   suspect product.json field. Covered by a new test.

2. The hand-rolled `_fetch` pipe ignored `out.write()`'s return value,
   so a slow disk (Windows AV scan, network home dir) could buffer
   the entire 70-95MB tarball in memory. Pause the source stream on
   write-buffer full, resume on drain. Can't use `stream/promises
   .pipeline()` here because `IRequestContext.stream` is a
   `VSBufferReadableStream`, not a Node Readable — the source's own
   `pause()`/`resume()` is what we have to work with.

Cancellation test still passes; backpressure change is transparent to
the cancel teardown.

* AH: per-platform agent SDK build + CDN upload (#7885)

Per-platform VS Code build jobs now produce + upload the Claude and
Codex agent SDK tarballs to main.vscode-cdn.net and stamp the resulting
url/sha into `product.agentSdks` of their own packaged product.json.

The build step (`build/azure-pipelines/common/agent-sdk-produce.yml`)
runs inline in each existing platform job (darwin/linux/win32/alpine),
before the gulp packaging step. It always builds the tarballs. The
AzureCLI credential fetch and the CDN upload are gated on
`VSCODE_PUBLISH=true` — test pipeline runs leave the tarballs as a
pipeline artifact (`agent_sdk_<platform>_<arch>_tarballs`) for
inspection but don't touch the CDN, and ship product.json without
`agentSdks` (same shape as a local dev build).

The REH gulpfile only stamps `agentSdks` for `type === 'reh'`; REH-web
skips it because the agent host is node-only.

* AH: use npm.cmd on Windows in agent SDK build

`spawnSync('npm', ...)` fails on Windows because npm ships as a `.cmd`
shim and Node's child_process doesn't resolve PATHEXT without an
explicit suffix. The Windows pipeline jobs were dying with `exited
null` and no further context.

Also surface `result.error` so a future spawn-resolution failure shows
the actual ENOENT instead of a bare exit-code message.

* AH: pass shell:true when spawning npm.cmd on Windows

Node 20+ (CVE-2024-27980) refuses to spawn `.cmd`/`.bat` files without
`shell: true` and fails with `EINVAL`. The Windows pipeline jobs hit
this after the previous fix swapped `npm` for `npm.cmd`.

* AH: emit {version, urlTemplate} per the runtime shape change

Stacked on top of the runtime PR (tyler/agent-sdk-url-template). With
the runtime now consuming `{version, urlTemplate}` and substituting
`{sdkTarget}` per launch, the build pipeline emits the matching shape:

  - `IAgentSdkResults[<sdk>]` drops `{url, sha256}` for
    `{version, urlTemplate}`.
  - `produce.ts` still uploads its platform's tarballs (idempotent
    HEAD-then-skip in upload.ts is unchanged), but the results JSON
    every job writes is identical per SDK — only the version differs.
    That's the whole point: macOS Universal can ship one product.json
    that covers both arm64 and x64 launches because the runtime
    resolves {sdkTarget} per launch.
  - New `buildCdnUrlTemplate(sdk, version)` mirrors `buildCdnUrl`'s
    path but leaves `{sdkTarget}` as the format2 placeholder.
  - README updated; TODO.md (the placeholder note left while the
    runtime PR was pending) deleted.

Tarballs at the existing CDN paths (e.g.
`agent-sdk/claude/0.3.168/darwin-arm64.tgz`) remain valid and reachable
— no re-upload needed, just a re-stamp of product.json on the next
publish run.

* AH: address PR review on build/agent-sdk (Copilot)

Five comments from the build PR review:

1. common.ts header named drift-check.ts (deleted during simplification)
   and missed produce.ts. Updated.
2. common.ts "single source of truth is package.json optionalDependencies"
   was aspirational — getSdkTargetForBuild is a hardcoded table. Reframed
   the comment to describe what we actually do (hardcoded table kept in
   lockstep by convention) and why (no runtime npm metadata lookup).
3. package.ts header said the library form is consumed by gulpfile
   packaging tasks — actually called from produce.ts as its own pipeline
   step. Updated.
4. + 5. isCliInvocation() in package.ts and upload.ts compared
   import.meta.url to a manually constructed `file://${process.argv[1]}`,
   which breaks on Windows (drive letters URL-encoded, spaces escaped).
   Repo already established the cleaner `import.meta.filename ===
   process.argv[1]` pattern (see build/npm/installStateHash.ts:143).
   Pure portability fix — only affects the dev-mode CLIs, the production
   pipeline calls these as library functions.

* AH: per-SDK agents/<sdk>/{package.json,package-lock.json} for byte determinism

Build 447090 surfaced sha drift across pipeline runs: same exact-pinned
SDK version, but transitive deps unlocked (`npm install
--no-package-lock`) → different bytes → CDN HEAD-then-fail rejected
the re-upload. Determinism is load-bearing for the security model
(content-addressed CDN URLs, no runtime sha verification — the bytes
at a given URL must be stable).

Fix: ship pinned lockfiles per SDK and use `npm ci`. New layout:

  build/agent-sdk/agents/
    claude/
      package.json       # exact one dep: @anthropic-ai/claude-agent-sdk@0.3.168
      package-lock.json  # full transitive graph
    codex/
      package.json       # exact one dep: @openai/codex@0.134.0
      package-lock.json

Bonus: the folder set IS the SDK list. Drops the hardcoded
`SDKS: readonly Sdk[]` and `PACKAGE_NAME: Record<Sdk, string>` from
common.ts; replaced with `getSdks()` (discovers from `agents/`) and
`getAgentMeta(sdk)` (reads from the agent's own package.json). Adding
a new SDK is now one folder + lockfile gen + commit.

Verified reproducible locally: two back-to-back runs of `package.ts
--sdk=codex --target=darwin-arm64` produce byte-identical tarballs.

NOTE: existing CDN blobs from build 446990 carry the old drifted
shas. The next publish will fail HEAD-then-skip against them. Need to
delete `agent-sdk/{claude,codex}/{0.3.168,0.134.0}/*.tgz` from the
vscodeweb storage account's $web container before re-publishing, or
the upload step will refuse with "blob already present with DIFFERENT
sha256".

* AH: bump pinned SDK versions to sidestep stale CDN blobs

claude 0.3.168 → 0.3.169 (one point release; 0.3.170/172/173/174/175
all exist upstream, sticking to the next bump for risk minimisation).
codex 0.134.0 → 0.135.0 (next stable; 0.135-0.139 are all stable
releases, picking the immediate successor).

Bumping versions changes the CDN URL path (`agent-sdk/<sdk>/<version>/...`)
so the next publish lands at fresh, never-uploaded blob URLs. Avoids
having to delete the drift-shaped blobs from build 446990 that would
otherwise trip HEAD-then-fail.

Bumped both the per-SDK `agents/<sdk>/package.json` (the build's pin)
and repo-root `package.json` devDeps (the runtime's type-import pin)
in lockstep, with all four lockfiles regenerated. Local reproducibility
re-verified: two back-to-back runs of `package.ts --sdk=codex
--target=darwin-arm64` produce byte-identical tarballs at the new pin.

Runtime typecheck clean — no API changes to either SDK in these point
releases.

* AH: stub usage_EXPERIMENTAL on test Query fakes (SDK 0.3.169)

Claude SDK 0.3.169 added `usage_EXPERIMENTAL_MAY_CHANGE_DO_NOT_RELY_ON_THIS_API_YET`
as a required method on `Query`. Three test files implement the
interface as fakes (FakeQuery, ImmediatelyDoneQuery, RoundTripQuery)
and broke the type-check on tsgo.

Stubbed each as `throw 'not modeled'` matching the existing pattern
for every other method these fakes don't exercise. The field name
makes it clear the SDK doesn't expect anyone to rely on it yet, so a
"not modeled" stub is honest.

* AH: authenticate npmrc before agent SDK `npm ci`

Build 447232 hit E401 from the private npm mirror: the platform job's
existing "Setup NPM Authentication" step is gated on the node_modules
cache being a miss (it lives in the cache-warming path), so on a cache
hit the user's ~/.npmrc has no auth token, and our agent-sdk `npm ci`
inherits the global registry override + missing auth → E401.

Fix: add an always-on auth step at the top of agent-sdk-produce.yml.
Captures the user's npmrc path, runs npmAuthenticate@0 against it. Now
runs independent of the node_modules cache state.

The previous npm install --no-package-lock path tolerated this because
it fell back to anonymous resolution against npmjs.org. `npm ci`
strictly resolves through the configured (private) registry, which
needs auth. The lockfile + private mirror combination is what we want
for supply-chain auditing — the fix is to make sure auth is set up
unconditionally rather than bypass the mirror.

Also reverts a brief stop-along-the-way that added
`--registry=https://registry.npmjs.org/` to the npm ci call — wrong
direction (would bypass the supply-chain mirror).

* AH: create ~/.npmrc with `npm config set` before authenticating

npmAuthenticate@0 errored on cache-hit runs: the .npmrc path returned
by `npm config get userconfig` is just where npm WOULD write — the
file doesn't exist until something actually writes to it. The platform
job's "Setup NPM" step creates it via `npm config set registry`, but
is skipped on cache hits.

Mirror that pattern in our prep step: run `npm config set registry`
ourselves (idempotent — rewrites the same value the existing config
already has on cache misses) so npmAuthenticate@0 has a real file to
edit.

* AH: move agent SDK step ahead of Download Copilot VSIX

Was: install-builtin → VSIX-background → Compile → … → VSIX-attach → agent-sdk → Build client.
Now: install-builtin → agent-sdk → VSIX-background → Compile → … → VSIX-attach → Build client.

No data dependency between the agent SDK step and the VSIX download
(or Compile, for that matter — agent SDK uses its own scratch dir,
its own npmrc, doesn't read node_modules or anything from out-build).

Benefit: fail-fast. The agent SDK step previously ran after Compile +
both VSIX wait points, so a CDN auth failure or a sha-mismatch
would only surface ~10 minutes into the job. Moving it earlier
catches those failures in seconds, before any heavy work runs.

Applied consistently across darwin/linux/win32/alpine (linux still
gated on `ne(VSCODE_ARCH, 'armhf')`).
2026-06-12 16:03:30 -04:00
..

build/agent-sdk

Per-platform agent SDK production. Each VS Code build (darwin-arm64, linux-x64, Alpine REH, etc.) uploads its own platform's SDK tarballs to main.vscode-cdn.net and stamps agentSdks into the shipped product.json with a {version, urlTemplate} per SDK. Every platform job emits the same urlTemplate per SDK — the runtime substitutes {sdkTarget} per launch via resolveSdkTarget(), which is what lets macOS Universal bundles share one product.json across arm64 + x64.

The runtime side (src/vs/platform/agentHost/) downloads and caches the SDK tarball at first use. See IAgentSdkProductConfig in src/vs/base/common/product.ts for the contract.

How the pipeline uses this

The platform packaging jobs (Linux, macOS, Windows, Alpine) each include the shared template build/azure-pipelines/common/agent-sdk-produce.yml before the existing gulp vscode-<platform>-<arch>-min-ci step:

- template: ../../common/agent-sdk-produce.yml@self
  parameters:
    vscodePlatform: linux

The template runs node build/agent-sdk/produce.ts --vscode-platform=<x> --arch=$(VSCODE_ARCH), which iterates the SDKs (SDKS = ['claude', 'codex']), figures out the matching sdkTarget for (vscode-platform, arch, sdk) via getSdkTargetForBuild, runs buildOne for each in parallel, and drops the tarballs in $(Build.SourcesDirectory)/.build/agent-sdk/tarballs/.

Publish vs test runs

produce.ts reads the pipeline variable VSCODE_PUBLISH from env (Azure auto-injects all non-secret pipeline variables) to decide whether to hit the CDN:

  • VSCODE_PUBLISH=true (real release builds) — the AzureCLI@2 step inside the template fetches CDN credentials, produce.ts calls uploadOne for every tarball (HEAD-then-decide idempotent), writes the results JSON, and emits ##vso[task.setvariable variable=AGENT_SDK_RESULTS_FILE]<path>. The downstream gulp packaging step then stamps product.agentSdks via readAgentSdkResults().

  • VSCODE_PUBLISH unset or not 'true' (PR runs, CI runs, manual test runs with the publish toggle off) — the AzureCLI credential step is skipped, the upload is skipped, no results file is written, and task.setvariable is not emitted. The tarballs are still produced and published as a pipeline artifact named agent_sdk_<vscodePlatform>_<arch>_tarballs so you can download and inspect them. product.json ships without agentSdks — same shape as a local dev build, so the runtime falls back to the per-provider env-var override.

Where the agentSdks gating lives

Inside packageTask's jsonEditor callback (the same one that injects commit / date / checksums / version), readAgentSdkResults() loads the results file (returns {} when the env var is unset) and merges agentSdks into product.json. The REH gulpfile only writes agentSdks for type === 'reh'; the REH-web variant skips it because the agent host is node-only and the SDK config has no consumer in a browser-served server.

Local gulp vscode-darwin-arm64 invocations don't set AGENT_SDK_RESULTS_FILE and don't have VSCODE_PUBLISH=true, so readAgentSdkResults() returns {} and product.json ships without agentSdks — same UX as today's no-config build.

Why two steps, not inline-in-gulp

The agent SDK work is a distinct concern from the VS Code packaging gulp graph. As its own pipeline step:

  • Visible in the build log — operators see a discrete "Agent SDK: build
    • upload" step they can click into instead of grepping inside "Build client" output.
  • Independently re-triggerable — if the SDK step fails, the operator can re-run just the platform job; if it succeeds but the gulp step fails, the SDK upload is already idempotent (HEAD-then-skip).
  • Doesn't add async-stream complexity to the gulpfile. packageTask stays a sync stream-returning function; the only change is one synchronous readAgentSdkResults() call inside the existing jsonEditor callback.

Files

  • agents/<sdk>/ — one folder per SDK we ship. Each contains a package.json (single dependency: the SDK's own npm package, pinned to an exact version) and a package-lock.json (full transitive graph). Folder name = SDK id = key under product.agentSdks = path segment in the CDN URL. The set of folders IS the SDK list — no parallel array to keep in sync.
  • common.ts — types, getSdks() (discovers SDKs from agents/), getAgentMeta() / getSdkVersion() (reads from agents/<sdk>/package.json, rejects ^/~ ranges), getSdkTargetForBuild() ((vscodePlatform, arch, sdk) → npm-suffix), buildCdnUrl() / buildCdnUrlTemplate(), sha256OfFile(), parseFlags() for CLI flag parsing, and readAgentSdkResults() for the gulpfile-side reader.
  • package.tsbuildOne({ sdk, sdkTarget, outDir }). Runs on any OS: copies agents/<sdk>/{package.json,package-lock.json} into a scratch dir, npm ci with npm_config_libc/os/cpu fetches the foreign platform binary verbatim from the locked graph, then node-tar+gzip with reproducible flags. Has a thin CLI at bottom.
  • upload.tsuploadOne(...). HEAD-then-decide: absent → upload; matching sha → skip (idempotent re-runs); different / no-metadata sha → fail loud, refusing to overwrite content-addressed history. Thin CLI.
  • produce.ts — pipeline-step entry. For one (vscode-platform, arch), iterates the SDKs in parallel, calls buildOne + uploadOne for each that applies, writes results to AGENT_SDK_RESULTS_FILE, and emits ##vso[task.setvariable] so downstream pipeline steps see the path.

Bumping an SDK version

  1. Edit the dependencies version in build/agent-sdk/agents/<sdk>/package.json to the new exact version.
  2. From that directory: npm install --package-lock-only --ignore-scripts to refresh package-lock.json.
  3. Also bump the matching devDependencies entry in repo-root package.json (the runtime imports types from that copy) so the shipped types and the build-time pin stay in lockstep.
  4. npm install at repo root to refresh the root lockfile.
  5. Commit all four edits together.

The next pipeline run rebuilds + uploads each platform tarball at the new content-addressed CDN path and re-stamps each product.json with the new urlTemplate pointing at the bumped version.

No human-paste step into vscode-distro. No coordination between jobs.

Local dev

Build one tarball locally:

node build/agent-sdk/package.ts --sdk=claude --target=darwin-arm64 --out=/tmp/out

For OSS contributors who want to drive the agent host without going through the CDN, point the dev override env vars at a local SDK install:

VSCODE_AGENT_HOST_CLAUDE_SDK_ROOT=/path/to/anthropic-claude-sdk-install \
  ./scripts/code.sh

(See src/vs/platform/agentHost/common/agentService.ts for env var names.)