Files
vscode/extensions
Ulugbek Abdullaev 02629fd4b7 nes-datagen: stream multi-GB inputs/outputs and switch output to JSON Lines (#320089)
* nes-datagen: stream large JSON array inputs to avoid 2 GiB readFile limit

Reading the whole input via fs.readFile fails for files larger than 2 GiB
(and exceeds V8's max string length). Add a streaming JSON-array parser and
use it in both the sequential and parallel pipeline paths so multi-GB
recordings can be processed with bounded memory.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* nes-datagen: also accept JSON Lines (NDJSON) input

Auto-detect the input format from the first non-whitespace character: a
leading '[' is parsed as a single JSON array, otherwise the file is parsed
as JSON Lines (one JSON object per line). Both formats are streamed so
multi-GB inputs work regardless of shape. Rename streamJsonArray ->
streamJsonRecords to reflect the broader purpose.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* nes-datagen: infer JSON vs JSON Lines input format from file extension

Use the file extension (.jsonl/.ndjson -> JSON Lines, otherwise JSON array)
to select the streaming parser instead of sniffing the content.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* nes-datagen: validate streamed JSON arrays for truncation and malformed input

The new streaming parser previously accepted any prefix of a JSON array
silently: a truncated file (no closing ']'), a missing element between
commas, a trailing comma or trailing data after the array all produced
zero or fewer records rather than an error. That is especially dangerous
for the multi-GB inputs this parser was introduced for, because the
underlying file is much more likely to be incomplete.

Tighten the state machine to surface these as errors, matching what the
old whole-file JSON.parse would have done, and add tests for each case.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* nes-datagen: stream worker-result merging and final output write

For multi-GB inputs the parent process was still hitting V8's ~512 MiB
max-string-length limit in two places after the input-side fix:

1. Merging worker result files used fs.promises.readFile + JSON.parse on
   each result, but with a 5+ GB input split N ways each per-worker file
   is hundreds of MB of similar-shaped data and easily exceeds the
   string limit.
2. writeSamples serialized the entire validSamples array via a single
   JSON.stringify(arr, null, 2) before writing, which has the same
   problem on output.

Switch both to stream over individual records:

- A new shared openWriteStream(filePath) helper wraps fs.createWriteStream,
  attaches an 'error' listener immediately (so async write failures don't
  surface as uncaughtException and skip cleanup), awaits backpressure via
  the per-write callback, and exposes an idempotent close().
- writeChunkFiles uses the helper inside a try/finally so any mid-stream
  ENOSPC/EIO bubbles up cleanly and the tmp dir is still removed.
- The merge step now uses streamJsonRecords<ISample>(resultPath), so the
  parent never materializes a single worker's output as one string.
- writeSamples emits the output JSON array incrementally: per-sample
  JSON.stringify(..., null, 2) (indented two spaces to match the previous
  layout) joined with ',\n'. Byte size is accumulated for the existing
  IWriteResult.fileSize.

Also documents that single-process loadAndParseInput still buffers the
full row set in memory and that --parallelism is required for very large
inputs (workers each only load their slice).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* nes-datagen: switch output to JSON Lines

Both the final user-facing output and the per-worker intermediate result
files now use JSON Lines (one record per line) instead of a
pretty-printed JSON array. JSONL is dramatically simpler to write and
read incrementally: no surrounding brackets/commas to track, no
multi-line per-element indentation, just JSON.stringify + '\n' per
record on the write side and split-on-newline + JSON.parse per non-empty
line on the read side (this is what streamJsonRecords already does when
it detects the .jsonl extension).

Changes:
- writeSamples emits one JSON.stringify(sample) + '\n' per validated
  sample via  no array wrapper, no pretty-printing.openWriteStream
- resolveOutputPath defaults the implicit output to <input>_output.jsonl
  (was <input>_output.json).
- Per-worker result files in runInputPipelineParallel are now
  result_${w}.jsonl, so the merge step's streamJsonRecords auto-picks
  the JSONL parser from the extension.
- E2e tests updated to read JSONL (split on newline, JSON.parse per
  line) and to use .jsonl output paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* nes-datagen: surface worker-result parse errors and update --out help text

Two review follow-ups:

1. The merge step that streams each worker's result file used to wrap
   the iteration in a try/catch that downgraded any parse error to a
   console.error warning. With the new streaming reader that is unsafe:
   streamJsonRecords yields N valid records first and then throws on a
   malformed/truncated tail, leaving those N partial records already in
   allSamples. The pipeline would then quietly emit a truncated
   training-data output. Drop the swallowing try/catch so a corrupt
   worker result aborts the run non-zero.

2. The --out help text in simulationOptions.ts still advertised the old
   JSON-array default (<input>_output.json). Update it to reflect the
   JSONL output, and also note in --input that the format is inferred
   from the .jsonl/.ndjson extension.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-05 12:51:10 +00:00
..