Run the Codex Readiness integration test. Use when you need an end-to-end agentic loop with build/test scoring.
LLM Codex Readiness Integration Test
This skill runs a multi-stage integration test to validate agentic execution quality. It always runs in execute mode (no read-only mode).
Outputs
Each run writes to .codex-readiness-integration-test/<timestamp>/ and updates .codex-readiness-integration-test/latest.json.
New outputs per run:
agentic_summary.jsonandlogs/agentic.log(agentic loop execution)llm_results.json(automatic LLM evaluation)summary.txt(human-readable summary)
Pre-conditions (Required)
- Authenticate with the Codex CLI using the repo-local HOME before running the test.
Run these in your own terminal (not via the integration test):
HOME=$PWD/.codex-home XDG_CACHE_HOME=$PWD/.codex-home/.cache codex login
HOME=$PWD/.codex-home XDG_CACHE_HOME=$PWD/.codex-home/.cache codex login status - The integration test creates {repo_root}/.codex-home and {repo_root}/.codex-home/.cache/codex as its first step.
Workflow
- Ask the user how to source the task.
- Offer two explicit options: (a) user provides a custom task/prompt, or (b) auto-generate a task.
- Do not run the entry point until the user chooses one option.
- Generate or load
{out_dir}/prompt.pending.json.- Use the integration test's expected prompt path, not
prompt.jsonat the repo root. - With the default out dir, this path is
.codex-readiness-integration-test/prompt.pending.json. - If
--seed-taskis provided, it is used as the starting task. - If not provided, generate a task with
skills/codex-readiness-integration-test/references/generate_prompt.mdand save the JSON to{out_dir}/prompt.pending.json. - The user must approve the prompt before execution (no auto-approve mode). Make sure to output a summary of the prompt when asking the user to approve.
- Use the integration test's expected prompt path, not
- Execute the agentic loop via Codex CLI (uses
AGENTS.mdandchange_prompt). - Run build/test commands from the prompt plan via
skills/codex-readiness-integration-test/scripts/run_plan.py. - Collect evidence (
evidence.json), deterministic checks, and run automatic LLM evals via Codex CLI. - Score and write the report + summary output.
Configuration
Optional fields in {out_dir}/prompt.pending.json:
agentic_loop: configure Codex CLI invocation for the agentic loop.llm_eval: configure Codex CLI invocation for automatic evals.
If these fields are omitted, defaults are used.
Requirements
- The LLM evaluator must fail if evidence mentions the phrase
Context compaction enabled. - Use qualitative context-usage evaluation (no strict thresholds).
What this test covers well
- Runs Codex CLI against the real repo root, producing real filesystem edits and git diffs.
- Executes the approved change prompt and then runs the build/test plan in-repo.
- Captures evidence, deterministic checks, and LLM eval artifacts for review.
What this test does not represent
- The agentic loop may use non-default flags (e.g., bypass approvals/sandbox), so interactive guardrails differ.
- Uses a dedicated HOME (
.codex-home), which can change auth/config/cache vs normal CLI use. - Auto-generated prompts and one-shot execution do not simulate interactive guidance.
- MCP servers/tools are not exercised unless explicitly configured.
Notes
- The prompts in
skills/codex-readiness-integration-test/references/expect strict JSON. - Use
skills/codex-readiness-integration-test/references/json_fix.mdto repair invalid JSON output. - This skill calls the
codexCLI. Ensure it is installed and available on PATH, or override the command in{out_dir}/prompt.pending.json. - If the agentic loop detects sandbox-blocked tool access, it now writes
requires_escalation: trueto{run_dir}/agentic_summary.jsonand exits with code3. Re-run the integration test with escalated permissions in that case.
You Might Also Like
Related Skills

fix
Use when you have lint errors, formatting issues, or before committing code to ensure it passes CI.
facebook
frontend-testing
Generate Vitest + React Testing Library tests for Dify frontend components, hooks, and utilities. Triggers on testing, spec files, coverage, Vitest, RTL, unit tests, integration tests, or write/review test requests.
langgenius
frontend-code-review
Trigger when the user requests a review of frontend files (e.g., `.tsx`, `.ts`, `.js`). Support both pending-change reviews and focused file reviews while applying the checklist rules.
langgenius
code-reviewer
Use this skill to review code. It supports both local changes (staged or working tree) and remote Pull Requests (by ID or URL). It focuses on correctness, maintainability, and adherence to project standards.
google-gemini
session-logs
Search and analyze your own session logs (older/parent conversations) using jq.
moltbot

