Enhance local AI review with full-file context, delta-diff re-review, and cost visibility#232
Enhance local AI review with full-file context, delta-diff re-review, and cost visibility#232
Conversation
… and cost visibility Add 9 improvements to /ai-review-local: full changed file contents (standard context), import-graph expansion (deep context), configurable --context levels, delta-diff re-review mode, sticky finding tracking via review-state.json, parameter propagation and semantic contract anti-patterns in pr_review.md, token cost visibility, --include-files for selective context, and --token-budget for controlling context size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…en budget Address all P1/P2 findings from local AI review: - P1: merge_findings() now matches by (severity, file_path, summary_fingerprint) instead of (location, severity) to prevent false "addressed" on line shifts - P1: parse_review_findings() requires bold severity (**P1**) format, skips table rows, assessment lines, instructional prose, and multi-severity refs - P2: Use list-per-key in merge to handle duplicate findings without overwrite - P2: Refactor apply_token_budget() to take typed params; source files are always included (sticky); only import-context governed by budget - P2: Add --base-ref argument instead of brittle branch_info parsing - P2: Add 8 new tests covering parser rejection and merge edge cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- P1: Simplify _finding_key() to (severity, summary_fingerprint) — file path excluded from primary key to prevent false "addressed" when location shifts or disappears between review rounds - P1: Fix parser regex to reliably match **Severity:** P1 format (4 capture groups covering **P1**, **Severity:** P1, **Severity: P1**, Severity: P1) - P2: Fix _finding_key() docstring to match actual implementation - P2: Fix test_matching_with_missing_location to assert correct behavior (no false addressed) instead of accepting the bug - Add 3 positive parser format tests (bold, bold-label, plain-label) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Overall Assessment ⛔ Blocker Executive Summary
Methodology No findings. The diff is limited to local review tooling, prompt text, and tests; it does not change any estimator, weighting, variance/SE, identification assumption, or default behavior under Code Quality
Performance
Maintainability
Tech Debt No findings. I did not find an existing Security
Documentation/Tests
I could not run Path to Approval
|
…g, pricing P0: Expand secret scan (Step 3c) to cover full source files, import-expanded files, and --include-files — not just diff hunks P1: Rewrite parse_review_findings() as block-based parser supporting multi-line Severity/Impact/Concrete-fix format. Add fail-safe: when parsing yields zero findings but severity markers exist, preserve prior findings instead of marking all as addressed. Return type changed to tuple[list[dict], bool]. P1: Implement tiered finding matching — primary key uses (severity, file_basename, summary[:50]) when file path available; fallback uses (severity, summary[:50]) only for findings without file paths and only with unique candidates. Prevents cross-file misattribution while still handling missing locations. P1: Fix estimate_cost() to match longest model prefix first, so gpt-4.1-mini and o3-mini snapshots get correct mini-tier pricing instead of parent. P1: Add branch/base validation to review-state.json reuse in skill Step 4. Discard stale state when stored branch or base_ref doesn't match current. P2: In delta mode, derive source/import context from delta changed-files list instead of full branch changed-files. P2: Add 7 regression tests: multi-line parser, parse uncertainty flag, same-summary-different-files, mini pricing, o3-mini pricing, delta context derivation, branch/base validation support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior P0 secret-scan issue appears addressed, and the earlier cost-prefix bug also looks fixed, but the new re-review state machinery still has three unmitigated P1 correctness gaps that can silently misreport whether earlier findings were actually addressed. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…esh, ancestry P1: Accept plain Impact:/Location: labels in multiline blocks (not just bold). Broaden uncertainty detector with non-anchored regex catching Severity: markers anywhere in review text. P1: Use full relative path in finding matching instead of basename to prevent __init__.py collisions across directories. Make fallback matching symmetric: when previous finding lacks location, reverse-match against current findings. P1: Wire --force-fresh to actually delete all prior state (review-state.json, local-review-latest.md, local-review-previous.md) and omit --review-state, --previous-review, --delta-diff args. Replace git cat-file -t with git merge-base --is-ancestor for proper rebase detection. P2: Add 4 regression tests: plain multiline blocks, plain severity uncertainty, previous-missing-location symmetric matching, same-basename-different-dirs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment No methodology-facing code changed, but two unmitigated Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Verification note: |
…ew gating P1: Replace broad_sev regex with line-by-line _BLOCK_START scan in uncertainty detector, so **Severity:** P1 and **Severity: P1** formats correctly trigger parse_uncertain=True when the block parser can't extract findings. P1: Move previous-review preservation inside the validated-state path in Step 4. When state is invalidated (branch mismatch, non-ancestor, rebase), delete local-review-previous.md to prevent stale findings leaking via --previous-review. P2: Add 2 regression tests for bold-label uncertainty detection, fix 2 existing test assertions for mid-line severity markers (correctly non-detected). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior re-review P1s appear resolved, but one new P1 remains in the new deep-context import expansion path. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1: parse_imports() no longer truncates to 2 components. Full module paths like diff_diff.visualization._common are preserved. For 'from . import foo' style relative imports, each alias is appended to the resolved base package (e.g., diff_diff.visualization._event_study instead of diff_diff.visualization). P2: Add 3 regression tests: submodule import truncation check, relative import alias expansion, visualization __init__.py expansion includes submodule files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment No estimator, weighting, variance/SE, identification, or default-behavior code changed, so the methodology registry is not implicated here. The prior deep-context import-resolution defect appears fixed, but the new sticky finding-tracking path still has two unmitigated P1 correctness issues. Executive Summary
Methodology No findings. The changed files are tooling and prompt/test assets, not estimator implementations, so there is nothing to cross-check against the academic-method sections of docs/methodology/REGISTRY.md. Code Quality
Performance No findings. Maintainability No findings beyond the two code-quality defects above. Tech Debt No findings. TODO.md does not currently track the open P1s above, so they are not mitigated for assessment purposes. Security No findings. Documentation/Tests
Path to Approval
|
P1: Strip inline file.py:Lnn references from summaries before fingerprinting in _finding_keys(). Uses lowercase regex since summaries are already lowercased. Prevents false addressed+open churn when the same finding shifts line numbers. P1: Skip write_review_state() entirely when parse_uncertain fires. This prevents advancing the delta baseline past unparsed code, so the next re-review correctly covers the unreviewed changes. P2: Add 2 end-to-end regression tests: parse-then-merge pipeline verifying line-shift matching produces 1 open / 0 addressed, and parse_uncertain state preservation verifying the state file is not modified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentNo estimator, weighting, variance/SE, identification, or default-behavior code changed, so I found no methodology issues. One prior P1 is fixed, but one prior P1 remains partially open, and there is one additional P1 in the new re-review workflow. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Verification note: Path to Approval
|
…cing P1: parse_uncertain now unconditionally skips all write_review_state() calls regardless of whether prior findings exist. Prevents advancing delta baseline after any parse failure, not just when structured_findings is truthy. P1: Branch/base mismatch block now also deletes local-review-previous.md to prevent stale findings from leaking via --previous-review after branch switch. P2: Update gpt-5.4 pricing to $2.50/$15.00 per 1M tokens (was $2.00/$8.00). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment No estimator, weighting, variance/SE, identification, or default-behavior methodology changed in this diff, so I found no registry/paper adherence issues. The two prior P1s are addressed, but one unmitigated P1 remains in the new sticky-finding logic. Executive Summary
Methodology No findings. The changed files do not alter any estimator implementation or any documented methodology deviation in docs/methodology/REGISTRY.md. Code Quality
Performance
Maintainability No findings. Tech Debt No findings. TODO.md#L44 does not currently track the open issues above, so they remain unmitigated for assessment purposes. Security No findings. Documentation/Tests No standalone findings beyond the missing regressions implied above. The current additions cover Python-file line shifts and basic token-budget behavior, but not non- Path to Approval
|
P1: Broaden _LOCATION_PATTERN and summary normalization in _finding_keys() from .py-only to any file extension (\.\w+). Findings on .md, .yml, etc. now correctly merge across line shifts without false addressed/open churn. P2: Include structured_findings table and delta_changed_files_text in mandatory_est for token budget calculation, so deep-context import files are correctly dropped when rerun prompts exceed the budget. Add regression test for .md file line-shift merge stability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker — one unmitigated Executive Summary
Methodology No findings. I checked docs/methodology/REGISTRY.md; this PR only changes local review tooling, prompt criteria, and tests, not any estimator implementation or documented methodology deviation. Code Quality
Performance No unmitigated findings. The earlier rerun token-budget omission looks addressed by the added Maintainability No separate findings beyond the state-parsing duplication already called out under Code Quality. Tech Debt No findings. Security
Documentation/Tests
Path to Approval
|
P0: Confine --include-files to repo root. Reject absolute paths upfront, normalize with realpath, verify candidate stays under repo_root + os.sep. Prevents ../ traversal and arbitrary local file exfiltration to OpenAI. P1: --force-fresh now still passes --review-state/--commit-sha/--base-ref so the fresh run seeds a new delta baseline. Only --previous-review and delta args are suppressed. P1: Harden parse_review_state() with type validation: non-dict root, non-list findings, non-int review_round all degrade to ([], 0) instead of aborting. Skill-side JSON reads wrapped in try/except. P2: Add 5 regression tests: path traversal detection, non-dict root, non-list findings, non-int round, absolute path rejection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1: Filter non-dict elements from findings list in parse_review_state(). Entries like strings or integers are silently dropped instead of crashing downstream code that calls f.get(...). P2: Track consumed current candidates in merge_findings() pass 2b so two prior no-location findings can't both match the same current finding. Ensures one-to-one matching in the reverse fallback pass. P2: Add 2 regression tests: non-dict findings filtering, duplicate no-location findings one-to-one matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessmentExecutive summary
MethodologyNo findings. The diff is confined to local review tooling, prompt criteria, and tests, and does not touch estimator implementations or any documented methodology deviations in Code Quality
PerformanceNo findings. MaintainabilityNo separate findings beyond the duplicated/partial state-validation flow called out above. Tech DebtNo mitigating TODO entry exists for the remaining state-validation issue under SecurityNo unmitigated findings. The prior local-file inclusion issue appears fixed by the Documentation/Tests
Path to Approval
|
… validation P1: Require id/severity/summary/status keys in finding dicts — malformed entries are filtered in parse_review_state() before reaching merge/compile. P1: Add validate_review_state() function that checks schema version, branch/base match, and required fields in a single call. Returns (findings, round, commit, is_valid) tuple. P1: Replace shell JSON parsing in skill Step 4 with Python validator call. Delta mode and --previous-review are now gated on the is_valid flag from the centralized validator, not raw file existence. P2: Add 5 regression tests: missing-key filtering, validate_review_state valid/branch-mismatch/schema-mismatch/missing-file cases. Fix existing tests to include all required finding keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
1 similar comment
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessmentExecutive summary
MethodologyNo findings. The diff is confined to local review tooling, prompt criteria, and tests; it does not modify estimator implementations, weighting, variance/SE code, identification checks, or defaults in Code Quality
PerformanceNo findings. MaintainabilityNo separate findings beyond the state-validation/gating issues above. Tech DebtNo mitigating SecurityNo unmitigated findings. The prior local-file inclusion concern appears fixed by the path confinement logic in .claude/scripts/openai_review.py:L1389-L1427. Documentation/Tests
Path to Approval
|
…ing test P1: validate_review_state() now fails closed on ANY malformed finding — validates raw findings for required keys BEFORE returning is_valid=True. Malformed entries invalidate delta mode entirely instead of being silently filtered. P1: Step 4 now deletes local-review-previous.md whenever delta mode is NOT active (including when review-state.json is missing entirely). Prevents stale previous-review from leaking into fresh reviews. P2: Add regression test asserting malformed findings return is_valid=False. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessmentExecutive summary
MethodologyNo findings. Methods affected: none. The PR does not touch Code Quality
PerformanceNo findings. MaintainabilityNo separate findings beyond the parser/state-coupling issue above. Tech DebtNo mitigating Security
Documentation/Tests
Path to Approval
|
P1: _BLOCK_START now accepts *, +, and numbered list prefixes (1., 2.) in addition to - bullets. Prevents parse_uncertain=False on common Markdown list formats like "1. Severity: P1" or "* **Severity:** P1". P2: Step 3c --include-files scan now mirrors the script's absolute-path rejection and realpath repo-root containment check before scanning. P2: Add 3 regression tests: numbered-list severity parsing, starred-bold severity parsing, numbered-bold uncertainty detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…tion P1: Add bold-colon (**P1:**) and bare-colon (P1:) severity formats to _BLOCK_START. These are common review output patterns that were previously missed, causing mixed-format reviews to falsely mark prior findings addressed. P1: parse_uncertain now detects partial parses (some findings parsed but severity lines remain unparsed) by comparing severity-line count vs findings count. Prevents state advancement on mixed-format output. P2: Step 3c --include-files now mirrors script's exact branching: only bare filenames resolve under diff_diff/; slash-containing paths resolve repo-root relative only. P2: Add 3 regression tests: bold-colon format, bare-colon format, and mixed-format review where both formats should parse. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
1 similar comment
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessment Executive summary
Methodology No findings. The diff is confined to local review tooling, prompt text, and tests; no estimator implementation or methodology-facing behavior in Code Quality
Performance No findings. Maintainability No separate findings beyond the parser/state-coupling issue above. Tech Debt No findings. Security No findings. The earlier slash-path Documentation/Tests
Path to Approval
|
P1: The plain Severity: pattern now accepts optional ** around the severity
value via \*{0,2} quantifier. Handles Severity: P1, Severity: **P1**,
and Severity: `P1` uniformly in a single regex branch.
P2: Add regression test for Severity: **P1** format.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology No findings. I checked docs/methodology/REGISTRY.md; this PR does not touch methodology-facing library code. Code Quality
Performance
Maintainability No findings. Tech Debt No separate findings. Security No findings. The new Documentation/Tests
Path to Approval
|
P1: Remove _should_skip_line() from block-start matching. If a line matches _BLOCK_START, it's a finding — trust the severity pattern regardless of summary content. Skip heuristics remain in the uncertainty scanner only. Prevents findings containing "Path to Approval", "Looks good", etc. in their summaries from being silently dropped. P2: Fix delta-mode token budget to not count changed_files_text (which is NOT included in delta-mode prompt, only in fresh reviews). Prevents unnecessary import-context dropping in re-reviews. P2: Add 2 regression tests for findings with skip markers in summaries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior parser and delta-budget issues look fixed, but there is still an unmitigated P1 in the new sticky-finding merge logic. Executive Summary
Methodology No findings. Methods affected: none. The diff is confined to review tooling/docs/tests rather than estimator, weighting, variance/SE, identification, or default-behavior code, so there is no methodology deviation to cross-check against docs/methodology/REGISTRY.md:L1-L40. Code Quality
Performance No findings. The prior delta-budget double-count appears fixed in .claude/scripts/openai_review.py:L1472-L1491. Maintainability
Tech Debt No separate findings. Security No findings. The new Documentation/Tests
Path to Approval
|
P1: Remove 50-char truncation from _finding_keys() summary fingerprint. Uses full normalized summary so distinct findings with same prefix don't collapse into one state entry. Normalization (lowercase + strip file:line references) already handles wording stability. P2: Escape | and newline chars in structured findings table cells to prevent table corruption when summaries contain pipe chars (e.g., "str | None"). P2: Add 2 regression tests: long summary collision prevention and pipe char escaping in findings table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
I could not execute |
Summary
--context minimal|standard|deeplevels (default:standard)review-state.json— findings are matched across rounds by severity + summary fingerprint, enabling structured status tracking (open/addressed)pr_review.md— shared between local and CI review--include-filesfor selective extra context and--token-budgetfor controlling import-context inclusion--base-refand--commit-shaarguments for robust review state trackingMethodology references
openai_review.py,ai-review-local.md), review criteria (pr_review.md), and tests.Validation
tests/test_openai_review.py(78 tests, all passing)TestResolveChangedSourceFiles,TestReadSourceFiles,TestParseImports,TestExpandImportGraph,TestEstimateCost,TestTokenBudget,TestParseReviewState,TestWriteReviewState,TestParseReviewFindings,TestMergeFindings,TestCompilePromptWithContextpr_review.mdchanges are strictly additive (anti-patterns 4-5 appended), all existing anti-patterns intact, substitution drift test passesSecurity / privacy
Generated with Claude Code