Code review is essential and expensive. An LLM reviewer will never replace a senior engineer's judgment, but it can catch the boring stuff first: missing error handling, obvious bugs, inconsistent naming, and untested edge cases. That frees your humans to focus on design. This article builds an automated reviewer on Model Database that comments on pull requests.
We will fetch a diff, send it to a capable model with a focused rubric, and post structured comments back to the PR.
Why a diff-first design
Sending an entire repository to a model is wasteful and dilutes attention. Reviewers care about what changed, so the unit of work is the diff. A diff is compact, contains the relevant context lines, and maps cleanly to line-level comments.
For code reasoning, choose a strong model. anthropic/claude-opus-4-8 is a good fit when correctness matters; anthropic/claude-sonnet-4-6 is a lighter option for routine PRs. Both are one model string away through the same API.
Fetching the diff
import subprocess
def get_diff(base, head):
return subprocess.check_output(
["git", "diff", f"{base}...{head}", "--unified=3"],
text=True,
)
In CI you would pull the diff from your platform's API instead, but the local git diff is perfect for prototyping.
The review call
Give the model a tight rubric and ask for structured output so you can act on it programmatically. Asking for line references lets you post inline comments.
from openai import OpenAI
import json
client = OpenAI(
base_url="https://modeldatabase.com/v1",
api_key="mdb_live_...",
)
RUBRIC = """You are a senior code reviewer. Review the diff for:
- correctness bugs and logic errors
- missing error/edge-case handling
- security issues (injection, secrets, unsafe input)
- unclear naming or dead code
Ignore style nits a formatter would fix.
Return JSON: {"comments": [{"file","line","severity","message"}],
"summary": "..."}. severity is info|warning|critical."""
def review(diff):
resp = client.chat.completions.create(
model="anthropic/claude-opus-4-8",
messages=[
{"role": "system", "content": RUBRIC},
{"role": "user", "content": f"Diff:\n{diff}"},
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(resp.choices[0].message.content)
Temperature zero keeps reviews consistent across reruns, which matters when developers expect the same feedback on the same code.
Posting comments back
With structured output, posting to GitHub or GitLab is mechanical. Map each comment to a file and line, and post the summary as the review body.
def to_github(review, pr):
for c in review["comments"]:
pr.create_review_comment(
body=f"**{c['severity'].upper()}**: {c['message']}",
path=c["file"],
line=c["line"],
)
pr.create_issue_comment(review["summary"])
Gate your CI on critical findings only. Treating every suggestion as blocking creates noise and trains developers to ignore the bot.
Keeping it useful, not annoying
- Scope the rubric: tell the model what to ignore. Overlapping with your linter just produces duplicate noise.
- Chunk large diffs: split by file when a PR is big, review each chunk, then merge the comment lists. This keeps each request focused and within context limits.
- Add repo context selectively: include the contents of a changed function's file if the diff alone is ambiguous, but resist sending everything.
- Severity discipline: only
criticalshould block a merge. Warnings inform; info is optional.
Measuring whether it helps
Log every review with the model ID and token usage. After a few weeks, sample the comments and label them useful or not. That gives you a concrete signal for whether a cheaper model like anthropic/claude-sonnet-4-6 is good enough, or whether the larger model's findings justify the extra tokens. Because Model Database is pay-as-you-go, you can run both in shadow mode on the same PRs and compare before committing.
A note on trust
Be transparent that comments come from an automated reviewer, and make it trivial for developers to dismiss a thread. The bot is an assistant, not an authority. Used this way, it shortens review cycles without eroding the human judgment that good engineering depends on.
Spin up a key and credit at your dashboard, and check request limits and JSON mode in the docs.