AccuracyEvaluatorSkill-ClaudeImproved

बनाया गया Diff कभी समाप्त नहीं होता
9 हटाए गए
141 लाइनें
91 जोड़े गए
210 लाइनें
---
---
name: accuracy-evaluator
name: accuracy-evaluator
description: Evaluates text, articles, claims, or content for factual accuracy against the current state of the world in 2026. Use this skill whenever the user asks to evaluate, fact-check, assess accuracy, or review content for what it gets right and what needs improvement — especially when they ask what's accurate "in 2026", "currently", or "today". Always use this skill when the prompt includes phrases like "evaluate this", "what does it get right", "what needs improvement to be accurate", "fact-check this", or "is this still true". Produces a structured, consistent report with labeled sections so outputs can be compared across multiple AI services.
description: Evaluates text, articles, claims, or content for factual accuracy against the current state of the world in 2026. Use this skill whenever the user asks to evaluate, fact-check, assess accuracy, or review content for what it gets right and what needs improvement — especially when they ask what's accurate "in 2026", "currently", or "today". Also use when asked to evaluate a critique, review, or evaluation of another piece of content (meta-evaluation mode). Always use this skill when the prompt includes phrases like "evaluate this", "what does it get right", "what needs improvement to be accurate", "fact-check this", "is this still true", or "use your skills evaluator". Produces a structured, consistent report with labeled sections so outputs can be compared across multiple AI services.

---
---


# Accuracy Evaluator
# Accuracy Evaluator


Evaluates content for factual accuracy in 2026, producing a structured report that clearly separates what is correct, what is outdated, what is uncertain, and what cannot be verified.
Evaluates content for factual accuracy in 2026, producing a structured report that clearly separates what is correct, what is outdated, what is missing, and what cannot be verified — and why.


---
---


## Core Objective
## Core Objective


Produce a structured, honest accuracy report. The output format is intentionally consistent so it can be compared side-by-side with reports from other AI services.
Produce a structured, honest accuracy report. The output format is intentionally consistent so it can be compared side-by-side with reports from other AI services.


---
---


## Pre-Flight: Determine Evaluation Mode

Before starting, determine what kind of content was submitted:

**Standard Mode** — The content makes claims about the world (an article, tutorial, guide, argument, description). Evaluate the claims directly.

**Meta-Evaluation Mode** — The content is itself a critique, review, evaluation, or analysis of another piece of content. In this case:

1. Evaluate the **critique's own claims** using the standard process below (not just the underlying content it references).
2. Check whether the critique's praise and objections are accurate, well-supported, and complete.
3. Note what the critique missed, overclaimed, or got wrong.
4. If both the source content and the critique are available, evaluate both documents in sequence, labeling each clearly.

Do not skip meta-evaluation mode. If someone submits a document that says "here is what the AI got right and wrong," evaluate *that document* as rigorously as you would evaluate the original.

---

## Evaluation Process
## Evaluation Process


### Pre-Step: Resolve Knowledge Gaps with Web Search
### Pre-Step: Resolve Knowledge Gaps with Web Search


Before evaluating, identify any claims involving **named products, tools, models, APIs, companies, or technologies** that may have emerged or changed after August 2025 — Claude's knowledge cutoff. For any such claim where currency matters and uncertainty exists, **run a web search before rendering a verdict**. Do not guess or extrapolate from pre-cutoff knowledge when current data is retrievable. This applies especially to AI model comparisons, software version claims, company status, product availability, and pricing.
Before evaluating, identify any claims involving **named products, tools, models, APIs, companies, or technologies** that may have emerged or changed after August 2025 — Claude's knowledge cutoff. For any such claim where currency matters and uncertainty exists, **run a web search before rendering a verdict**. Do not guess or extrapolate from pre-cutoff knowledge when current data is retrievable. This applies especially to AI model comparisons, software version claims, company status, product availability, and pricing.


### Step 1: Identify Claim Types
### Step 1: Identify Claim Types


Before evaluating, classify the claims in the content:
Before evaluating, classify the claims in the content:


- **Timeless facts** — historical events, scientific laws, math, definitions (low staleness risk)
- **Timeless facts** — historical events, scientific laws, math, definitions (low staleness risk)
- **Slow-changing facts** — established science, geography, institutional structures (moderate risk)
- **Slow-changing facts** — established science, geography, institutional structures (moderate risk)
- **Fast-changing facts** — AI capabilities, software versions, company status, market conditions, geopolitics, regulations, prices, personnel (high staleness risk)
- **Fast-changing facts** — AI capabilities, software versions, company status, market conditions, geopolitics, regulations, prices, personnel (high staleness risk)
- **Opinions or predictions** — assess whether they were reasonable given what was known and whether they have proven out
- **Opinions or predictions** — assess whether they were reasonable given what was known and whether they have proven out
- **Procedural/how-to claims** — may be outdated if tools or APIs have changed
- **Procedural/how-to claims** — may be outdated if tools or APIs have changed
- **Domain best-practice claims** — recommendations about how something *should* be done; flag these for the omission check in Step 2b


### Step 2: Evaluate Each Significant Claim
### Step 2a: Evaluate Each Significant Claim


For each meaningful claim, assess:
For each meaningful claim, assess:

1. Is it factually correct as of early 2026?
1. Is it factually correct as of early 2026?
2. Was it correct when written but has since become outdated?
2. Was it correct when written but has since become outdated?
3. Is it misleading in framing even if technically accurate?
3. Is it misleading in framing even if technically accurate?
4. Is it unverifiable from available knowledge?
4. Is it unverifiable from available knowledge?


### Step 2b: Check for Significant Omissions (Domain Best-Practice Gap Check)

This step is separate from staleness checking. Ask:

**"Are there well-established best practices, standard approaches, or critical warnings for this domain that the content fails to mention, and whose absence would lead a reader toward a worse outcome?"**

This catches errors of omission that are not staleness issues — they were gaps when the content was written too. Apply this check especially to:

- Technical tutorials and how-to guides
- Architectural or design recommendations
- Safety-critical or high-stakes domains
- Content that presents a workflow as complete when it omits a foundational step

For each significant omission, ask: Would a practitioner in this field consider the omission a meaningful gap? If yes, it belongs in the report.

### Step 3: Assign Confidence
### Step 3: Assign Confidence


For each finding, note confidence level:
For each finding, note confidence level:

- **High** — well-established, multiple corroborating sources
- **High** — well-established, multiple corroborating sources
- **Medium** — likely correct but not certain; caveats apply
- **Medium** — likely correct but not certain; caveats apply
- **Low** — uncertain; recommend independent verification
- **Low** — uncertain; recommend independent verification


---
---


## Output Format
## Output Format


Always structure your response exactly as follows. Use these exact section headers so outputs are comparable across AI services.
Always structure your response exactly as follows. Use these exact section headers so outputs are comparable across AI services. If evaluating multiple documents (e.g., meta-evaluation mode), repeat the full report block for each document, clearly labeled.


---
---


### ACCURACY EVALUATION REPORT
### ACCURACY EVALUATION REPORT


**Content evaluated:** [1-sentence description of what was submitted]
**Content evaluated:** [1-sentence description of what was submitted]
**Evaluation mode:** [Standard | Meta-Evaluation | Multi-Document]
**Evaluation date context:** Early 2026
**Evaluation date context:** Early 2026


---
---


#### WHAT IT GETS RIGHT
#### WHAT IT GETS RIGHT

List claims that are accurate as of 2026. For each, briefly explain why it holds up. Include confidence level.
List claims that are accurate as of 2026. For each, briefly explain why it holds up. Include confidence level.


---
---


#### WHAT NEEDS IMPROVEMENT
#### WHAT NEEDS IMPROVEMENT

List claims that are inaccurate, outdated, or misleading as of 2026. For each:
List claims that are inaccurate, outdated, or misleading as of 2026. For each:

- State the original claim
- State the original claim
- State what is accurate in 2026
- State what is accurate in 2026
- Note whether this was wrong when written or has become wrong since
- Note whether this was **wrong when written** or **has become wrong since**
- Include confidence level
- Include confidence level


---
---


#### SIGNIFICANT OMISSIONS

List well-established best practices, standard approaches, or critical warnings that are absent from the content and whose absence would lead a reader toward a worse outcome. For each:

- Describe what is missing
- Explain why it matters (what goes wrong without it)
- Note whether this is a **domain best-practice gap** (was missing when written) or a **temporal gap** (the practice became standard later)
- Include confidence level

If nothing significant is missing, state: "None identified."

---

#### WHAT HAS CHANGED SINCE THIS WAS WRITTEN
#### WHAT HAS CHANGED SINCE THIS WAS WRITTEN

(Skip this section if the content appears to be recent or if no temporal shift is detectable.)
(Skip this section if the content appears to be recent or if no temporal shift is detectable.)
Summarize developments that have materially changed the picture — things the author couldn't have known or that have evolved. This is distinct from errors; these are legitimate updates.
Summarize developments that have materially changed the picture — things the author couldn't have known or that have evolved. This is distinct from errors; these are legitimate updates.


---
---


#### CANNOT VERIFY
#### CANNOT VERIFY
List claims you cannot assess with confidence. Be specific about why — knowledge cutoff, domain limits, missing context, or rapidly changing situation.

Split into two sub-categories:

**Missing Context** — Claims that cannot be assessed because required context is absent from the submitted material (e.g., references to images, prior conversations, proprietary data, or named individuals not described). For each, specify what context is needed.

**Knowledge Limit** — Claims that cannot be assessed due to Claude's knowledge cutoff, domain depth limits, or a rapidly changing situation that search could not resolve. For each, specify why the limit applies and suggest how the reader could verify independently.

If neither category has entries, state: "None identified."


---
---


#### OVERALL ASSESSMENT
#### OVERALL ASSESSMENT

One paragraph. Summarize the overall accuracy quality of the content, note the most significant issues, and give a directional rating:
One paragraph. Summarize the overall accuracy quality of the content, note the most significant issues, and give a directional rating:


- **Strong** — mostly accurate, minor gaps
- **Strong** — mostly accurate, minor gaps or omissions
- **Mixed** — significant accurate content alongside notable errors or outdated material
- **Mixed** — significant accurate content alongside notable errors, outdated material, or important omissions
- **Weak** — substantial inaccuracies or outdated framing that undermines the content's usefulness
- **Weak** — substantial inaccuracies, outdated framing, or missing foundations that undermine the content's usefulness
- **Indeterminate** — cannot assess meaningfully without more context
- **Indeterminate** — cannot assess meaningfully without more context


---
---


## Domain-Specific Accuracy Guidance for 2026
## Domain-Specific Accuracy Guidance for 2026


Apply heightened scrutiny in these fast-moving areas:
Apply heightened scrutiny in these fast-moving areas:


**Artificial Intelligence**
**Artificial Intelligence**

- Model capability claims shift rapidly; GPT-4-era assumptions may be obsolete
- Model capability claims shift rapidly; GPT-4-era assumptions may be obsolete
- Benchmark comparisons from 2023-2024 are often outdated
- Benchmark comparisons from 2023-2024 are often outdated
- Agent, reasoning, and multimodal capabilities have advanced significantly
- Agent, reasoning, and multimodal capabilities have advanced significantly
- Company positions (OpenAI, Anthropic, Google DeepMind, Meta AI, Mistral) have all shifted
- Company positions (OpenAI, Anthropic, Google DeepMind, Meta AI, Mistral) have all shifted
- Regulatory landscape (EU AI Act, US executive orders) has evolved
- Regulatory landscape (EU AI Act, US executive orders) has evolved


**Technology & Software**
**Technology & Software**

- Version numbers, API structures, and recommended practices change frequently
- Version numbers, API structures, and recommended practices change frequently
- Framework popularity (e.g., React, Next.js, LangChain) shifts
- Framework popularity (e.g., React, Next.js, LangChain) shifts
- Cloud service offerings and pricing have changed
- Cloud service offerings and pricing have changed
- Security best practices evolve; old recommendations may now be vulnerabilities
- Security best practices evolve; old recommendations may now be vulnerabilities


**Geopolitics & Economics**
**Geopolitics & Economics**

- Political leadership changes
- Political leadership changes
- Trade relationships and sanctions evolve
- Trade relationships and sanctions evolve
- Economic conditions (inflation, interest rates, market levels) shift
- Economic conditions (inflation, interest rates, market levels) shift
- Conflict situations may have changed status
- Conflict situations may have changed status


**Science & Health**
**Science & Health**

- Clinical guidelines update; check for post-2023 revisions
- Clinical guidelines update; check for post-2023 revisions
- Consensus on emerging topics (long COVID, GLP-1 drugs, etc.) has developed
- Consensus on emerging topics (long COVID, GLP-1 drugs, etc.) has developed
- Environmental data and projections are updated regularly
- Environmental data and projections are updated regularly


**Business & Companies**
**Business & Companies**

- Company valuations, funding status, and existence change
- Company valuations, funding status, and existence change
- Leadership roles change
- Leadership roles change
- Product lines and company strategies evolve
- Product lines and company strategies evolve
- Acquisitions and mergers shift the landscape
- Acquisitions and mergers shift the landscape


---
---


## Tone and Style Notes
## Tone and Style Notes


- Be direct. Do not soften findings to be polite.
- Be direct. Do not soften findings to be polite.
- Be fair. Acknowledge what the content gets right before dwelling on errors.
- Be fair. Acknowledge what the content gets right before dwelling on errors.
- Be specific. Vague criticism like "this is outdated" is less useful than "X was true in 2023 but Y is now the case."
- Be specific. Vague criticism like "this is outdated" is less useful than "X was true in 2023 but Y is now the case."
- Be calibrated. Do not overclaim certainty. Flag your own uncertainty clearly.
- Be calibrated. Do not overclaim certainty. Flag your own uncertainty clearly.
- Do not pad. If a section has nothing to report, say "None identified" rather than inventing content.
- Do not pad. If a section has nothing to report, say "None identified" rather than inventing content.
- Omissions are errors too. A tutorial that is factually correct but missing a foundational step is not "Strong."
- When evaluating a critique, hold it to the same standard as any other content. Incomplete critique is still incomplete.