How We Rate Foods

This page documents exactly how every score on FoodRef.ai is generated, validated, and corrected. It is intentionally specific — the goal is to make our methodology auditable rather than aspirational.

1. AI-generated ratings

Each of the 1,000 foods in our catalogue is evaluated against 11 dietary frameworks: Keto, Paleo, Mediterranean, Vegan, Carnivore, Whole30, DASH, Zone, Low-FODMAP, Anti-Inflammatory, and GLP-1 Friendly. Each evaluation produces five fields: a 1–10 score, a verdict (approve / caution / avoid), a confidence level (high / medium / low), reasoning, and where applicable a dissenting view from within that diet's own community.

This produces 11,000 individual food×diet ratings — not one "is it healthy" opinion. The composite score on each food page aggregates these into a single number, and the controversy index measures the spread across the 11 frameworks. A food every framework agrees on (e.g., spinach) has a low controversy index. A food that some frameworks endorse and others reject (e.g., legumes) has a high one.

Ratings are generated by an Anthropic Claude model using diet-specific system prompts that encode each framework's published rules, permitted food lists, and nutritional priorities. The prompts themselves — not just the outputs — are the artifact our reviewing dietitian validates.

2. Expert validation

Our methodology is validated by Liz Cook, MS, RD. Her review covers three layers, each with different scope and a different evidentiary trail.

System prompts. Liz reviewed each of the 11 diet-specific system prompts that drive AI rating generation, line-by-line, for clinical accuracy. Her tracked-changes corrections were incorporated into the live prompts, and the affected diet columns were regenerated against the corrected prompts. Both pre-review and post-review versions of all 11 prompts are preserved in our audit corpus.

Personally-reviewed ratings.A methodologically-sampled subset of generated ratings is personally reviewed for clinical soundness. Liz's sampling protocol: rotate through all 11 diet frameworks evenly, plus all entries flagged as low-confidence by the model, plus all medium-confidence entries in diets where she identified systematic concerns. To date, 283 individual food×diet ratings have been personally vetted under this protocol — 2.6% of the catalogue's 11,000 ratings.

The 2.6% number deserves context. Methodologically-sampled spot review is the standard quality-assurance method across regulated industries — clinical trials, food safety inspection, financial auditing — precisely because reviewing 100% of a large population is rarely the most informative use of expert time. The right comparison is not "what fraction of our ratings has Liz seen" but "did Liz's sample catch the cases where the AI was wrong." Of the 283 ratings she personally reviewed, she recommended changes to 46. Every one of those 46 recommended changes was adopted into the live database.

Site-wide attestation. Liz attests that the system prompts she reviewed produce ratings consistent with established dietary science across the catalogue, and that her sampling protocol was sufficient to identify any systematic concerns. The full attestation document is available on the Advisory Board page.

"The system's structured approach—applying clearly defined dietary rules across multiple frameworks and generating compatibility scores, verdicts, confidence levels, and supporting rationale—aligns with established nutrition guidance from credible sources and industry experts."

— Liz Cook, MS, RD · Read the full attestation

Personally-reviewed ratings display a "Verified by Liz Cook, MS, RD" badge with the date of review. Other ratings are AI-generated under the reviewed and corrected methodology — they carry a "Reviewed methodology" indicator rather than the verified badge. We expand the personally-reviewed sample over time, prioritizing high-traffic and high-controversy foods.

3. Recipe ratings

FoodRef rates recipes as well as foods. Recipes are scored against the same 11 dietary frameworks, using the same diet-specific system prompts that drive food ratings. Each recipe receives the same five-field output as a food: a 1–10 score per diet, a verdict (approve / caution / avoid), a confidence level, reasoning, and where applicable a dissenting view.

The methodological difference between food and recipe ratings is in the input, not the prompt. A food rating evaluates a single ingredient against a diet's rules. A recipe rating evaluates the dish as a whole — both the full recipe text (ingredient list, preparation, and any user-provided notes) and a structured, normalized ingredient breakdown that we extract during the scan. The model sees both: the dish as written and the dish reduced to its constituent ingredients. Scores reflect the overall dish, including how cooking methods, portion sizes, and ingredient ratios shape the result — a recipe is more than the sum of its parts, and the scoring is designed to reflect that.

The structured-ingredient layer matters for an important auditability reason. Before recipe scans extracted ingredients into their own structured records, recipe scoring was effectively a black box — ingredient mentions only appeared inside the model's free-text rationale, where they couldn't be reliably queried or compared across recipes. Structured extraction means that for any recipe on the site, the exact ingredient list the model evaluated against is now persistent, queryable, and visible. Disagreements about a score can be traced to the underlying input, not lost to the model's internal reasoning.

Recipe scoring inherits the prompt-validation layer described in section 2. The 11 diet-specific prompts that drive recipe ratings are the same prompts Liz reviewed line-by-line for clinical accuracy and regenerated against where her corrections required regeneration. The methodological logic carries through: if the prompts are clinically sound for foods, they are clinically sound for recipes, because the rules being applied are the same.

Recipe ratings have not been individually spot-reviewed against Liz's sampling protocol the way food×diet ratings have. The 283 personally-vetted ratings in section 2 are all food entries, not recipes. Expansion of personal spot-review to cover recipe outputs is part of our next prompt-review cycle. In the meantime, the same drift-correction layer described in the next section applies to recipe ratings: if a score is materially wrong on inspection, it is corrected through the same audit trail used for foods.

4. Drift correction

When the AI model regenerates ratings against an updated prompt, a small fraction of outputs disagree with the prompt's clinical intent — usually because the model misses a hidden ingredient in a composite food (Hollandaise sauce contains butter; imitation crab contains starch binders; French toast contains both bread and added sugar). We catch these via diff analysis after each prompt-revision cycle and apply manual corrections.

These corrections are tracked separately in our audit trail. Personally-reviewed ratings carry Liz Cook's name as the reviewer; manual drift corrections carry a distinct reviewer marker so the two layers don't conflate. We've applied 8 such corrections across 1,132 regenerated ratings — a 0.71% drift rate, well within the range we consider acceptable for AI-generated nutritional content. The cumulative ledger of drift corrections is part of our public audit corpus.

5. Re-validation cycle

When Liz identifies updates needed in a system prompt — typically because new clinical guidance has been published or because she's identified a previously-overlooked food category — we incorporate her tracked changes, regenerate the affected diet column against the updated prompt, and re-validate. The most recent cycle (May 2026) regenerated four of the 11 diet columns against substantially-revised prompts, producing 1,132 verdict shifts that were then individually reviewed for directional consistency with Liz's corrections. This loop is documented in our audit corpus and is not a one-time event.

6. Data sources

Nutrition data (calories, macronutrients, micronutrients) is sourced from the USDA FoodData Central database. Diet-specific rules are derived from:

7. Scoring system

Each diet rates a food on a 1–10 scale. Verdicts map to score ranges deterministically: avoid = 1–3, caution = 4–6, approve = 7–10. The composite score on each food page is the mean across all 11 frameworks. The controversy index is the standard deviation of those 11 scores, normalized to a 0–10 scale. Agreement levels are categorized as consensus (controversy < 1.5), majority (1.5–2.5), divided (2.5–3.5), or controversial (> 3.5).

8. Updates and corrections

Ratings are regenerated when (a) the underlying system prompts are updated by Liz's review, (b) new clinical guidance changes a diet's published rules, or (c) a user-submitted correction is validated. Every food page shows a "Last reviewed" date. If you believe a rating is incorrect, tell us — we investigate all reported inaccuracies and update ratings within 48 hours when a correction is warranted.

What this page does not claim

We do not claim that every rating on FoodRef.ai has been individually reviewed by a registered dietitian. 2.6% of the catalogue has been; the remainder is AI-generated under a reviewed methodology and corrected for drift. We do not claim that AI-generated nutritional ratings are equivalent to personalized nutrition counseling. We do not claim that any single dietary framework is correct for any specific person — that's a question for you and a qualified clinician.

Important Disclaimer

FoodRef.ai provides informational content about how different dietary frameworks evaluate foods. It is not medical advice. Always consult a qualified healthcare professional before making significant changes to your diet, especially if you have medical conditions, allergies, or are taking medication.