How We Rank
Every Calorie Rankings review and ranking is scored on the same 100-point rubric. The protocol is published below in enough detail that an outside party could replicate it.
The 100-point rubric
| Criterion | Weight | What we measure |
|---|---|---|
| Accuracy & Database | 25% | Per-entry verification, coverage, freshness, noise resilience |
| Logging Ease | 20% | Median time-to-log across a 20-task battery; friction; recall efficiency |
| AI Photo Recognition | 15% | Top-1/top-3 identification, portion MAPE, plate segmentation |
| Macro & Goal Tracking | 15% | Macro depth, target flexibility, adaptive coaching algorithms |
| Insights & Reports | 10% | Trend analysis, exportability, biometric/lab data integration |
| Value & Price | 10% | Real 12-month cost vs feature delivery; free-tier usefulness |
| Privacy & Transparency | 5% | Data handling, disclosure clarity, cancellation friction |
How we measure accuracy
The accuracy criterion (25% of the 100-point total) is anchored to Mean Absolute Percentage Error (MAPE) against weighed reference meals. Each reference meal is built from USDA FoodData Central composition values, with every ingredient weighed on a calibrated kitchen scale (0.1g precision). We compute MAPE of each app's predicted kcal vs the reference value across the battery.
Scoring anchor: accuracy_points = clamp(100 − MAPE × 4, 0, 100).
A 5% MAPE earns 80 points; 15% MAPE earns 40; 25%+ earns zero. The slope was chosen so an app
at the boundary of clinical usefulness (~5% MAPE per Schoeller 1995) gets a strong but not
perfect score.
Sample size, equipment model numbers, and the full reference-meal list will be published as a downloadable CSV alongside the first batch of benchmark reviews. The scoring code will be on GitHub.
How we score database quality
Database quality is measured on three sub-dimensions: coverage (a sampled 50-item probe across single ingredients, composed plates, and regional dishes), verification (4-tier grading: USDA / manufacturer label / verified user / unverified user), and noise resilience (how often a common-foods search surfaces a usable result in the top three hits).
How we score logging ease
Logging Ease (20% weight) is measured as the median time-to-log across a standardized 20-task battery covering five input modes:
- Barcode scan → logged: target ≤ 10 seconds
- Search common food → logged: target ≤ 20 seconds
- Photo AI → logged (where supported): target ≤ 15 seconds
- Custom food entry (first-time): target ≤ 60 seconds
- Re-log a recent meal: target ≤ 5 seconds
An entry logged incorrectly (wrong food, wrong portion) counts as infinite time for that task — speed without accuracy doesn't earn points.
How we score AI photo recognition
For each AI-photo-capable app we run a 30-plate photo battery across three lighting conditions, three angles, and three plate sizes. Sub-scoring:
- Top-1 identification correctness (40 of 100 AI-subscore points)
- Top-3 identification correctness (20)
- Portion-size MAPE (30)
- Plate segmentation accuracy on multi-item plates (10)
How we score macro & goal tracking
Macro tracking (15% weight) covers four sub-dimensions: macro display depth (calories, P/C/F, net carbs, fiber as first-class metrics), target-setting flexibility (custom per-macro targets, time-windowed targets), adaptive coaching algorithms (TDEE estimation, weekly target adjustment), and recipe builder quality.
How we score value & price
Value (10% weight) is computed as feature-density per dollar of annual cost. Free-tier usefulness counts (a useful free tier raises the value sub-score). Aggressive trial-conversion pricing reduces the sub-score.
How we score privacy & transparency
Privacy (5% weight) is graded on data handling disclosure clarity, retention policy transparency, ease of data export and deletion, cancellation friction, and whether the product's monetization model creates conflicts of interest with user advice quality.
Test cadence
Top-tier apps are re-tested quarterly. Mid-tier apps are re-tested semi-annually. A vendor release that changes core methodology, database source, or photo-AI model triggers a 30-day re-test window.
Quality control
Until we publish named contributor bios, all writing and scoring is done by the editorial group and reviewed against the test data before publication. Substantive corrections are logged with date and reason (corrections policy).
How we use AI
We use AI tools for research summarization and copy editing. AI does not write reviews, does not generate scores, and is never the source of a factual claim. Full disclosure: how we use AI.
Why we don't take affiliate money
We don't maintain affiliate accounts with any of the apps we cover. Our reasoning is documented in our no-affiliate disclosure.