
Review-Trust Pipeline: how we make reviews reliable
Reliable review analysis requires transparency. At Collected.reviews, we use our own method: the Review-Trust Pipeline. It filters out noise, detects manipulation, and evaluates reviews based on reliability so that every theme score truly means something. Below, you can read how it works – with concrete data.
Dataset
For this analysis, we used the dataset EU Retail Reviews v1.3, containing a total of 182,450 reviews (169,732 unique after deduplication). The period covers January 1 through September 30, 2025, with data from the Netherlands, Germany, Belgium, and Austria, in the languages NL, DE, and EN. The analysis was conducted using pipeline version 2.4.0.
Why this is necessary
Not all reviews are equally valuable. We identify three structural issues:
- Manipulation – spikes in short periods, copied texts, or reward campaigns.
- Noise – incomplete sentences, duplicate submissions, non-experiential opinions.
- Bias – mostly extreme experiences are shared, or platforms moderate selectively.
To correct for such distortions, we evaluate each review based on six signals.
The five steps of our pipeline
-
Intake & normalization
All reviews are converted into a uniform schema (text, date, star rating, metadata). Exact duplicates are removed.
-
Identity & behavior
Account age, posting frequency, device patterns, and timing clusters (where the source allows).
-
Text signals
Semantic repetition, template phrases, and extreme sentiment without details.
-
Incentive detection
Language indicating benefit (discount, cashback, gift card) → label “incentivized.”
-
Weighting & normalization
Each review receives a trust score (0–1). Theme scores are weighted and time-corrected (recent > old).
Important: We never delete anything arbitrarily; we evaluate it. Transparency over censorship.
Key signals and thresholds
Signal Threshold Effect Duplicate / near-duplicate ≥ 0.88 semantic overlap lower trust Timing spike peak within 12 hours vs. baseline lower weighting Incentive language word list + context label “incentivized” Template phrases repetition score > 0.75 lower trust Lack of detail extreme sentiment without facts lower trust Account signals young account + high output lower trust
Weighting model
Each component receives a weight; the formula in short:
trust = 1 − (0.35D + 0.20S + 0.20I + 0.10T + 0.10P + 0.05A) Component Symbol Weight Duplicate / near-dup D 0.35 Timing spike S 0.20 Incentive language I 0.20 Template phrases T 0.10 Lack of detail P 0.10 Account signals A 0.05 Time decay λ 0.015
Mini results (Q1–Q3 2025)
Metric Value Share of near-duplicates 6.8% Share of incentivized reviews 12.4% Median trust score 0.73 Average theme score correction +4.6 points Detected spike events 89
This correction ensures more representative theme scores. A sector with many promotions is no longer artificially positive.
Example cases
Case Signal Effect on trust C-1274 35 identical sentence parts within 2 hours −0.22 C-2091 Coupon mention + referral link −0.18 C-3310 40 reviews new account within 24 hours −0.26
Normalization and reporting
After weighting, we first normalize per platform (to compensate for moderation differences) and then cross-platform via z-score, so that all results appear on a single scale (0–100). On the company page, we display:
- weighted theme scores,
- sentiment distribution,
- reliability band (CI),
- share of incentivized reviews.
Limitations
- Not every platform provides device or account data.
- Short reviews remain difficult to evaluate.
- Source bias: audience per source may differ from the actual customer base.
- Irony or sarcasm is not always accurately detected.
That’s why we report with margins and definitions instead of absolute truths.
What this means for you
For consumers
Trust patterns, not outliers. Check labels like “incentivized” and “low repetition.”
For companies
Address themes with high impact & low trust (e.g., billing or delivery time) for quick improvements.