About
news-please is an open-source news crawler used to build journalism and media datasets for NLP and LLM training. Used extensively in academic research and commercial AI projects to collect article text from news sites.
Purpose
News and media dataset collection for NLP/LLM training
User Agent String
news-please/1.5
How to Control in robots.txt
🚫 Block NewsPlease
User-agent: NewsPlease Disallow: /
✅ Allow NewsPlease
User-agent: NewsPlease Allow: /
Is NewsPlease crawling your site?
Run a free scan to check if Community / Various's crawler is accessing your website.
Check if NewsPlease is crawling YOUR site →