About
img2dataset is an open-source tool used to download and resize large image datasets for AI training. Widely used by the ML community to build CLIP training sets and image-generation datasets. When run against a site, it downloads images in bulk without standard browser identification.
Purpose
Bulk image dataset collection for AI training
User Agent String
img2dataset
How to Control in robots.txt
🚫 Block img2dataset
User-agent: img2dataset Disallow: /
✅ Allow img2dataset
User-agent: img2dataset Allow: /
⚠️ img2dataset has been observed ignoring robots.txt directives. You may need server-level blocking (e.g., firewall rules or user-agent filtering) to effectively prevent access.
Is img2dataset crawling your site?
Run a free scan to check if HuggingFace / Community's crawler is accessing your website.
Check if img2dataset is crawling YOUR site →