Information for Website Administrators
MapTheNet operates a web crawler as part of an open research project to understand and document the structure of the internet. Our bot systematically explores publicly accessible websites to build a comprehensive map of how the web is organized and interconnected.
We aim to create a public resource for researchers, educators, and internet enthusiasts to understand the web's topology, thematic organization, and how information flows across different domains and regions.
The crawler is designed to be respectful, well-behaved, and transparent in its operations. We strictly adhere to web standards including robots.txt, implement rate limiting, and collect only publicly accessible metadata.
All collected data is aggregated and only the metadata is stored. We do not publish individual page content or create searchable copies of websites.
Our crawler identifies itself with the following User-Agent string:
MapTheNetBot/0.1 (+https://mapthenet.org/crawler)
| Parameter | Value |
|---|---|
| Request Rate | 1 request per second per domain (default) |
| Pages per Domain | Maximum 250 pages |
| Redirects | Follows up to 5 redirects |
| Timeout | 30 seconds per request |
While we hope you'll allow our research crawler to visit your site, we respect your right to control access. Here are several methods to block our bot:
Add the following to your robots.txt file to block the crawler entirely:
User-agent: MapTheNetBot
Disallow: /
Or to allow crawling but request a slower rate:
User-agent: MapTheNetBot
Crawl-delay: 10
Add to your .htaccess file:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} MapTheNetBot [NC]
RewriteRule .* - [F,L]
Add to your server configuration:
if ($http_user_agent ~* "MapTheNetBot") {
return 403;
}
Add to the <head> section of your HTML:
<meta name="robots" content="noindex, nofollow">
Your website becomes part of documenting internet history and structure. Here's why it matters:
By allowing our crawler, you're contributing to a broader understanding of how the web connects and evolves over time.
No. Our crawler is rate-limited to 1 request per second by default and respects any crawl-delay you specify in robots.txt. We also limit ourselves to 250 pages maximum per domain.
We primarily crawl standard HTML content. JavaScript-rendered content may not be fully indexed as our focus is on traditional link relationships and metadata.
While it is planned at some point in the future to implement re-crawling, it is not currently active.
Therfore after collecting the a maximum of 250 pages on your domain, the bot stops visiting further pages.
Not yet. However we plan to release both a Map of the interconnectivity of domains, as well as release our DB with metadata only. Individual page content is not published or made searchable.
Contact us at the email address below and we'll remove your domain from our crawl queue and existing database.
Yes, we respect all standard meta robots tags and link attributes including rel="nofollow".
No, MapTheNet is a non-commercial research project. We do not sell data or provide commercial services.
Once our data explorer is available, you'll be able to search for your domain and view the metadata we've collected. Until then, contact us directly.
Questions, concerns, or want to report an issue with our crawler?
Email: [email protected]
GitHub: github.com/mapthenet
Main Website: mapthenet.org
We typically respond to inquiries within 48 hours.