MapTheNet Crawler

Information for Website Administrators

About Our Crawler

MapTheNet operates a web crawler as part of an open research project to understand and document the structure of the internet. Our bot systematically explores publicly accessible websites to build a comprehensive map of how the web is organized and interconnected.

Our Mission

We aim to create a public resource for researchers, educators, and internet enthusiasts to understand the web's topology, thematic organization, and how information flows across different domains and regions.

The crawler is designed to be respectful, well-behaved, and transparent in its operations. We strictly adhere to web standards including robots.txt, implement rate limiting, and collect only publicly accessible metadata.

What Data We Collect

Information We Collect

Information We Do NOT Collect

All collected data is aggregated and only the metadata is stored. We do not publish individual page content or create searchable copies of websites.

Technical Specifications

Bot Identification

Our crawler identifies itself with the following User-Agent string:

MapTheNetBot/0.1 (+https://mapthenet.org/crawler)

Crawler Behavior

Parameter Value
Request Rate 1 request per second per domain (default)
Pages per Domain Maximum 250 pages
Redirects Follows up to 5 redirects
Timeout 30 seconds per request

Standards Compliance

How to Block Our Crawler

While we hope you'll allow our research crawler to visit your site, we respect your right to control access. Here are several methods to block our bot:

Method 1: robots.txt (Recommended)

Add the following to your robots.txt file to block the crawler entirely:

User-agent: MapTheNetBot
Disallow: /

Or to allow crawling but request a slower rate:

User-agent: MapTheNetBot
Crawl-delay: 10

Method 2: Apache Configuration

Add to your .htaccess file:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} MapTheNetBot [NC]
RewriteRule .* - [F,L]

Method 3: Nginx Configuration

Add to your server configuration:

if ($http_user_agent ~* "MapTheNetBot") {
    return 403;
}

Method 4: Meta Robots Tags

Add to the <head> section of your HTML:

<meta name="robots" content="noindex, nofollow">

Why Allow Our Crawler

Please Consider Allowing Our Bot

Your website becomes part of documenting internet history and structure. Here's why it matters:

By allowing our crawler, you're contributing to a broader understanding of how the web connects and evolves over time.

Frequently Asked Questions

Will this bot slow down my website?

No. Our crawler is rate-limited to 1 request per second by default and respects any crawl-delay you specify in robots.txt. We also limit ourselves to 250 pages maximum per domain.

Do you crawl dynamic or JavaScript-rendered content?

We primarily crawl standard HTML content. JavaScript-rendered content may not be fully indexed as our focus is on traditional link relationships and metadata.

How often do you re-crawl websites?

While it is planned at some point in the future to implement re-crawling, it is not currently active.

Therfore after collecting the a maximum of 250 pages on your domain, the bot stops visiting further pages.

Is the collected data publicly available?

Not yet. However we plan to release both a Map of the interconnectivity of domains, as well as release our DB with metadata only. Individual page content is not published or made searchable.

What if I want my site removed from your database?

Contact us at the email address below and we'll remove your domain from our crawl queue and existing database.

Do you honor nofollow links?

Yes, we respect all standard meta robots tags and link attributes including rel="nofollow".

Are you affiliated with any commercial entity?

No, MapTheNet is a non-commercial research project. We do not sell data or provide commercial services.

Can I see what data you have about my domain?

Once our data explorer is available, you'll be able to search for your domain and view the metadata we've collected. Until then, contact us directly.

Contact Information

Questions, concerns, or want to report an issue with our crawler?

Email: [email protected]
GitHub: github.com/mapthenet
Main Website: mapthenet.org

We typically respond to inquiries within 48 hours.