MapTheNet Crawler Information

About Our Crawler

MapTheNet operates a web crawler as part of an open research project to understand and document the structure of the internet. Our bot systematically explores publicly accessible websites to build a comprehensive map of how the web is organized and interconnected.

Our Mission

We aim to create a public resource for researchers, educators, and internet enthusiasts to understand the web's topology, thematic organization, and how information flows across different domains and regions.

The crawler is designed to be respectful, well-behaved, and transparent in its operations. We strictly adhere to web standards including robots.txt, implement rate limiting, and collect only publicly accessible metadata.

What Data We Collect

Information We Collect

Domain names and public URLs
Link relationships between websites
Page titles and meta descriptions
Domain categories and classifications
Language codes (detected from content)
Geographic information (derived from TLD or GeoIP)

Information We Do NOT Collect

Personal information or user data
Form submissions or POST data
Cookies or session information
Content behind login pages or authentication
Private or password-protected areas
Full page content (only metadata for classification)
Email addresses or contact information

All collected data is aggregated and only the metadata is stored. We do not publish individual page content or create searchable copies of websites.

Technical Specifications

Bot Identification

Our crawler identifies itself with the following User-Agent string:

MapTheNetBot/0.1 (+https://mapthenet.org/crawler)

Crawler Behavior

Parameter	Value
Request Rate	1 request per second per domain (default)
Pages per Domain	Maximum 250 pages
Redirects	Follows up to 5 redirects
Timeout	30 seconds per request

Standards Compliance

robots.txt: Full compliance with all directives including User-agent, Disallow, Allow, and Crawl-delay
Meta robots tags: Honors noindex, nofollow, and related directives
Rate limiting: Respects crawl-delay specifications in robots.txt
HTTP status codes: Properly handles 4xx and 5xx responses

How to Block Our Crawler

While we hope you'll allow our research crawler to visit your site, we respect your right to control access. Here are several methods to block our bot:

Method 1: robots.txt (Recommended)

Add the following to your robots.txt file to block the crawler entirely:

User-agent: MapTheNetBot
Disallow: /

Or to allow crawling but request a slower rate:

User-agent: MapTheNetBot
Crawl-delay: 10

Method 2: Apache Configuration

Add to your .htaccess file:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} MapTheNetBot [NC]
RewriteRule .* - [F,L]

Method 3: Nginx Configuration

Add to your server configuration:

if ($http_user_agent ~* "MapTheNetBot") {
    return 403;
}

Method 4: Meta Robots Tags

Add to the <head> section of your HTML:

<meta name="robots" content="noindex, nofollow">

Why Allow Our Crawler

Please Consider Allowing Our Bot

Your website becomes part of documenting internet history and structure. Here's why it matters:

Open Data: Contribute to public datasets that anyone can use for analysis and research
Educational Value: Help everbody understand how the web is structured and connected with each other
Low Impact: Our crawler is lightweight, respectful, and strictly rate-limited
Privacy Conscious: We don't collect and sell personal data or private content
Transparent: Open source code and clear documentation of our methods

By allowing our crawler, you're contributing to a broader understanding of how the web connects and evolves over time.

Frequently Asked Questions

Will this bot slow down my website?

No. Our crawler is rate-limited to 1 request per second by default and respects any crawl-delay you specify in robots.txt. We also limit ourselves to 250 pages maximum per domain.

Do you crawl dynamic or JavaScript-rendered content?

We primarily crawl standard HTML content. JavaScript-rendered content may not be fully indexed as our focus is on traditional link relationships and metadata.

How often do you re-crawl websites?

While it is planned at some point in the future to implement re-crawling, it is not currently active.

Therfore after collecting the a maximum of 250 pages on your domain, the bot stops visiting further pages.

Is the collected data publicly available?

Not yet. However we plan to release both a Map of the interconnectivity of domains, as well as release our DB with metadata only. Individual page content is not published or made searchable.

What if I want my site removed from your database?

Contact us at the email address below and we'll remove your domain from our crawl queue and existing database.

Do you honor nofollow links?

Yes, we respect all standard meta robots tags and link attributes including rel="nofollow".

Are you affiliated with any commercial entity?

No, MapTheNet is a non-commercial research project. We do not sell data or provide commercial services.

Can I see what data you have about my domain?

Once our data explorer is available, you'll be able to search for your domain and view the metadata we've collected. Until then, contact us directly.

Contact Information

Questions, concerns, or want to report an issue with our crawler?

Email: [email protected]
GitHub: github.com/mapthenet
Main Website: mapthenet.org

We typically respond to inquiries within 48 hours.