MapTheNet

Mapping the Structure and Diversity of the Internet

Project Status: Development & Data Collection in Progress

We are currently actively developing Crawler, Website and other components as well as are in active data collection phase. An interactive data explorer and visualization tools will be made available once we have sufficient coverage and working tools. All collected data will be released as an open dataset for research purposes.

About the Project

MapTheNet is an open research initiative dedicated to understanding and documenting the structure of the internet. Through systematic web crawling and analysis, we are building a comprehensive dataset that captures how websites interconnect, what categories they represent, and how information flows across the global network.

This project aims to provide researchers, educators, and the public with insights into the internet's topology, thematic organization, linguistic diversity, and geographic distribution. All data and methodologies are transparent and will be freely available for non-commercial purposes.

Research Objectives

Our research focuses on documenting and analyzing several key aspects of internet structure:

Network Topology

Understanding how websites link to each other reveals the underlying structure of the web. We map link relationships to identify communities, hubs, and information pathways that form the backbone of the internet.

Thematic Classification

We categorize domains into 19 distinct types including education, government, technology, news, entertainment, and commerce. This classification helps understand the diversity and balance of content across the web.

Linguistic Diversity

By detecting and cataloging the languages used across domains, we document which languages are represented online and their relative prevalence in different categories and regions.

Geographic Distribution

Using GeoIP lookups and TLD analysis, we track where content originates geographically and how different regions interconnect through web links.

(Temporal Evolution)

Through periodic re-crawling, we track how the web's structure changes over time, including new domains, shifting link patterns, and evolving thematic distributions. This however is currently not yet implemented.

Open Research Principles

This project adheres to principles of open research. Our crawling methodologies are transparent, our code is open source, and collected data will be released publicly for research use. We believe the internet's structure should be documented and understood by all.

Data Collection

Current Statistics

Metric Status
Domains Discovered Collection in progress
URLs Crawled Collection in progress
Link Relationships Collection in progress
Languages Detected Collection in progress

What We Collect

What We Don't Collect

Technical Implementation

Crawler Behavior

For detailed information about our crawler, including how to block it if necessary, please visit the crawler information page.

Open Source

The crawler code is open source and available for review. This transparency ensures accountability and allows others to understand exactly what data we collect and how we collect it.