The Open Syllabus collection contains WARC files from a mid-2021 crawl of about 50 million unique seed URLs extracted from the Open Syllabus version 2.6 dataset and their page requisites. The bulk of the seed URLs are from ".com", ".org", ".edu", and ".uk" TLDs.
Crawl Summary
- Crawl start: 2021-04-12
- Crawl end: 2021-09-05
- Seed URLs: 49,735,419
- Archived URLs: 338,690,414
- Collection Size: 25 TB
- Crawler: Heritrix/3.3.0-hq1-SNAPSHOT-2015-03-16T18:09:23Z
- Crawl depth: maxHops=0
Seed Summary
- Unique URLs: 49,735,419
- Unique Canonical URLs: 48,956,395
- Unique Hosts: 984,223
- IPv4 Addresses: 3,328
- Unique TLDs: 21,761
- Unique IANA Valid TLDs: 739
- Wayback Machine URLs*: 6,568,213
* NOTE: More than 13% URLs in the dataset point to Wayback Machine!