iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://api.crossref.org/works/10.1002/SPE.577
{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,12]],"date-time":"2024-09-12T14:38:23Z","timestamp":1726151903249},"reference-count":13,"publisher":"Wiley","issue":"2","license":[{"start":{"date-parts":[[2004,1,22]],"date-time":"2004-01-22T00:00:00Z","timestamp":1074729600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Softw Pract Exp"],"published-print":{"date-parts":[[2004,2]]},"abstract":"Abstract<\/jats:title>How fast does the Web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the Web, including all the popular search engines, but few studies have been performed to date to answer them.<\/jats:p>One notable exception is a study by Cho and Garcia\u2010Molina, who crawled a set of 720\u2009000 pages on a daily basis over 4 months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all Web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily.<\/jats:p>This paper expands on Cho and Garcia\u2010Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150\u2009836\u2009209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo\u2010randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages.<\/jats:p>After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top\u2010level domains, and that larger pages change more often and more severely than smaller ones.<\/jats:p>This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages. Copyright \u00a9 2004 John Wiley & Sons, Ltd.<\/jats:p>","DOI":"10.1002\/spe.577","type":"journal-article","created":{"date-parts":[[2004,1,28]],"date-time":"2004-01-28T08:02:21Z","timestamp":1075276941000},"page":"213-237","source":"Crossref","is-referenced-by-count":87,"title":["A large\u2010scale study of the evolution of Web pages"],"prefix":"10.1002","volume":"34","author":[{"given":"Dennis","family":"Fetterly","sequence":"first","affiliation":[]},{"given":"Mark","family":"Manasse","sequence":"additional","affiliation":[]},{"given":"Marc","family":"Najork","sequence":"additional","affiliation":[]},{"given":"Janet L.","family":"Wiener","sequence":"additional","affiliation":[]}],"member":"311","published-online":{"date-parts":[[2004,1,22]]},"reference":[{"key":"e_1_2_1_2_2","unstructured":"Google Information for Webmasters.http:\/\/www.google.com\/webmasters\/2.html[16 October2003]."},{"key":"e_1_2_1_3_2","doi-asserted-by":"crossref","first-page":"669","DOI":"10.1145\/775152.775246","volume-title":"Proceedings of the 12th International World Wide Web Conference","author":"Fetterly D","year":"2003"},{"key":"e_1_2_1_4_2","first-page":"200","volume-title":"Proceedings of the 26th International Conference on Very Large Databases","author":"Cho J","year":"2000"},{"key":"e_1_2_1_5_2","first-page":"19","volume-title":"Proceedings IEEE Symposium on Security and Privacy","author":"Sun Q","year":"2002"},{"key":"e_1_2_1_6_2","first-page":"147","volume-title":"USENIX Symposium on Internetworking Technologies and Systems","author":"Douglis F","year":"1997"},{"key":"e_1_2_1_7_2","first-page":"257","volume-title":"Proceedings of the 9th International World Wide Web Conference","author":"Brewington B","year":"2000"},{"key":"e_1_2_1_8_2","unstructured":"BroderA GlassmanS ManasseM ZweigG.Syntactic clustering of the Web.Proceedings of the 6th International World Wide Web Conference April1997;391\u2013404."},{"key":"e_1_2_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/371920.371965"},{"key":"e_1_2_1_10_2","unstructured":"NajorkM HeydonA.High\u2010performance Web crawling.SRC Research Report 173 Compaq Systems Research Center Palo Alto CA September2001."},{"key":"e_1_2_1_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4613-9323-8_11"},{"key":"e_1_2_1_12_2","unstructured":"RabinM.Fingerprinting by random polynomials.Report TR\u201015\u201081 Center for Research in Computing Technology Harvard University 1981."},{"key":"e_1_2_1_13_2","first-page":"1579","volume-title":"Proceedings of the 8th International World Wide Web Conference","author":"Bharat K","year":"1999"},{"key":"e_1_2_1_14_2","unstructured":"PageL BrinS MotwaniR WinogradT.The PageRank citation ranking: Bringing order to the Web.Technical Report 1999\u201066 Database Group Stanford University 1998."}],"container-title":["Software: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fspe.577","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/spe.577","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,13]],"date-time":"2023-09-13T04:10:30Z","timestamp":1694578230000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/spe.577"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2004,1,22]]},"references-count":13,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2004,2]]}},"alternative-id":["10.1002\/spe.577"],"URL":"http:\/\/dx.doi.org\/10.1002\/spe.577","archive":["Portico"],"relation":{},"ISSN":["0038-0644","1097-024X"],"issn-type":[{"value":"0038-0644","type":"print"},{"value":"1097-024X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2004,1,22]]}}}