View on GitHub

Optimal-Freshness-Crawl-Scheduling

Dataset and code for three Web crawling-related papers from SIGIR-2019, NeurIPS-2019. and ICML-2020.

Introduction

This repository contains the dataset, different parts of which were used in the empirical evaluation for the following papers:

[1] A. Kolobov, E. Lubetzky, Y. Peres, E. Horvitz, "Optimal Freshness Crawl Under Politeness Constraints", SIGIR-2019.
[2] A. Kolobov, Y. Peres, C. Lu, E. Horvitz, "Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling", NeurIPS-2019. 

It also contains code for reproducing the experiments from [2] as well as from the following paper, which doesn’t use the dataset:

[3] A. Kolobov, S. Bubeck, J. Zimmert, "Online Learning for Active Cache Synchronization", ICML-2020.

When using the dataset in other works, please cite [1] if you mainly experiment with its data on host constraints, and [2] if you mainly focus on the rest of the dataset.

We thank Microsoft Bing for help in collecting this data. For any questions regarding the dataset, please contact Andrey Kolobov (akolobov@microsoft.com, https://www.microsoft.com/en-us/research/people/akolobov/).

The Dataset can be downloaded here

Data Collection Details

The dataset was gathered by crawling a large collection of URLs for approximately 14 weeks in 2017 using Microsoft Bing’s production web crawler, and upon every crawl recording whether the corresponding web page has changed since its previous crawl. These URLs were used as sources of structured information, e.g., event times, for Microsoft’s Satori knowledge base. For this purpose, information of interest was extracted from page content using templates. Accordingly, we considered a URL as changed across two crawls if and only if:

The crawler was scheduled to visit each URL from this collection approximately once a day. However, factors ranging from spikes in the crawler’s production workload to temporary host unavailability caused some URL crawl requests on some days to be dropped or otherwise fail. We didn’t record these crawl requests in the dataset, so for a given URL the number of recorded crawls can differ from 98, the expected number for 14 weeks’ worth of daily crawls. In fact, a small number of URLs were crawled far more frequently than once a day due to production crawl requests.

After completing the 14-week crawl, we dropped the URLs that:

For the remaining URLs, we report their change detection history in file urlid_offset_history.txt.

In addition to crawling the URLs themselves, for URLs from sites whose sitemaps we considered reliable we also crawled these sitemaps. A sitemap is a file that lists URLs on the corresponding site and, optionally, their change frequencies and last-modified dates. It is the latter two types of information that we were interested in extracting from sitemaps. We then used this data to estimate the Poisson change rates of these URLs more precisely than of the rest. Unfortunately, change frequencies and last-modified dates in most sitemaps are missing or inaccurate. Moreover, a sitemap maintainer’s notion of reportable web page modification can differ from ours, which, as mentioned above, is motivated by the need to extract specific data for a knowledge base. All this meant that we could use sitemaps for only ~4% of the URLs in the dataset. These URLs are used in the NeurIPS-2019 paper [2] as the web pages with complete change observation history. We report these their change rates in file urlid_chrate_compl_obs_hist.txt.

Bing assigns “importance scores” to the URLs it crawls. They are a combination of Bing’s PageRank-like measure and a click-based URL popularity value. For each URL in the dataset, we recorded the importance score Bing assigned to it as of the start of the 14-week crawl. The higher the score, the more important the URL is to Bing. The URLs’ importance scores are reported in file urlid_imp.txt.

Dataset Format

The resulting dataset contains importance scores, host information and 14-week crawl history for ~18M URLs, broken down across several TSV files:

– url_urlid_hostid.txt. Its columns are

– host_hostid_isconstrained.txt. Its columns are

– urlid_imp.txt. Its columns are

– urlid_offset_history.txt. Its columns are

5 5.5143055555555556 [[1.10396990740741, 0], [1.47311342592593, 1], …

indicates that URL with ID=5 in url_urlid_hostid.txt was first crawled ~5.5 days since data collection started. ~1.1 days later it was crawled again but didn’t change compared to the first crawl. ~1.5 days after the second crawl it was crawled yet again, and was discovered to have changed compared to the second crawl, etc.

– urlid_chrate_compl_obs_hist.txt. Its columns are

Instructions for reproducing the NeurIPS-2019 paper’s experiments

These instructions assume that you have launched Python 3 of higher from a directory containing LambdaCrawlExps.py and a Dataset subdirectory, which in turn contains unpacked .txt files of the dataset.

First, load LambdaCrawlExps:

> from LambdaCrawlExps import *

Next, process the raw dataset by running

> ProcessRawData("Dataset/urlid_imp.txt", "Dataset/urlid_offset_history.txt", "Dataset/urlid_chrate_compl_obs_hist.txt")

Doing so should generate files imps_and_chrates_incompl.txt and imps_and_chrates_compl.txt, which contain importance score-Poisson change rate pairs for all URLs in the dataset with incomplete and with complete change observations, respectively. Reproducing the three experiments in the NeurIPS-2019 paper then amounts to running

> Experiment1()
> Experiment2()
> Experiment3()

Instructions for reproducing the ICML-2020 paper’s experiments

These instructions assume that you have launched Python 3 of higher from a directory containing sync_bandits.py.

First, load sync_bandits.py:

> from sync_bandits import *

You may need to install some dependencies such as numpy and scipy if they are missing.

The results in Figures 1 and 2 in the main text and Figures 3 and 4 in the Appendix can be reproduced by running

> exp1()
> exp2()
> exp1a()
> exp2a() 

respectively.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.