FAQ > Identifying OneSpot Scraping Traffic

Identifying OneSpot Scraping Traffic

OneSpot makes automated requests of your site in order to make sure that we have the most up-to-date versions of your content.

Sometimes its useful to be able to identify this traffic, either to keep it from being tracked by your website analytics or make sure it doesn’t get blocked by your firewall. This page lists all the different OneSpot systems that will make HTTP requests on your site, the frequency and volume of traffic, and how to identify it.

All HTTP traffic from OneSpot can be easily identified by searching for OneSpot in the User-Agent request header. Every system from OneSpot that will make a request from your site will identify itself using this string.

Most systems, such as firewalls and website analytics products, enable you to create rules based on the user agent sent in the HTTP request headers. If you want to create one blanket rule that will identify all OneSpot scraping traffic you can look for OneSpot in the user agent and that will catch all of our automated traffic.

If you would like to identify individual subsystems that are making site requests, each of the subsystems are listed below along with details on what they do and how to identify them.

Purpose: The sitemap scanner reads a sites robots.txt and sitemap.xml file(s) in order to (a) detect new pages published (b) detect when pages have been taken down and (c) detect when pages have changed.

HTTP Request: HTTP GET

Request Frequency: Once an hour each sitemap that is configured to be scanned is requested.

User Agent String Contains: Onespot-SitemapBot

Example User Agent String: Sitemap Scanner
Mozilla/5.0 (compatible; Onespot-SitemapBot/1.0; +https://www.onespot.com/identifying-traffic.html)

Purpose: Scraping the content from a web page, this is what is used to get the headline, images, text etc as described in Scraping Pages

HTTP Request: HTTP GET

Request Frequency: Each page will be scraped at least once when the page is first discovered. If we detect that the page has changed and needs to be re-scraped then we will re-scrape the page. You should expect to see a high volume of scrapes when we first start scraping your site to get all of your previously published content. Once we are current, this should only scrape at the rate that new pages are published or pages are updated.

Request Rate: The maximum request rate you should see from our scraper is 5 requests/second, but you would only see that volume if there was a large backlog of pages on your site that we need to scrape.

User Agent String Contains: Onespot-ScraperBot Cloudinary

Example User Agent Strings: Content Scraper
Mozilla/5.0 (compatible; Onespot-ScraperBot/1.0; +https://www.onespot.com/identifying-traffic.html)

Mozilla/5.0 (compatible; Cloudinary/1.0)

Purpose: In order to determine if a page has changed, still available and accessible, we look at he HTTP status code for changes in the values of the page’s meta tags. If a change in meta tag values is detected, we mark the page for a full rescrape. An HTTP 200 (OK) response code indicates that the page is still available, a 404 (Not Found) error code indicates that the page has been taken down.

HTTP Request: HTTP GET

Request Frequency: By default each page is scanned once every 24 hours. If there is a need to spread these requests out over a longer period of time, contact us and we can increase the scanning time period; however, increasing the time interval increases the risk that the content we recommend will be out of date with changes you have made.

Request Rate: The requests are spread out evenly throughout the day (or a custom configured interval)

User Agent String Contains: Onespot-MetaBot

Example User Agent String: Meta Tag Scanner
Mozilla/5.0 (compatible; Onespot-MetaBot/1.0; +https://www.onespot.com/identifying-traffic.html)

Purpose: For monitoring that there have not been any site changes that prevented the expected OnSite recommendation units from appearing on the page. Every hour, we request a sample page for every OnSite unit and check that the unit is still rendering on the page. This is basically a headless browser that renders the sample pages and does a check to make sure that the expected OnSite unit is still being properly injected on the page. If for any reason the unit is no longer on the page, we send internal notifications and escalations to minimize any disruption in service.

HTTP Request: HTTP GET

URL Query Parameters: Each request from the onsite checker will contain an onsite_is_test parameter, this tells our script that (a) this is test traffic and (b) we will be specifying the variant that we want to render on the page. The variant name that we are testing is specified in the onsite_variant_name parameter

Request Frequency: There is one page requested for each OnSite unit, each hour

Request Rate: These are not rate limited, but since there is only one request per unique unit on the site, the overall throughput of requests should be quite low.

User Agent String Contains: Onespot-OnSiteChecker

Example User Agent String: OnSite Checker
Mozilla/5.0 (compatible; Onespot-OnsiteChecker/1.0; +https://www.onespot.com/identifying-traffic.html)

Purpose:The page status checker verifies that a page is available prior to scraping its content.

HTTP Request: HTTP GET

Request Frequency: See Content Scraper.

Request Rate: The maximum request rate you should see from our scraper is 25 requests/second, but you would only see that volume if there was a large backlog of pages on your site that we need to scrape.

User Agent String Contains: Onespot-StatusBot

Example User Agent String: Sitemap Scanner
Mozilla/5.0 (compatible; Onespot-StatusBot/1.0; +https://www.onespot.com/identifying-traffic.html)