Content Inventory > Scraping Pages

Scraping Pages

The following guide illustrates how OneSpot scrapes individual content pages as well as markup that clients can add to their pages in order to improve our capabilities to filter pages.

OneSpot takes the “raw” HTML of the page and extracts “clean” versions of the following page text
  • Title
  • Publish Date
  • Author
  • Description
  • Body Text
“Clean text” means that we strip any markup and any non-content related text (such as headers, sidebars, promotions, etc), our goal is to get the pure copy of the article text.

When the publish date appears on the page, our scraper will automatically extract that as well. If the publish date does not appear on the page but you would like to either filter or display the publish date as part of recommendations, you can pass us the publish date as custom metadata. (see below)

Once the clean text has been extracted from the article, OneSpot will tag the human language (English, Spanish, German, etc) and then run natural language processing on the text to categorize the article by topic.

OneSpot will scrape and process up to three images for each page

  • Page image
  • Social image
  • Custom image

The “Page image” is the image that is featured most prominently on the page. The AI processes determine the primary image that is on the page content, as opposed to other images on the page (header images, ads, etc).

The “Social image” is the facebook or twitter image that is in the meta tags of the HTML markup. These are the og:image and ‘twitter:image` meta tags and are typically used for social sharing.

The “Custom image” is an optional meta tag that you can include if there is an image that you want used in your recommendations that is not either the page image or social image. When you add the onespot:image meta-tag to your page, we will scrape and process the image that is in this URL.

Example: adding a custom image meta tag
<meta property="onespot:image" content="http://sample.com/image.png" />

Images are scraped and indexed with metadata like width, height, format, and aspect ratio. They are then uploaded to OneSpots digital asset management system so that they can be optimized for display in recommendations. When OneSpot serves images in recommendations we dynamically crop and scale the image for the unit it will appear in and also optimize the image format for the browser it appears in as well as optimize image resolution so that images appear crisp on retina displays.

Images are also analyzed using AI-based image processing to
  1. Extract image-tags that capture the subject matter of the image.
  2. Determine the color profile of each image
If the page we are scraping is a Product Detail Page (PDP), the OneSpot scraper will extract some key product details off of the page. These product details include:
  • price
  • availability
  • brand
  • sku
  • category
  • specs
These attributes can be used for filtering recommendations and in recommendation display. Note that these data fields need to exist somewhere on the page in order for us to be able to scrape them.

If there is one or more embedded videos on the page, OneSpot will tag the page as having a video. This value can be used for filtering recommendations (if you want to have a recommendation unit that only contains videos, for instance). It can also be used for recommendation display rules, for instance OneSpot recommendations could overlay a play button on top of images for pages that contain videos.

OneSpot will also scrape and index every meta-tag on your page, including any schema.org microdata tags that are included in your markup. These meta tags can be used for filtering recommendations.

Canonical Link Meta Tag
One very important meta tag that OneSpot will scrape for a given piece of content is the canonical URL of the page. In situations where content may be replicated across a site in several locations, the Canonical URL meta tag tells OneSpot exactly which page to recommend and prevents duplicate recommendations from occurring. If you are not familiar with the uses of the canonical URL tag, see this article for an introduction. https://yoast.com/rel-canonical/

For convenience, OneSpot will look for a few common versions of this tag which may already exist on a site. Below are examples of the traditional HTML and facebook Open Graph tags that can be used:

Example canonical link tag
<link rel="canonical" href="http://example.com/pagename.html" />
or
Example opengraph url
<meta property="og:url" content="http://example.com/pagename.html" />

It is worthwhile checking before OneSpot scraping starts that all of your content contains canonical URL meta tags. Since these are so important to search engine optimization and social sharing, most content management systems will include these by default.

Custom Meta Tags
While OneSpot will utilize pre-existing meta tags, there may be a need to provide additional or different information to OneSpot in order to achieve desired filtering based on business requirements. To accommodate this functionality, OneSpot has created a series of optional meta tags that can be used to pass this information. Use of any one of OneSpot’s proprietary tags will override values used in the examples referenced above.

Property Purpose
onespot:image URL for the image that is to be associated with a piece of content
onespot:blacklist If value set to “true”, this page will not be recommended
onespot:category Associates a particular category with a given page
onespot:sub-category Associates a particular sub-category with a given page
onespot:publish-date The date/time that the article was published, this will override whatever we scrape from the page text
onespot:title The headline/title to use for recommendations, this will override what we scrape from the page text
onespot:metadata An open ended meta tag that provides additional data for organization or association. Metadata values are in the format <key>:<value> - multiple key-value pairs can be passed in by separating entries with commas.
onespot:author-name The name of the author of the article, this will override what we scrape from the page text
onespot:author-image The URL of an author image to be used in recommendations
onespot:author-profile-link A URL for a page to link to for the author, for linking to the authors page in recommendations
onespot:primary-display-category A page may have multiple categories and tags, if we display a category on the recommendations this is the category that will be displayed
onespot:page-type A ‘type’ of content, like slideshow, video, article, product, etc. Used for displaying recommendations and also filtering
onespot:canonical-url A URL to use for normalizing and linking to this page. This value will override whatever exists in either the canonical link or og:url tags on the page
onespot:video-preview URL for the video that is to be associated with a piece of content

Simple Use Case

A grocery store has a website with two sections. One section is recipes and the other is product information pages. In this case, we want OneSpot’s OnSite product to populate an “other recipes” section at the bottom of all recipe pages. In order to achieve this, we configure OnSite to only recommend recipe pages which are identified using the onespot:category meta tag as seen below:

Simple use of custom meta tags
<meta property="onespot:category" content="recipe" />
<meta property="onespot:title" content="20 Minute Vegan Spaghetti" />
<meta property="onespot:image" content="http://sample.com/image.png" />
<meta property="onespot:canonical-url" content="http://sample.com/page.html" />

Advanced Use Case

The same grocery store would like to place recipe recommendations on product pages where the recommended recipes use the product featured on a particular page. In this case, we will associate products on the recipe page by using OneSpot’s onespot:metadata meta tag. The example below shows that there are two product page URLs that being associated with a recipe using this tag:

Advanced use case of custom meta tags
<meta property="onespot:category" content="recipe" />
<meta property="onespot:title" content="20 Minute Vegan Spaghetti" />
<meta property="onespot:image" content="http://sample.com/image.png" />
<meta property="onespot:canonical-url" content="http://sample.com/page.html" />
<meta property="onespot:metadata" content="product:/products/tomato_paste,product:/products/spaghetti" />