Content Inventory > Scraping Pages
Scraping Pages
The following guide illustrates how OneSpot scrapes individual content pages as well as markup that clients can add to their pages in order to improve our capabilities to filter pages.
- Title
- Publish Date
- Author
- Description
- Body Text
<TITLE>
tag as the title of the page. <TITLE>
HTML tags can contain extraneous text that is not actually part of the article title, and OneSpots goal is to get the clean title of the article. Based on the formatting of the title on the page, our scraper may pick up on the wrong headline for the title of the page. This is rare, but when it happens you can reach out to your customer success manager and we can implement custom scraping rules to ensure we grab the correct title.
When the publish date appears on the page, our scraper will automatically extract that as well. If the publish date does not appear on the page but you would like to either filter or display the publish date as part of recommendations, you can pass us the publish date as custom metadata. (see below)
Once the clean text has been extracted from the article, OneSpot will tag the human language (English, Spanish, German, etc) and then run natural language processing on the text to categorize the article by topic.
OneSpot will scrape and process up to three images for each page
- Page image
- Social image
- Custom image
The “Page image” is the image that is featured most prominently on the page. The AI processes determine the primary image that is on the page content, as opposed to other images on the page (header images, ads, etc).
The “Social image” is the facebook or twitter image that is in the meta tags of the HTML markup. These are the og:image
and ‘twitter:image` meta tags and are typically used for social sharing.
The “Custom image” is an optional meta tag that you can include if there is an image that you want used in your recommendations that is not either the page image or social image. When you add the onespot:image
meta-tag to your page, we will scrape and process the image that is in this URL.
<meta property="onespot:image" content="http://sample.com/image.png" />
Images are scraped and indexed with metadata like width, height, format, and aspect ratio. They are then uploaded to OneSpots digital asset management system so that they can be optimized for display in recommendations. When OneSpot serves images in recommendations we dynamically crop and scale the image for the unit it will appear in and also optimize the image format for the browser it appears in as well as optimize image resolution so that images appear crisp on retina displays.
For example, if we scrape an image that is only 100x150 and need to display the image in a 200x200 container, the visual quality of the image would be less than if we had scraped a higher resolution image.
If you have high resolution versions of your images available you may consider adding these as a custom image meta-tag (see above). OneSpot will always optimize the actual display resolution for the container that it is displayed in so you do not need to worry about having very large images in recommendations. It is much better for OneSpot to have scraped a high resolution image than a lower-resolution image.
- Extract image-tags that capture the subject matter of the image.
- Determine the color profile of each image
- price
- availability
- brand
- sku
- category
- specs
If there is one or more embedded videos on the page, OneSpot will tag the page as having a video. This value can be used for filtering recommendations (if you want to have a recommendation unit that only contains videos, for instance). It can also be used for recommendation display rules, for instance OneSpot recommendations could overlay a play button on top of images for pages that contain videos.
OneSpot will also scrape and index every meta-tag on your page, including any schema.org microdata tags that are included in your markup. These meta tags can be used for filtering recommendations.
Canonical Link Meta Tag
One very important meta tag that OneSpot will scrape for a given piece of content is the canonical URL of the page. In situations where content may be replicated across a site in several locations, the Canonical URL meta tag tells OneSpot exactly which page to recommend and prevents duplicate recommendations from occurring. If you are not familiar with the uses of the canonical URL tag, see this article for an introduction. https://yoast.com/rel-canonical/
For convenience, OneSpot will look for a few common versions of this tag which may already exist on a site. Below are examples of the traditional HTML and facebook Open Graph tags that can be used:
<link rel="canonical" href="http://example.com/pagename.html" />
<meta property="og:url" content="http://example.com/pagename.html" />
It is worthwhile checking before OneSpot scraping starts that all of your content contains canonical URL meta tags. Since these are so important to search engine optimization and social sharing, most content management systems will include these by default.
Custom Meta Tags
While OneSpot will utilize pre-existing meta tags, there may be a need to provide additional or different information to OneSpot in order to achieve desired filtering based on business requirements. To accommodate this functionality, OneSpot has created a series of optional meta tags that can be used to pass this information. Use of any one of OneSpot’s proprietary tags will override values used in the examples referenced above.
Property | Purpose |
---|---|
onespot:image | URL for the image that is to be associated with a piece of content |
onespot:blacklist | If value set to “true”, this page will not be recommended |
onespot:category | Associates a particular category with a given page |
onespot:sub-category | Associates a particular sub-category with a given page |
onespot:publish-date | The date/time that the article was published, this will override whatever we scrape from the page text |
onespot:title | The headline/title to use for recommendations, this will override what we scrape from the page text |
onespot:metadata | An open ended meta tag that provides additional data for organization or association. Metadata values are in the format <key>:<value> - multiple key-value pairs can be passed in by separating entries with commas. |
onespot:author-name | The name of the author of the article, this will override what we scrape from the page text |
onespot:author-image | The URL of an author image to be used in recommendations |
onespot:author-profile-link | A URL for a page to link to for the author, for linking to the authors page in recommendations |
onespot:primary-display-category | A page may have multiple categories and tags, if we display a category on the recommendations this is the category that will be displayed |
onespot:page-type | A ‘type’ of content, like slideshow, video, article, product, etc. Used for displaying recommendations and also filtering |
onespot:canonical-url | A URL to use for normalizing and linking to this page. This value will override whatever exists in either the canonical link or og:url tags on the page |
onespot:video-preview | URL for the video that is to be associated with a piece of content |
Simple Use Case
A grocery store has a website with two sections. One section is recipes and the other is product information pages. In this case, we want OneSpot’s OnSite product to populate an “other recipes” section at the bottom of all recipe pages. In order to achieve this, we configure OnSite to only recommend recipe pages which are identified using the onespot:category meta tag as seen below:
<meta property="onespot:category" content="recipe" />
<meta property="onespot:title" content="20 Minute Vegan Spaghetti" />
<meta property="onespot:image" content="http://sample.com/image.png" />
<meta property="onespot:canonical-url" content="http://sample.com/page.html" />
Advanced Use Case
The same grocery store would like to place recipe recommendations on product pages where the recommended recipes use the product featured on a particular page. In this case, we will associate products on the recipe page by using OneSpot’s onespot:metadata meta tag. The example below shows that there are two product page URLs that being associated with a recipe using this tag:
<meta property="onespot:category" content="recipe" />
<meta property="onespot:title" content="20 Minute Vegan Spaghetti" />
<meta property="onespot:image" content="http://sample.com/image.png" />
<meta property="onespot:canonical-url" content="http://sample.com/page.html" />
<meta property="onespot:metadata" content="product:/products/tomato_paste,product:/products/spaghetti" />