FAQ > Latency and Uptime

Latency and Uptime

  • All of OneSpot’s products and services are built to be fault tolerant.
  • In the event of a system failure, the end-user should not notice any disruptions, i.e. they will continue to see content recommendations displayed as usual.
  • The only disaster scenario that would cause a disruption in service is an AWS-region going down, which is an extremely rare event.
  • Script deployment is done via CDN and there is zero downtime required.
  • API releases are done in a rolling release so zero downtime required.
  • If for whatever reason our customer-side script cannot connect to the API or takes an unreasonable amount of time to respond, a set of static recommendations are served from our global CDN (AWS Cloudfront).
  • Static recommendations conform to all business rules/filters for the given unit.
  • Static recommendations are chosen from the top 50 most popular pieces of content in the last 5 days and refreshed nightly.
  • In this way, if there were any unexpected OnSite API outage, the end user would still receive content recommendations which conform to business rules and display template.
  • API releases are done in a rolling release so zero downtime required.
  • If for whatever reason the InBox API takes an unreasonable amount of time to respond, a set of static recommendations are stored in-memory on the API server and refreshed nightly.
  • Static recommendations conform to all business rules/filters for the given unit.
  • Static recommendations are chosen from the top 50 most popular pieces of content in the last 5 days and refreshed nightly.
  • In this way, if there were any unexpected Inbox back-end system outage, the end user would still receive content recommendations in their email which conform to business rules and display template.
  • For click tracking, all recommendation clicks are routed through our redirect service.
  • The redirect service is deployed as a separate service from OnSite or Inbox.
  • The redirect service is deployed on the AWS Elastic Beanstalk platform across multiple AWS Availability Zones.
  • Each availability zone has a cluster of servers serving click redirect requests.
  • Through this redundancy, the OneSpot click redirect service should be resilient against system level or even availability-zone level failures.
  • Buggy releases are also another potential source of service disruptions. We take several steps to reduce the risk of bugs being deployed:
    • OneSpot uses test driven development (TDD) where every function has automated testing routines associated.
    • Code test-coverage is a code quality metric tracked by OneSpot engineering.
    • OneSpot uses a continuous integration server where all code is built and all unit and integration tests are run for every build. If any test fails, the build will not deploy.
  • For OnSite there is also an automated Onsite-Checker process that requests one sample page for every region of every customer site each hour and looks for the presence of the expected OnSite units. This will catch any failures that may have been introduced by customer changes to their website.
  • Any OnSite-Checker failures trigger pager duty alerts with escalation paths that go up to the VP of Product and VP of customer Services.
  • Any OnSite-Checker failure incidents are expected to be resolved in less than 24 hours.
  • System failures are automatically injected in order to test system resiliency through chaotic failures. These tests are designed to ensure that multiple back-end systems can go down and the overall system continues to operate.
  • Nightly backups/snapshots taken of RDS, Redshift, and Elasticsearch databases into AWS S3 storage, so that they can be quickly restored for issues of data corruption or system failure.
  • AWS keeps S3 data backed up across data centers at a 99.999999999% (11 9’s) durability level.

In the event that (a) our API servers are failing and (b) the global CDN (CloudFront) fails to serve static content, the end user would not receive any content recommendations through OnSite or InBox.

The most likely cause for this scenario would be that an entire AWS region has gone offline. These events are extremely rare and would be considered a disaster. However, in these rare cases, once the regions came back online there was no data loss and systems could be brought back online.

In this case, assuming other customer systems continued to function through the AWS outage (i.e. web servers, CMS, ESP) the result is that OnSite recommendations would not appear on the customer site, but there should be no other impact to the customer site or page. For InBox, if there were an AWS outage during an email send, connections would time out at the network level. For image-based recommendations, the images would appear to be broken and for send-time recommendations, it’s a function of the ESP to decide how to handle the failure to connect.

In the most common cases of unexpected downtime, end-users should not notice any difference in the user interface. Static recommendations will be served in place of dynamic content recommendations and dynamic recommendations will begin again once service is restored.

In the very rare disaster scenario of an entire AWS region going offline, if other customer critical systems are up, then end-users will see no recommendations where they are supposed to appear. However, this should not impact the availability or performance of the site and emails from the customer’s ESP should continue to be sent.