Multi-Instance Environments on Oracle
Posted 09/18/2013 by vconcepcion
If you've ever setup a multi-instance environment, you're probably familiar with how Sitecore facilitates publishing and cache synchronization across instances through the use of the Event Queue. Setting up publishing from one or more content management (CM) servers to one or more content delivery (CD) servers is often as simple as enabling the ScalabilitySettings.config file, as documented in the Sitecore Scaling Guide.
We recently ran into some issues, however, in a multi-instance Oracle production environment where Scalability Settings were enabled but changes that were published from the CM server would not always appear on the front-end CD servers. When this happened, we would manually clear all caches on the CD servers via the /sitecore/admin/cache.aspx page, and the published changes would then appear on the live site. Since this behavior was observed even on sublayouts on which HTML/output caching was NOT enabled, we suspected that one of the lower level caches on the CD servers (i.e., item, data, prefetch) was holding onto an older version of the published content.
As we dug deeper into how the Event Queue works, we saw that publishing an item on the CM server raises the Sitecore.Data.Eventing.Remote.SavedItemRemoteEvent and writes a corresponding record to the EventQueue table of the web database/schema. Remote events are then replayed by the CD servers (and any other CM servers in the environment) to ensure that local caches are updated. To keep track of which events a server has processed, each EventQueue record has a numerical "Stamp" whose value is determined by an auto-incrementing sequence created in the database. In Oracle, the following statement is used to create this sequence as part of the standard Sitecore database schema:
CREATE SEQUENCE EVENTQUEUE_SEQ INCREMENT BY 1 START WITH
1 MAXVALUE 1.0E28 MINVALUE 1 NOCYCLE
CACHE 20 NOORDER
Keeping this sequence and the Event Queue process in mind, we reproduced the issue and ran the following query against the web schema to gather some more information.
SELECT * FROM "EventQueue" ORDER BY "Created" DESC
Within the data that was returned by this query, we noticed that our Stamp column was not always increasing with the "Created" timestamp of the events.
This is problematic, as each Sitecore instance will attempt to process events sequentially. That is, an instance will not process events with a Stamp value lower than the last processed event, which can be identified for all servers by running the following query:
SELECT * FROM "Properties" WHERE "Key" LIKE 'EQStamp%'
So our question then was why was the EventQueue Stamp value not consistently increasing with the Created timestamp?
One potential issue we looked into is that the Sitecore installation script used to create the Oracle database schema sets the "Created" timestamp using local time (sysdate) instead of UTC time. We adjusted the schema as follows to ensure that times are stored in UTC:
ALTER TABLE "EventQueue" MODIFY("Created" DEFAULT SYS_EXTRACT_UTC(SYSTIMESTAMP))
Even with this adjustment, however, we were still able to reproduce the issue. Finally, with help from Lilya Markovska from Sitecore, we were able to identify the cause, which is that when running Oracle in Parallel Server mode, the NOORDER flag does not guarantee that sequences are generated in the order in which they are committed to the database (see http://docs.oracle.com/cd/A58617_01/server.804/a58238/ch6_dbar.htm).
The fix was to run the following against the core, master, and web schemas of the database:
ALTER SEQUENCE EVENTQUEUE_SEQ ORDER
After running this, we were able to consistently see EventQueue Stamps incrementing and caches being synchronized on our CD servers. I'm guessing that the use of Oracle in Parallel Server mode is not uncommon for large enterprise customers requiring a scaled database environment, so for those out there who plan to implement a multi-server setup, hope this can save you some headache.