The Black Art of Sitecore Custom Data Providers

Mon Jan 03 2011

Having coded a few Sitecore data providers, and with the release of Hanson Dodge Creative’s data provider for the Bits on the Run Video CDN, I thought I would take the time to gather my thoughts and lessons on implementing custom data providers for Sitecore. There are many ways of integrating external data with Sitecore, but a data provider is one of the most powerful. You can get external data to display and behave as native Sitecore data — to be browsed, rendered and related to other content within Sitecore. How cool is that? Developing custom data providers is a bit of a black art though – documentation is lacking, and the side effects of such a low-level integration are often unexpected. I’ve tried to capture various tips, HOWTOs, and gotchas here. I’ll also be keeping this entry updated as I come across other useful information.

This article assumes a basic knowledge of data providers and how to implement them. If you’re just learning about them, be sure to review the (somewhat outdated) SDN Article on them, and perhaps a simple data provider such as the shared source YouTubeDataProvider.

Essential Methods for Read-Only Data Providers

In case it’s not totally clear from SDN, the following methods are essential to implement for a basic, read-only data provider:

GetItemDefinition
GetItemVersions
GetItemFields
GetChildIDs
GetParentID

Provide Values for Standard Sitecore Fields

There are some fields in the Standard Template you should always populate within GetItemFields. This is not only helpful for users, but also for any 3rd party APIs that expect the fields to be populated. For example, we ran into trouble with the Coveo Enterprise Search connector for Sitecore when we did not have these fields populated from our data provider.

Created, Created By
Updated, Updated By
Owner

Make Use of those Standard Values

Your data source likely can’t provide everything you need for utilizing your external data in Sitecore. For example, can it really provide your Presentation Details? You don’t want to hardcode the presentation XML into your data provider either. Standard values will still work great with your data provider’s items, so make use where appropriate.

Utilize a Data Provider to Disconnect Source Data in Production

You may be tempted to run your data provider on both the Master and Web databases, thus running it in “live mode.” Though this has the benefit of not requiring publishing, you lose a huge benefit to the publishing process: disconnection of the source data from production load. After publishing, items from your data provider will be “native” Sitecore items in the web database, and thus you don’t need to be concerned about your data source handling live traffic. The disadvantage? You’ll need either an automated publishing process or manual publishes by your content editors to ensure that changes in the data source are replicated to the live website.

Utilize the Revision Field to Enable Smart Publishing

Unless you are running your Data Provider in “live mode,” Sitecore will need to publish items from your data provider, and you want this publishing to perform well. Though Sitecore publishing will attempt a Smart Publish anyway using a field-by-field comparison, to get the full benefit of Smart Publishing you really need to fill in the Revision field. Options for populating this field include:

A revision field in the source data, if available
A version number in the source data, if available
An MD5 hash or other checksum of the values populated in the item. If the source data doesn’t provide one, this isn’t difficult to generate.

byte[] valueBytes = _unicode.GetBytes(values.ToString());  
System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create();  
byte[] hash = md5.ComputeHash(valueBytes);  
string base64 = System.Convert.ToBase64String(hash);

Caching #1: Implement Your Own Data Caching

Unless accessing and querying your data source is really fast, you can get huge performance gains from implementing your own cache in your data provider. Sitecore’s own data and item caches will help, but considering the low-level granularity of the DataProvider calls, there are very likely efficiencies you can find from having your own caching layer in your data provider. Luckily, Sitecore makes this very easy to do with the Sitcore.Caching namespace. Be sure to check my tips on using that API.

As an example, consider a webservice that you might be integrating with. True, you could make separate round trips to get a record’s base data, get its children, get its fields, and get its parent ID. But good luck getting this to perform at any acceptable level. Better is to immediately query and cache all data for the record, or better yet (for small data sets) all the data available in the web service.

Caching #2: Prefetch Children, Because you Know Sitecore Will Want Them

If your data set is too large to prefetch and cache as a whole, you should at least prefetch child data during a call to GetChildIDs. You can bet that immediately after your call to GetChildIDs, Sitecore will be calling GetItemDefinition on each child ID, and potentially GetItemVersions and GetItemFields as well. Better to retrieve and cache all the needed child data at once, especially if your data source provides a means of efficiently fetching all of it in one round trip.

Caching #3: Give your Users a Cache Clearing Shortcut

If your data source can be modified outside of Sitecore (very likely), even with a conservative expiration on your cache data, you will frustrate your content editors if they have no means of “refreshing” the data inside of Sitecore. Add a button to the ribbon and associated command which clears your internal data cache, as well as appropriate items from the data and item caches. For example, see the DataRefreshCommand in the Bits on the Run data provider.

Implementing Reference Fields isn’t Difficult

Are there relationships between items in your data source? You can implement link fields, such as a General Link, Droplink, or Multilist, without too much difficulty. Obviously you will need to populate either a GUID or pipe-delimited GUIDs into the field value. When resolving the reference, utilize the IDTable to either create or find the mapped GUID for the related data record.

Utilize CallContext.Abort() Where Appropriate

You can help the performance of your Data Provider a bit by appropriately calling CallContext.Abort() if your data provider method can service the call sent to it. Remember, data providers are chained, so if your data provider is configured to run before the main provider, the main provider will still attempt to service the call as well.

Then again, there may be some cases where you intentionally don’t want to abort the context. The chain of responsibility pattern used by the data provider API provides for some interesting possibilities:

You want Sitecore-native items to coexist with items from your data source as children of a Sitecore-native item
You want to supplement the Sitecore-native field values of an item with values from your data source

How to Implement Blob Fields

As far as I can tell, there is no existing documentation on how to implement a Blob field in a data provider. This is useful if, for example, you want to create a data provider that integrates media items that appear within the Media Library. There are four steps to implementing a Blob field:

Place a Blob GUID in the Blob field in GetItemFields. This GUID can be arbitrary, so long as in later calls you can recognize it. For this reason, in my implementation I’ve used the item’s ID as the Blob GUID. This allows use of your existing IDTable entry to recognize the Blob GUID in BlobStreamExists and GetBlobStream.
Overide the BlobStreamExists method. All you need to do here is return “true” if you recognize the Blob GUID that’s being requested.
Override the GetBlobStream method. If you recognize the GUID being requested, return a System.IO.Stream containing the Blob. For example, the bytes of the image from a database Blob field or from a URL.
Ensure you have enabled BlobStreamExists and GetBlobStream in your dataProvider configuration in Web.config.

If you’d like an example, check out the BotrDataProvider in the shared source repository. One tip to avoid some frustration while developing Blob fields on image Media items: turn off caching in your browser.

Don’t Access the Database Object in your Constructor

More in the realm of quick tips. Keep in mind, your data provider will be constructed at the same time as the Database in which it is hosted. If you attempt to access the Database in the constructor of your data provider, you will create an infinite loop. Trust me from experience. :)

Be Careful Implementing DeleteItem

Though it sounds like a great convenience for your content editors, be very careful implementing DeleteItem, as there are likely deletion scenarios you are not thinking of that could get you in trouble. For example:

The Sitecore-native parent item of your data source items is deleted. Sitecore automatically iterates through all its children and deletes them, thus deleting all data from your source system.
A Sitecore package is installed that includes the Sitecore-native parent item of your data source items. The “overwrite” option is selected during installation. Sitecore automatically iterates through all the item’s children and deletes them, thus deleting all data from your source system.

Trust me again on this one from experience. :) If you really want deletes, you should ensure that your data source has a means of recovering deleted data. Alternatively, instead of a true deletion you could implement a “flag” in the data source that indicates the item should be hidden from Sitecore.

You Have no History Engine

At least in your master data, and in web if you are running in “live” mode, your data provider data will not be available to the History Engine. Depending on your requirements, this could cause complications for functionality that depends on the Links Database or on Lucene indexes, which are dependent on the History Engine to know when to update. This is something I haven’t looked into deeply, so maybe there is a solution, but read-only providers are definitely limited by this.

Be Careful with Item “Exports”

Especially if you have a read-only data source whose data is maintained outside of Sitecore, keep in mind that the various “export” functions of Sitecore will treat your data as any other Item in the content tree. This can lead to unexpected data in those exports. What exports am I referring to? Offhand it would include:

Generating installation packages
Item serialization
Globalization exports

For Globalization exports, I have gone so far as to disable a data provider during the one-time export process.

Performance is Key

As you may have figured already from the above hints, performance is pretty key when implementing a data provider. You are at the lowest tier of the Sitecore data access APIs, so many calls into your data provider may be necessary for a single user action in the Sitecore interface. If they don’t perform well, you will likely be getting calls about the endless “spinning snake” that your users are dealing with.

If All Else Fails, Check how Sitecore Does It

As with many APIs in Sitecore, if you’re not sure how to implement something in your data provider, crack open Reflector and open Sitecore.Kernel.dll. Likely you’ll want to look at Sitecore.Data.DataProviders.Sql.SqlDataProvider and Sitecore.Data.SqlServer.SqlServerDataProvider.