Generate a Google sitemap for a Million plus Sitecore items site

Posted 10/07/2013 by asura

Generating a Google sitemap is not difficult, especially when you are dealing with regular websites. In one of our recent projects we had to generate a Google sitemap for over a million products stored in Sitecore.

This posses the following challenges:

  • Generate sitemap every week 
  • The max sitemap file size cannot exceed 50MB or 50,000 URLs
  • While the sitemap is generated, it should not severely affect the server memory and processor

Now you can do the math, 1 million divided by 50,000 and the logistics behind generating such sitemap. The trick is to use as little memory as possible and push the output to the file as quickly as possible without loading too many objects in memory.

Here is how we did it. The following is the config:

        
    
            
      
    

As you can see, we are defining two different sitemaps, one for the basic pages and one for products. We also defined the file name, the item root for each sitemap along with the base template ID.

We are also defining the virtual path at which the sitemap is accessed, the staging folder path, the final folder path and the physical path of the site root.

Now that we have the config, lets look at some code. Here is the Location class, which defines the attributes for each sitemap entry:

        
    public class Location
    {
        public enum eChangeFrequency
        {
            always,
            hourly,
            daily,
            weekly,
            monthly,
            yearly,
            never
        }

        public string Url { get; set; }
        public eChangeFrequency? ChangeFrequency { get; set; }
        public DateTime? LastModified { get; set; }
        public double? Priority { get; set; }
    }

Once defined, generation of the sitemap can be configured as an Agent or can be called through code in a webpage. The whole process took about 3 - 4 hours, the processor never peaked more than 23% and the memory consumption was barely noticeable as compared to the site running normally.

So we start by defining a public class XmlSitemap, Here is initial code:

        
    public class XmlSitemap
    {
        const int MaxItems = 49999;
        const string ZipFileExtension = ".xml.gz";
        const string SitemapFileExtension = ".xml";
        
        Database SitecoreDB;
        string SiteRoot;
        string FolderPath;
        string FinalFolderPath;
        string SitemapWebsitePath;

        Stream fs;
        XmlTextWriter writer;
        int fileCounter = 1;
        int itemCounter = 0;
        string xmlFileName = "";

        public void GenerateAllSitemaps()
        {
            Log.Info("XmlSitemap - Start", this);

            XElement config = XGet.GetConfigXElement("sitemaps");
            SitecoreDB = Sitecore.Configuration.Factory.GetDatabase("master");

            SitemapWebsitePath = config.Attribute("virtualpath").Value;
            FolderPath = config.Attribute("folderpath").Value;
            FinalFolderPath = config.Attribute("finalfolderpath").Value;
            SiteRoot = config.Attribute("siteroot").Value;
            // clear existing .xml and gzip files
            DeleteExistingFiles(FolderPath);

            var data = from t in config.Descendants("sitemap")
                       select t;

            foreach (XElement tConfig in data)
            {
                string filename = tConfig.Attribute("filename").Value;
                string basetemplate = tConfig.Attribute("basetemplate").Value;
                string itemroot = tConfig.Attribute("itemroot").Value;

                Log.Info("XmlSitemap - Start " + filename, this);
                Item items = SitecoreDB.GetItem(new ID(itemroot)); 
                GenerateMap(items, basetemplate, filename);
                Log.Info("XmlSitemap - End " + filename, this);
            }

            //compress all files in sitemap folder to gz
            CompressFiles();

            // create sitemap index file and include all gz files
            CreateSitemapIndex();
            DeleteExistingFiles(FinalFolderPath);
            CopyFromStagingToMain();
        }

Here are the steps:

  1. Load configuration
  2. Delete existing files in the staging folder
  3. For each sitemap definition, we generate the sitemap files
  4. GZip generated files
  5. Create a sitemap index file (list of all sitemaps generated)
  6. Delete files in final folder
  7. Copy newly generated files to the final folder

After steps 1 and 2, the call goes to GenerateMap, where we pass it the root item, the base template and the filename. The following code shows the flow of how we traverse through the tree and push the output to an XmlTextWriter:

        
        public void GenerateMap(Item root, string templateID, string fileName)
        {
            try
            {
                Log.Info("XmlSitemap - GenerateMap Start", this);
                fileCounter = 1;
                itemCounter = 0;
                xmlFileName = fileName;

                fs = new FileStream(string.Format("{0}{1}{2}", FolderPath, xmlFileName + fileCounter.ToString(), SitemapFileExtension), FileMode.Create);
                writer = new XmlTextWriter(fs, Encoding.UTF8);

                writer.WriteStartDocument();
                writer.WriteStartElement("urlset");
                writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");

                GetItems(root, templateID);

                writer.WriteEndElement();
                writer.WriteEndDocument();
                writer.Flush();

                fs.Close();

                Log.Info("XmlSitemap - GenerateMap End", this);
            }
            catch (Exception ex)
            {
                Log.Error("XmlSitemap - GenerateMap Error", ex, this);
            }
        }

        /// 
        /// Get a list of items matching the base template provided.
        /// 
        /// Item used to traverse through.
        /// Base template the items have to inherit.
        /// none.
        public void GetItems(Item i, string templateID)
        {
            if (i.HasBaseTemplateOf(templateID))
            {
                var opts = Sitecore.Links.LinkManager.GetDefaultUrlOptions();
                opts.AlwaysIncludeServerUrl = true;
                opts.LanguageLocation = Sitecore.Links.LanguageLocation.FilePath;
                opts.LanguageEmbedding = Sitecore.Links.LanguageEmbedding.Always;

                if (GetLanguage(i) == "en")
                    opts.LanguageEmbedding = Sitecore.Links.LanguageEmbedding.Never;

                if (templateID == YOURNAMESPACE.Model.Sitecore.Pages.Base.BasePage.TemplateID)
                {
                    if (i.Fields[Model.Sitecore.Pages.Base.NavigationInformation.ShowInSitemap_FID].Value == "1")
                        AppendToXML(new Location { Url = GetPageUrl(i), ChangeFrequency = Location.eChangeFrequency.weekly, Priority = 0.75, LastModified = i.Statistics.Updated });
                }
                else
                    AppendToXML(new Location { Url = Sitecore.Links.LinkManager.GetItemUrl(i, opts), ChangeFrequency = Location.eChangeFrequency.weekly, Priority = 0.75, LastModified = i.Statistics.Updated });
            }

            foreach (Item child in i.Children)
            {
                GetItems(child, templateID);
            }
        }

        /// 
        /// Appends a location xml to the xml file.
        /// 
        /// Reference to the xml writer to append to.
        /// Location object to generate xml for.
        /// none.
        private void AppendToXML(Location addLocation)
        {
            writer.WriteStartElement("url");
            writer.WriteElementString("loc", addLocation.Url);
            writer.WriteElementString("lastmod", String.Format("{0:yyyy-MM-dd}", addLocation.LastModified));
            writer.WriteElementString("changefreq", addLocation.ChangeFrequency.ToString());
            writer.WriteElementString("priority", addLocation.Priority.ToString());
            writer.WriteEndElement();

            itemCounter += 1;

            if (itemCounter == MaxItems)
            {
                Log.Info("XmlSitemap - AppendToXml - Reached max size, creating new file.", this);
                writer.WriteEndElement();
                writer.WriteEndDocument();
                writer.Flush();
                fs.Close();

                fileCounter++;
                fs = new FileStream(string.Format("{0}{1}{2}", FolderPath, xmlFileName + fileCounter.ToString(), SitemapFileExtension), FileMode.Create);
                writer = new XmlTextWriter(fs, Encoding.UTF8);

                writer.WriteStartDocument();
                writer.WriteStartElement("urlset");
                writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");

                itemCounter = 0;
            }
        }

        public string GetPageUrl(Item page)
        {
            SiteInfo site = Factory.GetSiteInfo("website");

            string path = page.GetItemUrl().ToLower().Replace("/en/sitecore/content/home", "");

            UrlString url = new UrlString(path);
            url.HostName = site.TargetHostName;
            return url.ToString();
        }

        /// 
        /// Return item language
        /// 
        /// Item passed to check the language.
        /// item iso language.
        public string GetLanguage(Item item)
        {
            Sitecore.Data.ID contextLanguageId = Sitecore.Data.Managers.LanguageManager.GetLanguageItemId(item.Language, item.Database);
            Item contextLanguage = SitecoreDB.GetItem(contextLanguageId);
            string iso = contextLanguage["Regional Iso Code"];
            if (string.IsNullOrEmpty(iso))
            {
                iso = contextLanguage["Iso"];
            }

            return iso.ToLower();
        }

Now to compress the generated files, we are using System.IO.Compression.GZipStream. The goal was not to use any third party components. Here is the CompressFiles function code:

        
        /// 
        /// Compress all the files in the sitemap folder into gz (gzip)
        /// 
        /// none
        /// none.
        public void CompressFiles()
        {
            DirectoryInfo directorySelected = new DirectoryInfo(FolderPath);
            foreach (FileInfo fileToCompress in directorySelected.GetFiles())
            {
                using (FileStream originalFileStream = fileToCompress.OpenRead())
                {
                    if ((File.GetAttributes(fileToCompress.FullName) & FileAttributes.Hidden) != FileAttributes.Hidden & fileToCompress.Extension != ".gz")
                    {
                        using (FileStream compressedFileStream = File.Create(fileToCompress.FullName + ".gz"))
                        {
                            using (GZipStream compressionStream = new GZipStream(compressedFileStream, CompressionMode.Compress))
                            {
                                originalFileStream.CopyTo(compressionStream);
                                Log.Info(string.Format("XmlSitemap - WriteXMLFile - Compressed from {0} to {1} bytes.", fileToCompress.Length.ToString(), compressedFileStream.Length.ToString()), this);
                            }
                        }
                    }
                }
            }
        }

Once the files are compressed, its time to create the sitemap index file. Here is the CreateSimapIndex function code:

        
        /// 
        /// Create Sitemap Index file - this should include all the gzip's
        /// 
        /// none.
        /// none.
        public void CreateSitemapIndex()
        {
            Stream fs = new FileStream(string.Format("{0}{1}{2}", FolderPath, "sitemap", SitemapFileExtension), FileMode.Create);
            SiteInfo site = Factory.GetSiteInfo("website");
            UrlString url;

            using (XmlTextWriter writer = new XmlTextWriter(fs, Encoding.UTF8))
            {
                writer.WriteStartDocument();
                writer.WriteStartElement("urlset");
                writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");

                DirectoryInfo directorySelected = new DirectoryInfo(FolderPath);
                foreach (FileInfo file in directorySelected.GetFiles())
                {
                    if ((File.GetAttributes(file.FullName) & FileAttributes.Hidden) != FileAttributes.Hidden & file.Extension == ".gz")
                    {
                        url = new UrlString(SitemapWebsitePath + "/" + file.Name);
                        url.HostName = site.TargetHostName;

                        writer.WriteStartElement("url");
                        writer.WriteElementString("loc", url.ToString());
                        writer.WriteElementString("lastmod", file.LastWriteTime.ToString());
                        writer.WriteEndElement();
                    }
                }

                writer.WriteEndElement();
                writer.WriteEndDocument();
                writer.Flush();
            }
            fs.Close();
        }

At this point, its just a matter of cleaning up the folders and moving the newly generated files in to the main sitemap folder. Here is the sample config for adding the sitemap generation as an agent:

        
    
      
    


The code for the entire class is given below:

        
namespace YOURNAMESPACE.Business.Sitemap
{
    public class XmlSitemap
    {
        const int MaxItems = 49999;
        const string ZipFileExtension = ".xml.gz";
        const string SitemapFileExtension = ".xml";
        
        Database SitecoreDB;
        string SiteRoot;
        string FolderPath;
        string FinalFolderPath;
        string SitemapWebsitePath;

        Stream fs;
        XmlTextWriter writer;
        int fileCounter = 1;
        int itemCounter = 0;
        string xmlFileName = "";

        public void GenerateAllSitemaps()
        {
            Log.Info("XmlSitemap - Start", this);

            XElement config = XGet.GetConfigXElement("sitemaps");
            SitecoreDB = Sitecore.Configuration.Factory.GetDatabase("master");

            SitemapWebsitePath = config.Attribute("virtualpath").Value;
            FolderPath = config.Attribute("folderpath").Value;
            FinalFolderPath = config.Attribute("finalfolderpath").Value;
            SiteRoot = config.Attribute("siteroot").Value;
            // clear existing .xml and gzip files
            DeleteExistingFiles(FolderPath);

            var data = from t in config.Descendants("sitemap")
                       select t;

            foreach (XElement tConfig in data)
            {
                string filename = tConfig.Attribute("filename").Value;
                string basetemplate = tConfig.Attribute("basetemplate").Value;
                string itemroot = tConfig.Attribute("itemroot").Value;

                Log.Info("XmlSitemap - Start " + filename, this);
                Item items = SitecoreDB.GetItem(new ID(itemroot)); //{CBCC60AC-4458-435C-B706-16ABF4F404B4} //{24DB6C5E-EFFD-46F6-94DA-088A872A2C63}
                GenerateMap(items, basetemplate, filename);
                Log.Info("XmlSitemap - End " + filename, this);
            }

            //compress all files in sitemap folder to gz
            CompressFiles();

            // create sitemap index file and include all gz files
            CreateSitemapIndex();
            DeleteExistingFiles(FinalFolderPath);
            CopyFromStagingToMain();
        }

        /// 
        /// Generate one or more sitemap files (gzip compresses) based on the criteria provided
        /// The goal is to keep one sitemap file size less than 10mb and/or less than 50,000 urls
        /// 
        /// the root item from which the list if generated.
        /// Base template the items have to inherit.
        /// Base file name (products, faq) to which we append numbers 1 -> x depending on the number of sitemap files.
        /// none.
        public void GenerateMap(Item root, string templateID, string fileName)
        {
            try
            {
                Log.Info("XmlSitemap - GenerateMap Start", this);
                fileCounter = 1;
                itemCounter = 0;
                xmlFileName = fileName;

                fs = new FileStream(string.Format("{0}{1}{2}", FolderPath, xmlFileName + fileCounter.ToString(), SitemapFileExtension), FileMode.Create);
                writer = new XmlTextWriter(fs, Encoding.UTF8);

                writer.WriteStartDocument();
                writer.WriteStartElement("urlset");
                writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");

                GetItems(root, templateID);

                writer.WriteEndElement();
                writer.WriteEndDocument();
                writer.Flush();

                fs.Close();

                Log.Info("XmlSitemap - GenerateMap End", this);
            }
            catch (Exception ex)
            {
                Log.Error("XmlSitemap - GenerateMap Error", ex, this);
            }
        }

        /// 
        /// Get a list of items matching the base template provided.
        /// 
        /// Item used to traverse through.
        /// Base template the items have to inherit.
        /// none.
        public void GetItems(Item i, string templateID)
        {
            if (i.HasBaseTemplateOf(templateID))
            {
                var opts = Sitecore.Links.LinkManager.GetDefaultUrlOptions();
                opts.AlwaysIncludeServerUrl = true;
                opts.LanguageLocation = Sitecore.Links.LanguageLocation.FilePath;
                opts.LanguageEmbedding = Sitecore.Links.LanguageEmbedding.Always;

                if (GetLanguage(i) == "en")
                    opts.LanguageEmbedding = Sitecore.Links.LanguageEmbedding.Never;

                if (templateID == YOURNAMESPACE.Model.Sitecore.Pages.Base.BasePage.TemplateID)
                {
                    if (i.Fields[Model.Sitecore.Pages.Base.NavigationInformation.ShowInSitemap_FID].Value == "1")
                        AppendToXML(new Location { Url = GetPageUrl(i), ChangeFrequency = Location.eChangeFrequency.weekly, Priority = 0.75, LastModified = i.Statistics.Updated });
                }
                else
                    AppendToXML(new Location { Url = Sitecore.Links.LinkManager.GetItemUrl(i, opts), ChangeFrequency = Location.eChangeFrequency.weekly, Priority = 0.75, LastModified = i.Statistics.Updated });
            }

            foreach (Item child in i.Children)
            {
                GetItems(child, templateID);
            }
        }

        /// 
        /// Appends a location xml to the xml file.
        /// 
        /// Reference to the xml writer to append to.
        /// Location object to generate xml for.
        /// none.
        private void AppendToXML(Location addLocation)
        {
            writer.WriteStartElement("url");
            writer.WriteElementString("loc", addLocation.Url);
            writer.WriteElementString("lastmod", String.Format("{0:yyyy-MM-dd}", addLocation.LastModified));
            writer.WriteElementString("changefreq", addLocation.ChangeFrequency.ToString());
            writer.WriteElementString("priority", addLocation.Priority.ToString());
            writer.WriteEndElement();

            itemCounter += 1;

            if (itemCounter == MaxItems)
            {
                Log.Info("XmlSitemap - AppendToXml - Reached max size, creating new file.", this);
                writer.WriteEndElement();
                writer.WriteEndDocument();
                writer.Flush();
                fs.Close();

                fileCounter++;
                fs = new FileStream(string.Format("{0}{1}{2}", FolderPath, xmlFileName + fileCounter.ToString(), SitemapFileExtension), FileMode.Create);
                writer = new XmlTextWriter(fs, Encoding.UTF8);

                writer.WriteStartDocument();
                writer.WriteStartElement("urlset");
                writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");

                itemCounter = 0;
            }
        }

        public string GetPageUrl(Item page)
        {
            SiteInfo site = Factory.GetSiteInfo("website");

            string path = page.GetItemUrl().ToLower().Replace("/en/sitecore/content/home", "");

            UrlString url = new UrlString(path);
            url.HostName = site.TargetHostName;
            return url.ToString();
        }

        /// 
        /// Return item language
        /// 
        /// Item passed to check the language.
        /// item iso language.
        public string GetLanguage(Item item)
        {
            Sitecore.Data.ID contextLanguageId = Sitecore.Data.Managers.LanguageManager.GetLanguageItemId(item.Language, item.Database);
            Item contextLanguage = SitecoreDB.GetItem(contextLanguageId);
            string iso = contextLanguage["Regional Iso Code"];
            if (string.IsNullOrEmpty(iso))
            {
                iso = contextLanguage["Iso"];
            }

            return iso.ToLower();
        }

        /// 
        /// Create Sitemap Index file - this should include all the gzip's
        /// 
        /// none.
        /// none.
        public void CreateSitemapIndex()
        {
            Stream fs = new FileStream(string.Format("{0}{1}{2}", FolderPath, "sitemap", SitemapFileExtension), FileMode.Create);
            SiteInfo site = Factory.GetSiteInfo("website");
            UrlString url;

            using (XmlTextWriter writer = new XmlTextWriter(fs, Encoding.UTF8))
            {
                writer.WriteStartDocument();
                writer.WriteStartElement("urlset");
                writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");

                DirectoryInfo directorySelected = new DirectoryInfo(FolderPath);
                foreach (FileInfo file in directorySelected.GetFiles())
                {
                    if ((File.GetAttributes(file.FullName) & FileAttributes.Hidden) != FileAttributes.Hidden & file.Extension == ".gz")
                    {
                        url = new UrlString(SitemapWebsitePath + "/" + file.Name);
                        url.HostName = site.TargetHostName;

                        writer.WriteStartElement("url");
                        writer.WriteElementString("loc", url.ToString());
                        writer.WriteElementString("lastmod", file.LastWriteTime.ToString());
                        writer.WriteEndElement();
                    }
                }

                writer.WriteEndElement();
                writer.WriteEndDocument();
                writer.Flush();
            }
            fs.Close();
        }

        /// 
        /// Delete existing .xml and .gz files in the sitemap folder
        /// 
        /// none.
        /// none.
        public void DeleteExistingFiles(string folderPath)
        {
            DirectoryInfo directorySelected = new DirectoryInfo(folderPath);
            foreach (FileInfo fileToDelete in directorySelected.GetFiles())
            {
                if ((File.GetAttributes(fileToDelete.FullName) & FileAttributes.Hidden) != FileAttributes.Hidden & (fileToDelete.Extension == ".gz" || fileToDelete.Extension == ".xml"))
                    fileToDelete.Delete();
            }
        }

        /// 
        /// Compress all the files in the sitemap folder into gz (gzip)
        /// 
        /// none
        /// none.
        public void CompressFiles()
        {
            DirectoryInfo directorySelected = new DirectoryInfo(FolderPath);
            foreach (FileInfo fileToCompress in directorySelected.GetFiles())
            {
                using (FileStream originalFileStream = fileToCompress.OpenRead())
                {
                    if ((File.GetAttributes(fileToCompress.FullName) & FileAttributes.Hidden) != FileAttributes.Hidden & fileToCompress.Extension != ".gz")
                    {
                        using (FileStream compressedFileStream = File.Create(fileToCompress.FullName + ".gz"))
                        {
                            using (GZipStream compressionStream = new GZipStream(compressedFileStream, CompressionMode.Compress))
                            {
                                originalFileStream.CopyTo(compressionStream);
                                Log.Info(string.Format("XmlSitemap - WriteXMLFile - Compressed from {0} to {1} bytes.", fileToCompress.Length.ToString(), compressedFileStream.Length.ToString()), this);
                            }
                        }
                    }
                }
            }
        }

        public void CopyFromStagingToMain()
        {
            DirectoryInfo directorySelected = new DirectoryInfo(FolderPath);
            foreach (FileInfo fileToCopy in directorySelected.GetFiles())
            {
                if ((File.GetAttributes(fileToCopy.FullName) & FileAttributes.Hidden) != FileAttributes.Hidden & fileToCopy.Extension == ".gz")
                {
                    File.Copy(fileToCopy.FullName, FinalFolderPath + fileToCopy.Name, true);
                }
                else if ((File.GetAttributes(fileToCopy.FullName) & FileAttributes.Hidden) != FileAttributes.Hidden & fileToCopy.Name == "sitemap.xml")
                {
                    File.Copy(fileToCopy.FullName, SiteRoot + fileToCopy.Name, true);
                }
            }
        }
    }

    public class Location
    {
        public enum eChangeFrequency
        {
            always,
            hourly,
            daily,
            weekly,
            monthly,
            yearly,
            never
        }

        public string Url { get; set; }
        public eChangeFrequency? ChangeFrequency { get; set; }
        public DateTime? LastModified { get; set; }
        public double? Priority { get; set; }
    }
}

My colleague Matt Schultz has modularized this code and made it usable in any project. His module allows for customization's to item link generation, generating sitemaps for multiple websites and security based sitemap generation. His module is going to be posted to the marketplace shortly.

Share:

Archive

Syndication