Table of Contents

  1. What are sitemaps?
  2. Why do we want sitemaps?
  3. What is a microservice?
  4. Why do we want sitemaps as a microservice?
  5. Great, show us the way, O’ Benevolent voice on the internet!

Sitemaps as a Microservice

Oh hi there, sorry, I didn’t see you behind all my beautiful sitemaps! Well since you’re here let’s have a chat about sitemaps.

What are sitemaps?

Shamelessly torn from Wikipedia’s grubby sitemaps page:

“A site map (or sitemap) is a list of pages of a web site accessible to crawlers or users. It can be either a document in any form used as a planning tool for Web design, or a Web page that lists the pages on a Web site, typically organized in hierarchical fashion.”

If the block quote scared you and you zoned out trying to calculate the most apparently reasonable time to go for your 27th coffee break of the day then I’ll shamelessly tear the simple definition from my simple brain:

“It is literally a map of the website, i.e. These are the pages on our site and how to get there.”

Sitemaps come in a variety of flavours and serve various purposes, we’ll be discussing sitemaps for SEO purposes (that means sitemaps for lifeless robot eyes, not beady human eyes). The XML format is best suited to our needs as it allows us to set additional properties on each URL, detailed later.

XML you say?

XML?! WHY??

I know. I tried to find out why XML is supported and a more lightweight language is not but there appears to be no rhyme or reason. XML is semantically rich, and elements can contain mixed content, and there are probably other benefits of using XML over something simple like JSON but none of these attributes seem to be taken advantage of in the sitemap protocol! Oh well, best get in line.

For the non-believers amongst you here is a real live (example, not real or live) sitemap with annotations by yours truly on the enigmatically named properties:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Using the industry standard XML namespace for sitemap definition, we define a urlset (read as an array or collection of urls), this is our sitemap. -->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- Within the urlset we define the urls on our site -->
  <url>
    <!-- REQUIRED: set the loc to be the fully qualified URL of the page in question. Must be less than 2048 characters. -->
    <loc>http://www.tinypuppies.co.uk/pup-of-the-week</loc>
    <!-- OPTIONAL: set the lastmod on links to signify the last time this page was modified. The date should be in W3C Datetime format. *Some* crawlers / search engines will use this property to determine if your page shows up in results when the searcher has time filters set, e.g. only show results from within the last week. -->
    <lastmod>2016-01-23</lastmod>
    <!-- OPTIONAL: set the changefreq on links to let crawlers know how often this content is likely to change, this will affect how often *some* crawlers revisit the page. Changefreq can be always, hourly, daily, weekly, monthly, yearly, and never -->
    <changefreq>weekly</changefreq>
    <!-- OPTIONAL: set the priority on links to let crawlers know how important this page is in relation to the rest of your site. This will affect how *some* search engines prioritise your pages in results, but only against your own pages. Priority is from 0(low) to 1(high), the default being 0.5 -->
    <priority>0.7</priority>
  </url>   
  <url>
    <loc>http://www.tinypuppies.co.uk/breeds</loc>
    <lastmod>2016-01-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

URLs detailed in an XML sitemap can be on the same domain or different subdomains and domains. Each XML sitemap file must be UTF-8 encoded, is limited to containing 50000 URLs, and must be under 10MB in size. XML sitemaps can be compressed using gzip.

If your sitemap is exceeding the size limits, it’s time to invest in a sitemapindex. In the same way an XML sitemap lists URLs, a sitemapindex lists XML sitemaps. If your mind is wonderfully advanced and you’re still reading this rather than ctrl-tabbing to facebook to see if that special someone has liked your photo yet then you can probably see beautiful sitemap hierarchies stretching off into the sunset, well stop that. This is the real world where nobody gets to be beautiful up close (yes, I am hinting at your photo). A sitemapindex can NOT contain links to other sitemapindex files. You can submit multiple sitemapindex files to search engines but never the twain shall meet.

Here’s an example of a sitemapindex file containing two XML sitemaps (no annotations required, I’m not holding your hand through all of this and besides, it’s not rocket science):

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<sitemap>
		<loc>http://www.tinypuppies.co.uk/sitemap-breeds-a1.xml</loc>
		<lastmod>2016-01-01</lastmod>
	</sitemap>
	<sitemap>
		<loc>http://www.tinypuppies.co.uk/sitemap-breeds-a2.xml</loc>
		<lastmod>2016-01-05</lastmod>
	</sitemap>
</sitemapindex>

Search engine crawlers and the occasional strange user will look for the sitemap on your site at http://www.yoursite.com/sitemap.xml as is the convention, but nothing stops you from storing your sitemaps in a different location and submitting them to search engines manually or programmatically.

As we stumble to the end of this section and the start of your next coffee break which you can hopefully drag out to lunch or home time, here are some common XML sitemap myths debunked:

  1. “Including a URL in the XML sitemap guarantees it will be indexed.”
    • No. It’s important to note that XML sitemaps are only recommendations. The XML sitemap will not guarantee indexation of the URLs included.
  2. “If I leave a URL out of the XML sitemap it will get deindexed.”
    • No. The XML sitemap will not exclude indexation of URLs not included on the XML sitemap. It’s merely a set of recommended URLs that, if the recommendations agree with the signals the rest of the site is sending, will lend a bit of extra importance to the URLs included above and beyond the other URLs on the site.
  3. “XML sitemaps are difficult to create and maintain.”
    • No. In the simplest cases, small sites can easily create and post their own XML sitemaps manually using the examples above as formatting guides. For larger sites and sites that change more frequently, plugins or modules available for most ecommerce platforms can automate the creation and posting of XML sitemaps.
  4. “Posting an XML sitemap is like asking to get scraped and spammed.”
    • No. An XML sitemap is nothing more than a list of URLs. Scrapers and spammers can easily crawl any public site they wish to generate a list of URLs and content from which to steal a site’s content for their own nefarious purposes. They certainly don’t need an XML sitemap to do it, and not posting an XML sitemap won’t keep the scrapers and spammers away.

Why do we want sitemaps?

Sitemaps are the cure for world hunger, and they are your best friend. Sitemaps like to give you money for nothing, and they always help you move house. Almost none of that is true, it was just a clever ploy to get you interested again.

Using sitemaps has many benefits, if only there was a sensible way of listing points along with brief descriptions..

  • I know - I’ll use a table!
Benefit Description
Content Modification It’s easier to maintain higher ranks in search engines if you keep modifying content on your site; keeping it fresh and useful to the needs of your visitors. Effective use of sitemaps means Google will be alerted whenever your site’s content is modified.
Efficient Crawling All content on a site should be crawled, which can take a long time. If it takes too long, crawlers are likely to give up. With a trusty sitemap in their lifeless robot hands however, the website can be efficiently and effectively crawled, which leads to better indexing of your site.
Content Prioritisation Sitemaps let you prioritise your pages. This means that the pages carrying your most important content will be crawled and indexed faster than those with a less priority value.
Get Discovered The main reason you invest time and money creating new content for your site is that you expect to be found on the web. Using sitemaps means your new content will be discovered by search engines a lot faster. Highly recommended for new websites or web pages and other pieces of content.
Free service The best things in life are free! Except for spam mail, cut that out. Yes, submitting your sitemap to search engines doesn’t cost a penny, it only costs a poor engineering schmuck some of his greasy time, no loss there just do it!
Learn About Your Visitors You can learn a lot by monitoring your sitemap reports on the various search engine webmaster tools. Errors will be displayed so that you can fix, traffic sources and even keyword searches. Using this information can help you improve your content and attract more attractive traffic which is important because nobody wants ugly traffic.

What is a microservice?

In the real world your face experiences, a service is a system that supplies utilities or commodities as required. For example, a telephone service supplies customers with the telephone communications utility (most of the time). Another example is a bus service supplying the dirty general public with the transportation utility.

In terms of software engineering, the meaning of “service” doesn’t actually change! We add the word “Micro” because ideally these services should be small, simple, and be concerned with a single high level responsibility. Look at us, learning stuff that we already knew but wrapped in a shiny new context, how admirable. A software microservice example is a DateTime microservice. If you want today’s date, you ask the DateTime microservice, easy.

One thing to keep in mind when designing, thinking about, implementing, or even smelling software microservices is the responsibility of the microservice. The microservice should be built for one high level purpose, and the interface to the microservice should reflect that purpose in a cohesive manner. For example, the DateTime microservice could have functions for getting tomorrow’s date, and getting the date in different formats. The DateTime microservice should not have functions for converting currency.

The cohesive microservice principle still applies to real world services. Using the previous examples, you wouldn’t expect your phone to only work if you have bought a bus ticket.

Hopefully that wraps up microservices. If you want to read further a simple google search will not be beyond your technical grasp, and should return results such as Martin Fowler’s definition of Microservice Architecture

Why do we want sitemaps as a microservice?

If you’re as swayable as a drunk toddler and are ready to implement all the things as microservices already then you can probably skip this section.

MICROSERVICES ARE FOR ME, I READ THE INTERNET

Rather than answer the question with sitemaps specifically on the chopping board, I think it’s more useful to answer the question for any potential microservice. This way when a terrifying new microservice is proposed, you will be proud and fearless rather than shaking in your booties.

To answer “Why microservices?” I feel you would be better served by looking up one of the many, many articles already arguing for or against the architecture, such as SmartBear’s What Are Microservices.

In my own simple words Microservice Architecture means you can develop and deploy individual components without worrying about the larger system. Fixing a bug in the billing microservice means deploying the billing module, not everything else.

Implementing a Sitemap microservice means all our products don’t need individual implementations of sitemap generation and submission. They simply make use of the utility supplied by the sitemap microservice. This means less code duplication, and more consistency in our public-facing websites.

Great, show us the way, O’ Benevolent voice on the internet!

Ok, on to the delicious meaty reason of this great wall of text. How I would go about implementing the sitemaps microservice. At this point we aren’t too clear on the inputs to the microservice but we know the desired outputs: sitemaps!

Given this scenario,

“I am an FMP website, and I need a sitemap for all my lovely content pages.”

What does the sitemap microservice need? It needs to know the common structure of the URLs to be built, a template if you will, e.g. http://www.fmp4life.com/inspect/transcriptions/{transcription_id}, and it needs to either be given, or be told how to get all the data necessary to build all the variations of that URL.

Let’s briefly explore the implementation of a sitemap microservice which accepts the page template and a link to or an identifier for a data source. This design would mean the sitemap microservice has to trawl through the data itself to gather the page names, and the other properties. This would mean less complexity in the requesting sites but it would couple the sitemap microservice to the various data sources. An example of why this is bad is because when a new site is built it can’t use the sitemap microservice until work has been done to register the new data source with the sitemap microservice. It also means if any of the data sources move or change structure, work will need to be done in the requesting sites as well as work in the sitemap microservice.

Let’s explore the implementation of a sitemap microservice which accepts the page template and all the associated data with an example. A full example of a sitemap request being given all the data necessary to build all variations of the URLs would be:

GET: http://sitemap-service.dun.fh/sitemap

BODY:

{
  "url_sets" : [
    {
      "template" : "http://www.findmypast.co.uk/surnames/{surname}",
      "pages" : [
        {
          "url_part" : "aardvark",
          "lastmod" : "2004-12-23",
          "changefreq" : "weekly"
        },
        {
          "url_part" : "aastra"
        },
        {
          "url_part" : "behemoth",
          "priority" : "0.7"
        }
      ]
    },
    {
      "template" : "http://www.findmypast.co.uk/transcriptions/{transcription_id}",
      "pages" : [
        {
          "url_part" : "gbprs_2_india_mars",
          "priority" : "0.7"
        },
        {
          "url_part" : "prs_all_detail_209_3_d_nwkfhs_47",
          "lastmod" : "2004-10-19"
        },
        {
          "url_part" : "tna_ccc_2c_multi",
          "changefreq" : "yearly",
          "lastmod" : "2003-11-09"
        }
      ]
    }
  ]
}

RESPONSE:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.findmypast.co.uk/surnames/aardvark</loc>
    <lastmod>2004-12-23</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>http://www.findmypast.co.uk/surnames/aastra</loc>
    <lastmod>2004-12-23</lastmod>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>http://www.findmypast.co.uk/surnames/behemoth</loc>
    <lastmod>2004-12-23</lastmod>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>http://www.findmypast.co.uk/transcriptions/gbprs_2_india_mars</loc>
    <lastmod>2004-12-23</lastmod>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>http://www.findmypast.co.uk/transcriptions/prs_all_detail_209_3_d_nwkfhs_47</loc>
    <lastmod>2004-10-19</lastmod>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>http://www.findmypast.co.uk/transcriptions/tna_ccc_2c_multi</loc>
    <lastmod>2003-11-09</lastmod>
    <priority>0.5</priority>
    <changefreq>yearly</changefreq>
  </url>
</urlset>

Pro’s of this design

  • Clear single high level responsibility for the sitemap microservice, it is only concerned with building sitemaps, it doesn’t care or know about where the data is originally from.
  • Individual properties are settable for each page. Note the optional page properties being optional.
  • No tight coupling to other services or data sources, this means any site can make use of this microservice with no extra work required.

Con’s of this design

  • A lot of the complexity (gathering all the page names, and their details like last modified, and change frequency) is pushed to the requesting site, complexity which is likely to be quite expensive depending on the data source.

The con listed above I don’t see as an issue, this is probably the best place for that complexity as otherwise the sitemap microservice will need to know far too much about all sorts of data sources. Caching can be put in place to save on the expensive operations, or perhaps only generate the sitemap once a day.

The winner is the “Give the Sitemap Microservice all the things!” approach. Using this design we should end up with a highly cohesive (you won’t need a bus ticket to request a sitemap), easily testable (due to simple functional design), fast (no external dependencies, simple operations, caching) sitemap microservice.

Thanks for reading, or at least gazing zombie-like at the harsh screen with the occasional flick of the mouse wheel to make it look like you’re soaking it in.

Fin.