I have recently noticed that a lot of the relevant search queries that reference this site find things like the categories and archives instead of the specific posts that contains the relevant content.
It makes the search results look dirty and disorganized and means that there is duplicate content in different pages, which is what is confusing the search engines.
Doing a bit of digging found a few ways to help direct the search engines to index the content I wish they would, instead of what they choose.
One quick way of preventing Google and other search engines from indexing a site is by adding a robots.txt file in the root directory of the site. This file contains instructions for “well behaved” search engine crawlers.
The first section is defined for all robots agents and blocks access to private WordPress directories as well as virtual paths that we don’t want indexed, such as the RSS feeds and categories.
The second section allows the Mediapartners-Google robot full access to the site. This is the robot used by adsense, so that any page serving ads will get indexed for keyword context matching. Without this, adsense will not be able to review the contents of the page to help match ads.
The last line “Sitemap:” identifies the sitemap built by the XML-Sitemap plugin.
The <meta> keyword in the head of a page can be used to help robots determine, dynamically, what to do with a page. I use this, rather than the robots.txt for the archives since the format of the archive page names is somewhat dynamic (if I were to change it, then I would have to update the robots.txt)
Instead of just blocking archives, I chose to block anything that is not a page, a post, or the homepage with the following code in my header.php of my theme.
Insert the following in the <head> tag in header.php.
If the blog page is NOT a single, or a page, or the homepage OR it is a paged file, then block it from being indexed and archived by search engines, but allow them to follow the links to other pages.
I chose to block the is_paged() (things like the previous pages from the homepage like /page/2) pages this way instead of through the robots.txt so that they would get “followed”. Anything excluded in robots.txt is, in theory, never loaded by a search engine robot, so they cannot follow any links in that page. I’m not sure this is strictly necessary, since those links should all be available by following the links through the posts.
This <meta> tag will also block category pages, so the /category exclusion in the robots.txt is not strictly necessary.
I’m not clear on exactly how the adsense robot treats the <meta> tag, but it seems like they might be blocked too. We will have to see how this plays out.
Another feature I discovered, and like, is to reverse the title of the pages. By default, my theme was making hierarchical names starting with “Notions” on the left, and the post name on the right. I switched them so the article was on the left, and the blog name is on the right. It makes the most significant thing, the page subject, the first thing you read
So now this entry is titled
“WordPress Blog Search Engine Optimization « Notions”
“Notions » WordPress Blog Search Engine Optimization”
The following was inserted into the <head> tag in header.php and replaces any other reference to <title>
To block categories or not to block categories?
The reason to block real pages such as the feeds and categories is to prevent duplicate content from being indexed. Category and Archive pages contain copies of original posts, which they should do, but it confuses search engines as they see the same content on your site. By blocking the extra copies it makes it more obvious to the search engine where the real content is, what to index, and will send users to the real pages (with comments, etc.) rather than an archive page.
I debated for a while whether I should block the category pages, since they do provide a service for users searching for things related to those categories, and group things all together. In the end I decided that it was still not worth the confusion of the extra pages. An alternative would be to block regular posts, and only allow the categories to be index, as they may provide more keywords to search and contain more relevant content.