Search
Enter Keywords:
Home
Search Engine Optimization PDF Print E-mail
User Rating: / 0
PoorBest 
Written by Michael Salsbury   
Wednesday, 14 September 2005

This article concludes my series on search engines.  The initial article introduced search engines and explained what they did at a very high level.  The second and third articles provided more detail about how the search engines accomplish what they do.  This fourth and final article describes how to make your content more accessible to search engines and how to avoid being classified as a "search engine spammer."

Helping Search Engines Find Your Content

If you do nothing at all but add content to your site and submit it to the search engines, you’ll no doubt find that they manage to index some or all of it over time.  There are, however, things you can do as a webmaster that will make it much easier for the search engines to locate your content on your site.  In fact, you may find that the search engines NEVER index certain content on your site if you don’t take some of these steps.

Consider my own site’s content management system, called “Mambo Open Source”.   Mambo meets my needs as a content provider, but isn’t the most friendly tool for search engine spiders to figure out.  For example, one of the URLs on my site looks like this:

http://mikesalsbury.com/mambo/index.php?option=content&task=view&id=234

If you’re a search engine spider, you might look only at the part of this URL that actually refers to a file on my web server.  In that case, you’d shorten the above URL to something like this:

 http://mikesalsbury.com/mambo/index.php

If you don’t think that makes a difference, try clicking both of those URLs and seeing where you wind up.  You’ll see that the first URL takes you directly to a page about my experiences with virtual machine software, while the second one just brings you to my home page.  If the search engine spider visiting my site happens to ignore all that part of my URLs after the question mark (“?”) then it’s not going to find any of my content and it’s only ever going to index the home page.  (I should point out that the reason it would potentially ignore the part of the URL after the question mark is that the question mark normally denotes parameters passed to the server software, like your name, login ID, account number, etc.  Such information normally isn’t needed to get to a web page and can cause a search engine trouble in retrieving it, such as when a password is needed in order to access the site with an account.)

In this case, my content management system also supports another kind of URL it calls “Search Engine Friendly” (SEF) URLs.  The same article linked to that first URL looks like this when it’s “Search Engine Friendly”:

 http://mikesalsbury.com/mambo/content/view/234/

If a search engine sees this URL, what it thinks it sees (hopefully) is that there is a directory on my server called “mambo”, with a subdirectory of “content”, containing a subdirectory of “view”, and in there a subdirectory of “234”.  It should therefore leave this URL alone and index whatever page it finds at that location.  Thus, this particular article would now get indexed even if the search engine in question normally discarded the things after the question mark in the URL.

Thus, if you’re using a content management system, you should take the time to look at the URLs it works with and see if they look more like my first example above, or the third example.  If they have lots of parameters like the first example, you might want to spend some time in the system’s configuration details looking to see if there is some kind of alternative for using URLs that are more search engine friendly.

 But Wait… There’s More!

Search engine spiders have limitations.  They might be programmed to only go a certain number of levels deep through your site.  They might be programmed to spend only a few seconds indexing each site.  They might do any of a number of things that prevent them from getting to everything in your site.  In terms of making your content more accessible to users, this is a bad thing.  Fortunately, there are some things you can do to help the spiders find the content on your site.

One of the easiest and most universal of these is to create a “site index” file.  It is probably most effective to make this an HTML file that is up near the “root” (highest level) directory on your site.  For example, on my site the site index file is contained at:

http://mikesalsbury.com/siteindex.html

The reason we want to employ the use of such a file is primarily to make things easier for search engines to locate.  If you look at my site index, you’ll generally find a link to everything (except maybe the most recently-added articles) that exists here.  Why is this page a good thing?  Think like a search engine spider.

A search engine spider hits my home page, for example.  There, it finds a link to the site index file.  Since that link is near the top level of my site’s content, it will load up the site index page.  It will see that this page is chock full of interesting links to other things on my site.  It will start dutifully following those links one at a time until one of the following things happens:

  • It finds that it’s already indexed everything here and goes on to another page, or
  • It’s reached its time limit for indexing my site.

In the first case, that means it should theoretically have hit every piece of content on this site that I’d like it to find, which makes me very happy because it means that people will be able to find it by making a search on that particular search engine. 

In the second case, it has run out of time before indexing everything I have to offer.  While this is bad news for me, the next time the spider comes to my site and to this page, it will recognize all the URLs it’s scanned that haven’t changed.  It won’t bother with those URLs this time around, and will start looking at the ones it hasn’t seen.  This also makes me happy because it means that, given enough visits to my site, the spider is going to find everything here I have to offer (or at least everything I wanted it to find).  Again, that means people are going to find my content when they do a search on that search engine. 

Some content management systems include the ability to generate a site index file like this.  If yours does, that’s great news because it means you need only turn on the feature, check to see that it works, and move on to other things.  If yours, like mine, doesn’t do that, you’ll need to do it yourself like I do.  In my case, that means downloading a copy of the current site index, copying and pasting in the URLs to the newest items of content, saving the updated site index, and pushing it back to the server.  It’s a pain, but it pays off in indexing.  Since I implemented this technique, Google has managed to index pretty much every page I’ve got.  That’s better results than I saw in the first 2-3 months after the site was first registered on Google.  Other search engines like MSN, Yahoo, and the like, have also picked up my pages a lot more quickly. 

Google Sitemaps Beta

Google is currently testing a technique that is very similar to the above, called “Google Sitemaps”.  Like the above, the idea of a Google Sitemap is to prepare a specially-formatted XML file that contains a list of all the content pages on your site that you want to see indexed.  For each page, you specify how often you expect to update it (for example, daily/weekly/monthly), how important that page is relative to the others on your site (i.e., which ones should the spider hit first, second, third, etc.), and when the page was last modified.  When Google’s spider comes to your site, it will look for this Sitemap file.  If it finds one, the spider will prioritize what’s in it based on the information you provided, then visit each page that it hasn’t indexed or which has changed since the last visit.  In a sense, this is your chance to tell the spider where you’d like it to spend its limited time at your site. If you create the file properly and keep it up to date, you should find more and more of your content showing up on Google. 

What Have They Got on Your Site?

If you aren’t sure what a search engine has collected from your site, one of the easiest ways to find out is to ask it.  For example, if you go to Google’s home page and enter “site:mikesalsbury.com” you will find a list (in no particular order I can discern) of every page Google has found on my site… which ought to be almost all of them.  There are similar facilities within most of the other search engines out there, though it may take some time to sort them all out.  Using this technique, you can determine if your content management system is causing the search engine any trouble (i.e., very few of your pages are showing up), if the spiders have found the content you consider most important (i.e., is “Page X” in the list), and get a feel for when a spider last visited your site (by looking at the “cached” home page and seeing what date or content it contains). 

Don’t Become a Spammer!

There are number of things you could do on your site that might generate more hits from search engine visitors.  For example, if a musician currently has a number 1 single, you could shove lots of keywords into a page’s meta tags that would give a searcher the impression that your page contains information about the song, the musician, the song’s lyrics, or maybe even a free download of it.  When someone found this page in their search and clicked on it, they’d find whatever’s on your page instead of what they expected.  While this would generate a good bit of traffic on your site, it would also make you a “search engine spammer” because you lied to people in order to get them to come to your web page.  Even if such behavior is ethical and morally acceptable to you, search engines take a very dim view of it and they’ll very quickly drop your page from their index and perhaps even permanently ban you.

There are a number of other things you could do that, while they might raise your ranking in the search engine results, will likely get you classified as a search engine spammer.  These include: 

  • Creating a page that contains content that looks like a valid web page on a popular topic, but when visited immediately directs people to a page of your choice that isn’t about that content (e.g., they visit your page about a recent hurricane and before they can even read it, you’ve redirected them to a page about your new online casino).  This is also known as a “bait and switch”.
  • Stuffing your page content or meta tags with lots of keywords so that it appears to contain more information, or more relevant information, than it really does.  For example, if you have one paragraph about a car show you attended and three paragraphs with nothing but words and phrases (such as repeating the name of that car show 100 times in a row), that’s considered spamming.
  • Duplicating your content across several pages.  Since search engines analyze the text content of your site to try to determine its overall theme (e.g., the word “cartoon” appears 1000 times and the word “business” appears 600, so the search engine figures your site focuses on business cartoons and therefore moves you up in the ranking when someone searches for “business cartoons”), duplicating the same content on several pages would give the impression that your site (which has, let’s say, 10 unique pages about cars and 90 copies of those pages) has more content than it does and that this content is more focused on a particular subject than it really is (e.g., you have 10 pages about cars and 30 about computers, but your 10 car pages are duplicated 10x each, so it looks like a huge percentage of your site is about cars when it’s really 75% about computers).
  • Using tiny text on the page to increase the keyword count, or to include keywords that don’t really relate to the content, on the assumption that visitors won’t read this small print but search engines will.  While this may work with some search engines, I’m pretty sure most of them will pick up on it and drop you in the rankings of search results.  Similarly, using hidden text, hidden links, and gibberish will get you into trouble.
  • Putting doorway pages between users and their desired content.  For example, before I can get to your page about the latest legal snafu in my state’s government, I have to click through a page advertising some overseas pharmacy.
  • Creating a “link farm”, or a part of your site that does nothing but accumulates links to other sites that may or may not have anything to do with yours, or a page that implies that it’s about a particular subject but really just contains a list of links and a bunch of ads that pay you money.
  • “Cloaking” your page, or rather, serving text pages to search engine spiders and graphic pages when users click on those links in the search engine results… which means that what the user gets and what the search engine thinks they’re showing the user are two different things.
  • “Domain Spam” or creating a large number of web sites that do nothing more than link to your main web site.  This would cause your page to rank higher among search results because it would appear to spiders that your site is linked to by a large number of other sites (and as we discussed earlier, the number of inbound links to your site is one way your pages move up in the search result rankings).
  • “Typo Spam” or registering a URL that is a typo for a commonly-visited one.  For example, www.microsoft.com is commonly visited, so you might register a “typo spam” URL that is similar to a keyboarding mistake, such as “mcirosoft.com” simply to drive traffic to you when people mistype the name Microsoft.

If you avoid using these things, you should manage to keep yourself on the good side of those maintaining the search engines of the world.

That’s All, Folks!

 At this point, you pretty much know everything I do when it comes to search engines.  Now it’s time for you to go out there and optimize your own site for search engine results!

 

 

 

 

 

 

SEO ELITE - Search Engine Optimization Software

Related Blogs:

Related Links:

Last Updated ( Thursday, 30 March 2006 )
< Previous   Next >

Main Menu
Home
Blog
Photos
Links
Search
Site Index
Feedback
Administrator
Featured Links
BlogInspiration
SpamToons
Shawn Prince's Blog
Jack Ludwig's Blog
Mike Cramer's Site
Fark
Slashdot
Woot!
Cigar Envy
John Kricfalusi's Blog
CigarBlog 101
Cigars 101 Forum
Sponsored Links


View Site Stats