Search
Enter Keywords:
Home
How Search Engines Work, Part 1 PDF Print E-mail
User Rating: / 0
PoorBest 
Written by Michael Salsbury   
Thursday, 08 September 2005

In the first article in this series, we saw how search engines work. They have “spiders” which crawl the web looking for pages to store in a database, a “search page” that people use to find web pages of interest, and a “search engine” that helps translate what people ask for into a list of pages that seem to be most relevant. The process looks like this:




But there is a lot more to the “search engine” block in this diagram than there is to any other part of the overall search engine system. As a webmaster, you should be asking yourself several questions about the search engine at this point:

  • How does the search engine find a site to spider? (“How do I get my site listed?”)

  • What does my web site look like to a spider? (“Is the design of my site preventing spiders from finding my content?”)

  • Once the search engine has found a site, how does it know what the pages on that site are about? (“How can I help the search engine figure out what my site is all about?”)

  • How does the search engine determine which pages to place at the top of the list and why? (“How can I move my site farther up the list?”)

  • What is “Search Engine Spam”? (“How do I avoid looking like Search Engine Spam to a spider?”)

We'll be discussing the answers to some of these questions in this article. The rest will be covered in future articles in this series.

How do I get my site listed with the search engines?

If your site has been around for a while, the major search engines may already have found it. But if it's a new site, or one that isn't widely publicized, the search engines may not know you exist. The first thing you need to do is make sure they can find your site. If they can't find your site, neither can your intended audience.

According to SearchEngineWatch.com, the following search sites accounted for the vast majority of all search traffic on the Internet during July 2005:

Add those together and you'll find that they represent 99.4% of the search engine traffic you're likely to receive at your site. If you want to get listed on each of those sites, click their names to go (as directly as I can get you) to their submission page. One thing you may find as you read information on the other sites is that they draw from Google in one way or another. AOL's search, for example, is provided by Google. InfoSpace draws information from Google, Yahoo, and others, but charges to submit your site to their engine. Since they're providing less than 1% of the traffic we're likely to get, I don't think that paying for their service is a smart idea (unless maybe you've surveyed your potential audience and found that a lot of them use it).

Once you've submitted your site to one of these search engines, it will be added to the spider's database of sites to crawl through. It will look for a file called “ROBOTS.TXT” (in most cases) that tells it which parts of your site you DON'T want included in the database. It will then look up your homepage and scan that page for links to other pages on your site. It will then visit those pages and scan them for links, and so on. It may index all of your site in one visit, or it may take several visits over a period of months to pick everything up. Rest assured that eventually they will find you and include you in their system.

What does my site look like to a spider?

If you thought a spider sees your web page the same way you do in Firefox, Internet Explorer, or Netscape, think again. If you want to see what your site looks like to a search engine spider, here's a simple test. Launch your web browser and go to your site's home page. Right-click the page and select the “View Source” or “View Page Source” option from the pop-up menu. What you see in the resulting window is just what the spider will see. Got a fancy Macromedia Flash menu? Surprise! The spider can't see that. Got lots of complicated frames and graphics content? The spider can't see that either. All it sees is the raw HTML output of your server. No graphics, no Macromedia content, no sounds, nothing but HTML.

Put yourself in the shoes of the spider. Are there links here that a spider can follow to your content? If not, you've got a real problem. There should be at least one HTML link here that the spider can take to delve further into your site. From there, it should be able to find more... and more... until it finds everything you'd consider “relevant” on your site. The more levels deep that the spider has to go, the longer and the less-likely it is that the content at that level will ever be fully indexed by the spider. This is because most of the spiders are coded with “limits” to keep them from spending too much crawl time at any one site. If the spider is allotted 1 minute per site, for example, and it can only get about 3 levels deep into your 7 level site by that time, the content you've got at levels 4 and above will probably never show up in a search engine.

One way to help the search engine spiders help you is to provide them with an index to your site. If you can do it, I recommend creating a single page that lives very close to the top level of your site's hierarchy that contains a link to every important piece of content you have. This way, the spider will find that content early on in its search and there is a better chance it will get everything you have to offer.

With respect to Google, there is a feature called “Sitemaps” that they're working on. You can help them work on it by submitting your site to them and including a “sitemap.xml” file for their spider to pick up. This file (like the index we discussed above), gives them a list of every relevant page on your site, your own ranking as to how important that site is to you to be indexed, and an indicator of how often you expect that page to change. With this file, Google's spider is better able to judge what content is available at your site to crawl through and what order it should follow to crawl through it. There is a good chance that this will improve your overall coverage in that search engine. Looking long-term, I expect the other search engines to pick up on this file eventually and incorporate it into their processes, too. Thus, it may help you with sites other than Google.

How does the search engine classify the pages the spiders find?

When a spider crawls through a page on your site, it will place a copy of that page in a “cache” on the search engine database. The search engine will scan through that cached page to determine what words and phrases it finds there. It will then “link” your page to those words and phrases, known as “keywords” and “keyphrases”. The more times it finds a particular keyword or keyphrase mentioned on your page, the more it thinks your page is relevant to searches for that particular keyword or keyphrase. Thus, it's important to make sure that the keywords people are likely to use to find your site appear on it as often as possible without making it sound ridiculous.

For instance, if I create a page about the Model T Ford automobile, it might be the best and most useful page about that car on the entire Internet. But if I only mention the phrase “Model T Ford” 2-3 times in that page, while someone else mentions it 20 or 30 times in the same size page, the search engines (generally speaking) are going to think that other page is “more relevant” than mine and place it higher up in the list of results. Chances are that people will visit the pages higher in the list before getting to mine, and they might find their answers before they get to my site. In this example, that may be disappointing to me (if I'm a Model T collector, for instance, and I want to share what I know) or it could be catastrophic (if I'm selling Model T accessories and no one ever comes to my site to buy them).

This can be carried too far. If you simply filled the page with the phrase “Model T Ford” over and over again, you could theoretically go way up in the search engine rankings. However, your page would read like gibberish to a human being who visited it, and they'd make a mental note never to go to your ridiculous site ever again. Similarly, if you mentioned that phrase in every sentence, your reader is going to get very annoyed with you. So the key is to use the key words and phrases as often as you can without making the page appear silly or garbled. Besides, people called “search engine spammers” have used this tactic (filling a page up with commonly-searched words and phrases) in the past and most of the search engines contain safeguards for preventing such a page from ever getting very high in the results. In fact, if they suspect you of doing it, they may very well delete all the pages on your site from their database and never visit your site again.

That's all for now...

In the next installment of this series, we'll take a closer look at how search engines rank the pages they display in their search results to users. In doing so, we'll learn ways we can improve our site's ranking in the results and thus get more traffic to our site.

SEO ELITE - Search Engine Optimization Software

Related Blogs:

Related Links:

Last Updated ( Thursday, 30 March 2006 )
< Previous   Next >

Main Menu
Home
Blog
Photos
Links
Search
Site Index
Feedback
Administrator
Featured Links
BlogInspiration
SpamToons
Shawn Prince's Blog
Jack Ludwig's Blog
Mike Cramer's Site
Fark
Slashdot
Woot!
Cigar Envy
John Kricfalusi's Blog
CigarBlog 101
Cigars 101 Forum
Sponsored Links


View Site Stats