15 August 2008

Search Engine : The Pool of Information

A search engine is a database of resources extracted from the Internet through an automated “crawling” process. This database is searchable through user queries.

Words or phrases you enter in the search box are matched to resources in the search engine’s database that contain your terms. These are then automatically sorted by their probable relevance and presented with the most “relevant” sites appearing first.

Once a search engine has used your search terms to gather “hits” from its database, it lists or “ranks” the resulting sites in order of its own estimation of their relevance. The procedures and factors used to create this ranking are often company secrets, so understanding exactly why one hit is listed higher than another is difficult.

Take entire page and linked page and collect all words into a database. Check how frequent each word appears, and determines important keywords within the page. With some additional information, the final database is created. Each search engine uses different information as additional information

o The content of the title element.
o Keyword specified in Meta element.
o Description specified in Meta element.
o How many other pages link this page?


Search engines use automated software programs know as spiders or bots to survey the Web and build their databases. Web documents are retrieved by these programs and analyzed. Data collected from each web page are then added to the search engine index. When you enter a query at a search engine site, your input is checked against the search engine's index of all the web pages it has analyzed. The best urls are then returned to you as hits, ranked in order with the best results at the top.
This is the most common form of text search on the Web. Most search engines do their text query and retrieval using keywords.

What is a keyword, exactly? It can simply be any word on a webpage. For example, I used the word "simply" in the previous sentence, making it one of the keywords for this particular webpage in some search engine's index. However, since the word "simply" has nothing to do with the subject of this webpage (i.e., how search engines work), it is not a very useful keyword. Useful keywords and key phrases for this page would be "search," "search engines," "search engine methods," "how search engines work," "ranking" "relevancy," "search engine tutorials," etc. Those keywords would actually tell a user something about the subject and content of this page.

Most sites offer two different types of searches--"basic" and "refined" or "advanced." In a "basic" search, you just enter a keyword without sifting through any pulldown menus of additional options. Depending on the engine, though, "basic" searches can be quite complex.

Advanced search refining options differ from one search engine to another, but some of the possibilities include the ability to search on more than one word, to give more weight to one search term than you give to another, and to exclude words that might be likely to muddy the results. You might also be able to search on proper names, on phrases, and on words that are found within a certain proximity to other search terms.

Relevancy Rankings:
Most of the search engines return results with confidence or relevancy rankings. In other words, they list the hits according to how closely they think the results match the query. However, these lists often leave users shaking their heads on confusion, since, to the user, the results may seem completely irrelevant.
Why does this happen? Basically it's because search engine technology has not yet reached the point where humans and computers understand each other well enough to communicate clearly.

Most search engines use search term frequency as a primary way of determining whether a document is relevant. If you're researching diabetes and the word "diabetes" appears multiple times in a Web document, it's reasonable to assume that the document will contain useful information. Therefore, a document that repeats the word "diabetes" over and over is likely to turn up near the top of your list.

If your keyword is a common one, or if it has multiple other meanings, you could end up with a lot of irrelevant hits. And if your keyword is a subject about which you desire information, you don't need to see it repeated over and over--it's the information about that word that you're interested in, not the word itself.

Search engine capabilities:
Search engines are rated by the size of their index. Large engines such as AltaVista and Google are good tools to use when searching for obscure information, but one drawback to an extensive index is the overwhelming number of results on more general topics. If this is the case, it might be better to use a search engine with a small to medium size index, such as Excite or WebCrawler. The directory structure of engines such as Yahoo! or Lycos is also helpful for categorizing the hits.

Many search engines provide directory-listing search tools such as yellow pages, white pages, and email addresses. In addition, many allow you to personalize their site to your needs. For example, you might want to set the attributes of the page to show educational news headlines and your favorite teacher resource links. In the preferences of your web browser, you can then set this page as your home start-up page.