Detailed Report On INTERNET SEARCH ENGINES AND HOW THEY WORKS


Because the web contains over 50 trillion pages of information, searching is an art. There are many approaches to search. The goal of searching is to provide quality search results efficiently. Internet searching is a process in which we are able to find required informations (like text documents, images, videos etc) in the internet using different search tools. But for extraction of quality results we need to know which search tools to choose. Explore the strengths and weaknesses of different search tools: how to pick the right one(s) for the job. Learn to use the tools appropriately and how to re-phrase questions in a search tool-shaped way. Evaluate the information that search tools provide and how to efficiently cream off the most relevant results.

The history of Internet search engine spans nearly 20 years, from 1990 to the present. The first search engines patrolled public directories, which contained only text and could not be searched using such as techniques as adding "AND" or "OR" to the search string. As search engines became more advanced, graphics and keywords (including detailed phrases) became an important component to the way we find things on the Internet.   

Different types of search tools are:-
·       Search engines
·       Meta-search Engines
·       Subject Directories
·       Information Gateways
·       Specialist Databases

                   Search engines
Most people find what they're looking for on the World Wide Web by using search engines like Yahoo!, Alta Vista, or Google. According to InformationWeek, aside from checking e-mail, searching for information with search engines was the second most popular Internet activity in the early 2000s. Because of this, companies develop and implement strategies to make sure people are able to consistently find their sites during a search. These strategies oftentimes are included in a much broader Web site or Internet marketing plan. Different companies have different objectives, but the main goal is to obtain good placement in search results.


                  Search engines are programs that allow the user to enter key words that are used to search a database of web pages. Each search engine searches in a different way and searches different sites. That is why doing a search in one search engine
will produce different results than doing the same search in another engine. An example would be www.google.com


    Types of search engines:-
  
·      Spider-based search engines
·      Directory-based search engines
·      Link-based search engines

       Spider-based search engines
Many leading search engines use a form of software program called spiders or crawlers to find information on the Internet and store it for search results in giant databases or indexes. Some spiders record every word on a Web site for their respective indexes, while others only report certain keywords listed in title tags or meta tags.
Although they usually aren't visible to someone using a Web browser, meta tags are special codes that provide keywords or Web site descriptions to spiders. Keywords and how they are placed, either within actual Web site content or in meta tags, are very important to online marketers. The majority of consumers reach e-commerce sites through search engines, and the right keywords increase the odds a company's site will be included in search results.
Companies need to choose the keywords that describe their sites to spider-based search engines carefully, and continually monitor their effectiveness. Search engines often change their criteria for listing different sites, and keywords that cause a site to be listed first in a search one day may not work at all the next. Companies often monitor search engine results to see what keywords cause top listings in categories that are important to them.
In addition to carefully choosing keywords, companies also monitor keyword density, or the number of times a keyword is used on a particular page. Keyword spamming, in which keywords are overused in an attempt to guarantee top placement, can be dangerous. Some search engines will not list pages that overuse keywords. Marketing News explained that a keyword density of three to seven percent was normally acceptable to search engines in the early 2000s. Corporate Web masters often try to figure out the techniques used by different search engines to elude spammers, creating a never-ending game of cat-and-mouse.
Sometimes, information listed in meta tags is incorrect or misleading, which causes spiders to deliver inaccurate descriptions of Web sites to indexes. Companies have been known to deliberately misuse keywords in a tactic called cyber-stuffing. In this approach, a company includes trademarks or brand names from its competitors within the keywords used to describe its site to search engines. This is a sneaky way for one company to direct traffic away from a competitor's site and to its own. In the early 2000s, this was a hot legal topic involving the infringement of trademark laws.
Because spiders are unable to index pictures or read text that is contained within graphics, relying too heavily on such elements was a consideration for online marketers. Home pages containing only a large graphic risked being passed by. An emerging content description language called extensible markup language (XML), similar in some respects to hypertext markup language (HTML), was emerging in the early 2000s. An XML standard known as synchronized multimedia integration language will allow spiders to recognize multimedia elements on Web sites, like pictures and streaming video.

           Directory-based search engines
While some sites use spiders to provide results to searchers, others—like Yahoo!—use human editors. This means that a company cannot rely on technology and keywords to obtain excellent placement, but must provide content the editors will find appealing and valuable to searchers. Some directory-based engines charge a fee for a site to be reviewed for potential listing. In the early 2000s, more leading search engines were relying on human editors in combination with findings obtained with spiders. LookSmart, Lycos, AltaVista, MSN, Excite and AOL Search relied on providers of directory data to make their search results more meaningful.

              Link-based search engines
One other kind of search engine provides results based on hypertext links between sites. Rather than basing results on keywords or the preferences of human editors, sites are ranked based on the quality and quantity of other Web sites linked to them. In this case, links serve as referrals. The emergence of this kind of search engine called for companies to develop link-building strategies. By finding out which sites are listed in results for a certain product category in a link-based engine, a company could then contact the sites' owners—assuming they aren't competitors—and ask them for a link. This often involves reciprocal linking, where each company agrees to include links to the other's site. Besides focusing on keywords, providing compelling content and monitoring links, online marketers rely on other ways of getting noticed. In late 2000, some used special software programs or third-party search engine specialists to maximize results for them. Search engine specialists handle the tedious, never ending tasks of staying current with the requirements of different search engines and tracking a company's placement. This trend was expected to take off in the early 2000s, according to research from IDC and Netbooster, which found that 70 percent of site owners had plans to use a specialist by 2002. Additionally, some companies pay for special or enhanced listings in different search engines.


   A search engine operates, in the following order :-
·       Crawling the Web, following links to find pages.
·       Indexing the pages to create an index from every word to every place it occurs.
·       Ranking the pages so the best ones show up first.
·       Displaying the results in a way that is easy for the user to understand.

 Crawling the web :-
Crawling is conceptually quite simple: starting at some well-known sites on the web, recursively follow every hypertext link, recording the pages encountered along the way. In computer science this is called the transitive closure of the link relation. However, the conceptual simplicity hides a large number of practical complications: sites may be busy or down at one point, and come back to life later; pages may be duplicated at multiple sites (or with different URLs at the same site) and must be dealt with accordingly; many pages have text that does not conform to the standards for HTML, HTTP redirection, robot exclusion, or other protocols; some information is hard to access because it is hidden behind a form, Flash animation or Javascript program. Finally, the necessity of crawling 100 million pages a day means that building a crawler is an exercise in distributed computing, requiring many computers that must work together and schedule their actions so as to get to all the pages without overwhelming any one site with too many requests at once.

    Indexing :-
A search engine’s index is similar to the index in the back of a book: it is used to find the pages on which a word occurs. There are two main differences: the search engine’s index lists every occurrence of every word, not just the important concepts, and the number of pages is in the billions, not hundreds. Various techniques of compression and clever representation are used to keep the index “small,” but it is still measured in terabytes(millions of megabytes), which again means that distributed computing is required. Most modern search engines index link data as well as word data. It is useful to know how many pages link to a given page, and what are the quality of those pages. This kind of analysis is similar to citation analysis in bibliographic work, and helps establish which pages are authoritative. Algorithms such as PageRank and HITS are used to assign a numeric measure of authority to each page. For example, the PageRank algorithm says that the rank of a page is a function of the sum of the ranks of the pages that link to the page. If we let PR(p) be the PageRank of page p, Out(p) be the number of outgoing links from page p, Links(p) be the set of pages that link to page p and N be the total number of pages in the index, then we can define PageRank by

PR(p) = r/N + (1 - r) i ⊂Links(p) PR(i)/Out(i)

where r is a parameter that indicates the probability that a user will choose not to follow a link, but will instead restart at some other page. The r/N term means that each of the N pages is equally likely to be the restart point, although it is also possible to use a smaller subset of well known pages as the restart candidates. Note that the formula for PageRank is recursive – PR appears on both the right- and left-hand sides of the equation. The equation can be solved by
iterating several times, or by standard linear algebra techniques for computing the eigen values of a (3-billion-by-3-billion) matrix.
The two steps above are query independent—they do not depend on the user’s query, and thus can be done before a query is issued with the cost shared among all users. This is why a search takes a second or less, rather than the days it would take if a search engine had to crawl the web a new for each query. We now consider what happens when a user types a query.

Consider the query [“National Academies” computer science], where the square brackets denote the beginning and end of the query, and the quotation marks indicate that the enclosed
words must be found as an exact phrase match. The first step in responding to this query is to look in the index for the hit lists corresponding to each of the four words “National,” “Academies,” “computer” and “science.” These four lists are then intersected to yield the set of pages that mention all four words. Because “National Academies” was entered as a phrase, only hits where these two words appear adjacent and in that order are counted. The result is a list of 19,000 or so pages.

 Ranking:-
The next step is ranking these 19,000 pages to decide which ones are most relevant. In traditional information retrieval this is done by counting the number of occurrences of each word, weighing rare words more heavily than frequent words, and normalizing for the length of the page. A number of refinements on this scheme have been developed, so it is common to give more credit for pages where the words occur near each other, where the words are in bold or large font, or in a title, or where the words occur in the anchor text of a link that points to the page. In addition the query-independent authority of each page is factored in. The result is a numeric score for each page that can be used to sort them best-first. For our four-word query, most search engines agree that the Computer Science and Telecommunications Board home page at
www7.nationalacademies.org/cstb/ is the best result, although one preferred the National Academies news page at www.nas.edu/topnews/ and one inexplicably chose a year-old news story that mentioned the Academies.

Displaying the result:-
The final step is displaying the results. Traditionally this is done by listing a short description of each result in rank-sorted order. The description will include the title of the page and may include additional information such as a short abstract or excerpt from the page. Some search engines generate query-independent abstracts while others customize each excerpt to show at least some of the words from the query. Displaying this kind of query-dependent excerpt means that the search engine must keep a copy of the full text of the pages (in addition to the index) at a cost of several more terabytes. Some search engines attempt to cluster the result pages
into coherent categories or folders, although this technology is not yet mature.

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve.

         ADVANTAGES OF SEARCH ENGINES
There are three very compelling advantages of most search engines:-
1.    The indexes of search engines are usually vast, representing significant portions of the Internet, offering a wide variety and quantity of information resources.
 
2.    The growing sophistication of search engine software enables us to precisely describe the information that we seek.
 
3.    The large number and variety of search engines enriches the Internet, making it at least appear to be organized.

  DISADVANTAGES OF SEARCH ENGINES
1) They do not crawl the web in “real time”.
2) If a site is not linked or submitted it may not be accessible
3) Not every page of a site is searchable.
4) Special tools needed for the Invisible/Deep Web.
5) Few search engines search the full text of Web pages.

WE CAN IMPROVE OUR USE OF SEARCH ENGINES IN A THE FOLLOWING WAY:-

You can get better results from an Internet search engine if you know how to use wildcards and "Boolean operators." Wildcards allow you to search simultaneously for several words with the same stem. For example, entering the single term "educat*" will allow you to conduct a search for "educator", "educators", "education" and "educational" all at the same time.
Boolean operators were named after George Boole (1815-1864) who combined the study of logic with that of algebra. Using the boolean operator "and", it is possible to narrow a search so that you get quite a limited set of results. Another common operator is "not" which acts to limit a search as well. The boolean operator "or" has the opposite effect of expanding a search. Using boolean terms, you can have the search engine look for more than one word at a time. Here are three examples of such search terms.
Eg- endangered and species
       insecticides not ddt
       university or college

WILD CARD:-
A wild card is a special character which can be appended to the root of a word so that you can search for all possible endings to that root. For instance, you may be looking for information on the harmful effects of smoking. Documents which contain the following words may all be useful to your search: smoke, smoking, smokers, smoked, and smokes. If your search engine allowed wild cards, you would enter "smok*". In this case, the asterisk is the wild card and documents which contained words that started with "smok" would be returned.

BOOLEAN OPERATORS:-
The boolean operator "and" is the most common way to narrow a search to a manageable number of hits. For example, with "heart and disease" as the search term, an engine will provide links to sites which have both of these words present in a document. It will ignore documents which have just the word "heart" in it (e.g., heart transplant) and it will ignore documents which have just the word "disease" in it (e.g., lung disease, disease prevention). It will only make a link if both of the words are present - although these do not necessarily have to be located beside each other in the document.
For even more narrow searches, you can use "and" more than once. For example, "heart and disease and prevention" would limit your search even more since all three terms would have to be present before a link would be made to the document.
The boolean operator "not" narrows the search by telling the engine to exclude certain words. For example, the search term "insecticides not DDT" would give you links to information on insecticides but not if the term "DDT" was present.
It is possible to combine two different operators. For example, the term "endangered and species not owl" would give you information on various kinds of endangered species - both of the words "endangered" and "species" would have to be present for there to be a hit. However, you would not get information on any owls that are endangered since the "not" term specifically excludes that word.
The boolean operator "or" will broaden your search. You might use "or" if there were several words that could be used interchangeably. For example, if you were looking for information on drama resources, using just that one search term might not give you all that you wanted. However, by entering "drama or theater", the search engine would provide a link to any site that had either of those words present. For even wider searches, you can use "or" more than once. For example, "drama or theater or acting or stage" would provide a very broad search indeed.


                HOW GOOGLE WORKS?
Here we will know how Google creates the index and the database of documents that it accesses when processing a query.

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:
  • Googlebot, a web crawler that finds and fetches web pages.
  • The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
  • The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.
Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.


Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.
Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters you see — something like an eye-chart test to stop spambots. When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.
Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.
To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

2. Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.
To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.

3. Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.
PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank.
Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page.

 Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).


META SEARCH ENGINES:-
A meta search engine is a search tool that doesn't create its own database of information, but instead searches those of other engines. "Metacrawler", for instance, searches the databases of each of the following engines: Lycos, WebCrawler, Excite, AltaVista, and Yahoo. Using multiple databases will mean that the search results are more comprehensive.
 Eg- SurfWaxhttp://www.surfwax.com


SUBJECT DIRECTORIES:-
Subject directories organize Internet sites by subject, allowing users to choose a subject of interest and then browse the list of resources in that category. Users conduct their searches by selecting a series of progressively more narrow search terms from a number of lists of descriptors provided in the directory. In this fashion, users "tunnel" their way through progressively more specific layers of descriptors until they reach a list of resources which meet all of the descriptors they had chosen.
For example, if you were using the Yahoo subject directory to find math lesson plans, you would start at the top level of the directory where there are approximately 15 general categories, including "arts and humanities", "government" and "education." Selecting "education" would lead to a list of about 35 descriptors, including "higher education", "magazines", and "teaching." Selecting "teaching" would lead to another page of resources all about teaching - including "English", "K-12", and "Math." This last choice would reveal a number of actual resources for the math teacher.
It's important to understand that a subject directory will not have links to every piece of information on the Internet. Since they are built by humans (rather than by computer programs), they are much smaller than search engine databases. Moreover, every directory is different and their value will depend on how widely the company searches for information, their method of categorizing the resources, how well information is kept current, etc.


INFORMATION GATEWAYS:- It includes internet catalogues, subject directories, virtual libraries and gateways.
Eg- Development Gateway
http://www.developmentgateway.org/

Good For: topics that fall into a thematic area that has a subject directory, guided browsing in your subject area.
          Not Good For: Quickly finding information from widely varying themes.

SPECIALIST DATABASE:-
Specialized databases are indexes that can be searched, much like the search engines. The main difference is that specialized databases are collections on particular subjects, such as medical journal article abstracts and citations, company financial data, United States Supreme Court decisions, census data, patents, and so forth. You can find information in specialized databases that you often would not locate by using a global WWW search engine. If you know there is a specialized database on the subject you are researching, using that database can save you time and give you reliable, up-to-date information.

                         conclusion

Search engines are  designed to be a scalable search tools. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Search tools like google search engine employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Furthermore, search engine is a complete architecture for gathering web pages, indexing them, and performing search queries over them.
n Use site look for all pages.
n Follow standards for search forms.
n Balance information and clarity for results.
n Index everything, hide obscure stuff.
n Use search query features wisely.
n Adjust results to fit your situation.
n Track search use with logs.

No comments:

Post a Comment

leave your opinion