Study Of Search Engines And Their Algorithm


Search engines are huge databases of web page files that have been assembled automatically by machine. A web search engine is software code that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages (SERP's). The information may be a specialist in web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.
There are two types of search engines: 
Individual.  Individual search engines compile their own searchable databases on the web.
Meta.   Metasearchers do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously (see Lesson 2).

Search engines compile their databases by employing "spiders" or "robots" ("bots") to crawl through web space from link to link, identifying and perusing pages. Sites with no links to other pages may be missed by spiders altogether. Web page owners may submit their URLs to search engines for "crawling" and eventual inclusion in their databases. 
Whenever you search the web using a search engine, you're asking the engine to scan its index of sites and match your keywords and phrases with those in the texts of documents within the engine's database. 

Search engines use selected software programs to search their indexes for matching keywords and phrases, presenting their findings to you in some kind of relevance ranking. Although software programs may be similar, no two search engines are exactly the same in terms of size, speed and content; no two search engines use exactly the same ranking schemes, and not every search engine offers you exactly the same search options. Therefore, your search is going to be different on every engine you use. The difference may not be a lot, but it could be significant. Recent estimates put search engine overlap at approximately 60 percent and unique content at around 40 percent. 

In ranking web pages, search engines follow a certain set of rules. These may vary from one engine to another. Their goal, of course, is to return the most relevant pages at the top of their lists. To do this, they look for the location and frequency of keywords and phrases in the web page document and, sometimes, in the HTML META tags. 

Search engines are best at finding unique keywords, phrases, quotes, and information buried in the full-text of web pages. Because they index word by word, search engines are also useful in retrieving tons of documents.
NOTE: Today, the line between search engines and subject directories (see Lesson 3) is blurring. Search engines no longer limit themselves to a search mechanism alone. 


Working Of Web Search Engines:-
  This section may contain original research. Please improve it by verifying the claims made and adding references. Statements consisting only of original research may be removed. (October 2012)  
  This section does not cite any references or sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be challenged and removed. (October 2012)  
A search engine operates in the following order:
1. Web crawling
2. Indexing
3. Searching
Web search engines work by storing information about many web pages, which they retrieve from the HTML itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link on the site. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words can be extracted from the titles, page content, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. A query can be a single word. The index helps information be found as quickly as possible.Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment, since the user normally expects that the search terms will be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.

High-level architecture of a standard Web crawler
When a user enters a query into a search engine (typically by using keywords), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed.[11] As early as 2007 the search engine has allowed one to search by date by clicking 'Show search tools' in the leftmost column of the initial search results page, and then selecting the desired date range.[citation needed] Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search, which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases you search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be
The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another.The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analyzing texts it locates. This first form relies much more heavily on the computer itself to do the bulk of the work.
Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ, the practice of allowing advertisers to pay money to have their listings ranked higher in search results. Search engines that don't accept money for their search results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.

Yahoo! Search 


Algorithm :-1
TDAssign(D, H e ): the generic top-down assignment algorithm.

The set De.
The function H used to select the initial documents to form the centers of mass of the partitions.

An ordered list representing an assignment function π for De.

3:   De0, De00= HDe;
4:   c1 = center of mass De0 ;
5:   c2 = center of mass De00;
6:   for all not previously selected d ∈ De \De0 ∪ De00do
7:    if De0 ≥ De 2 ∨ De00 ≥ De 2
8:   Assign d to the smallest partition;
9:   else
10: dist1 = distance (c1, d);
11: dist2 = distance (c2, d);
12: if dist1 < dist2 then
13: De0 = De0 ∪ {d};
14: else
15: De00 = De00 ∪ {d};
16: end if
17: end if
18: end for
19: Df0 ord = TDAssign(Df0 , H);
20: Df00 ord = TDAssign(Df00 , H);
21: if Df0 ord  Df00 ord then
22: De ord = Df0 ord ⊕ Df00 ord
23: else
24: De ord = Df00 ord ⊕ Df0 ord
25: end if
26: return De ord; 

The k-scan assignment algorithm.

1: Input:
The set De.
The number k of sequences to create.

2: Output:
k ordered sequences representing an assignment π of De.
3: sort De by descending lengths of its members;
4: ci = ∅ i = 1, . . . , k;
5: for i = 1, . . . , k do
6: current center = longest membe De
7: De = De \ current center
8: for all de j ∈ De do
9: sim [j] = compute jaccard current center, de j
10: end for
11: M = select members(sim)
12: ci = ci ⊕ M
13: De = De \ M
14: ci = ci ⊕ current center
15: dump (ci)
16: end for
17: return Lk i=1 ci ;

1 comment:

leave your opinion