PAGE RANK TECHNIQUES IN SEARCH ENGINE



PageRank is the technique used by Google to determine importance of page on the web. It considers all incoming links to a page as votes for PageRank. But if these votes from different sources weighted equally, this will lead to wrong result. Thus, votes from different sources are weighted according to PageRank of voting page and number of links that voting page contains. Founders of Google, Sergey Brin and Lawrence Page have defined PageRank by a formula. Researchers have done lot of experiments and came up with some conclusions. There are conflicts among opinions of different researchers. This paper starts with formula for PageRank calculation. It describes how to use the formula. Then, paper goes through some examples and derives some observations from it. Paper also contains basic ideas to improve PageRank of web site.


1. Introduction:

Today search engines are becoming best friends of most of the people for navigation on Internet. User needs to enter keyword or combination of keywords to trigger the search. Search engine searches the web pages available on Internet and returns the result as number of ordered pages. This order should depend on relevancy of pages and importance of pages.
Different search engines use different techniques to decide importance of page on the web. Alta Vista uses HITS (Hyperlink Induced Topic Search) algorithm to rank pages while Google uses PageRank algorithm to rank the pages.
 Yahoo also uses some variation of PageRank algorithm that Google uses. There are many variations of Page Ranking algorithm. Like confidence based page ranking, weighted page rank algorithm, query sensitive self-adaptable web page ranking algorithm etc. But, this paper covers only one algorithm that is based on formula given by Bin and Page, the founders of Google.

2. What is PageRank?


            Most of the people start their web navigation by search engine. Google is the most famous search engine used now days. While presenting search results, they should be ordered by their relevancy and importance on the web. User cannot go through all the pages presented as output of search. Thus all the pages in the collection should be weighted and represented in the order of their weights.
            One of the most important factors that Google uses is PageRank. PageRank is a numeric value that represents how important a page is on the web. Off course PageRank is not the only factor, which decides importance of page, but still it is one of them. PageRank is described by one mathematical formula that seems very difficult at first, but actually it is not.

2.1 Formula:

            The citation graph of the web is main resource for calculation of PageRank. In the paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” founders of Google, Sergey Brin and Lawrence Page defined PageRank as:

“We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor, which can be set between 0 and 1. We usually set d to 0.85 .…….  C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))  “

It's the original formula that was published when PageRank was being developed, and it is probable that Google uses a variation of it but they aren't telling us what it is. Thus, when one page links to other page, first page votes some PageRank to the second page. PageRank of a page is the addition of one constant (1-d=0.15) and the damped value of addition of votes by all pages pointing to it. Value of vote by a particular page depends on PageRank and total number of out links of that page. Thus, higher the PageRank, higher is the value of vote and higher the number of out links, lower is the value of vote.  
So every page distributes 85% of its original PageRank evenly among all pages to which it points. e.g. if page A is pointing to four other pages then factor ‘d * PR(A)/4’ will come in PageRank equation of all those four pages. That four factors will add up to d * PR(A) , i.e. 85% of PageRank of A.( From here onwards I’ll refer PageRank as PR)  Note that when page votes for another page, it doesn’t give anything from its own PR. Its just voting, only the difference is that weight of vote of a page depends on its own PR. It is same as shareholders meeting where weight of vote of shareholder depends on the shares held.     

2.2 How to use formula?

If we look at the equation, it’s obvious that PR of a page depends on PR of other pages, which are pointing to it, which may depend on PR of original page if it has back link to that page. e.g. If there are only two pages A and B which points to each other, then PR of each page will depend on PR of other, which is yet uncalculated. So where to start from?  Surprisingly, we can start with any assumed values of PR. And repeat the calculation of PR values of all pages iteratively till they become stable.
            Lets see an example: consider there are only two pages A and B, which points to each other. We don’t know what their PR should be to begin with, so let’s take a guess at 1 and do some calculations.

PR(A)  =  (1-d) + d * PR(B)
=  (1-0.85) + 0.85 * 1
=   1
PR(B)  =  (1-d) + d * PR(A)
=  (1-0.85) + 0.85 * 1
=   1
Since they are not changing we can stop here. Now lets do the calculations with initial guess at 0:

PR(A) = 0.15 + 0.85 * 0
                                               = 0.15
PR(B) = 0.15 + 0.85 * 0.15
           =  0.2775
After second iteration:           
PR(A) = 0.15 + 0.85 * 0.2775
                       = 0.385875
            PR(B) = 0.15 + 0.85 + 0.385875
                       = 0.47799375

After third iteration:
           
                                    PR(A) = 0.15 + 0.85 * 0.47799375
                                               = 0.5562946875
                                    PR(B) = 0.15 + 0.85 * 0.5562946875
                                               = 0.622850484375

and so on. They will keep changing till they reach value of 1. If we start with values greater than 1, then also values will degrade iteration by iteration and will settle down to 1. If we start calculation with different values of PR(A) and PR(B), then also we will end up with PR(1) = PR(B) = 1. Here we considered symmetrical link structure between A and B, thus with any value of PR(A) and PR(B) to start, we ends up with PR(A) = PR(B) = 1. If we take asymmetrical structure e.g. only A is pointing to B and B is not pointing to A, then we will end up with consistent values of PR(A) and PR(B), whatever the values we start with. 
For more practice with different numbers of pages with different configurations of citation graphs, visit:


3. How is PageRank used?

Google uses PR as one of the important factor in search process to fine the relevancy of page. So while searching, web pages are accessed in the decreasing order of PR. One can find PR of its own webpage by using Google toolbar (http://toolbar.google.com/). But output of this toolbar is somewhat unexpected. As our normal PR calculation can yield PR ranging from 0.15 to unlimited. But toolbar gives PR of any page in the range 1 to 10. So actual PR values are divided into intervals and are represented as one of the value ranging from 1 to 10. But the question is whether these intervals are equidistant? Means is the scale linear? No one outside Google knows it. For this question, there are different answers from different researchers. According to Ian Rogers, intervals cannot be equidistant. Means the scale used by Google must be logarithmic. This yields a result that efforts needed to increase PR from 2 to 3 are very less as compared to increase PR from 8 to 9. What these efforts are? At the end of this paper anyone can answer this question.
           
4. Examples and Observations:

            Google does not provide any information about methods to improve PR values of page or site. But we can solve different examples and come up with different observations. Before starting examples I would like to explain difference between PR of page and PR of site. Google assigns PR for each page on the web. But from the site point of view, total PR of all pages of site is important. Thus PR of site is nothing but sum of PRs of all pages of site.

 In this example A and B don’t have any inbound and outbound link. They come up with PR values equal to 0.15. Thus we can conclude that: “Minimum value of PR of any page is 0.15.”


Actually this is not the correct result. There are two concepts: orphan pages and dangling links. Orphan pages are those, which don’t have any inbound link. For a page to be indexed by Google it must have at least one page linking to it. If a page is not in the Google index, it and its links don't exist as far as the calculations are concerned. Therefore, it can't have a PR value and it can't share PR with other pages. Page B is orphan page and thus will be eliminated during PR calculation. And thus page A will not receive any vote from page B.

Dangling links are links that go to pages that don't have any outbound links. These links are dropped for the duration of the calculations. Link from B to A is a dangling link and will be eliminated during PR calculation. But here onwards we will consider only small part of whole web. Thus even there are no inbound links shown for a particular page, they are assumed to be present there. Thus we will not consider these two concepts anymore in our examples.      

From here onwards I’ll not show all calculations. Instead I’ll show only final settled values of PR after large number of iterations.



This is hierarchical structure. In general homepage of website should have maximum PR. We need our PR to be concentrated at homepage. Thus we can use hierarchical structure and can channel large proportion of PR of site to where we want.

Compare example 5 with example 6. Page A, B and C are pages of same site. External site 1 is pointing to page A and page C is pointing to page C.  In both of these examples, PR of external site 1 is assumed to be equal to 1. And then PR values of remaining pages are calculated. Can we infer something from these examples? Definitely. There is PR leak in example 5, which is avoided in example 6. In example 5 page is C gives its entire vote to external site. While in example 6 page C gives part of it to external site and part of it to page A. This part of vote from C increases PR value of A and thus increases weight of vote of A. In turn PR of B and C increase which, in turn again increases weight of vote of C. This is iterative process, which finally results into increased value of PR of all A, B and C. This is definitely in the favor of PR of whole site.
            From these examples we can conclude that if any page of site is pointing to external page, then we can reduce the PR leak by increasing citation network.
Can we decrease PR leak by introducing reciprocating links between B and C? Lets try this. 

That’s great!!! Average PR of site increased as we expected.
   
5. How to increase PageRank?

            There are different ways to increase PR of your site. From examples illustrated above we can come up with some ideas to increase PR of web site. Some standard ideas are given below:

5.1 Add spam pages: 
           
As shown in figure, we can increase total PR of website by adding more pages into site. No matter, what are the contents of these pages. Total PR of site increases, but we can’t increase average PR of site (i.e. PR per page). Upper limit on average PR of site is 1.   

5.2 Join forums:


            Forums are a great way to achieve links to your website. In most forums you are allowed to have a signature and in your signature you can put a link to your website. But another important note to look on is making sure the forum is somewhat related to your website. You will still get credit if it's not, but if it's related to your website then you will be accomplishing two tasks at once. From this people will come to know about your site and this will help to increase popularity of your site.

5.3 Submit to search engine directories.

Search engine directories are a good way to get a free link to your website. They also increase your chances at being listed higher on popular search engines like Google, and others. Most search engine directories allow you to submit to their website for free. This will allow you to increase your web presence by being listed on another search engine, and it will also be a free link. Remember the more links you have the higher your PR will be.


5.4 Reciprocating Links:

You can search for sites related to same topic as that of your website. If PR of that site is higher than your site, then you can pay that site and can create reciprocating links among two. There are link-farming sites, which exchanges links with other sites. Actually this is illegal. Google is banning the sites participating in link farming. Thus be aware of link farming!!

5.5 Contents:

Last and the most important way to increase PR of your site is to keep solid contents on your sites. Other sites will automatically link your site if you are having good contents. Really there is no substitute for good contents!!  

6. Conclusion:

            Even though formula for calculating PageRank seems to be difficult, it is easy to understand. But when a simple calculation is applied hundreds of times over the results can seem complicated. And we cannot predict the result of these iterations. Surely, more practice can yield more observations.
            PageRank is important factor considered in Google ranking, but it is only one of the important factors considered. e.g. now a days Google is paying a lot of attention to the link’s anchor text while deciding relevancy of target page.
            But as PageRank is also one of the important factor, one should be well aware of PageRank while designing the website. 


7.References:


http://www.iprcom.com/papers/pagerank/



http://www-db.stanford.edu/~backrub/google.html










No comments:

Post a Comment

leave your opinion