A story that ran at Search Engine Land a few days ago informed us of a possible new Algorithm at Google: Unconfirmed Google algorithm update may be better at discounting links and spam. Before I read that post, I had just read a new Google patent, and the post reminded me of the patent. The patent was granted on January 31, 2017, and it is possible that what is described in the patent may be what people were experiencing in the update reported at Search Engine Land.
The algorithm behind the patent is based upon rankings that involve how many resources might link to a resource that may be ranked (Like Stanford’s PageRank Patent). Historically, at Google, a page that has a large number of resources that link to it may rank higher than other pages that have a smaller amount of resources that link to it. But what if Google decided to look closer at those resources and demote some of the ranking weight passed along by them? We have seen indications that Google may do something like that in the Reasonable Surfer Patent which had links passing along different amounts of PageRank. Another way to change how much PageRank might be passed along with a link might be based upon the amount of traffic a resource might receive from links, and the dwell times of traffic from those links, whether they might be short clicks, medium clicks, or long clicks.
This linking approach may also consider other aspects of links, such as the anchor text for a link pointing to a source resource, which it will consider as an n-gram and will assign a source score for that anchor text used to link to a page.
This was an interesting statement I ran across the first time I read through the newly granted patent:
Search result rankings can be adjusted based on a search query’s propensity to surface spam-related search results. The weighting of resource link counts in a ranking process can be reduced for search queries that have a high propensity for surfacing spam-related search results to reduce the skew on resource rankings caused by some resources having disproportionately large number of links compared to the number of selections of the links.
The patent tells us that it has a number of advantages in its use that can make it worth using, including the discounting of some links in rankings of pages being linked to.
ADVANTAGES OF THIS PATENTED PROCESS
1) Search results for resources can be more accurately ranked using data regarding links to the resources and selections of those links.
2) A seed score can be determined for a resource based on the number of links to the resource contained in other resources and a number of selections of those links.
3) Source resources that include links to resources that have a disproportionate number of links relative to the number of selections, as indicated by the seed scores for those resources, can be identified.
4) The links from these identified source resources can be discounted in a ranking process that ranks resources based on the number of links to the resource.
5) Resources for which data regarding links are unavailable or insufficient can be scored using data regarding resources that include a link to the resource.
The patent I am writing about can be found here, and is worth spending some time with:
Determining a quality measure for a resource
Inventors: Hyung-Jin Kim, Paul Haahr, Kien Ng, Chung Tin Kwok, Moustafa A. Hammad, and Sushrut Karanjkar
United States Patent: 9,558,233
Granted: January 31, 2017
Filed: December 31, 2012
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining a measure of quality for a resource. In one aspect, a method includes determining a seed score for each seed resource in a set. The seed score for a seed resource can be based on a number of resources that include a link to the seed resource and a number of selections of the links. A set of source resources is identified. A source score is determined for each source resource. The source score for a source resource is based on the seed score for each seed resource linked to by the source resource. Source-referenced resources are identified. A resource score is determined for each source-referenced resource. The resource score for a source-referenced resource can be based on the source score for each source resource that includes a link to the source-referenced resource.
DEMOTION BASED UPON A HIGH NUMBER OF LINKS THAT DON’T PRODUCE MUCH TRAFFIC
This was another passage from the patent that struck me because it pointed at potentially harmful results for links that didn’t match up to expectations that might be held for them:
A system can determine a measure of quality for a particular web resource based on the number of other resources that link to the particular web resource and the amount of traffic the resource receives. For example, a ranking process may rank a first web page that has a large number of other web pages that link to the first web page higher than a web page having a smaller number of linking web pages. However, some a resource may be linked to by a large number of other resources, while receiving little traffic from the links. For example, an entity may attempt to game the ranking process by including a link to the resource on another web page. This large number of links can skew the ranking of the resources. To prevent such skew, the system can evaluate the “mismatch” between the number of linking resources and the traffic generated to the resource from the linking resources. If a resource is linked to by a number of resources that is disproportionate with respect to the traffic received by use of those links, that resource may be demoted in the ranking process.
How might traffic be determined because of a link?
The evaluation of resources can be performed by a “pull-push” process. In an example pull-push process, a seed score is determined for each of a set of seed resources for which sufficient link and traffic data is available. The seed score for a particular seed resource is based on the number of source resources that link to the seed resource and the amount of traffic generated to the resource from the source resources. In some implementations, the seed score for a particular resource is the ratio between the number of selections of links to the particular resource and the number of source resources that link to the particular resource.
These seed scores are “pulled” up to the source resources and used to determine a source score for each source resource. In some implementations, the source score for a source resource is based on the seed score for each seed resource to which the source resource links. These source scores can be used to classify each source resource as being a “qualified source” or an “unqualified source.”
Links from sources that might be determined to be unqualified might then be discounted.
Some queries tend to produce more spam that others. The patent points at one group in particular:
For example, publishers of many video sharing web sites attempt to manipulate rankings by creating links to the sites, resulting in a disproportionately large number of links compared to the number of selections, while national news web sites typically do not attempt such manipulation.
For queries that tend to often produce higher amounts of spam, selection clicks may be given more value in this calculation of links compared to traffic sent by those links:
For queries that have a high propensity for surfacing spam-related web pages, the system can put a higher weight on selection counts for the search results and a lower weight on resource link counts for the search results when ranking the search results. Thus, the system can be said to “trust” the click counts more than the resource link counts for search queries that have a propensity for surfacing spam-related web pages.
THE SELECTION QUALITY SCORE MAY BE BASED UPON DWELL TIME
Part of the process involved in calculating a quality score for resources involves determining a seed score for a seed resource. This can start with identifying a link resource count for the seed resource. That can be done by looking at the number of resources that include a link to the seed resource.
The next aspect of that involves identifying a selection count for the seed resource. This selection count for the seed resource may be based on a number of times the link(s) to the seed resource that are included in other resources have been selected.
A selection quality score is determined for at least a portion of the selections of the links to the seed resource. The selection quality score for a selection is a measure of quality for the selection and can be used to discount low quality selections when determining the seed score for the seed resource.
This brings back memories of the book by Steven Levy, called In the Plex, in which he stated that one metric that was often treated with a positive outlook by people at Google was one they referred to as “The Long Click.”
The patent tells us:
The selection quality score may be higher for a selection that results in a long dwell time (e.g., greater than a threshold time period) than the selection quality score for a selection that results in a short dwell time (e.g., less than a threshold time period). As automatically generated link selections are often of a short duration, considering the dwell time in determining the seed score can account for these false link selections.
The patent also tells us that some historic selection behavior might indicate that selections were made by real users rather than some automated process.
Resources with relatively low resource scores may be demoted in rankings, and resources with high resource scores may be boosted in rankings.
The patent provides much more detail than I have in this post, and it is highly recommended reading. It is the first I can recall that has attempted to set up some kind of quality scores for links that point to pages on the web, and determine how much weight those should pass along. The reasonable surfer patent was different in that it determined how much weight a link might pass along based upon a probability that it was important based upon features involved in how (and where) it was presented on a page.
I mentioned on Twitter that I would be writing about the Search Engine Land post I mentioned at the start of this post, and that I had a guess as to what may have been implemented that would result in the Algorithmic change at Google that a number of people had noticed. I had a suggestion from Jonathan Hochman that I consider referring to it as the Groundhog Update, considering the timing, and that it seemed to take effect at the beginning of February. This patent was granted on the last day of January, and while it could have been implemented before then, it is possible that it also could have been put into place at the start of February.
Was what took place algorithmically at Google a weighting of linking resources based upon traffic associated with them, or whether or not they were associated with spammy results?