Saturday, June 27, 2009

Challenges in Parallelism

A major challenge in the design of search engines is the management of parallel computing processes. There are many opportunities for race conditions and coherent faults.
For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a web crawler is the consumer of this information, grabbing the text and storing it in a cache or corpus. The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture.

Sunday, March 1, 2009

The History

Page and Brin founded Google in 1998. Google attracted a loyal following among the growing number of Internet users, who liked its simple design. Off-page factors (such as PageRank and hyperlink analysis) were considered as well as on-page factors (such as keyword frequency, meta tags, headings, links and site structure) to enable Google to avoid the kind of manipulation seen in search engines that only considered on-page factors for their rankings. Although PageRank was more difficult to game, webmasters had already developed link building tools and schemes to influence the Inktomi search engine, and these methods proved similarly applicable to gaming PageRank. Many sites focused on exchanging, buying, and selling links, often on a massive scale. Some of these schemes, or link farms, involved the creation of thousands of sites for the sole purpose of link spamming. In recent years major search engines have begun to rely more heavily on off-web factors such as the age, sex, location, and search history of people conducting searches in order to further refine results.

By 2007, search engines had incorporated a wide range of undisclosed factors in their ranking algorithms to reduce the impact of link manipulation. Google says it ranks sites using more than 200 different signals. The three leading search engines, Google, Yahoo and Microsoft's Live Search, do not disclose the algorithms they use to rank pages. Notable SEOs, such as Rand Fishkin, Barry Schwartz, Aaron Wall and Jill Whalen, have studied different approaches to search engine optimization, and have published their opinions in online forums and blogs. SEO practitioners may also study patents held by various search engines to gain insight into the algorithms.

Legal precedents

On October 17, 2002, SearchKing filed suit in the United States District Court, Western District of Oklahoma, against the search engine Google. SearchKing's claim was that Google's tactics to prevent spamdexing constituted a tortuous interference with contractual relations. On May 27, 2003, the court granted Google's motion to dismiss the complaint because SearchKing "failed to state a claim upon which relief may be granted.
In March 2006, KinderStart filed a lawsuit against Google over search engine rankings. Kinderstart's web site was removed from Google's index prior to the lawsuit and the amount of traffic to the site dropped by 70%. On March 16, 2007 the United States District Court for the Northern District of California (San Jose Division) dismissed KinderStart's complaint without leave to amend, and partially granted Google's motion for Rule 11 sanctions against KinderStart's attorney, requiring him to pay part of Google's legal expenses.

Saturday, February 21, 2009

International markets

The search engines' market shares vary from market to market, as does competition. In 2003, Danny Sullivan stated that Google represented about 75% of all searches. In markets outside the United States, Google's share is often larger, and Google remains the dominant search engine worldwide as of 2007. As of 2006, Google held about 40% of the market in the United States, but Google had an 85-90% market share in Germany. While there were hundreds of SEO firms in the US at that time, there were only about five in Germany. In Russia the situation is reversed. Local search engine Yandex controls 50% of the paid advertising revenue, while Google has less than 9%. In China, Baidu continues to lead in market share, although Google has been gaining share as of 2007. Successful search optimization for international markets may require professional translation of web pages, registration of a domain name with a top level domain in the target market, and web hosting that provides a local IP address. Otherwise, the fundamental elements of search optimization are essentially the same, regardless of language.

Friday, March 7, 2008

White hat versus black hat

SEO techniques are classified by some into two broad categories: techniques that search engines recommend as part of good design and those techniques that search engines do not approve of and attempt to minimize the effect of, referred to as spamdexing. Some industry commentators classify these methods, and the practitioners who employ them, as either white hat SEO, or black hat SEO. White hats tend to produce results that last a long time, whereas black hats anticipate that their sites may eventually be banned either temporarily or permanently once the search engines discover what they are doing.
An SEO technique is considered white hat if it conforms to the search engines' guidelines and involves no deception. As the search engine guidelines are not written as a series of rules or commandments, this is an important distinction to note. White hat SEO is not just about following guidelines, but is about ensuring that the content a search engine indexes and subsequently ranks is the same content a user will see.

Friday, February 15, 2008

Webmasters and search engines

By 1997 search engines recognized that some webmasters were making efforts to rank well intheir search engines, and even manipulating the page rankings in search results. Early search engines, such as Infoseek, adjusted their algorithms to prevent webmasters from manipulating rankings by stuffing pages with excessive or irrelevant keywords.

Due to the high marketing value of targeted search results, there is potential for an adversarial relationship between search engines and SEOs. In 2005, an annual conference, AIRWeb, Adversarial Information Retrieval on the Web, was created to discuss and minimize thedamaging effects of aggressive web content providers.

SEO companies that employ overly aggressive techniques can get their client websites banned from the search results. In 2005, the Wall Street Journal profiled a company, Traffic Power, that allegedly used high-risk techniques and failed to disclose those risks to its clients. Wired magazine reported that the same company sued blogger Aaron Wall for writing about the ban. Google's Matt Cutts later confirmed that Google did in fact ban Traffic Power and some of its clients.

Some search engines have also reached out to the SEO industry, and are frequent sponsors and guests at SEO conferences and seminars. In fact, with the advent of paid inclusion, some search engines now have a vested interest in the health of the optimization community. Major search engines provide information and guidelines to help with site optimization. Google has a Sitemaps program to help webmasters learn if Google is having any problems indexing their website and also provides data on Google traffic to the website.Yahoo! Site Explorer provides a way for webmasters to submit URLs, determine how many pages are in the Yahoo!index and view link
information.