Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 25
CHAPTER 1
1.1 INTRODUCTION TO SEARCH ENGINE OPTIMIZATION
Search engines are one of the primary ways that Internet users find Web sites. That's why a Web
site with good search engine listings may see a dramatic increase in traffic. Everyone wants those
good listings. Unfortunately, many Web sites appear poorly in search engine rankings or may not
be listed at all because they fail to consider how search engines work. In particular, submitting to
search engines (as covered in the Essentials section) is only part of the challenge of getting good
search engine positioning. It's also important to prepare a Web site through "search engine
optimization."Search engine optimization means ensuring that your Web pages are accessible to
search engines and are focused in ways that help improve the chances they will be found.
Search for anything using your favorite crawler-based search engine. Nearly instantly, the search
engine will sort through the millions of pages it knows about and present you with ones that
match your topic. The matches will even be ranked, so that the most relevant ones come first.
Of course, the search engines don't always get it right. Non-relevant pages make it through, and
sometimes it may take a little more digging to find what you are looking for. But, by and large,
search engines do an amazing job. As WebCrawler founder Brian Pinkerton puts it, "Imagine
walking up to a librarian and saying, 'travel.' They’re going to look at you with a blank face."OK
-- a librarian's not really going to stare at you with a vacant expression. Instead, they're going to
ask you questions to better understand what you are looking for.
Unfortunately, search engines don't have the ability to ask a few questions to focus your search,
as a librarian can. They also can't rely on judgment and past experience to rank web pages, in the
way humans can.
So, how do crawler-based search engines go about determining relevancy, when confronted with
hundreds of millions of web pages to sort through? They follow a set of rules, known as an
algorithm. Exactly how a particular search engine's algorithm works is a closely-kept trade
secret. However, all major search engines follow the general rules below.
1.1.1 Location, Location, Location...and Frequency
One of the main rules in a ranking algorithm involves the location and frequency of keywords on
a web page. Call it the location/frequency method, for short.
Remember the librarian mentioned above? They need to find books to match your request of
"travel," so it makes sense that they first look at books with travel in the title. Search engines
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 26
operate the same way. Pages with the search terms appearing in the HTML title tag are often
assumed to be more relevant than others to the topic.
Search engines will also check to see if the search keywords appear near the top of a web page,
such as in the headline or in the first few paragraphs of text. They assume that any page relevant
to the topic will mention those words right from the beginning.
Frequency is the other major factor in how search engines determine relevancy. A search engine
will analyze how often keywords appear in relation to other words in a web page. Those with a
higher frequency are often deemed more relevant than other web pages.
1.1.2 Spice in the Recipe
Now it's time to qualify the location/frequency method described above. All the major search
engines follow it to some degree, in the same way cooks may follow a standard chili recipe. But
cooks like to add their own secret ingredients. In the same way, search engines add spice to the
location/frequency method. Nobody does it exactly the same, which is one reason why the same
search on different search engines produces different results.
To begin with, some search engines index more web pages than others. Some search engines also
index web pages more often than others. The result is that no search engine has the exact same
collection of web pages to search through. That naturally produces differences, when comparing
their results.
Search engines may also penalize pages or exclude them from the index, if they detect search
engine "spamming." An example is when a word is repeated hundreds of times on a page, to
increase the frequency and propel the page higher in the listings. Search engines watch for
common spamming methods in a variety of ways, including following up on complaints from
their users.
1.1.3 Off The Page Factors
Crawler-based search engines have plenty of experience now with webmasters who constantly
rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters
may even go to great lengths to "reverse engineer" the location/frequency systems used by a
particular search engine. Because of this, all major search engines now also make use of "off the
page" ranking criteria.
Off the page factors are those that a webmasters cannot easily influence. Chief among these is
link analysis. By analyzing how pages link to each other, a search engine can both determine
what a page is about and whether that page is deemed to be "important" and thus deserving of a
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 27
ranking boost. In addition, sophisticated techniques are used to screen out attempts by
webmasters to build "artificial" links designed to boost their rankings.
Another off the page factor is click through measurement. In short, this means that a search
engine may watch what results someone selects for a particular search, then eventually drop
high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages that do pull
in visitors. As with link analysis, systems are used to compensate for artificial links generated by
eager webmasters.
The Search Engine Features Chart has a section that summarizes key areas of how crawler-based
search engines rank web pages. The Search Engine Placement Tips page also summarizes key
skill that will help to improve the relevancy of pages with crawler-based search engines.
Search Engine Watch members have access to the How Search Engines Work section. This
section provides detailed information about how each major search engine gathers its listings and
an additional skill on enhancing your position in their results.
A query on a crawler-based search engine often turns up thousands or even millions of matching
web pages. In many cases, only the ten most "relevant" matches are displayed on the first page.
Naturally, anyone who runs a web site wants to be in the "top ten" results. This is because most
users will find a result they like in the top ten. Being listed 11 or beyond means that many people
may miss your web site.
1.1.4 How to Pick Target Keywords
How do we think people will search for our web page? The words we imagine them typing into
the search box are our target keywords. For example, say we have a page devoted to stamp
collecting. Anytime someone types "stamp collecting," we want our page to be in the top ten
results. Accordingly, these are our target keywords for that page.
Each page in our web site will have different target keywords that reflect the page's content. For
example, say we have another page about the history of stamps. Then "stamp history" might be
your keywords for that page.
Our target keywords should always be at least two or more words long. Usually, too many sites
will be relevant for a single word, such as "stamps." This "competition" means our odds of
success are lower. No need to waste our time fighting the odds. Pick phrases of two or more
words, and we'll have a better shot at success.
The Researching Keywords article provides additional information about selecting key terms.
It's worth taking the time to make our site more search engine friendly because some simple
changes may pay off with big results. Even if we don't come up in the top ten for our target
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 28
keywords, we may find an improvement for target keywords we aren't anticipating. The addition
of just one extra word can suddenly make a site appear more relevant, and it can be impossible to
guess what that word will be.
Also, remember that while search engines are a primary way people look for web sites, they are
not the only way. People also find sites through word-of-mouth, traditional advertising,
traditional media, blog posts, web directories, and links from other sites. Since the advent of
Web 2.0 applications, people are finding sites through feeds, blogs, podcasts, vlogs and many
other means. Sometimes, these alternative forms can be more effective draws than search
engines. The most effective marketing strategy is to combine search marketing with other online
and offline media.
We should know when it's time to call it quits. A few changes may be enough to achieve top
rankings in one or two search engines. But that's not enough for some people, and they will
invest days creating special pages and changing their sites to try and do better. This time could
usually be put to better use pursuing non-search engine publicity methods.
Let not obsess over our ranking. Even if we follow every skill and find no improvement, we still
have gained something. We will know that search engines are not the way we’ll be attracting
traffic. We can concentrate our efforts in more productive areas, rather than wasting your
valuable time.
Manipulating text on a webpage has been an early form of SEO. The classic document-ranking
technique involves viewing the text on a website and determining its value to a search query by
using a set of so-called “on page” parameters. For reasons that will be made obvious, a simple,
text-only information retrieval system produces poor search results. In the past, several text-only
search engines have relied upon on-page ranking factors. One of the early web crawlers has been
Wandex, created in 1993 at MIT by Matthew Gray. WebCrawler, released in 1994, is considered
the first web crawler to look at the entire text of a web document. When ranking a document, the
early companies (and most that followed) focused on what are now called “on-page factors”--
parameters a webpage author can control directly. As you will see, these parameters are of little
use in generating relevant search results. If we were to write a crude ranking algorithm we could
create combinations of HTML parameters appearing on a webpage to generate ranking factors.
By using on-page HTML parameters, a simple ranking algorithm could generate a list of relevant
documents to a given search query. This approach has the built-in assumption that the authors of
the WebPages we are indexing are honest about the content they are authoring. An algorithm is
simply a set of instructions, usually mathematical, used to calculate a certain parameter and
perform some type of data processing. It is the search engine developer’s job to generate a set of
highly relevant documents for any search query, using the available parameters on the web. The
task is challenging because the available parameters usable by the algorithm are not necessarily
the same as the ones web users see when deciding if a webpage is relevant to their search.
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 29
A Web search engine is a tool designed to search for information on the World Wide Web. The
search results are usually presented in a list and are commonly called hits. The information may
consist of web pages, images, information and other types of files. Some search engines also
mine data available in News Books, databases, or open directories. Unlike Web directories,
which are maintained by human editors, search engines operate algorithmically or are a mixture
of algorithmic and human input.
Search engines are the key to finding specific information on the vast expanse of the World Wide
Web. Without sophisticated search engines, it would be virtually impossible to locate anything
on the Web without knowing a specific URL.
When people use the term search engine in relation to the Web, they are usually referring to the
actual search forms that search through databases of HTML documents, initially gathered by
a robot.
There are basically three types of search engines: Those that are powered by robots (called
crawlers; ants or spiders) and those that are powered by human submissions; and those that are a
hybrid of the two.
Crawler-based search engines are those that use automated software agents (called crawlers)
that visit a Web site, read the information on the actual site, read the site's meta tags and also
follow the links that the site connects to performing indexing on all linked Web sites as well. The
crawler returns all that information back to a central depository, where the data is indexed. The
crawler will periodically return to the sites to check for any information that has changed. The
frequency with which this happens is determined by the administrators of the search engine.
Human-powered search engines rely on humans to submit information that is subsequently
indexed and catalogued. Only information that is submitted is put into the index.
In both cases, when you query a search engine to locate information, you're actually searching
through the index that the search engine has created —you are not actually searching the Web.
These indices are giant databases of information that is collected and stored and subsequently
searched. This explains why sometimes a search on a commercial search engine, such as Yahoo!
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 30
or Google, will return results that are, in fact, dead links. Since the search results are based on the
index, if the index hasn't been updated since a Web page became invalid the search engine treats
the page as still an active link even though it no longer is. It will remain that way until the index
is updated.
It can be divided into four different modules and components as described below:
Fig. 1: SEO Lifecycle with a Search Engine
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 31
1.1.5 Crawler module
The crawler module consists of software that collects and categorizes relevant objects from web
documents.
This module creates a program called spider that crawls over the web-pages on www and then
returns back to search engine with these collected information to be stored in page repository.
Some popular pages that are frequently queried by users will remain at page repository perhaps
for indefinite amount of time. It is estimated that from 20 billion of existing web-pages, search
engines crawled 8-10 million of them.
1.1.6 Indexing module
The indexing module retrieves pages stored in page repository and then extracts only their “vital
descriptors”. The results of this extraction process are then compressed and stored in three types
of indexes that differ in the information they kept. The first type of indexes, content index, is
used for keeping content-based information such as keyword (met-tags), title and anchor pages
used in a web-page. The second type of index, structure index, is used for storing valuable
information regarding to the hyperlink structure of a web-page. Information such as amount and
sources of in-link coming to a web-page is therefore stored in the structure index.
1.1.7 Ranking (Search Algorithm) module
The ranking module takes relevant web-pages gained from both query module and structured
index and then rank them up based on the mathematical algorithm used by search engines.
Results from this process will take form as a set of ordered web-pages listed based on their
relevancies. Therefore, in theory, pages that appear at the top of the list are those pages that are
considered as the most desirable pages by users.
The search engine modules described in figure are categorized in two different categories based
on the type of dependencies they have.
Both crawling and indexing are done continuously on the web; therefore, these modules are not
triggered by user queries and grouped under the query independent category.
SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR
Page 32
On the other hand, query and ranking processes are triggered by queries made by users; therefore
these modules are grouped under query dependent category.
1.1.8 Query module
The query module process all queries made by users by retrieving web information stored in
indexes. For some popular web-pages whose information are not stored in indexes may be
retrieved straightly from page repository. Results displayed on the user’s computer screen will be
filtered by the ranking module described in the following subsection.