CHAPTER 1 1.1 INTRODUCTION TO SEARCH ENGINE …dsc.du.ac.in/wp-content/uploads/2020/03/Search-Engines_SodhGang… · Some search engines also index web pages more often than others

SEARCH ENGINE OPTIMIZATION: TO MAKE A NEW SEARCH ENGINE IN THE SERVICE TO THE INFORMATION TECHNOLOGY SECTOR

Page 25

CHAPTER 1

1.1 INTRODUCTION TO SEARCH ENGINE OPTIMIZATION

Search engines are one of the primary ways that Internet users find Web sites. That's why a Web

site with good search engine listings may see a dramatic increase in traffic. Everyone wants those

good listings. Unfortunately, many Web sites appear poorly in search engine rankings or may not

be listed at all because they fail to consider how search engines work. In particular, submitting to

search engines (as covered in the Essentials section) is only part of the challenge of getting good

search engine positioning. It's also important to prepare a Web site through "search engine

optimization."Search engine optimization means ensuring that your Web pages are accessible to

search engines and are focused in ways that help improve the chances they will be found.

Search for anything using your favorite crawler-based search engine. Nearly instantly, the search

engine will sort through the millions of pages it knows about and present you with ones that

match your topic. The matches will even be ranked, so that the most relevant ones come first.

Of course, the search engines don't always get it right. Non-relevant pages make it through, and

sometimes it may take a little more digging to find what you are looking for. But, by and large,

search engines do an amazing job. As WebCrawler founder Brian Pinkerton puts it, "Imagine

walking up to a librarian and saying, 'travel.' They’re going to look at you with a blank face."OK

-- a librarian's not really going to stare at you with a vacant expression. Instead, they're going to

ask you questions to better understand what you are looking for.

Unfortunately, search engines don't have the ability to ask a few questions to focus your search,

as a librarian can. They also can't rely on judgment and past experience to rank web pages, in the

way humans can.

So, how do crawler-based search engines go about determining relevancy, when confronted with

hundreds of millions of web pages to sort through? They follow a set of rules, known as an

algorithm. Exactly how a particular search engine's algorithm works is a closely-kept trade

secret. However, all major search engines follow the general rules below.

1.1.1 Location, Location, Location...and Frequency

One of the main rules in a ranking algorithm involves the location and frequency of keywords on

a web page. Call it the location/frequency method, for short.

Remember the librarian mentioned above? They need to find books to match your request of

"travel," so it makes sense that they first look at books with travel in the title. Search engines

http://searchenginewatch.com/webmasters


Page 26

operate the same way. Pages with the search terms appearing in the HTML title tag are often

assumed to be more relevant than others to the topic.

Search engines will also check to see if the search keywords appear near the top of a web page,

such as in the headline or in the first few paragraphs of text. They assume that any page relevant

to the topic will mention those words right from the beginning.

Frequency is the other major factor in how search engines determine relevancy. A search engine

will analyze how often keywords appear in relation to other words in a web page. Those with a

higher frequency are often deemed more relevant than other web pages.

1.1.2 Spice in the Recipe

Now it's time to qualify the location/frequency method described above. All the major search

engines follow it to some degree, in the same way cooks may follow a standard chili recipe. But

cooks like to add their own secret ingredients. In the same way, search engines add spice to the

location/frequency method. Nobody does it exactly the same, which is one reason why the same

search on different search engines produces different results.

To begin with, some search engines index more web pages than others. Some search engines also

index web pages more often than others. The result is that no search engine has the exact same

collection of web pages to search through. That naturally produces differences, when comparing

their results.

Search engines may also penalize pages or exclude them from the index, if they detect search

engine "spamming." An example is when a word is repeated hundreds of times on a page, to

increase the frequency and propel the page higher in the listings. Search engines watch for

common spamming methods in a variety of ways, including following up on complaints from

their users.

1.1.3 Off The Page Factors

Crawler-based search engines have plenty of experience now with webmasters who constantly

rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters

may even go to great lengths to "reverse engineer" the location/frequency systems used by a

particular search engine. Because of this, all major search engines now also make use of "off the

page" ranking criteria.

Off the page factors are those that a webmasters cannot easily influence. Chief among these is

link analysis. By analyzing how pages link to each other, a search engine can both determine

what a page is about and whether that page is deemed to be "important" and thus deserving of a


Page 27

ranking boost. In addition, sophisticated techniques are used to screen out attempts by

webmasters to build "artificial" links designed to boost their rankings.

Another off the page factor is click through measurement. In short, this means that a search

engine may watch what results someone selects for a particular search, then eventually drop

high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages that do pull

in visitors. As with link analysis, systems are used to compensate for artificial links generated by

eager webmasters.

The Search Engine Features Chart has a section that summarizes key areas of how crawler-based

search engines rank web pages. The Search Engine Placement Tips page also summarizes key

skill that will help to improve the relevancy of pages with crawler-based search engines.

Search Engine Watch members have access to the How Search Engines Work section. This

section provides detailed information about how each major search engine gathers its listings and

an additional skill on enhancing your position in their results.

A query on a crawler-based search engine often turns up thousands or even millions of matching

web pages. In many cases, only the ten most "relevant" matches are displayed on the first page.

Naturally, anyone who runs a web site wants to be in the "top ten" results. This is because most

users will find a result they like in the top ten. Being listed 11 or beyond means that many people

may miss your web site.

1.1.4 How to Pick Target Keywords

How do we think people will search for our web page? The words we imagine them typing into

the search box are our target keywords. For example, say we have a page devoted to stamp

collecting. Anytime someone types "stamp collecting," we want our page to be in the top ten

results. Accordingly, these are our target keywords for that page.

Each page in our web site will have different target keywords that reflect the page's content. For

example, say we have another page about the history of stamps. Then "stamp history" might be

your keywords for that page.

Our target keywords should always be at least two or more words long. Usually, too many sites

will be relevant for a single word, such as "stamps." This "competition" means our odds of

success are lower. No need to waste our time fighting the odds. Pick phrases of two or more

words, and we'll have a better shot at success.

The Researching Keywords article provides additional information about selecting key terms.

It's worth taking the time to make our site more search engine friendly because some simple

changes may pay off with big results. Even if we don't come up in the top ten for our target

http://searchenginewatch.com/2167891


http://searchenginewatch.com/membership


http://searchenginewatch.com/article/2048023/Researching-Keywords


Page 28

keywords, we may find an improvement for target keywords we aren't anticipating. The addition

of just one extra word can suddenly make a site appear more relevant, and it can be impossible to

guess what that word will be.

Also, remember that while search engines are a primary way people look for web sites, they are

not the only way. People also find sites through word-of-mouth, traditional advertising,

traditional media, blog posts, web directories, and links from other sites. Since the advent of

Web 2.0 applications, people are finding sites through feeds, blogs, podcasts, vlogs and many

other means. Sometimes, these alternative forms can be more effective draws than search

engines. The most effective marketing strategy is to combine search marketing with other online

and offline media.

We should know when it's time to call it quits. A few changes may be enough to achieve top

rankings in one or two search engines. But that's not enough for some people, and they will

invest days creating special pages and changing their sites to try and do better. This time could

usually be put to better use pursuing non-search engine publicity methods.

Let not obsess over our ranking. Even if we follow every skill and find no improvement, we still

have gained something. We will know that search engines are not the way we’ll be attracting

traffic. We can concentrate our efforts in more productive areas, rather than wasting your

valuable time.

Manipulating text on a webpage has been an early form of SEO. The classic document-ranking

technique involves viewing the text on a website and determining its value to a search query by

using a set of so-called “on page” parameters. For reasons that will be made obvious, a simple,

text-only information retrieval system produces poor search results. In the past, several text-only

search engines have relied upon on-page ranking factors. One of the early web crawlers has been

Wandex, created in 1993 at MIT by Matthew Gray. WebCrawler, released in 1994, is considered

the first web crawler to look at the entire text of a web document. When ranking a document, the

early companies (and most that followed) focused on what are now called “on-page factors”--

parameters a webpage author can control directly. As you will see, these parameters are of little

use in generating relevant search results. If we were to write a crude ranking algorithm we could

create combinations of HTML parameters appearing on a webpage to generate ranking factors.

By using on-page HTML parameters, a simple ranking algorithm could generate a list of relevant

documents to a given search query. This approach has the built-in assumption that the authors of

the WebPages we are indexing are honest about the content they are authoring. An algorithm is

simply a set of instructions, usually mathematical, used to calculate a certain parameter and

perform some type of data processing. It is the search engine developer’s job to generate a set of

highly relevant documents for any search query, using the available parameters on the web. The

task is challenging because the available parameters usable by the algorithm are not necessarily

the same as the ones web users see when deciding if a webpage is relevant to their search.


Page 29

A Web search engine is a tool designed to search for information on the World Wide Web. The

search results are usually presented in a list and are commonly called hits. The information may

consist of web pages, images, information and other types of files. Some search engines also

mine data available in News Books, databases, or open directories. Unlike Web directories,

which are maintained by human editors, search engines operate algorithmically or are a mixture

of algorithmic and human input.

Search engines are the key to finding specific information on the vast expanse of the World Wide

Web. Without sophisticated search engines, it would be virtually impossible to locate anything

on the Web without knowing a specific URL.

When people use the term search engine in relation to the Web, they are usually referring to the

actual search forms that search through databases of HTML documents, initially gathered by

a robot.

There are basically three types of search engines: Those that are powered by robots (called

crawlers; ants or spiders) and those that are powered by human submissions; and those that are a

hybrid of the two.

Crawler-based search engines are those that use automated software agents (called crawlers)

that visit a Web site, read the information on the actual site, read the site's meta tags and also

follow the links that the site connects to performing indexing on all linked Web sites as well. The

crawler returns all that information back to a central depository, where the data is indexed. The

crawler will periodically return to the sites to check for any information that has changed. The

frequency with which this happens is determined by the administrators of the search engine.

Human-powered search engines rely on humans to submit information that is subsequently

indexed and catalogued. Only information that is submitted is put into the index.

In both cases, when you query a search engine to locate information, you're actually searching

through the index that the search engine has created —you are not actually searching the Web.

These indices are giant databases of information that is collected and stored and subsequently

searched. This explains why sometimes a search on a commercial search engine, such as Yahoo!


Page 30

or Google, will return results that are, in fact, dead links. Since the search results are based on the

index, if the index hasn't been updated since a Web page became invalid the search engine treats

the page as still an active link even though it no longer is. It will remain that way until the index

is updated.

It can be divided into four different modules and components as described below:

Fig. 1: SEO Lifecycle with a Search Engine


Page 31

1.1.5 Crawler module

The crawler module consists of software that collects and categorizes relevant objects from web

documents.

This module creates a program called spider that crawls over the web-pages on www and then

returns back to search engine with these collected information to be stored in page repository.

Some popular pages that are frequently queried by users will remain at page repository perhaps

for indefinite amount of time. It is estimated that from 20 billion of existing web-pages, search

engines crawled 8-10 million of them.

1.1.6 Indexing module

The indexing module retrieves pages stored in page repository and then extracts only their “vital

descriptors”. The results of this extraction process are then compressed and stored in three types

of indexes that differ in the information they kept. The first type of indexes, content index, is

used for keeping content-based information such as keyword (met-tags), title and anchor pages

used in a web-page. The second type of index, structure index, is used for storing valuable

information regarding to the hyperlink structure of a web-page. Information such as amount and

sources of in-link coming to a web-page is therefore stored in the structure index.

1.1.7 Ranking (Search Algorithm) module

The ranking module takes relevant web-pages gained from both query module and structured

index and then rank them up based on the mathematical algorithm used by search engines.

Results from this process will take form as a set of ordered web-pages listed based on their

relevancies. Therefore, in theory, pages that appear at the top of the list are those pages that are

considered as the most desirable pages by users.

The search engine modules described in figure are categorized in two different categories based

on the type of dependencies they have.

Both crawling and indexing are done continuously on the web; therefore, these modules are not

triggered by user queries and grouped under the query independent category.


Page 32

On the other hand, query and ranking processes are triggered by queries made by users; therefore

these modules are grouped under query dependent category.

1.1.8 Query module

The query module process all queries made by users by retrieving web information stored in

indexes. For some popular web-pages whose information are not stored in indexes may be

retrieved straightly from page repository. Results displayed on the user’s computer screen will be

filtered by the ranking module described in the following subsection.

Documents

CHAPTER 1 1.1 INTRODUCTION TO SEARCH ENGINE …dsc.du.ac.in/wp-content/uploads/2020/03/Search-Engines_SodhGang… · Some search engines also index web pages more often than others