33
Genesis of the Open Directory Project Rich Skrenta [email protected] January 21, 2003

Genesis of the Open Directory Project Rich Skrenta [email protected] January 21, 2003

Embed Size (px)

Citation preview

Page 1: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Genesis of the Open Directory Project

Rich Skrenta

[email protected]

January 21, 2003

Page 2: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

March 1998

• Work project was winding down

• Going up and down Sand Hill road trying to get a web-calendar startup funded

• Read Danny Sullivan’s report on Yahoo’s listing problems on Search Engine Watch

Page 3: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

http://www.searchenginewatch.com/sereport/97/09-yahoo.html

Page 4: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

http://www.wired.com/news/print/0,1294,10236,00.html

Page 5: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Idea for GnuHoo

• Yahoo seemed to be ignoring their core asset - the directory

• How could we build a competitor?

• Didn't want to pay an editorial staff– even a cheap one

• Tequila + Brainstorming = GnuHoo

Page 6: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Idea for GnuHoo

• Use volunteer editors to build a web directory like Yahoo’s

• Volunteers would do a better job than paid generalists, since they would be experts about their area & have a personal interest

• Restrict editors to sub-branches of the directory, to limit the harm they could do

Page 7: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Original Goals

• Thought if we could reach 1,000 editors the directory would be successful

• Bootstrap problem was key - how to get the first 10,000 sites. The directory had to look “real” from Day 1

• Figured we needed 1M sites for a competitive directory

• Original get-off-the-coach motivational goal: We told ourselves that if we could get a story in Wired out of the effort, it would be worth doing

Page 8: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

“Seed” Problem

• Needed a hierarchy & 10,000 sites to launch the directory

• Briefly considered Dewey Decimal– good thing we didn’t, it’s not free– didn’t seem to fit the web

• Original GnuHoo hierarchy mirrored Usenet

Page 9: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

alt.2600 Computers/Hackingalt.3d Computers/Graphics/3Dalt.food Recreation/Foodalt.internet Computers/Internetalt.mud Games/MUDsalt.online-service Computers/Internet/ISPsalt.rock-n-roll Music/Rock-n-Rollalt.rock-n-roll.metal Music/Heavy_Metalalt.security Computers/Securityalt.sources Computers/Softwarealt.tv.simpsons Television/Simpsonsalt.tv.x-files Television/X-Filescomp.ai Computers/AIcomp.ai.alife Computers/AI/Artificial_Lifecomp.ai.fuzzy Computers/AI/Fuzzycomp.ai.games Computers/AI/Gamescomp.ai.nat-lang Computers/AI/Natural_Language

Page 10: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

ARTS RECREATION Movies Television Books ... Travel Food Outdoors Humor ...

BUSINESS REFERENCE Jobs Companies Investing ... Education Libraries Taxes ...

COMPUTERS REGIONAL Internet Software Hardware ... US Canada UK Australia Belgium ...

GAMES SCIENCE Video MUDs Gambling ... Engineering Psychology Physics ...

HEALTH SHOPPING Fitness Medicine Diseases ... Autos Clothing Directories ...

HOME SOCIETY Kids Houses Consumers ... People Religion Issues ...

NEWS SPORTS Online Media Newspapers ... Baseball Football Skiing ...

Original Homepage Mock-up

Page 11: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Category Bootstrapping

• Scanned URLs mentioned in newsgroups to find seed sites for the corresponding directory category

• This yielded something that looked pretty good at a casual glance

• …but a lot of the of the original seed URLs were bad sites or placed in the wrong category

• The first editor in a category simply had to delete or move the bad entries, which left behind a good category

Page 12: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Coding & Launch

• Coded from April-June, 1998• Perl cgi and flat files• Simple HTML forms to add/edit/delete

websites in the directory• Web pages served from static HTML files in a

directory tree• HTML files regenerated whenever an edit was

made

Page 13: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Simple Flat File Format

u: http://www.newhoo.com/t: NewHoo!d: The largest human-edited directory of the webc: Computers/Internet/Web_Directories

Page 14: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Minimalist Design

• Minimal locking, last-writer-wins semantics– flock() only used for category counts

• Write-with-append, rename() only safe operations

• No big database

• A few DBM files for minor stuff

Page 15: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Coding & Launch

• Used publicly-available software for keyword search of the directory: Originally Glimpse, later Isearch

• First ran on BSDI, later moved to Linux– filesystem progression: ufs, ext2, vxfs

• Launched June 5, 1998

• Acquired by Netscape in October, 1998

Page 16: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003
Page 17: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

http://www.wired.com/news/print/0,1294,13625,00.html

Page 18: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Early Press was Key to Growth

• About 1% of the visitors to NewHoo applied to become editors

• Some fraction of those would be accepted• The more traffic we got, the more editors we would get • We grubbed around for any hits we could in the

beginning• Initial Slashdot, Netly, Wired, Red Herring stories were

vital traffic sources• No matter what the story said, “Just spell our URL right”

Page 19: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003
Page 20: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Social Design of NewHoo

• Not a free-for-all links page - every editor had to apply & be approved

• Every edit logged and possible to undo

• Hierarchy of editors, with senior ones keeping an eye on the new ones

• Emergent editing guidelines, enforced with peer review

Page 21: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Why Did You Apply to be a NewHoo Editor?

“There is a link to my old warwick uni account that has been dead for two years. As editor I could change it.”

Page 22: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Why Did You Apply to be a NewHoo Editor?

I’m already building Linux indexes and sites, better to have them all nicely integrated in computers/software/linux

Page 23: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Why Did You Apply to be a NewHoo Editor?

We already maintain a site called CoinLink which lists over 800 coin related sites. We know the coin industry and could easily assist in building and maintaining this section of the index.

Page 24: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Why Did You Apply to be a NewHoo Editor?

You have no category in Recreation/Collecting that focuses on Christmas ornament collecting. Ornament collecting is one of the fastest growing hobbies. I've collected ornaments for 25 years and feel I know many of the "best" web sites dealing with this subject.

Page 25: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Motivations to Edit

• Same urge that makes you straighten a crooked picture you see on the wall

• People were maintaining link lists on their own manually; they could do so more easily with NewHoo’s web forms

• Didn’t need to see the whole directory finished to have their category be useful

• …but knowing they were helping to build the pyramid was a warm fuzzy

Page 26: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Directory Editing is Amenable to Incremental Effort

• First editor finds a good site and adds it• Second fixes a typo in the description• Third editor moves it to a more appropriate category• Fourth editor later notices the site moved and fixes the

URL

• Not as hard as writing device drivers; many can help• If you ask too much, results fall off quickly

Page 27: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

The Free Use License

• Netscape offered the data from the ODP under a free-use license

• Directory data was adopted by Lycos, AltaVista, Google and other search engines

• Only requirement was that the Add URL link point back to dmoz.org– helped keep dmoz authoritative & prevent forks

Page 28: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

GnuHoo -> NewHoo -> ODP

• FSF objected to the “Gnu”

• Yahoo objected to the “Hoo”

• Netscape renamed it to the Open Directory Project and hosted it on directory.mozilla.org

• directory.mozilla.org was too long to type, so we shortened it to dmoz.org

Page 29: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

Robozilla

• Lloyd Tabb wrote a crawler to visit every site in the ODP to see if it was 404/301/302

• Didn’t take action on its own, but alerted editors to potentially bad or moved sites

• Brought bad sites in the ODP down to 0.25%

• Our crawl of Yahoo showed 8% bad links

Page 30: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

“That’s a Problem We Want to Have”

• Design decisions were made in the interest of expediency. Why invest more time in the infrastructure if the site never takes off?

• Still running much of the 1.0 code today, over 4 years later

• Zillions of flat files in a gigantic VXFS filesystem

• Were we wrong? No, I don’t think so.

Page 31: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

The ODP Won

• 55,000 total editors, probably 10,000 active

• 3.4M sites, 460K categories

• Largest human-created taxonomy ever

• Several times larger than competitors• Cited in 83 academic research papers

(source: citeseer.nj.nec.com)

Page 32: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

The ODP “Won”

Everyone uses :-)

…but directories no longer scale to the web for users:– small web: use a directory– big web: use keywords

Page 33: Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003

“Lost Ark” Ending?

• The traffic & validation provided by Netscape was key to the ODP’s success

• Possible future: lost server in an ops farm• What new idea can take the ODP to the next level?