Download ppt - Internet and WWW Services

Transcript
Page 1: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-1

Internet and WWW Services

• Security• Types of Services• Vended versus Internally Provided• Costs and Benefits• Servers and Clients• Potential Problems• Stats

Page 2: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-2

General Network Security

• Isolated Servers• Restricted Subnets• Firewalls• Proxy Servers

Page 3: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-3

WWW Application Security

• OS Level• Server Level• Program Level

Page 4: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-4

Types of WWW Services

• Static Data• Server Search Engines• Dynamic Data• Server Applications• Java Enabled

Page 5: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-5

Vended

• Which Vendor

• How Much Do They Do

– HTML

– Graphics

– Design & Layout

– Programming

• Bandwidth

– Total

– Dedicated

Page 6: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-6

Internally Provided WWW Server

• For who?• How many services, how much traffic?• For what use (scope the server) ?

Page 7: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-7

Cost of a WWW Service

• Server Usage• Disk Space• Network Bandwidth• Router or LAN Load• Application Development with Limited Capabilities• Application Development with Limited

Standardization

Page 8: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-8

Benefits

• High-touch, High Impact Narrow-casting• Kiosks• Fast, Simple Apps From Central Server• Built-in Protocols• Potentially Large Installed Client Base

Page 9: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-9

Shopping List

• Server Machine and O/S• Network Access• WWW Server• WWW Client• Server Programming Tools• Data and/or Databases

Page 10: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-10

Which Server Platform?

• Unix• NT

Page 11: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-11

Which Server?

• CREN• Microsoft• Netscape - Communication or Commerce• O’Reilly• WebForce• Oracle WebServer

Page 12: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-12

Client Compliance Level

• HTML 2.0• HTML 3.0• Netscape Enhancements• Java• Lynx (Text Browser)

Page 13: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-13

CGI-BIN Risks

• Dangerous Programs or Scripts• User-supplied Programs or Scripts

Page 14: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-14

Robots and Other Network Creatures

• Problems with “Automated Agents”• Deterring Robots• Reacting to Robots

Page 15: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-15

WWW Server Stats

JanuaryFebruary

March

UNM

Outside

0

50000

100000

150000

200000

250000

300000

350000

400000

WWW Accesses per Week

UNM35%

Outside65%

WWW Accesses per Week

UNM

Outside

0

100,000

200,000

300,000

400,000

500,000

600,000

January February March

Page 16: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-16

WWW Server Stats

JanuaryFebruary

March

UNM

Outside

0

50000

100000

150000

200000

250000

300000

350000

400000

WWW Accesses per Week

UNM35%

Outside65%

Page 17: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-17

Web Mining

Web based information extraction

Page 18: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-18

Why the Web(web = web browser)

• Ubiquitous: – Web browsers are on every desktop, every PC, Mac,

workstation, and terminal.

• Platform independence– Use of Java and server side programs means clicking on a

button does the same thing everywhere.

Page 19: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-19

Natural Language

News ServicesNews Services

Multidimensional vectors

Markov objectsID3ID3

Word frequency

Data warehouse

Data CleansingData Compression

Text Mining

Factorial Analysis

Keyword Search

Decision Trees

Tri-Grams

Tri-Letter Sets

Hidden Information

Hidden Information

Hyp

othe

sis

Ver

ific

atio

n

Page 20: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-20

DATA CleansedData

ExtractedData

N

DisplayResults

Page 21: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-21

What Kind of Data?

• Usenet News– Most places have Multi gigs of news

• System accounting files – Can tell who is doing what, when

• Misc. Web pages– A variety of interesting information

• Listserver or public system email– We keep email concerning system problems

Page 22: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-22

Cleansing Data

• News article– NNTP fields

– signatures

• Web Page– HTML codes

– descriptions of links to other sites

– pattern fields (headers and trailers that appear on every page at the site)

Page 23: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-23

Mining for data

• Test hypothesis• Look for hidden information• Find other similar information

Page 24: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-24

Display of Information

• Graphical• Text Listing

– Directories: human maintained categories• e.g.: recreation, computers, finances, arts

– Computer generated list

• Customized– User defined defaults

– Cookie defined defaults

Page 25: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-25

Data and Services

N

DisplayResultsLearning

to useservices

Learning to extract

data from the answer

Compileand clean

data

Page 26: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-26

What Services?

• Search Engines• Internet White Pages

– (information on individuals)

• Internet Yellow Pages – (information on corporations)

• Usenet News repositories• Online libraries• Online periodicals

Page 27: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-27

Learning to use Services

• Sample sets of data– can derive a format if taught to.

• Machine learning (same as in Data Mining)– look at every interpretation, find the one that conveys the

most information.

Page 28: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-28

Learning to interpret answers

• What format is information given in?• What do the fields mean?

– Can identify unknown fields by matching the data with a known information.

Page 29: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-29

Compile and Clean Data

• Redundancies• Duplicates• Redundancies• Newer information has precedence

Page 30: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-30

Security

• Server environment– Use trusted CGI scripts and server side includes

• Client environment– Restrict access by IP number or domain

– Restrict access by password

• Internet– encrypt data (PGP)

– Certification authority

Page 31: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-31

Data is in database?

Checking for hidden information

MachineLearning

N

Y

Page 32: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-32

Article: 52151 of comp.lang.perl.miscPath: lynx.unm.edu!pr1.plk.af.mil!tesuque.cs.sandia.gov!sloth.swcp.com!news.ironhorse.com!op.net!news.mathworks.com!enews.sgi.com!news.sgi.com!mr.net!news.mid.net!sbctri.tri.sbc.com!newspump.wustl.edu!newsfeed.rice.edu!rice!addFrom: [email protected] (Arthur Darren Dunham)Newsgroups: comp.lang.perl.misc,comp.infosystems.www.authoring.htmlSubject: Re: WWW: web site "pre-processor" in perl ?Date: 31 Oct 1996 00:20:06 GMTOrganization: Rice UniversityLines: 23Message-ID: <[email protected]>References: <[email protected]> <[email protected]> <[email protected]> <[email protected]>NNTP-Posting-Host: pecos.is.rice.eduXref: lynx.unm.edu comp.lang.perl.misc:52151 comp.infosystems.www.authoring.html:111886

In article <[email protected]>, Clay Shirky <[email protected]> wrote:>>Au contraire. HTML _is_ broken, relative to, say, SGML, but if you are>careful with your tags and comment carefully, your data can be derived>from your HTML files, not v-v.>>find . -name '*html' -exec perl -p -i.bak -e> 's#(<body[^>]*bgcolor="?)oatmeal("?[^>]*>)#$1skyblue$2#i;' {} \;

or if you wanted perl to do all the work, rather than have find(1)launch N perl executables for each .html files, you could do this....

find . -type f -name '*html' -print | xargs perl -p -i.bak -e 's#(<body[^>]*bgcolor="?)oatmeal("?[^>]*>)#$1skyblue$2#i;'

That way, perl happily iterates through all the lines in all the filessince we don't care which file we're in when we do the substitution.

-- Darren Dunham [email protected] Sysadmin Rice University(This line currently in revision) Houston, TXAny resemblance between real opinions and my post is coincidental

Page 33: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-33

<HTML><HEAD><TITLE>Information gathering</TITLE></HEAD><BODY><TABLE><TR><TH><IMG SRC="info.gif"></TH> <TH><font size="+3">Information Gathering</font><BR>Just some sample text which might or might not be worthless.You'd want to sort out which of this was just HTML tags and other worthless junk and which was meaningful.</TH></TR></TABLE><P><CENTER><H2>Links to</H2><A HREF="/sameplace/otherinfo"> A link to something on this site </A>

<A HREF="/otherplace/otherinfo"> A link to something on this another site </A>

</BODY></HTML>

Page 34: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-34

Re: Scots and English Gregory J Dalley, 30 May 1995, Lines: 18.Re: Dutch and English accents Phil Rose, 15 Jun 1995, Lines: 28.Re: ANY SIL'rs out there? A.K.A. Summer Institute of Linguistics. yomomma, 16 Jun 1995, Lines: 6.Re: ANY SIL'rs out there? A.K.A. Summer Institute of Linguistics. yomomma, 16 Jun 1995, Lines: 6.Conferences, Seminars-info wanted chris bowen, Mon, 03 Jul 1995, Lines: 7.AIGH? Coby (Jacob) Lubliner, 8 Jul 1995, Lines: 8."Shall" and "Will" in Welsh English [email protected], Wed, 19 Jul 95, Lines: 14.careers in linguistics scharle, 10 Sep 1995, Lines: 8.job opportunities in computational linguistics? Sonny Xuan Vu, 30 Sep 1995, Lines: 14.Re: job opportunities in computational linguistics? Miss Sarah Tiller, Wed, 4 Oct 1995, Lines: 27.Re: What Is Singapore English? Zhong Qiyao, 11 Dec 1995, Lines: 28.Re: What Is Singapore English? Chew Kim Swee Andrew, 14 Dec 1995, Lines: 41.Re: What Is Singapore English? Pota alok Ashwin, 16 Dec 1995, Lines: 45.Re: How to write in English ... Ann Weiner, Tue, 2 Jan 1996, Lines: 13.Re: What Is Singapore English? Wing Luk, 7 Jan 1996, Lines: 27.Linguistics Careers lebitz,stacey b, 23 Jan 1996, Lines: 14.English Teaching Offering in China - offer2.doc [1/1] XIAOJUN ZHANG, 24 Jan 1996, Lines: 240.TRYING TO PROTECT YOUR WORK? prepaid, Sun, 04 Feb 1996, Lines: 1.Give me, please, one program for learn to speak english!! Please!! "Eugen I. Ivanov", 20 Feb 1996, Lines: 1.Re: The English "R" for Germans Joerg Settemeyer, 8 Mar 1996, Lines: 5.English Tutor Needed. Mua Tran, 23 Mar 1996, Lines: 20.Re: old form of shorthand Fido, 1 Apr 1996, Lines: 9.Re: Math as pornography Gordon Fitch, 17 May 1996, Lines: 7.Re: Chain Shift Charles Lieberman, 26 Jul 1996, Lines: 10.Re: Tendency of Inflections to Disappear - Why? Terrence Griffin, 28 Jul 96, Lines: 1.Re: Concerning the number of esperantists Marc Bonnaud, Fri, 09 Aug 1996, Lines: 14.Re: Concerning the number of esperantists Cheradenine Zakalwe, Fri, 9 Aug 1996, Lines: 16.Re: Concerning the number of esperantists Alan Gould, Sat, 10 Aug 1996, Lines: 22.Re: Concerning the number of esperantists Don HARLOW, Sun, 11 Aug 1996, Lines: 21.Re: Kiom da E-istoj *ne* regas la anglan? Andrew McConnell, Fri, 30 Aug 1996, Lines: 19.cohesion in CMC Per-Mikael Jansson ENGE, 22 Oct 1996, Lines: 10.

Articles from sci.lang selected through webSOM

Page 35: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-35

Limitations of the Web

• Some functionality/specialization was given up for ubiquity

• Transfer time– Mass data transfer prohibitive

• External to machine– Reliance on network

• Not inherently as secure as staying home

Page 36: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-36

Why Data Mining

• There is a lot of data of unknown worth and purity• Data mining uses the same underlying procedures as

other knowledge discovery/ data extraction systems

Page 37: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-37

Automatic Customization to user preferences

• Web pages– Hotwired autoconfigs based on what you surf to

• News services– usenet service custom.roy-corey.1

• Information display paradigm– industry report style

– collegiate style

– Microsoft style

Page 38: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-38

Methods for gathering data

• Extraction from documents– data mining

– keyword searches

– similarity searches

• Extraction from services– ILA: internet learning agents

– Softbots

– Metacrawler

Page 39: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-39

Data mining on the web?

• Transfer rate too slow to transfer most databases whenever you want

• Computation too intensive to let others mine your database whenever they want

• So: Use pre-collected data or pre-indexed database

Page 40: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-40

Java -- What is it?

• Programming Language• Java Compiler• Java Interpreter (Java Virtual Machine)• For creating applets which run inside a browser• For creating applications (stand alone programs)

Page 41: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-41

Java Application Source Code

//

// Sample HelloWorld application

//

class HelloWorldApp {

public static void main(String args[]) {

System.out.println("Hello World!");

}

}

Page 42: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-42

Java Applet Source Code

//

// Sample HelloWorld applet

//

import java.awt.Graphics;

import java.applet.Applet;

public class HelloWorld extends Applet {

public void paint (Graphics g){

g.drawString("Hello world!", 25, 25);

}

}

Page 43: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-43

How could you use it?

• Client applets or applications• Server code• Portable code• Create via Developer Tools

Page 44: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-44

Developer Tools

• Visual C++ (Visual Java?)• Symantec• Sun• SGI - Cosmo Code

Page 45: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-45

Developer Tools

• SourceCraft• Powersoft - Fusion• Quintessential Objects - Diva for Java (Javaside)• Roguewave - JFactory

Page 46: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-46

Advantages

• Object Oriented and event-driven• Portable* bytecode• Multi-threaded• Integrated Network Abilities• Built-in Multimedia Capabilities• “Robust and Secure”

Page 47: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-47

Drawbacks

• Few deployed clients• Very C++ -like• Not yet stabilized• Very few Developer Tools• Not all the class libraries exist (yet)

Page 48: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-48

Class Structure

Class java.applet.Applet

java.lang.Object

|

+----java.awt.Component

|

+----java.awt.Container

|

+----java.awt.Panel

|

+----java.applet.Applet

Page 49: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-49

Security

• OS security in applications• “No Pointers” and no user memory management• Compile-time and Run-time checking• Client Data Security

– No access to disk from Netscape

– Directory-based security in Hot Java

Page 50: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-50

Security

• Network Security– No Applets

– No Access

– Applet Host

– Firewall

– Any Host

Page 51: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-51

Security Problems

• CERT 96.05 - Firewall Security– ftp://info.cert.org/pub/cert_advisories/CA-

96.05.java_applet_security_mgr

• CERT 96.07 - Bytecode Verifier– ftp://info.cert.org/pub/cert_advisories/CA-

96.07.java_bytecode_verifier

Page 52: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-52

Alternative Options

• Visual Basic and browsers• Visual Basic separate from WWW• Web Server without Java

Page 53: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-53

Books About Java

• Teach Yourself Java in 21 Days• Java!• Hooked On Java• Presenting Java• O’Reilly

Page 54: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-54

Java WWW Sites

• Sun– http://java.sun.com/

• The Internet Programming Page– http://www.apexsc.com/vb/internet.html

• Rogue Wave Home Page– http://www.roguewave.com/

• Symantec Café– http://cafe.symantec.com/cafe/index.html

Page 55: Internet and WWW  Services

© Copyright 1997, The University of New Mexico M-55

Java WWW Sites

• JavaSoft– http://www.javasoft.com/

• The Java Directory (Gamelan)– http://www.gamelan.com/

• IBM: Centre for Java Technology– http://www.hursley.ibm.com/javainfo/

• News: comp.lang.java