LIS650lecture 1 Major HTML Thomas Krichel 2004-10-02

Preview:

Citation preview

LIS650 lecture 1

Major HTML

Thomas Krichel

2004-10-02

structure

• It's not just about HTML– web– web server– markup– XML– HTML

• Fairly general but abstract• Probably the toughest lecture in the course

literature

• I work from the text of the official standard at http://www.w3.org/TR/html4/

• To work with it faster, I made a copy at http://wotan.liu.edu/~krichel/html4/

• You can work from any HTML book.• The W3C is the standard making body for the

Web. Anything that they say is the standard.• But some people don't behave according to the

standard.

The world wide web

The World Wide Web (Web) is a network of information resources. The Web relies on three mechanisms to make these resources readily available to the widest possible audience:– A uniform naming scheme for locating resources on

the Web (i.e. URIs). – Protocols, for access to named resources over the

Internet (e.g., http). – Hypertext, for easy navigation among resources (e.g.,

HTML).

URI introduction

• Every resource available on the Web -- HTML document, image, video clip, program, etc. -- has an address that may be encoded by a Universal Resource Identifier, or "URI".

• URIs typically consist of three pieces:– The name of the mechanism used

• to access the resource• or the otherwise “resolve” it

– The name of the machine hosting the resource. – The name of the resource itself, given as a path.

example URI

• http://openlib.org/home/krichel

This URI may be read as follows: There is a document available via the HTTP protocol, residing on the site openlib.org, accessible via the path "/home/krichel".

• mailto:krichel@openlib.org

This URI may be read as follows: There is email user krichel in a domain openlib.org to whom email may be sent.

Internet application protocols

• On the Internet machines use different application level protocols to do things

• Common protocols include– http -- dns --telnet– smtp -- ssh --ftp

• All of the ones cited are client/server protocols– client issues a request– server gives a response

• All of them use a different port. A port is a number that tells the machine what to do with the incoming stream of data.

http• The web operates mostly on http, the hypertext

transfer protocol.• The client software is run on the local PC that

you are using, called – a web browser (not politically correct)– a user agent (that's better)

• Our server is a piece of hardware called wotan.liu.edu, “wotan” for short– It runs the Debian GNU/Linux operating system on a

Intel architecture. – It provides http daemon software that serves http

requests. The particular software is called Apache.

main features of http

• http is insecure. the contents of http transactions (requests/responses) can be observed

• http is stateless. each transaction is self-contained and has no relationship to the previous one.

• http has a limited vocabulary of requests and responses. It is no good, say, to operate a machine remotely.

• We can therefore not use it communicate with the server.

working with a remote machine

• There are two traditional ways to work with a remote machine– issue commands to it

• used to be done with “telnet”

– transfer files to and from it• used to be done with “ftp”

• Telnet and ftp servers are not available on wotan.liu.edu. Telnet and ftp do not encrypt the communication stream. Therefore they are not secure.

communication with wotan

• The protocol that we use for communicating with the server is the secure shell, short ssh. It is based public-key cryptography.

• There are two PC programs commonly used as ssh clients– putty for issuing commands – winscp for file transfer.

• winscp is the one we will use. In offers a range of other facilities besides file transfer.

• Mac users should investigate a software called “fugu”.

registration time

• As part of the course, you are being provided with web space on the server wotan.liu.edu, at the URL

http://wotan.liu.edu/~username

where username is a user name that you will chose now.

• It is my intention to maintain this web space for you into the foreseeable future.

• You should also choose a password, now. • I will now register you.

free software• I maintain wotan.liu.edu server but you can build

your own server if– you have Internet access– you have an old PC to spare

• All the server software, as well as putty and winscp are free, open-source.

• It is one of my fundamental beliefs that free information should run on free software.

• The library community can learn a hell of a lot from the free software community.

• See my talk at http://openlib.org/home/krichel/ presentations/new_york_2003-11-07.ppt

installing winscp

• http://winscp.sourceforge.net/eng/download.php has – “installation package”. for use if you have administrator

rights on the machine where you are installing to – “application”. for use otherwise, i.e. to just download

and run the application

• at installation time, when/if asked about the default interface, I suggest you use “Windows explorer style”, rather than the default “Norton commander style” . You can change that later, so no panic.

other stuff: installing “user agents”

• Download and install a recent version of at least two browsers. I suggest– Mozilla Firefox at

http://www.mozilla.org/products/firefox/– Netscape Navigator at

http://channels.netscape.com/ns/browsers/download.jsp– Opera at http://www.opera.com

open a wotan session

• start winscp • the host name is “wotan.liu.edu”• give your user name• click on “save”, this will save the session, after

“ok”• you will be lead to the list of saved sessions• double click to open the session• Note:

– you can save the password as part of the session– it is risky to do that in a public classroom

initial remote files on wotan

• a set of files starting with a dot.– These are places where Linux Masters exert their black

magic.– Leave them alone.

• a directory called public_html– This is the place where web masters exert their magic.

you can go into that directory to see the files that you have on your web site at the moment.

– There should be two files• empty.html• validated.html

public_html• Imagine you are user user and you have a file

file in public_html.• The web server will map requests to

http://wotan.liu.edu/~user/file to show the file public_html/file.

• Here user stands for your user id, and file is the file name, and “/” is the directory separator.

• If file ends with “.html” or “.htm” the web browser will be told that the file is a HTML file. It will be rendered accordingly by the browser.

index.html

• The web server on wotan will map requests to http://wotan.liu.edu/~user to show the file public_html/index.html

• If this file is not there, the server will prepare a html document from the list of files that it finds in the directory and send it to the user agent.

• Once you have a file index.html, the web user can no longer see the individual files in your directory.

HTML and XHTML

• HTML is the hypertext markup language• HTML is a markup language that is widely used

on the Web.• The latest, and probably last version of HTML is

at http://www.w3.org/TR/html4/• The W3C, the standard making body for the

Web, have issued XHTML, a replacement of HTML that is compatible with XML.

• We will work with XHTML.

SGML HTML XML

• You will probably have come across these terms. • SGML was developed first. HTML and XML are

developed from SGML in different ways.– HTML is an SGML DTD– XML is an SGML application

• One common thing here is the ML. It stands for Markup Language.

• Markup is everything in a document that is not content.

• (something to scratch your head about)

procedural/descriptive

• Markup can be given in two ways• 1: Procedural

– Codes identify point size, style, font, etc.– Usually only understood by defining tool– Example: Microsoft Word

• 2: Descriptive– Describes purpose of text within the document– Chapter head, Paragraph, Section Head, TOC– Structure and Style are kept separate– Example: LaTeX, SGML

SGML• Standard Generalized Markup Language• Descriptive approach with three separate layers

– structure: types of information in document– content: the information itself– style: defines how to typeset the document

• Developed for the publishing industry by a group around Goldfarb.

• So complicated that no software implements it fully.

• But an important idea that remains of it is the document type definition.

Document Type Definition (DTD)

• The DTD is a non-SGML language that describes SGML.

• Describes information the document handles, e.g.– title– chapter

• Relationships between fields e.g.– a chapter contains sections

• Consistency and logical structure

XML

• Since SGML is so complicated, it is not good for use on the Web.

• So the W3C has issued XML, the eXtensible markup language.

• Every XML document is SGML, but not the opposite.

• Thus XML is like SGML but with many features removed.

XML elements• XML is based on elements. There are basically

three ways of writing an element.• The first way is write <name/>• Here name is the name of the element.• Example:

– <bang/>

• Such an element is called an empty element. Here its name is “bang”.

non-empty elements

• If name is the name of the element, you can give an element contents contents by writing <name>contents</name>.

• Examples:– <greeting>bonjour</greeting>– <greeting>здравствуйте</greeting>– <sentence>She says <greeting>hello</greeting> to

you.</sentence>

• In fact <name/> is just a shortcut for <name></name>.

attributes to elements

• Elements can have attributes. Here is an element with two attributes

• <name attribute_name_one="value_one" attribute_name_two="value_two"/>

• Here attribute_name_one and attribute_name_two are attribute names and value_one and value_two are attribute values. The element itself is empty.

• Example: <greeting language=”french”>bonjour</greeting>

more on attributes

• There can be no two attributes to the same element with the same names.

• Attribute values are simple strings. You can not have an element inside attribute.

• Attribute names are separated from their values by the = sign.

• Attribute values can be enclosed in single or double quotes. It does not matter. Double quotes are more common, so I suggest you use those.

XML document

• An XML document is a piece of data that is written in XML.

• But sometimes the author of a document makes a mistake, and, in fact the XML is wrong in some ways.

• If there is no mistake, the document is called well-formed.

• If a document is not well-formed, it really is not an XML document.

some rules for well-formedness • There must be one single element in the

document. – It is called the root element.– It may be preceded by a prolog (stuff before the root

element)– All other elements are called children of the root.– Whitespace that surrounds the root element is ignored.

• All elements must be properly nested. You can only close the outer element after all inner elements are closed. Examples– <a><b></a></b> not well-formed– <a><b></b></a> well formed

other stuff: comments

• In an XML document, you can make comments about your code. These are notes to yourself.

• Comments start with <!--• Comments end with -->• Example: <!-- this is a comment --> • Comments can not be nested.• Can appear anywhere in the document

other stuff: XML declaration

• The XML declaration is a special line that says that what follows is XML and give some very basic information about that XML. It is trendy to use it.

• It is optional, but if it is there it has to be on the first line.

• You will need to have an XML declaration if your character encoding is not UTF-8. We will come back to this point later.

other stuff: XML declaration

• Normally the XML declaration looks like • <?xml version="1.0" encoding="encoding"?>• where encoding is the character encoding. By

default, the character encoding is UTF-8, so if you use that, you do not need to mention it.

• There is now a version “1.1” of XML around, but – it is not widely deployed– it is not much different from version 1.0

other stuff: document type declaration

• XML documents, like any SGML documents, accept document type declarations.

• A document type declaration tells us something about the vocabulary of elements and attributes used in the document.

• It should appear before the root element, after the XML declaration, if you have one.

• It takes the form <!DOCTYPE mumbojumbo >• We will come back to the document type

declaration later.

HTML

• HyperText Markup Language• HTML is an SGML DTD

– Head, Title, Body, Paragraph, etc.– Headings, Bold, Italic, etc.– Table, List, Image, etc.– Links to other documents– Forms– and many others

HTML history• HTML was a very bare-bones language when

first invented by Tim Berners-Lee. It did not describe pages with much of a visual appeal.

• In the 90s, successful browsers invented “extensions” that aimed to stretch the visual boundaries of HTML.

• Some of these extensions found their way in the official HTML spec issued by the W3C.

• Later the W3C developed style sheets as a way to accommodate for display requirements without having to extend HTML

HTML versions

• HTML 4.01 is the last version of HTML This version has two different DTDs:– the loose DTD– the strict DTD

• I only the cover the elements of the strict DTD.• The loose DTD has more elements, but all the

functionality of these elements is best done with style sheets.

• Thus, the pages created with HTML only will look rather boring.

• But we do cover style sheets later.

XHTML

• XHTML is HTML written in an XML syntax.• Every XHTML document has to be well-formed

XML. • non-XHTML HTML documents can violate some

well-formedness constraints, including– HTML element names are not case sensitive– some HTML elements do not need closing. – there is no need for a single root element in a HTML

document.

XHTML: pain without gain?

• In this course we study XHTML. • When I say HTML in the following, I mean

XHTML. • Reasons to study XHTML rather than HTML

– syntactic rules of XML are easier to understand.– any tool that can work with XML can be applied to

XHTML, but can not be applied to HTML.– in general XML documents are more computer

understandable. This is crucial in the age of the search engine.

Example HTML snippet

<a href="http://openlib.org/home/krichel" title="homepage of Thomas Krichel">Thomas Krichel</a> – the whole thing is an <a> element. It creates an

anchor. (I use < and > to surround element names.)– “href” is an attribute name– “http://openlib.org/home/krichel” is the value of the

"href" attribute

(I surround attribute names with straight quotes)– 'Thomas Krichel' is character data.

Characters: concept

• A character set combine two things– Character repertoire: a set of characters e.g. "A", "ض"

"‼", "₣"– Character code positions: defines a number for each

character in the repertoire.

• Character encoding is a way to encode the code positions in bytes

• To correctly display a document, the user agent needs to know both!

playing safe with characters

• Only use the characters on the US keyboard, don't insert symbols.

• Save as ASCII or UTF-8. All ASCII files are also UTF-8 files.

• Never save as "Unicode" within MS Notepad.• If you encounter a character that is not on your

keyboard, use an SGML entity.• The SGML entity is the last special SGML thing

that we have to study.

SGML entities

• SGML entities are something like a way to represent non-ASCII characters when only ASCII input is possible.

• Codes can can be &code;– Ex. &eacute;

• Inserts and e with acute accent.

– this is called a character entity– Codes are often abbreviation of the character names

• Codes can be in hex form• Ex. &#38; to insert an ampersand• this is called a numeric entity

XHTML entities• They are officially defined in three files that are

maintained by the W3C– http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent– http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent– http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent

• A sample line is

<!ENTITY ccedil "&#231;"> <!-- latin small letter c with cedilla, U+00E7 ISOlat1 -->

• <!ENTITY is DTD speak for defining an entity• it is followed by the character form and the numeric form of the

entity• the rest of the line is a comment, of course

entities used in XML

• There are three that you need to know and use.– &lt; stands for <– &gt; stands for > – &amp; stands for &

• Every time you want to insert <, > or & in the documents, you have to use the entities instead.

• Examples:– krichel&#64;openlib.org– je suis Fran&ccedil;ais– Marks &amp; Spencers

another look at empty.html<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html> <head> <title></title> <meta http-equiv="content-type" content="text/html; charset=UTF-8"/> </head> <body></body></html>

empty.html dissected

• the <!DOCTYPE ... > is an SGML document type declaration. It says that the document contains XHTML of the “strict” flavor.

• The document type declaration is the only thing that we have in the prolog. We could have placed an XML declaration before it but chose not to do so.

• <html> is the root element. It contains some other elements. Some of these we discuss now, others later.

the <html> element

• It is the root element of an XHTML document.• It has required children <head> and <body>.• It has two optional attributes

– the "dir" attribute says in which direction the contents is rendered. The classic value is "ltr", "rtl" is also valid.

– the "lang" attribute says in which language the contents is. Use ISO 639 codes, e.g. lang="en-us"

– these two attributes are know as the internationalization (i8n) attributes.

• Example: <html lang="en-us"> … </html>

the <title> element • This is a required child of <head>.• It defines the title of the document.• It takes the i18n attributes.• Example <html><head lang="en-us"><title>A fine limerick</title></head><body><div>There was a young friar named Tuck</div></body></html>• It must not contain other HTML tags.

usability concerns with <title>

• The title is used by the user agent in a special manner– as bookmark default title– as the title for a window in which the user agent runs

• Google uses the title as anchor text to your web page. – It is a crucial ad for your page– Google may truncate the title.

• Bad ideas for titles– section 1 -- home page

the <body> element

• This encloses the contents of the page as opposed to its header.

• Validation requires one and only one body. • It takes the i18n attributes. as well as some

others that we will discuss now. These fall into a another group of attributes we call “core attributes”.

• We will study those core attributes now.

core attributes: "id"

• This attribute assigns a name to a element. • This name must be unique in a document. In

the <body> element, this requirement is superfluous, of course.

• The "id" attribute has several roles in HTML, including

– As a style sheet selector– As a target anchor for hypertext links

core attributes: "class"• The class attribute is a friend of the "id" attribute. • It assigns one or more class names to a

element. Class names are separated by colons. The element may be said to belong to these classes. A class name may be shared by several elements.

• The "class" attribute has several roles in HTML, but it is most useful as a style sheet selector, when you want to assign style information to a set of elements.

Example for "class" and "id"

<p class="limerick" id="limerick_1">

There was a young man from Peru<br/>

Whose limericks stopped at line two.</p>

<p>OK, that's a stupid limerick. Let us look at another</p>

<p class="limerick" id="limerick_2">

There was a young man from Japan<br/>

Whose limericks would never scan<br/>

And when they asked why<br/>

He said it is because I<br/>

Try to put as many words into the last line as I possibly can.</p>

core attributes: "title"

• The "title" attribute sets a title in use with the tag. • There is no prescribed way in with the title is

being rendered by a user agent. • Sometimes it is shown as a tool tip, i.e.

something that flashes up when the mouse is rolled over it.

• Example:

<a href="http://wotan.liu.edu/home/krichel" title="Thomas Krichel's homepage at wotan">Thomas Krichel</a>

core attributes: style

• Use the "style" attribute to give style information to a particular element.

• This will be more discussed when we do the style sheets.

• Usually there are better ways to attach style information then writing it onto every element. It is better to place the tag into a class by giving them the same "class" attribute, and then give style sheet information for the class.

• See validated.html for an example.

summary: core attributes

• To summarize, we have a group of core attributes.

• These attributes can be used with almost all elements.

• There are other attributes that can be almost universally used, called "event attributes", but they have to do with scripting, they are therefore not studied in this course.

block-level vs text-level elements• Block-level elements contain data that is aligned

vertical by visual user agent. • Text-level elements are aligned horizontally by

visual user agents.• The reasons behind this distinction

– Block level can contain other block level elements and text-level elements.

– Text-level elements can not contain block-level elements.

– Visual user agents start a new line at the beginning of block-level elements.

– Multidirectional text would be impossible without it.

the <div> and <p> elements

• The <div> elements allows you to set arbitrary block level divisions in your document.

• It takes the core attributes. • RULE: put all your contents that is vertically

aligned into a <div>.• The <p> tag is like <div> but it signals the start

and end of a paragraph.

the <br/> element

• is used to create a line break.• Note its emptiness! • It has the "clear" attribute that can take the

values "left", "right" and "center" and "all". This prevents textual contents to float around other content.

The <span> element

• This is another element for arbitrary divisions, but it operates on inline content. This is contents that is put in lines horizontally, rather than block-level contents, that is put in vertically.

• Admits core attributes. • Put things in a <span> that belong together in a

line.

<span> exampleA worse poet however was

J<span class="r">enny</span>.<br/>

Her limericks weren’t worth a P<span class="r">enny</span><br/>

Though the invention was

s<span class="r">ound</span><br/>

She always f<span class="r">ound</span><br/>

That, whenever she tried to write <span class="r">any</span><br/>

She always had one line to

m<span class="r">any</span><br/>.

abstraction ends here

• Up until now, we have done a lot of abstract elements and attributes that do not achieve much visual impact.

• Instead, they– point the style sheet to where things are– create a semantic design

• We will now turn to more physical descriptions.

try it out

• right click empy.html in your winscp window.• you will see the option to duplicate the file.• duplicate it, say, to “tryout.html” by entering the

new name.• right-click tryout.html and choose edit.• open a user agent to • http://wotan.liu.edu/~user/tryout.html

where user is the name of your user name. You should be able to see your changes, as last saved.

the <a> element I

• opens a hyperlink, contents of element is the anchor text, it is limited to text only

• "href" attribute has the target URL• "hreflang" has the language of the target• "type" attribute gives the MIME-type of the target• Some other attributes for which we have no use

– coords –shape –accesskey –tabindex

• and of course, <a> takes the core attributes

the <a> element II

• It takes the "rel" attributes to specify the relationship between the current document and the link target, as well as the "rev" attribute to specify the reverse. – This is not currently well supported by the browsers. – I will come back to these relational attributes when

discussing the <link> tag.

• Ex: <a href=http://openlib.org/home/krichel>a nice man</a>.

linking within a document

• If the "id" attribute of an element in a document at a URL URL is set to id , you can make the element the target of a link.

• You use the URL URL#id for this purpose.• If the document linked to is the current

document, you don’t need to reference its URL.• example: <a href=

"http://openlib.org/home/krichel#joke">joke</a> links to the element with id "joke" in Thomas Krichel's homepage.

the <img> element I

• makes an image.• "src" attribute says where the image is• "alt" attribute give a text to show for user agents

that do not display image. It may be shown by the user agents as the user highlights the image. It is limited to 1024 characters.

• "longdesc" attribute is the same as "alt" but does not have the length limitation.

• Example: <img src="thomas_krichel.jpg" alt="picture of Thomas Krichel">

the <img> element II

• "width" attribute gives the user agent a suggestion for the width of the image.

• "height" attribute gives the user agent a suggestion for the height of the image

• both can be expressed – in pixels, as a number– in %age of the current display width

• of course <img> supports the core attributes.

HTML checking • validated.html has some additional code (as

compared to empty.html), that we can now understand.

<p>

<a href="http://validator.w3.org/check?uri=referer">

<img style="border: 0pt"

src="http://wotan.liu.edu/valid-xhtml10.png"

alt="Valid XHTML 1.0!" height="31"

width="88" />

</a></p>

• click on the icon to check your code. That's cool!

header elements

• Headers <h1> to <h6>• Simple form of text formatting • Vary text size based on the header’s level.• Actual size of text of header element is selected

by browser. • Results can vary significantly between user

agents.• All take the core attributes.

<hr/> element

• creates a horizontal rule• admits the core attributes• other attributes have been deprecated, i.e. are

allowed in the loose DTD but not the strict one.

contents-based style elements• <abbr> encloses abbreviations• <acronym> encloses acronyms• <cite> encloses citations• <code> encloses computer code snippets• <dfn>encloses things being defined• <em> encloses emphasized text• <kbd> encloses text typed on a keyboard• <samp> encloses literal samples• <strong> encloses strong text• <var> encloses variablesall admit the core attributes

physical style elements

• <b> encloses bold contents• <big>encloses big contents• <small> encloses small contents• <i> encloses italics contents• <sub> encloses subscripted contents• <sup> encloses superscripted contents• <tt> encloses typewriter-style contents

all admit the core attributes

the <pre> element

• encloses contents that is to be rendered with the characters and line breaks just like in the source text. Markup is still allowed, but elements that do spacing should not be used, obviously.

• It takes the core attributes and a "width" attribute setting the number of characters per line.

<blockquote> and <q> elements

• <blockquote> quotes a paragraph• <q> make a short quote inside a paragraph• both takes a "cite" attribute that take the value of

a URL of the source of the quote.• They also take the core attributes.

list elements

• <ol> creates an ordered list.– <li> encloses each item

• <ul> unordered list– <li> encloses each item

• <dl> encloses a definition list– <dt> encloses the term that is being defined– <dd> encloses the definition

• All take the core attributes and the i18n attributes.

http://openlib.org/home/krichel

Thank you for your attention!