23
Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Embed Size (px)

DESCRIPTION

Document Retrieval Lots of applications –Chasing down citations in papers you read –Web search engines –Managing your personal files Two basic approaches –Explicit queries (“information retrieval”) –“Watch what I do” (“adaptive filtering”)

Citation preview

Page 1: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Text Retrieval and Spreadsheets

Session 4LBSC 690

Information Technology

Page 2: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Agenda

• Questions

• Text retrieval

• Spreadsheets

Page 3: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Document Retrieval

• Lots of applications– Chasing down citations in papers you read– Web search engines– Managing your personal files

• Two basic approaches– Explicit queries (“information retrieval”)– “Watch what I do” (“adaptive filtering”)

Page 4: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Ways of Finding Text

• Searching metadata– Using controlled or uncontrolled vocabularies

• Free text– Characterize documents by the words the contain

• Social filtering– Exchange and interpret personal ratings

Page 5: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

“Exact Match” Retrieval

• Find all documents with some characteristic– Indexed as “Presidents -- United States”– Containing the words “Clinton” and “Peso”– Read by my boss

• A set of documents is returned– Hopefully, not too many or too few– Usually listed in date or alphabetical order

Page 6: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Ranked Retrieval

• Put most useful documents near top of a list– Put possibly useful documents lower in the list

• No need to exclude any documents– Just list those least likely to be useful last

• Two basic techniques– Similarity-based– Probability-based

Page 7: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Similarity-Based Retrieval

• Assume “most useful” = most similar to query• Weight terms based on two criteria:

– Repeated words are good cues to meaning– Rarely used words make searches more selective

• Compare weights with query– Add up the weights for each query term– Put the documents with the highest total first

Page 8: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Example: Coordination Measure

11

1

1: Nuclear fallout contaminated Montana.

2: Information retrieval is interesting.

3: Information retrieval is complicated.

11

1

1

1

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

1

1 2 3

Documents:

Query: recall and fallout measures for information retrieval

Page 9: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Some Search Engines to Try

• Images– http://altavista.com (select images)

• Audio– http://www.musclefish.com (select demos)

Page 10: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

What’s a Spreadsheet?

• Large table containing numbers– May also contain labels to aid interpretation– Columns are named with LETTERS– Rows are named with NUMBERS– Cells are named like A4, C1, ...

• Some cells are automatically calculated– Formula specified when spreadsheet is created– Values are recalculated continuously

Page 11: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

How Spreadsheets are Used

• Record keeping (checkbook)• Calculation (income tax)• What-if analysis (cash flow)

– Sensitivity analysis (exchange rate)• Goal seeking (retirement planning)

– Uses continuous recalculation (“iteration”)

Page 12: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Spreadsheet Applications

• Originally designed for financial records• Library applications

– Budget– Collection development– Shelving capacity

• Educational Applications– Grade records– Equipment inventory

Page 13: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Excel Demo

• Start Excel– Microsoft Office folder

• Open N:\SHARE\CLASS\POSTCARD.XLS– File menu– N: is the volume labeled lbsc690c in windows

• Enter your 1999 (desired) income in cell B3– Tax due is displayed in cell B4

Page 14: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Excel Demo

• Change the tax due– Place the cursor over B4– Type “=B3*0.x”

• “=” tells Excel this is a formula• “B3” refers to the number in cell B3• The “x” in “0.x” should reflect your political views

– 0.5 would take away half your money

– Try different values in cell C3• What kind of spreadsheet use is this?

Page 15: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Excel Demo

• Add itemized deductions– Highlight row 4 (click on 4)– Select “Rows” in “Insert” menu twice– Label A4 as “Deduction amount”– Label A5 as “Taxable income”– Put the appropriate formula in B5– Change the formula in B6 as needed

• Note how it was copied from B4 with changes

Page 16: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Excel Demo

• Limit the deduction– Maximum of 50% of income or 10,000

• Search for help on “maximum” and “minimum”• Replace the formula in B5 with a more

complicated one– You can use another cell to show a partial result

Page 17: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

When Style is Important

• Too complex to visualize at once– Size– Relationships between formulas

• Used by more than one person– Includes use in presentations and papers

• Used for a long time– Essentially communicating to yourself

Page 18: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Style Guidelines

• Organization– Depict the solution approach visually– Group things where possible (e.g., parameters)– Build in cross-checks to discover input errors

• Readability– Describe the computation– Meaningful labels help a lot– Minimize clutter

Page 19: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Building Complex Applications

• Computers keep track of detail well– But people don’t

• Adopt meaningful abstractions– Organize a calculation the way you think

• Use a structured process– Examples: waterfall and spiral models

Page 20: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Waterfall Model

• Five steps– Identify requirements– Develop a detailed specification– Design the spreadsheet– Implement the spreadsheet– Test the spreadsheet

• Team project is based on a waterfall model– Specification, Test Plan, and User Manual

Page 21: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Spiral Model

• Build a prototype to solve part of the problem– Don’t worry about efficiency at this point

• Use what you learn to build another prototype– Either more complete or more efficient

• Repeat until the prototype does what you want

Page 22: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Lessons Learned

• Large projects need both models– Waterfall model helps identify subtasks– The first try is usually not right

• Most common mistake is not starting over– It seems easier to keep refining a prototype– But that won’t ever fix design-level problems

• Rule of thumb: double every estimate!

Page 23: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology

Summary

• Retrieval exploits human-machine synergy– Machines are fast, but simple– Humans are sophisticated, but slow

• Spreadsheets can make calculation easy– Easily modified to add new calculations– Need to design complex spreadsheets carefully