1 Context discovery with SHY (Song Huiyao – 宋會要 ) Jieh Hsiang ( 項潔 ) National Taiwan...


Citation preview


Context discovery with SHY

(Song Huiyao – 宋會要 )

Jieh Hsiang ( 項潔 )

National Taiwan Universityand

Academia Sinica

2012/12/07 PNC 2012, Berkeley

Joint work with

• Hsieh-Chang Tu ( 杜協昌 ), NTU

• Shih-Pei Chen ( 陳詩沛 ), Harvard

With special thanks to

• Cheng-yun Liu ( 劉錚雲 ) of IHP, Academia Sinica

• Peter Bol of Harvard University

2012/12/07 PNC 2012, Berkeley 2


Songhuiyao《宋會要》• Huiyao ( 會要 ):

– Decrees and laws, usually collected throughout a dynasty

• Songhuiyao ( 宋會要 ): The huiyao of the Song Dynasty, 960 – 1279 AD, most important government record of the Song Dynasty

• Current version is only a remnant, extrated by Xu Song ( 清,徐松 ) around 1800 from Yong-le Dadian ( 永樂大典 )

2012/12/07 PNC 2012, Berkeley


Songhuiyao《宋會要》• 35,000,000 words in full text, 17 categories• Full text done by the Institute of History and

Philology (IHP) of the Academia Sinica and the Chinese Bibliographical Dababase (CBDB) project of Harvard University

• Included in the Scripta Sinica of IHP • Why another system?

– Songhuiyao is fragmented and very difficult to use– Need a better way to re-contextualize the material

2012/12/07 PNC 2012, Berkeley


Introducing THDL

• Originally designed as a system for a Chinese corpus of full text historical documents related to Taiwan (thus the name THDL: Taiwan History Digital Library)

• Tailored for scholarly use with many special features

2012/12/07 PNC 2012, Berkeley


Key design philosophy of THDL


• Assume that documents are related

• Treats a query return as a sub-collection of inter-related documents

• provides ways to discover the collective meanings of a sub-collection

• Contexts, contexts, contexts(Preserving old) (creating new)

(observing different)

2012/12/07 PNC 2012, Berkeley


Features in THDL

• Main goal: provide ways to show collective meanings (contexts) of documents– Multi-level classification of query result– Term co-occurrence analysis– GIS/time distributions

• Term extraction tools

• Text mining tools

• Annotation/correction tools2012/12/07 PNC 2012, Berkeley


THDL as a shell• THDL

– Taiwanese Land deeds– Ming Qing court documents– Dan-Xin archives

• KMT (Nationalist Party) archives• Taiwanese democratic magazines• Songhuiyao ( 宋會要 ) (this talk)• Qingshilu – Veritable Records of Qing ( 清實

錄 ) (IP)• Gujin tushu jicheng ( 古今圖書集成 ) and other

leisu ( 類書 ) (IP), other smaller books• Over 400,000,000 Chinese words, 1,000,000

metadata records, 2,000,000 images2012/12/07 PNC 2012, Berkeley

XMLize the data

CBDB processed the data into 80,396 entries into excel form, each with 7 fields : category, emperor, dates (4 fields), and full-text


XMLize the data

• Dates: use DDBC from Dharma Drum to convert the dates in western calendar (61,002 documents)

• Extract names for SHY– 9,470 person names from CBDB (CBDB

has 35,632 Song names)– 3,366 official titles from CBDB– 4,010 locations from CBDB– Text-mined 11,901 additional potential

names (estimate correctness: 33%)112012/11/29


Features of SHY (1)• Finding documents

– Full text search, plus logical operations

• Multiple contextual presentations of query results

• Term frequency and co-occurrence (contextual) analysis of people, locations, and offices

• Biography of people (from CBDB)

2012/12/07 PNC 2012, Berkeley


Features of SHY (2)• Chronological distribution of query

results• Geographic distribution of query results• Self-defined document sets (with all the

features above)• Chronological comparison of two query

result sets• User-feedback mechanism (especially

useful for Song research community)

• Appositional term analysis2012/12/07 PNC 2012, Berkeley


Full text search in SHYQuery term “locust”

2012/12/07 PNC 2012, Berkeley


Multi-contextual classification• Years• Era (of emperors)• Categories• Subcategories • Error detection

2012/12/07 PNC 2012, Berkeley


Error detection using facets

• Years that are not supposed to exist (e.g., 2nd month of first year of Xinguo)

2012/12/07 PNC 2012, Berkeley


Facets within a facet

• Distribution of result of the query “locust” within the category Ruiyi ( 瑞異 strange phenomenon)

2012/12/07 PNC 2012, Berkeley


Biography of people from CBDB

• Click biography ( 生平 ) by any name and get the information from CBDB

2012/12/07 PNC 2012, Berkeley


Term frequency analysis

• Common names and locations in the query result

• df: document frequency tf: term frequency

• df(A)=4, tf(A)=6 df(B)=3, tf(B)=4 df(C)=2, tf(C)=3 df(D)=2, tf(D)=2

A…B… A





2012/12/07 PNC 2012, Berkeley


Term frequency analysis

df: given query q, the number of documents of the query result in which term t appears. df(t)

tq: percentage of documents in df(t) over the total number of documents in which t appears

(the higher it is, the more relevant t is to q)


2012/12/07 PNC 2012, Berkeley


Chronological distribution of documents

• Chronological distribution of documents is often useful

• Among the 80,396 documents in Songhuiyao, 61,002 have dates that were extracted automatically

2012/12/07 PNC 2012, Berkeley


Comparing timelines of two queries

• q1 ?vs q2

• Ex : Wenzhou ?vs Raozhou

Grey: with Raozhou

Red: with Wenzhou

2012/12/07 PNC 2012, Berkeley


Geographic distribution

• Locations (with df) plotted on map.

• Location names obtained from CBDB Query “locust”

2012/12/07 PNC 2012, Berkeley


Self-defined folders

• User can define her own folders of documents so that they can be used later

• All the features described above apply to all self-defined folders (i.e., any sets of documents, not only query results)

2012/12/07 PNC 2012, Berkeley


Self-defined folders

• Light green color means that document has been kept in some folder

2012/12/07 PNC 2012, Berkeley


User feedback mechanism

• Simple way for users to report errors in metadata or full-text

• Also used effectively for SHY users to determine the correctness of new names found through term extraction

2012/12/07 PNC 2012, Berkeley


User feedback mechanism

• 目前在詞頻分析的每個「其他」詞彙右方,都有一個「錯誤回報」連結• 全文的右下方,有「更正全文錯誤」的連結


2012/12/07 PNC 2012, Berkeley


User feedback mechanism

Feedback on terms

Feedback on full text

2012/12/07 PNC 2012, Berkeley


Ask the user community to check for correctness

2012/12/07 PNC 2012, Berkeley


So far: 966 names confirmed from 2,390 candidates

2012/12/07 PNC 2012, Berkeley


Appositional term analysis

• Given a set of documents, what terms (and their frequency) appeared precede or after a certain word– Example: what words appeared before tax

(which also gives an indication of what type of taxes there were)

• Simple interface: simply type the keyword and a number that indicates the number of words precede or after the keyword

2012/12/07 PNC 2012, Berkeley


Statistics of x tax

2012/12/07 PNC 2012, Berkeley


Can directly read the text

2012/12/07 PNC 2012, Berkeley

Discussion • SHY is an example of a new methodology of

search systems• It can analyze contexts of documents

resulting from a query• The first prototype of SHY was completed

within a week (fine-tuning took longer)– Critical input from CBDB, especially on terms,

locations, calendar, biography

• THDL as a shell is very effective quickly prototyping such systems

342012/12/07 PNC 2012, Berkeley

Thank you

352012/12/07 PNC 2012, Berkeley
