45
Dr.Om vikas ICDL-2004 Dr. Om Vikas Department of Information Technology Ministry of Communications and Information Technology Government of India E-mail: [email protected] Towards Universalisation of Creativity

ICDL-2004Dr.Om vikas Dr. Om Vikas Department of Information Technology Ministry of Communications and Information Technology Government of India E-mail:

Embed Size (px)

Citation preview

Dr.Om vikas ICDL-2004

Dr. Om Vikas

Department of Information Technology

Ministry of Communications and Information TechnologyGovernment of India

E-mail: [email protected]

Towards Universalisation of Creativity

Dr.Om vikas ICDL-2004

Is there gain in knowledge or loss of Knowledge?

• From an estimated 10,000 world languages in 1900, about 6,700 language survived in 2000. Two percent of the world's languages are becoming extinct every year.

• There is worldwide, unquantifiable erosion of cultural participation, knowledge and innovation.

• With the loss of a language, we lose art and ideas, scientific information and technological innovation capacity.

• World-level literacy is improving. More people can read than ever before, but fewer people create stories.

• There is tendency from being creators to consumers at the time when technology could have amplified our creative capacities.

• UNESCO study (1999) of 65 languages: 49 of the languages (75%) had experienced real decline in number of works translated from these languages into other languages.

• The proportion for English arose from 43 percent in 1980 to over 57 percent in 1994. • The share held by top four translated languages (English, Spanish, French and

German) rose from 65 percent in 1980 to 81 percent in 1994. • According to an UNESCO study involving world’s 140 most published authors; 90 out

of 140 were English writers in 1994 compared to 64 out of 140 in 1980. • There is collapse in authorship, translation and quality in other languages.

Erosion of Language and Culture !!

Dr.Om vikas ICDL-2004

Is the technology to divide or to unite ?

• Latin Alphabet users , 39 % of the global population enjoy 84% of access to the Internet

• Hanzi-users in (CJK), 22% in global population enjoy 13% of Internet access

• Arbic script users, 9% of the population have 1.2 % of the Internet Access

• Bralmi-origin scripts users in South-east Asia and Indic scripts users occupy 22 % of the World population have just 0.3 % of Internet access.

• More than 80% content on Internet is in English.

• ICT penetration in India and other developing countries is lower.

Dr.Om vikas ICDL-2004

ICT Indicators

Advanced Nations

Developing Nations

Underdeveloped Nations

Teledensity Cellphone Density PC penetration

Digital <<<<<<< >>>>>> >Divide !!!!!

50-70 % 30-75 % 30-60 %

20-30 % 04-7 % 0.5-2 %

Sprawling

Dr.Om vikas ICDL-2004

Digital Divide as They Behold

Perception Developed Countries Developing Countries

Why discussed ? Desire to capture larger markets Fear of lagging behind in economic race

Policy Information explosion Localization

Results Increasing use of English and Preservation of local thrust of western culture. language and culture.

Consumer nature “substitute the old” “Upgrade the Old”[Consumerism-centric]

Technology IPR-Centric Open source technology development

Low cost PC $400 less than $ 40

Reason: PPP : (15:1) 34260 (USA) 2400 (India) GNP : (75:1) 24260 460

Focus Digital divide Digital UniteAccess to Information Share the Knowledge

Wider control Small is beautiful.

Low affordability means low ICT penetration & sprawling Digital Divide

Dr.Om vikas ICDL-2004

e-Content & Universal Access

UNESO identifies Challenges in Multilinguism and universal access to information

• General affordable worldwide access

• Hardware and Software, Web and Internet Features.

• Availability of Accessible websites and Internet Access devices.

• Accessibility of multiple languages

• Development of content in Native languages, and its placement on Internet.

• Appropriate design of software for users

Dr.Om vikas ICDL-2004

Users

100 Mn

200 Mn

300 Mn

400 Mn

500 Mn

Eng Jap Chinese Spanish German0

20102003

Indian languages

Potential Use of non-English languages on Internet will increase drastically by 2010 as shown below:

French

65 % information on Internet is in English

Source : IBM’s Web Fountain

Dr.Om vikas ICDL-2004

New Order of Knowledge based Society :

• Universalization of Creativity

• Rise, Raise & Race

Dr.Om vikas ICDL-2004

Raise to Rise & Race to LimitsLiberalisation is advice of advanced nations to the rest for creating conducive environment for technology acquisition and absorption

and thus expanding their market. Mindset needs to be changed to help the underdeveloped nations to catch-up in technology absorption and

participation in knowledge generation.

Following is an example of providing high-tech solution in low-tech environment. A group of engineer volunteers in USA designed and built a rugged and low-cost bicycle- powered computer and wireless network for villagers of phon kham in Laos which had no electricity or phone service. There was no way to call relatives living abroad or even in the next town. This is a project to bridge the digital divide.

Innovation follows on Stretching our imagination to limits. As we noticed that constrained environment of a village in Lao led

development of new operating system, cycle-powered PC, etc. Heterogeneity of communities opens up new opportunities for innovation and integration skills. Time is critical factor in the context of ICT. Let all the communities the world over catch up to the basic technology absorption capability and use it for improving quality of life of the people at large.

Dr.Om vikas ICDL-2004

Digital Knowledge Resources:

• Electronic Information is being created in many forms and formats and stored in many repositories

• Ever improving Information Technology makes sharing of Knowledge Resources economical , universally accessible

Dr.Om vikas ICDL-2004

World Scenario of Digital Library Initiatives

Digital libraries are a form of information technology in which social

impact matters as much as technological advancements.

DLI in USA

Six major projects were launched during 1994-1998 under DLI (Digital

Library Initiative) funded by the NSF, DARPA and NASA in the USA.

Digital Libraries Initiative-phase 2 (DLI-2) is an NSF led initiative that

builds on the successes of DLI-1. DLI-2 is supported by many funding

agencies like NSF, DARPA, National Library of Medicine, Library of

congress National Endowment for the Humanities. DLI-2 will

investigate digital libraries as human-centered systems.

Dr.Om vikas ICDL-2004

DARPA's Information Management program address (www.dapra.mil/ito/research/in) core digital library issues requiring revolutionary research technology:

Federated repositories. The organisation of distributed repositories into a coherent virtual collection is fundamental

Scalability. Managing billions of digital objects and millions of sources poses challenges in identifying, categorizing, indexing, summarizing and extracting content.

Interoperability. Digital libraries require semantic interoperability among heterogeneous repositories distributed across the network.

Collaboration. Analysts work in distributed teams, building on each other's knowledge experience and resources.

Communication. Timely dissemination of research results is the focus of D-Lib.

Dr.Om vikas ICDL-2004

The Illinois D-Lib project (http://dli.grainger.uiuc.edu) take SGML directly from the publisher's collections, convert it into a canonical format for federated searching and transform tags into a standard set.

Federating the search at a semantic level is an area of active research in digital library community. Statistical approaches lead toward scalable semantics - indexing deeper than text word search that is computable on large real collections. Journal Storage project started at University of Michigan with the grant of the Andrew W Mellon Foundation. JSTOR database total 450,000 articles and 2.7 million pages created via a combination of page images and full-text at a rate pf 100,000 pages. The www.jstor.org URL links to three server machines: two at University of Michigan, a third at Princeton University. Distributed mirrors offer increased reliability, accessibility, and capacity.

Dr.Om vikas ICDL-2004

The Informedia Project at Carnegie Mellon University has created a terabyte digital video library in which automatically derived descriptors for the video are used for indexing, segmenting, and accessing the library contents. Artificial Intelligence techniques have been used to create metadata - the data that describes video content. Powerful browsing capabilities are essential in a multimedia information retrieval system.

The Carnegie Mellon DLI project searched multimedia, particularly video segments, by generating text indexes using speech understanding. The Stanford DLI project searched across different engines using multiprotocol gateways. Other even harder issues remain untouched, such as multicultural search across context and meaning.

Dr.Om vikas ICDL-2004

DLI in Europe

The importance of D-Lib research is spreading beyond the US.European research in Digital Libraries is funded by the European Union as well as national sources. DL projects have supported by the Information Engineering, (www.echo.lu/ie), Language Engineering (www.echo.lu/langeng/en/lehome.html), and Esprit (www.cordis.lu/esprit) programs in Europe.

Under NSF-EU collaboration, five working groups has been formed in the key technical areas of Interoperability, Metadata, IPR, Resource indexing and discovery, and multilingual information access.

Dr.Om vikas ICDL-2004

DLI in AsiaSince 1995, D-Lib research has become a national grand challenge in several countries in Asia. Most projects can be classified into the following categories:

Nationwide D-Lib initiative and special purpose digital libraries-for example, the library 2000 Project in Singapore (to link all library resources) and Financial Digital Library at the University of Hong Kong (to serve the needs of HK stock market and users)

Digital museum and historical document digitalization-fox example, Digital Museum Project of the National Taiwan University and Digitalization of art collection of the Palace Museum in Taipai by IBM.

Local language processing and historical cultural content could be the most immediate Asian contribution to the international DL community. An Asia Digital Library consortium is fostering long-term collaboration and projects in DL-related topics in Asia (www.cyberlib.net/adl).

Dr.Om vikas ICDL-2004

Local language and multilingual information retrieval-for example, the Net Compass Project of Tsinghua University in China, Chinese Information Retrieval at the Academia Sinica, Taiwan, and New Zealand's multilingual project.

The New Zealand D-Lib (http://www.nzdl.org) currently offers about 20 collections, varying in size from a few documents upto 10 million documents and several gigabytes of text. The documents written in many different languages, including English, French, German, Arabic, Maori, Portugese and Swahili. The D-Lib provides interfaces to the collections in several languages. To accommodate blind users (with speech synthesizers) and partially sighted users (with large-font displays), NZ D-Lib provides text only version of the interface for each language.

Dr.Om vikas ICDL-2004

iv. Digital Library of India Initiative

Broad Objectives :

• To digitize and index the heritage knowledge.

• To promote life long learning in the society (a necessity of the Knowledge-based society).

• To promote collaborative creativity and building up knowledge teams across borders.

• Participation in World initiatives on Digital Library such as UDL.

[ It is to note that India has

Multiple Languages, Multiple scripts, Manuscripts in different forms,

Books using various fonts, Vast tacit knowledge resource of

vanishing scholars, and Multiple commentaries on a text This forms

a vast treasure of heritage knowledge.]

Dr.Om vikas ICDL-2004

• Mobile Digital Library – Knowledge at doorsteps

To facilitate surf, access, print,and take away a book of choice anywhere and anytime

• 20 DL Centers with 106 high resolution Scanners

• 4 Megacenters (to setup)

Dr.Om vikas ICDL-2004

Multilingual Issues

• Character Sets (UNICODE?)

• Representations

• Multilingual Navigation

• Translation Assistance

Policy Challenges

• Convenient quality displays• What to digitize first?• Use of copyrighted material• Economics (Who pays? Who gets?)• Privacy• Reliability of information• Authentication of text from multiple versions• Digital Library Act.

• Issues pertaining to digitization

Dr.Om vikas ICDL-2004

Need for Indian Digital Library Act.

Issues to tackle may include compulsory Licensing, digital pack

book (incentive: 10% tax deduction on lifetime revenue); deemed

out of print (donate electronic rights); concept shift in Royalty

per copy to per preview; public lending rights (as in Japan); 4Cs

(Consortium for Compensation for Creative Content), formula to

respect content creator and pay compensation, (min. Rs. 100/- to

max Rs. 1 lakh), inclusion of books, music and movie with

higher & higher privacy value.

Dr.Om vikas ICDL-2004

• Linguistic Scenario in India• Eighteen constitutional Indian Languages are mentioned as follows with

their scripts within parentheses: Hindi (Devanagari), Konkani (Devanagari), Marathi (Devanagari), Nepali (Devanagari), Sanskrit (Devanagari), Sindhi (Devanagari/Urdu), Kashmiri (Devanagari/Urdu); Assamese (Assamese), Manipuri (Manipuri), Bangla (Bengali), Oriya (Oriya), Gujarati (Gujarati), Punjabi (Gurumukhi), Telugu (Telugu), Kannada (Kannada), Tamil (Tamil), Malayalam (Malayalam) and Urdu (Urdu). There are 10 Indic Scripts in vogue.

• Interestingly, Indian languages owe their origin to Sanskrit, hence they have in common rich cultural heritage and treasure of knowledge. Indic scripts have originated from Brahmi script. Less than 5 percent of people can either read & write English. Over 95 percent population is normally deprived of the benefits of English-based Information Technology.

Characteristics of Indian Languages

• What You Speak Is What You Write (WYSIWYW)

• Script grammar - transformation rules

• Relatively word order free

• Common phonetic based alphabet

• Common concept terms (from Sanskrit)

Dr.Om vikas ICDL-2004

Indian Language Technology Map

CoILTech

CoILTech

IETE – New DelhiG.G.Univ. Bilaspur

Dr.Om vikas ICDL-2004

Major Achievements in ILT

Translation Support Systems

Human Machine Interface systems

Knowledge Resources

Knowledge Tools

Standardization

Localization of LINUX

Information Dissemination

Dr.Om vikas ICDL-2004

Translation Support Systems (MAT)

• English to Hindi (Angla-Bharati) http:// anglahindi.iitk.ac.in (very satisfactory above 85% consistently okay) • Indian Languages to Hindi (In the process of development) • Hindi to English (In the process of development)

Human Machine interface Systems

Optical Character Recognition (OCR) (accuracy for 7 ILs viz. Hindi Marathi, Bangla, Tamil, Telugu,

Gurumukhi, Malayalam, above 97%. OCRs in other ILs are in the process of development)

Text to Speech system (TTS) (Hindi, Bangla,)

Continuous Speech Recognition CSR (Hindi)

Dr.Om vikas ICDL-2004

Knowledge Resources

Bilingual dictionaries (over 30, 000) words•English - Hindi •English - Telugu - Hindi•English - Tamil - Hindi•English - Kannada - Hindi•English - Bangla - Hindi•English - Punjabi - Hindi•English - Oriya - Hindi •English - Malayalam - Hindi•English - Sanskrit - Hindi

Parallel Corpora – One Million page Parallel Corpora is under process of development. The development of the parallel corpora is one of the unique achievement of the TDIL programme and is appreciated worldwide [ 600 Thousand pages ready.]

Major Achievements in ILT…..

Dr.Om vikas ICDL-2004

Standardization UNICODE

DIT is the voting member of the Unicode Consortium.

Proposed changes in the Unicode Standards finalized in consultation with respective State Government and Indian IT Industry and presented in the UNICODE Technical committee meeting. Some of the proposed changes have been incorporated in Unicode version 4.0

INdian Scripts FOnt Code (INSFOC) Standards have been developed

Indian Script to Romanization Tables (INSROT)are ready

Knowledge ToolsMorph Analyzer, Syntactic Analyzer, Spell checker, Messaging system , Authoring Systems, Word processors, code conversion utilities have been developed.

Major Achievements in ILT…..

Dr.Om vikas ICDL-2004

Localization of LINUX systems

INDIX system : Localized INDIX-2 supports 5 IL s Viz. Hindi, Marathi, Gujrati, Tamil

and Bangla. LINUX operating system with other Indian Languages support is in the

process of development.

Information Dissemination:

TDIL Web-site http://tdil.mit.gov.inThis Web Site contains information for various TDIL activities, achievements and

provides access to a variety of content and downloadable in Hindi and for other

Indian languages.

– Free DownloadsIndian Language keyboard driver & fonts and other tools, corpus, content,

conversion utilities, Machine aided Translation systems.

Quarterly Language Technology Flash : Vishwabharat@tdil

Major Achievements in ILT…..

Dr.Om vikas ICDL-2004

• Language Technology HRD

• Post Graduate Programs in the Domains of

Computational Linguistics &

Knowledge Engineering.

• All the Bachelors and Masters Programmes in Computer Science Engineering will cover the Multilingual Computing aspect also.

• School curricula include basics of multilingual computing.

Dr.Om vikas ICDL-2004

Typical illustration of Indian Language OCRs

Hindi OCR Input OCR Output

Efficiency 96.8%, working for font size from 12-36

Dr.Om vikas ICDL-2004

Gurmukhi to Shahmukhi Transliteration

Gurmukhi Shahmukhi

Dr.Om vikas ICDL-2004

Illustration of online MAT system

Simple Sentences.

sarala vaakya .sarla vaa@ya.

Welcome to London.landana men aapakaa svaagata hai.

landna maoM Aapka svaagat hO.

There are some cases which are still pending.

vahaan kuc'ha kesa hain jo abhii bhii nilamibata hain .vahaÐ kuC kosa hOM jaao ABaI BaI inalaimbat hOM .

• Machine Translation (MAT) – English to Hindi

http://anglahindi.iitk.ac.in

Dr.Om vikas ICDL-2004

• Machine Translation (MAT) – Hindi to English

Dr.Om vikas ICDL-2004

Researchers always want to go for that last 2% of performance. But it’s better to get a sufficient solution out fast and then continue to enhance it.

….MarkDean, IBM

(Source : Harvard Business Review, Aug’2002)

Hence TDIL Program emphasizes on Collaborative development of language technology and. Taking Language Technology Products out to market rapidly for

feedback and refinement

Innovating to Innovate

Dr.Om vikas ICDL-2004

Media Lab Asia : another initiative

World Computer (Lowcost PC)

Rural Operating Systems; Speech Interfaces For Local Dialects; Visual Language; Interfaces for All; Interlingua Web; Multi-Literate Interface; Literacy Learning Through Pictures

Bits for All (Universal Connectivity)

Rural WiFi, DakNet, Digital Gangetic Plain, Off-Line Internet Access, Rural VoIP

Tomorrow's Tools (Language Interfaces)

Mapping For the Masses, Community Access to Sustainable Health (Ca:sh), Building Robots Creating Science (BRICS), Digital Craft Revival, Digital Human Body, Digital Music, InfoSculpture, Suchik, Polysensors, Complex RF Impedance Analyzers, UV-VIS Spectrometer, Power Sensors, Think Cycle

Digital Village (Consolidation in delivering value to the masses)

Sustainable Access in Rural India, Community Connection, Digital Mandi, InfoThela

Dr.Om vikas ICDL-2004

• Intelligent Human Computer Interaction

To support more sophisticated and natural input and output that promise knowledge or agent-based dialogue in which the interface gracefully handles errors and interruptions and dynamically adapts to the current context.

Typical properties :

Multimodal input - They process potentially ambiguous, imprecise combinations of mixed input such as written text, spoken language, gestures (e.g., mouse, pen, dataglove) and gaze.

Multimodal output - They design coordinated presentation of, e.g., text, speech, graphics, and gestures, which may be presented via conventional displays or animated, life-like agents.

Interaction management - mixed initiative interactions that are context-dependent based on system models of the discourse, user, and task

Trends in Language Technology

Dr.Om vikas ICDL-2004

1970s : Narrow domain , Rules-based approach

1980s : Practical MT system example based approach

nterlingua and Transfer method.

1990s : Multilingual MT, Simultaneous Interpretation, example based revisited, corpus based and statistics based approach.

2000s : MT through NL understanding language resources

• Machine Translation

Dr.Om vikas ICDL-2004

• Speech technology is the field of Interactive Technologies. There is ongoing shift from Speech component research to research on integrated Speech Systems. Together with Speech, are the modalities that constitute full natural human - human communication (e.g.. Gesture, lip movements, facial expression, gaze, bodily posture) leading towards multimodal interactive systems

• 1970s : Speech synthesis systems used rule-based formant system. (Formants are transfer function of vocal tract resonant frequency.)

• 1990s: Concatenated speech synthesis systems use small pieces of pre-recorded speech.

• There is trend towards cross-project collaboration, synergy, critical mass, and deployable & scalable technologies

• Speech Technology Development:

Dr.Om vikas ICDL-2004

• Trends in Digital Library Technologies

Multi-modal Input Scanning, Smartizing (Value Addition), Content, Multi-lingual, Multi-media

Standardization Character Code, Font Code, Semantic Indexing, DOI, XML, SCORM

Navigation Browsing, Finding, Searching, Zooming, Hyperbolic Tree, Virtual Reality, Aboutness, Searching Mathematics, Multilingual Navigation, Translation Assistance.

Architecture Interoperability, Multi-lingual Information Access, Metadata, Resource Indexing & Discovery In Globally Distributed Digital Library

IPR Issues 4Cs(Consortium For Compensation For Creative Content)

Knowledge Generation Capacity

Focus In 20th Century

Capitalistic & Monopolistic Trend In Publication & Dissemination.

Focus In 21st Century

Universalization Of Creativity.

Dr.Om vikas ICDL-2004

The Interspace represents the third wave in the ongoing evolution of the Global Information Infrastructure, driven by rapid advances in computing and Information Technology during .

Future Knowledge Networks

The wave pattern roughly describes four distinct phases of functionality: fundamental research (trough), development of prototype systems (ascent), emergence of commercial systems (crest), and mass propagation (descent)

Dr.Om vikas ICDL-2004

Scalable Semantics

Future knowledge networks will rely on scalable semantics, on automatically indexing the community collections so The knowledge networks of the Interspace will be connected via switching machines that switch concepts. Connectivity and training continue to be the principal barriers to integrating the global network of libraries.

Interspace focuses on scalable technologies for semantic indexing that work

generally across all subject domains. We can use concept spaces -

collections of abstract concept generated from concrete objects-to boost

searches by interactively suggesting alternative terms. We can use category

maps to boost navigation by interactively browsing clusters of related

documents. Scalable semantics is used to index the semantics of document

contents on large collections. Concept spaces use text documents as the

objects and noun phrases as the concepts.

Dr.Om vikas ICDL-2004

Summing up the Challenges Ahead

•ML Open Source Software- Shareable Software - Standards database and updating - Support service & Help line- Consortium approach - GPL with performance else Garbage In Garbage

out• Benchmarking & Standards

- testing against international standards - active participation in evolving standards

• Information Technology Culture- Awareness : IT Clinic, Workshops, media- BIPK (Basic information Processing Kit) with user

friendly, easy-to-use, affordable, scalable, interoperable and re-usable tools. BIPK may consist star office like

processing facility, fonts, KB driver, spell checker,

dictionary and conversion utility.- Entrepreneurship : Gyanaudyog workshops.

Dr.Om vikas ICDL-2004

.... Challenges Ahead

• Cross–lingual Information Access - Search engine, Web Crawler, on-line machine translation.

• Localization - Localization of software and content into local languages - Enlarging share in localization outsourcing ( $ 8 Bn By 2006:IDC)

• International Collaboration in Language Informatics.- Industry - academia cooperation in joint research & technology

development projects.- Exchange of faculty and students - HRD programs in knowledge Engineering & Computational Linguistics

• Rise, Raise & Race- Possess basic language technologies- Promote Collectivistic Culture- Think globally & act locally- Collaborate for innovation

Dr.Om vikas ICDL-2004

Digital Library is a means to meet the end :

Objective of Universalization of Creativity

Dr.Om vikas ICDL-2004

Nothing is so pious as knowledge.

xÉ Ê½þ YÉÉxÉäxÉ ºÉoù¶ÉÆ {ÉÊ´ÉjÉʨɽþ Ê´ÉtiÉä*

(Bhagwadgita: 4.38)

¶ÉÉÆÊiÉ: (Shaantih)