Upload
gwendolyn-burke
View
213
Download
1
Embed Size (px)
Citation preview
Dr.Om vikas ICDL-2004
Dr. Om Vikas
Department of Information Technology
Ministry of Communications and Information TechnologyGovernment of India
E-mail: [email protected]
Towards Universalisation of Creativity
Dr.Om vikas ICDL-2004
Is there gain in knowledge or loss of Knowledge?
• From an estimated 10,000 world languages in 1900, about 6,700 language survived in 2000. Two percent of the world's languages are becoming extinct every year.
• There is worldwide, unquantifiable erosion of cultural participation, knowledge and innovation.
• With the loss of a language, we lose art and ideas, scientific information and technological innovation capacity.
• World-level literacy is improving. More people can read than ever before, but fewer people create stories.
• There is tendency from being creators to consumers at the time when technology could have amplified our creative capacities.
• UNESCO study (1999) of 65 languages: 49 of the languages (75%) had experienced real decline in number of works translated from these languages into other languages.
• The proportion for English arose from 43 percent in 1980 to over 57 percent in 1994. • The share held by top four translated languages (English, Spanish, French and
German) rose from 65 percent in 1980 to 81 percent in 1994. • According to an UNESCO study involving world’s 140 most published authors; 90 out
of 140 were English writers in 1994 compared to 64 out of 140 in 1980. • There is collapse in authorship, translation and quality in other languages.
Erosion of Language and Culture !!
Dr.Om vikas ICDL-2004
Is the technology to divide or to unite ?
• Latin Alphabet users , 39 % of the global population enjoy 84% of access to the Internet
• Hanzi-users in (CJK), 22% in global population enjoy 13% of Internet access
• Arbic script users, 9% of the population have 1.2 % of the Internet Access
• Bralmi-origin scripts users in South-east Asia and Indic scripts users occupy 22 % of the World population have just 0.3 % of Internet access.
• More than 80% content on Internet is in English.
• ICT penetration in India and other developing countries is lower.
Dr.Om vikas ICDL-2004
ICT Indicators
Advanced Nations
Developing Nations
Underdeveloped Nations
Teledensity Cellphone Density PC penetration
Digital <<<<<<< >>>>>> >Divide !!!!!
50-70 % 30-75 % 30-60 %
20-30 % 04-7 % 0.5-2 %
Sprawling
Dr.Om vikas ICDL-2004
Digital Divide as They Behold
Perception Developed Countries Developing Countries
Why discussed ? Desire to capture larger markets Fear of lagging behind in economic race
Policy Information explosion Localization
Results Increasing use of English and Preservation of local thrust of western culture. language and culture.
Consumer nature “substitute the old” “Upgrade the Old”[Consumerism-centric]
Technology IPR-Centric Open source technology development
Low cost PC $400 less than $ 40
Reason: PPP : (15:1) 34260 (USA) 2400 (India) GNP : (75:1) 24260 460
Focus Digital divide Digital UniteAccess to Information Share the Knowledge
Wider control Small is beautiful.
Low affordability means low ICT penetration & sprawling Digital Divide
Dr.Om vikas ICDL-2004
e-Content & Universal Access
UNESO identifies Challenges in Multilinguism and universal access to information
• General affordable worldwide access
• Hardware and Software, Web and Internet Features.
• Availability of Accessible websites and Internet Access devices.
• Accessibility of multiple languages
• Development of content in Native languages, and its placement on Internet.
• Appropriate design of software for users
Dr.Om vikas ICDL-2004
Users
100 Mn
200 Mn
300 Mn
400 Mn
500 Mn
Eng Jap Chinese Spanish German0
20102003
Indian languages
Potential Use of non-English languages on Internet will increase drastically by 2010 as shown below:
French
65 % information on Internet is in English
Source : IBM’s Web Fountain
Dr.Om vikas ICDL-2004
New Order of Knowledge based Society :
• Universalization of Creativity
• Rise, Raise & Race
Dr.Om vikas ICDL-2004
Raise to Rise & Race to LimitsLiberalisation is advice of advanced nations to the rest for creating conducive environment for technology acquisition and absorption
and thus expanding their market. Mindset needs to be changed to help the underdeveloped nations to catch-up in technology absorption and
participation in knowledge generation.
Following is an example of providing high-tech solution in low-tech environment. A group of engineer volunteers in USA designed and built a rugged and low-cost bicycle- powered computer and wireless network for villagers of phon kham in Laos which had no electricity or phone service. There was no way to call relatives living abroad or even in the next town. This is a project to bridge the digital divide.
Innovation follows on Stretching our imagination to limits. As we noticed that constrained environment of a village in Lao led
development of new operating system, cycle-powered PC, etc. Heterogeneity of communities opens up new opportunities for innovation and integration skills. Time is critical factor in the context of ICT. Let all the communities the world over catch up to the basic technology absorption capability and use it for improving quality of life of the people at large.
Dr.Om vikas ICDL-2004
Digital Knowledge Resources:
• Electronic Information is being created in many forms and formats and stored in many repositories
• Ever improving Information Technology makes sharing of Knowledge Resources economical , universally accessible
Dr.Om vikas ICDL-2004
World Scenario of Digital Library Initiatives
Digital libraries are a form of information technology in which social
impact matters as much as technological advancements.
DLI in USA
Six major projects were launched during 1994-1998 under DLI (Digital
Library Initiative) funded by the NSF, DARPA and NASA in the USA.
Digital Libraries Initiative-phase 2 (DLI-2) is an NSF led initiative that
builds on the successes of DLI-1. DLI-2 is supported by many funding
agencies like NSF, DARPA, National Library of Medicine, Library of
congress National Endowment for the Humanities. DLI-2 will
investigate digital libraries as human-centered systems.
Dr.Om vikas ICDL-2004
DARPA's Information Management program address (www.dapra.mil/ito/research/in) core digital library issues requiring revolutionary research technology:
Federated repositories. The organisation of distributed repositories into a coherent virtual collection is fundamental
Scalability. Managing billions of digital objects and millions of sources poses challenges in identifying, categorizing, indexing, summarizing and extracting content.
Interoperability. Digital libraries require semantic interoperability among heterogeneous repositories distributed across the network.
Collaboration. Analysts work in distributed teams, building on each other's knowledge experience and resources.
Communication. Timely dissemination of research results is the focus of D-Lib.
Dr.Om vikas ICDL-2004
The Illinois D-Lib project (http://dli.grainger.uiuc.edu) take SGML directly from the publisher's collections, convert it into a canonical format for federated searching and transform tags into a standard set.
Federating the search at a semantic level is an area of active research in digital library community. Statistical approaches lead toward scalable semantics - indexing deeper than text word search that is computable on large real collections. Journal Storage project started at University of Michigan with the grant of the Andrew W Mellon Foundation. JSTOR database total 450,000 articles and 2.7 million pages created via a combination of page images and full-text at a rate pf 100,000 pages. The www.jstor.org URL links to three server machines: two at University of Michigan, a third at Princeton University. Distributed mirrors offer increased reliability, accessibility, and capacity.
Dr.Om vikas ICDL-2004
The Informedia Project at Carnegie Mellon University has created a terabyte digital video library in which automatically derived descriptors for the video are used for indexing, segmenting, and accessing the library contents. Artificial Intelligence techniques have been used to create metadata - the data that describes video content. Powerful browsing capabilities are essential in a multimedia information retrieval system.
The Carnegie Mellon DLI project searched multimedia, particularly video segments, by generating text indexes using speech understanding. The Stanford DLI project searched across different engines using multiprotocol gateways. Other even harder issues remain untouched, such as multicultural search across context and meaning.
Dr.Om vikas ICDL-2004
DLI in Europe
The importance of D-Lib research is spreading beyond the US.European research in Digital Libraries is funded by the European Union as well as national sources. DL projects have supported by the Information Engineering, (www.echo.lu/ie), Language Engineering (www.echo.lu/langeng/en/lehome.html), and Esprit (www.cordis.lu/esprit) programs in Europe.
Under NSF-EU collaboration, five working groups has been formed in the key technical areas of Interoperability, Metadata, IPR, Resource indexing and discovery, and multilingual information access.
Dr.Om vikas ICDL-2004
DLI in AsiaSince 1995, D-Lib research has become a national grand challenge in several countries in Asia. Most projects can be classified into the following categories:
Nationwide D-Lib initiative and special purpose digital libraries-for example, the library 2000 Project in Singapore (to link all library resources) and Financial Digital Library at the University of Hong Kong (to serve the needs of HK stock market and users)
Digital museum and historical document digitalization-fox example, Digital Museum Project of the National Taiwan University and Digitalization of art collection of the Palace Museum in Taipai by IBM.
Local language processing and historical cultural content could be the most immediate Asian contribution to the international DL community. An Asia Digital Library consortium is fostering long-term collaboration and projects in DL-related topics in Asia (www.cyberlib.net/adl).
Dr.Om vikas ICDL-2004
Local language and multilingual information retrieval-for example, the Net Compass Project of Tsinghua University in China, Chinese Information Retrieval at the Academia Sinica, Taiwan, and New Zealand's multilingual project.
The New Zealand D-Lib (http://www.nzdl.org) currently offers about 20 collections, varying in size from a few documents upto 10 million documents and several gigabytes of text. The documents written in many different languages, including English, French, German, Arabic, Maori, Portugese and Swahili. The D-Lib provides interfaces to the collections in several languages. To accommodate blind users (with speech synthesizers) and partially sighted users (with large-font displays), NZ D-Lib provides text only version of the interface for each language.
Dr.Om vikas ICDL-2004
iv. Digital Library of India Initiative
Broad Objectives :
• To digitize and index the heritage knowledge.
• To promote life long learning in the society (a necessity of the Knowledge-based society).
• To promote collaborative creativity and building up knowledge teams across borders.
• Participation in World initiatives on Digital Library such as UDL.
[ It is to note that India has
Multiple Languages, Multiple scripts, Manuscripts in different forms,
Books using various fonts, Vast tacit knowledge resource of
vanishing scholars, and Multiple commentaries on a text This forms
a vast treasure of heritage knowledge.]
Dr.Om vikas ICDL-2004
• Mobile Digital Library – Knowledge at doorsteps
To facilitate surf, access, print,and take away a book of choice anywhere and anytime
• 20 DL Centers with 106 high resolution Scanners
• 4 Megacenters (to setup)
Dr.Om vikas ICDL-2004
Multilingual Issues
• Character Sets (UNICODE?)
• Representations
• Multilingual Navigation
• Translation Assistance
Policy Challenges
• Convenient quality displays• What to digitize first?• Use of copyrighted material• Economics (Who pays? Who gets?)• Privacy• Reliability of information• Authentication of text from multiple versions• Digital Library Act.
• Issues pertaining to digitization
Dr.Om vikas ICDL-2004
Need for Indian Digital Library Act.
Issues to tackle may include compulsory Licensing, digital pack
book (incentive: 10% tax deduction on lifetime revenue); deemed
out of print (donate electronic rights); concept shift in Royalty
per copy to per preview; public lending rights (as in Japan); 4Cs
(Consortium for Compensation for Creative Content), formula to
respect content creator and pay compensation, (min. Rs. 100/- to
max Rs. 1 lakh), inclusion of books, music and movie with
higher & higher privacy value.
Dr.Om vikas ICDL-2004
• Linguistic Scenario in India• Eighteen constitutional Indian Languages are mentioned as follows with
their scripts within parentheses: Hindi (Devanagari), Konkani (Devanagari), Marathi (Devanagari), Nepali (Devanagari), Sanskrit (Devanagari), Sindhi (Devanagari/Urdu), Kashmiri (Devanagari/Urdu); Assamese (Assamese), Manipuri (Manipuri), Bangla (Bengali), Oriya (Oriya), Gujarati (Gujarati), Punjabi (Gurumukhi), Telugu (Telugu), Kannada (Kannada), Tamil (Tamil), Malayalam (Malayalam) and Urdu (Urdu). There are 10 Indic Scripts in vogue.
• Interestingly, Indian languages owe their origin to Sanskrit, hence they have in common rich cultural heritage and treasure of knowledge. Indic scripts have originated from Brahmi script. Less than 5 percent of people can either read & write English. Over 95 percent population is normally deprived of the benefits of English-based Information Technology.
Characteristics of Indian Languages
• What You Speak Is What You Write (WYSIWYW)
• Script grammar - transformation rules
• Relatively word order free
• Common phonetic based alphabet
• Common concept terms (from Sanskrit)
Dr.Om vikas ICDL-2004
Indian Language Technology Map
CoILTech
CoILTech
IETE – New DelhiG.G.Univ. Bilaspur
Dr.Om vikas ICDL-2004
Major Achievements in ILT
Translation Support Systems
Human Machine Interface systems
Knowledge Resources
Knowledge Tools
Standardization
Localization of LINUX
Information Dissemination
Dr.Om vikas ICDL-2004
Translation Support Systems (MAT)
• English to Hindi (Angla-Bharati) http:// anglahindi.iitk.ac.in (very satisfactory above 85% consistently okay) • Indian Languages to Hindi (In the process of development) • Hindi to English (In the process of development)
Human Machine interface Systems
Optical Character Recognition (OCR) (accuracy for 7 ILs viz. Hindi Marathi, Bangla, Tamil, Telugu,
Gurumukhi, Malayalam, above 97%. OCRs in other ILs are in the process of development)
Text to Speech system (TTS) (Hindi, Bangla,)
Continuous Speech Recognition CSR (Hindi)
Dr.Om vikas ICDL-2004
Knowledge Resources
Bilingual dictionaries (over 30, 000) words•English - Hindi •English - Telugu - Hindi•English - Tamil - Hindi•English - Kannada - Hindi•English - Bangla - Hindi•English - Punjabi - Hindi•English - Oriya - Hindi •English - Malayalam - Hindi•English - Sanskrit - Hindi
Parallel Corpora – One Million page Parallel Corpora is under process of development. The development of the parallel corpora is one of the unique achievement of the TDIL programme and is appreciated worldwide [ 600 Thousand pages ready.]
Major Achievements in ILT…..
Dr.Om vikas ICDL-2004
Standardization UNICODE
DIT is the voting member of the Unicode Consortium.
Proposed changes in the Unicode Standards finalized in consultation with respective State Government and Indian IT Industry and presented in the UNICODE Technical committee meeting. Some of the proposed changes have been incorporated in Unicode version 4.0
INdian Scripts FOnt Code (INSFOC) Standards have been developed
Indian Script to Romanization Tables (INSROT)are ready
Knowledge ToolsMorph Analyzer, Syntactic Analyzer, Spell checker, Messaging system , Authoring Systems, Word processors, code conversion utilities have been developed.
Major Achievements in ILT…..
Dr.Om vikas ICDL-2004
Localization of LINUX systems
INDIX system : Localized INDIX-2 supports 5 IL s Viz. Hindi, Marathi, Gujrati, Tamil
and Bangla. LINUX operating system with other Indian Languages support is in the
process of development.
Information Dissemination:
TDIL Web-site http://tdil.mit.gov.inThis Web Site contains information for various TDIL activities, achievements and
provides access to a variety of content and downloadable in Hindi and for other
Indian languages.
– Free DownloadsIndian Language keyboard driver & fonts and other tools, corpus, content,
conversion utilities, Machine aided Translation systems.
Quarterly Language Technology Flash : Vishwabharat@tdil
Major Achievements in ILT…..
Dr.Om vikas ICDL-2004
• Language Technology HRD
• Post Graduate Programs in the Domains of
Computational Linguistics &
Knowledge Engineering.
• All the Bachelors and Masters Programmes in Computer Science Engineering will cover the Multilingual Computing aspect also.
• School curricula include basics of multilingual computing.
Dr.Om vikas ICDL-2004
Typical illustration of Indian Language OCRs
Hindi OCR Input OCR Output
Efficiency 96.8%, working for font size from 12-36
Dr.Om vikas ICDL-2004
Illustration of online MAT system
Simple Sentences.
sarala vaakya .sarla vaa@ya.
Welcome to London.landana men aapakaa svaagata hai.
landna maoM Aapka svaagat hO.
There are some cases which are still pending.
vahaan kuc'ha kesa hain jo abhii bhii nilamibata hain .vahaÐ kuC kosa hOM jaao ABaI BaI inalaimbat hOM .
• Machine Translation (MAT) – English to Hindi
http://anglahindi.iitk.ac.in
Dr.Om vikas ICDL-2004
Researchers always want to go for that last 2% of performance. But it’s better to get a sufficient solution out fast and then continue to enhance it.
….MarkDean, IBM
(Source : Harvard Business Review, Aug’2002)
Hence TDIL Program emphasizes on Collaborative development of language technology and. Taking Language Technology Products out to market rapidly for
feedback and refinement
Innovating to Innovate
Dr.Om vikas ICDL-2004
Media Lab Asia : another initiative
World Computer (Lowcost PC)
Rural Operating Systems; Speech Interfaces For Local Dialects; Visual Language; Interfaces for All; Interlingua Web; Multi-Literate Interface; Literacy Learning Through Pictures
Bits for All (Universal Connectivity)
Rural WiFi, DakNet, Digital Gangetic Plain, Off-Line Internet Access, Rural VoIP
Tomorrow's Tools (Language Interfaces)
Mapping For the Masses, Community Access to Sustainable Health (Ca:sh), Building Robots Creating Science (BRICS), Digital Craft Revival, Digital Human Body, Digital Music, InfoSculpture, Suchik, Polysensors, Complex RF Impedance Analyzers, UV-VIS Spectrometer, Power Sensors, Think Cycle
Digital Village (Consolidation in delivering value to the masses)
Sustainable Access in Rural India, Community Connection, Digital Mandi, InfoThela
Dr.Om vikas ICDL-2004
• Intelligent Human Computer Interaction
To support more sophisticated and natural input and output that promise knowledge or agent-based dialogue in which the interface gracefully handles errors and interruptions and dynamically adapts to the current context.
Typical properties :
Multimodal input - They process potentially ambiguous, imprecise combinations of mixed input such as written text, spoken language, gestures (e.g., mouse, pen, dataglove) and gaze.
Multimodal output - They design coordinated presentation of, e.g., text, speech, graphics, and gestures, which may be presented via conventional displays or animated, life-like agents.
Interaction management - mixed initiative interactions that are context-dependent based on system models of the discourse, user, and task
Trends in Language Technology
Dr.Om vikas ICDL-2004
1970s : Narrow domain , Rules-based approach
1980s : Practical MT system example based approach
nterlingua and Transfer method.
1990s : Multilingual MT, Simultaneous Interpretation, example based revisited, corpus based and statistics based approach.
2000s : MT through NL understanding language resources
• Machine Translation
Dr.Om vikas ICDL-2004
• Speech technology is the field of Interactive Technologies. There is ongoing shift from Speech component research to research on integrated Speech Systems. Together with Speech, are the modalities that constitute full natural human - human communication (e.g.. Gesture, lip movements, facial expression, gaze, bodily posture) leading towards multimodal interactive systems
• 1970s : Speech synthesis systems used rule-based formant system. (Formants are transfer function of vocal tract resonant frequency.)
• 1990s: Concatenated speech synthesis systems use small pieces of pre-recorded speech.
• There is trend towards cross-project collaboration, synergy, critical mass, and deployable & scalable technologies
• Speech Technology Development:
Dr.Om vikas ICDL-2004
• Trends in Digital Library Technologies
Multi-modal Input Scanning, Smartizing (Value Addition), Content, Multi-lingual, Multi-media
Standardization Character Code, Font Code, Semantic Indexing, DOI, XML, SCORM
Navigation Browsing, Finding, Searching, Zooming, Hyperbolic Tree, Virtual Reality, Aboutness, Searching Mathematics, Multilingual Navigation, Translation Assistance.
Architecture Interoperability, Multi-lingual Information Access, Metadata, Resource Indexing & Discovery In Globally Distributed Digital Library
IPR Issues 4Cs(Consortium For Compensation For Creative Content)
Knowledge Generation Capacity
Focus In 20th Century
Capitalistic & Monopolistic Trend In Publication & Dissemination.
Focus In 21st Century
Universalization Of Creativity.
Dr.Om vikas ICDL-2004
The Interspace represents the third wave in the ongoing evolution of the Global Information Infrastructure, driven by rapid advances in computing and Information Technology during .
Future Knowledge Networks
The wave pattern roughly describes four distinct phases of functionality: fundamental research (trough), development of prototype systems (ascent), emergence of commercial systems (crest), and mass propagation (descent)
Dr.Om vikas ICDL-2004
Scalable Semantics
Future knowledge networks will rely on scalable semantics, on automatically indexing the community collections so The knowledge networks of the Interspace will be connected via switching machines that switch concepts. Connectivity and training continue to be the principal barriers to integrating the global network of libraries.
Interspace focuses on scalable technologies for semantic indexing that work
generally across all subject domains. We can use concept spaces -
collections of abstract concept generated from concrete objects-to boost
searches by interactively suggesting alternative terms. We can use category
maps to boost navigation by interactively browsing clusters of related
documents. Scalable semantics is used to index the semantics of document
contents on large collections. Concept spaces use text documents as the
objects and noun phrases as the concepts.
Dr.Om vikas ICDL-2004
Summing up the Challenges Ahead
•ML Open Source Software- Shareable Software - Standards database and updating - Support service & Help line- Consortium approach - GPL with performance else Garbage In Garbage
out• Benchmarking & Standards
- testing against international standards - active participation in evolving standards
• Information Technology Culture- Awareness : IT Clinic, Workshops, media- BIPK (Basic information Processing Kit) with user
friendly, easy-to-use, affordable, scalable, interoperable and re-usable tools. BIPK may consist star office like
processing facility, fonts, KB driver, spell checker,
dictionary and conversion utility.- Entrepreneurship : Gyanaudyog workshops.
Dr.Om vikas ICDL-2004
.... Challenges Ahead
• Cross–lingual Information Access - Search engine, Web Crawler, on-line machine translation.
• Localization - Localization of software and content into local languages - Enlarging share in localization outsourcing ( $ 8 Bn By 2006:IDC)
• International Collaboration in Language Informatics.- Industry - academia cooperation in joint research & technology
development projects.- Exchange of faculty and students - HRD programs in knowledge Engineering & Computational Linguistics
• Rise, Raise & Race- Possess basic language technologies- Promote Collectivistic Culture- Think globally & act locally- Collaborate for innovation
Dr.Om vikas ICDL-2004
Digital Library is a means to meet the end :
Objective of Universalization of Creativity