3RD INTERNATIONAL CONFERENCE ON LINGUISTIC AND CULTURAL DIVERSITY IN CYBERSPACE - 28 June - 3 July, 2014 Yakutsk, Russia

  • View
    257

  • Download
    5

Embed Size (px)

Text of 3RD INTERNATIONAL CONFERENCE ON LINGUISTIC AND CULTURAL DIVERSITY IN CYBERSPACE - 28 June - 3 July,...

  • 3RD INTERNATIONAL CONFERENCE ON LINGUISTIC AND CULTURAL DIVERSITY IN CYBERSPACE-28 June - 3 July, 2014 Yakutsk, Russia

  • Daniel Pimientapimienta@funredes.org

    Networks & Development Foundationhttp://funredes.org

    Observatory of languages & cultures in the Internethttp://funredes.org/lc

    Executive Committee Memberof

    http://maaya.org

  • A methodology for exploring the situation of French & languages of France

    in the Internet which could applyto other groups of languages.

    Daniel Pimienta and Daniel Prado MAAYA, May 2014

    Mayotte

  • CREDITSThe methodology is the result of the merge of the products of two independent studies realized by the team D. Prado/D. Pimienta, on behalf MAAYA, in 2013:

    OIF mandated study about the space of French on the Internet

    General Delegation to French and languages of France (DGLFF) of Ministry of Culture mandated study about the space of languages of France on the Internet

  • TWO COMPLEMENTARY APPROACHES

    FRENCH, a language classified in position 8 in terms of speakers (L1+L2)

    OTHER MINORITY LANGUAGES spoken in France territories

  • ANTECEDENTSDIFFICULTIES IN PRODUCTION OF INDICATORS

    DILINET PROJECT

    DILINET PROJECT STATUS

    MEANWHILE

  • LINGUISTIC DIVERSITY INDICATORS PARADOX1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 10 11 12 13 2014 INTERESTCAPACITY

  • LINGUISTIC DIVERSITY INDICATORS PARADOX1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 2010 INTERESTCAPACITYFUNREDES/UL...LOPALIS/ISOC..OCLCFUNREDES..XEROX..IDESCAT.

  • Internet users per language (source InternetworldStats). till 2011

    Web pages per language (not all!) till 2008

    Other indicators per country (FUNREDES/UL) till 2008WHAT INDICATORS DO WE HAVE?

  • WHERE IS THE BOTTLENECK?The two main indicators building activities rely:

    on crawling ccTLD for languages in Asia, Africa, the Caribbean and applying recognition algorithms (LOP).

    on using Search Engines counting capacity and their large percentage of web coverage (FUNREDES/UNION LATINA).

  • WHERE IS THE BOTTLENECK?But

    - The size of the web is getting too large for traditional crawling (close to infinite!).

    Search Engines are no more indexing a substantial part of it (80% 5%)

    Search Engines counting has became unreliable.

    And anyway all we got is static data mostly focused on the number of web pages per language.

  • A RESEARCH PROJECTCollaboration between UNESCO, OIF, UNION LATINA with participation of ITU.

    High level profile partners ERCIM, MAAYA, UNESCO, OIF, FUNREDES, EXALEAD, UPC, DIALOGIC, CNRS/LIMSI, FRAUNHOFER, CWI, VOCAPIA, NIELSEN

    Important investment (estimate 300 Keuros, direct and indirect)

  • PROCESSProposing to 2 EU/PF7 calls:Jan. 2012: Integrated Project of 7Meuros for ICT-2011.4.4 Intelligent Information ManagementJan. 2013: Specific Targeted Research Project of 3Meuros for ICT-2013.4.1 Content analytics and language technologies - Cross-media content analytics

    2 near misses reflecting low EU interest in the theme

    New attempt in process with Qatar partners with LOP on board

  • MEANWHILEInternetWorldStats stopped updating 3 years ago

    A new interesting player but limited to 10 millions top sites (2% of the sites) : W3TECH

    Web evolution towards dynamic pages, video, social networks

    The context call for alternative approaches

  • PART 1 : MEASURING FRENCHDefining a large set of spaces and applications to get data from.

    Searching for a large number of Internet sites which offer linguistic or country data for those spaces/applications.Applying appropriate selection criteria to this set of sites.Collecting, compiling, organizing dataCrossing Internet data with reliable demo-linguistic dataPutting results in perspective.

  • P1 : SPACES & APPLICATIONSApplicationsOffice applicationsWeb 2.0Search enginesEmailP2P

    UP TO 100, split into following categories: SpacesInfrastructureOnline librarySmartphonesVOIP/ChatOperating systemsBrowsers

  • P1: SOURCESTraditional sources (UN, UNESCO, ITU, OCDE, EU) have few linguistic data but plenty of country data

    Most non traditional sources are either:Marketing company offering free glances on expensive dataExperts showing their capacity thru reports

    Life duration of non traditional sources is often short.

  • SOURCE SELECTION CRITERIAToo small scopeToo biasedNot recently updatedMethodology not reliable

  • SELECTED SOURCESmore than 200 sourcesless than 100 sources10 = excellent< 5 = Not used but kept for future check

  • SOURCES PARAMETERSTitleURLPublication yearRating (0 10)Focus (worldwide, Europe, France, USA, OCDE)Frequently updated (y/n)Type of source (meta, general, space, application, book, report, paper, webpage)Application or space concernedLanguage specific (y/n)Comments

  • DEMO-LINGUISTIC DATANo institutional support low data qualityLarge and diverse geography divergent dataMain demo-linguistic sources divergent dataLanguage typology boundary dilemma

    L2 counting

  • DEMO-LINGUISTIC CHOICESETHNOLOGUE FOR L1 ( homogeneity)

    DIVERSE SOURCES FOR L2 ( reliability)

    WIKIPEDIA FOR COUNTRY DEMOGRAPHIC

    INTERVAL DATA FOR SOME SPACE/APPLICATION

  • PUT IN PERSPECTIVEI = AxBxCxD/1000A= Level of world relevance (0 to 10)B = Level of reliability of source (0 to 10)C = Level of trust for French (0 to 10)D = Level of relevance for French (0 to 10)

    P = Direct weighting

    LEMENTABCDIL1L1+L2(L12)PL1xIL12xIL1 xPL12xPTYPEViadeo257107160706RSTumblr6676154246030168RSHotmail556692401808APPOpen office999858250117010APPBlogs.com67751525029010BLOG

    Ning777827651650300RSMsn777621651230300APPWordpress877727751920350BLOGAVERAGE6,84,27,44,37,24,2

  • ANALYZE PER TYPE* = Only one source

    Type of spaceL1L1+L2BOOKS3*BLOGS6,53,3APPLICATIONS6,73,6SOCIAL NETWORKS74INFRASTRUCTURES7,94USERS94*CONTENTS84,1VIDEO76*P2P6,3

  • CONCLUSION P1French, as first language, can be considered up but close to position 7 in the Internet, all elements mixed.

    French, as first and second language, can be considered as up but very close to position 4.

  • CONCLUSION P1French, in spite its lower demographic strength, is in close competition in the Internet, depending of space/application, with:Spanish, German, Japanese, Portuguese, and in some way with Russian and Arabic.

  • CONCLUSION P1: TRENDSStrongly emerging languages (competing with English)Chinese (will go over English) Spanish

    Emerging languages(Competing with French)Hindi, Bengali, Russian, Arabic

    New players Urdu, Indonesian

  • CONCLUSION P1Most of the elements of the applied methodology should perform for other languages of large world wide scope, such as Arabic, Portuguese, Spanish or Russian.

  • PART 2 : LANGUAGES OF FRANCEMAYOTTEMAYOTTE

  • SELECTION OF LANGUAGES OF FRANCE FOR THAT STUDYAlsatianBasqueBretonCatalanCorsicanCreole (*)FlemishFrankish Franco-Provenal Futunan Languages of Mayotte (*) Ol languages (*) Kanak languages (*) Occitan (*) Tahitian Walisian

    (*) : family of languages

  • SELECTION CRITERIATerritory based languages (no immigration languages)

    Subset with higher probability of Internet presence

    > more than 50,000 speakersor > used in official teaching

  • Language familiesCreole : Martinique, Guadeloupe, Guyane, la RunionOccitan: auvergnat, gascon, languedocien, limousin, provenal, vivaro-alpinKanak: aji, drehu, nengone, paic, xrc (+ 24 more not studied)Languages of Mayotte: kibushi et shimaor

  • Languages terminologyAlsacien: alemannic, alemannisch, alsacien, elsaessisch, elsssisch, etc.Basque: biscayan, gipuzkera, gipuzkoan, guipuzcoan, guipuzcoano, euskera, euskara, roncalese, vasco, vascuense, vizcaino, etc.Catalan: Aiguavivan, Algherese, Aragonais oriental, Balear, Catal, Cataln, Catalan-Valencian-Balear, Eivissenc, Mallorqui, Menorqui, Menorquin, Lleidat, Pallarese, Ribagoran, Valenci, Valenciano, etc.Corse: corsu, corsican, corsi, corso, sartenais, venaco, vico-ajaccio, etc.

  • Languages terminologyFrancique mosellan: lothrnger ditsch, lothringer deutsch, lothringer plattm, lothrnger deitsch, lothrnger deitsch, lothrnger platt, francique luxembourgeois, francique mosellan, platt, etc.

    Futunian: fakafutuna

  • Languages terminologyFrancoprovenal: arpetan, arpian, arpitan , arpitano , brass , burgondan , burgonds, dauphinois, delfinese, dialetto , faetar, francoprovenl , friborgs , fribourgeois, genevois, harpitan , lyon, lyonnais, mcons, neuchatelais, neuchtelois, patois, patoua, patous, romand, romand , savoiardo, savoyard, savoyrd, tot-parier, valaisan, valdostano, valdtain, valdtn, valsan , vaudois, vdous

  • Languages terminologyLangue dol: angevin, berrichon, bourbonnais, bourguignon-morvandiau, brionnais-charolais, champenois, frain-comtou, franc-comtois, gallo, langue comtoise, lorrain, mconnais, manceau, marachin, mayennais, normand, normand mridional, picard, poitevin, poitevin-saintongeais, saintongeais, wallon, etc.

  • Languages terminologyOccitan: barnais, aspois, girondin, lemozin,limousin, mdocain, mondin, monegasque, neugue, niois, nissard, nissart, occitanien, occitanique, parler doc, romans, patois, proensal, raimondin, rouergat, etc. Shibushi: malgache de Mayotte, kibushi kimaore, kibushi kiantalaoutsi, kibushi, kibuki, bushiTahitien: reo tahitiWallisian: fakauvea, faka uvea, ouva

  • DIFFERENT METHODOLOGYThe