48

Click here to load reader

Big&open data challenges for smartcity-PIC2014 Shanghai

Embed Size (px)

Citation preview

Diapositiva 1

Big and Open Data Challengesfor Smartcity

Victoria LpezGrupo G-TeCwww.tecnologiaUCM.esUniversidad Complutense de Madrid

1

Big and Open data. Challengesfor SmartcityIntroductionFighting with Big Data: Genoma Data Big Data. Big ProjectsOpen Data. Technology Transfer OpportunitiesSmartcity. Big and Open SystemsMadrid as SmartcityConclusions2

Introduction

Our Goal: to transfer technology and knowledgeMobile technologies applyed to environmentIntelligent agentsOptimization and forecasting from data Bioinformatics, BiostatisticsG-TeC group: statisticians, physicists, mathematicians, economists and several computer scientists.www.tecnologiaUCM.es

Motivation & Evolution

GRASIA: Agentes inteligentes e ingeniera del software3

Fighting with the Big Data Every day we need to deal with more and more data.For many years, new computers with more memory and higher speed seem to be the solution for data growing (Elephant vendors).Many researching areas which was fighting with the Big Data: Bioinformatics, Genoma data, DNA, RNA, proteins and, in general all biological data have been required by computing monitors and storing in large data bases in several laboratories and researching centers along the world.

The future of genomics rests on the foundation of the Human Genome Project 4

Fighting with the Big Data Each time an organization or an individual is not able to deal with data, a big data problem is facing. Human Genoma Project managed with same philosophy than modern Big Data: large data bases distributed along the world with parallel processing when available and suitable.Our experience: Sequence alignment and its optimization with Dynamic Programming and their heuristics.The amount of biological data is a Big Data base.Adding new sequences, searching and forecasting are task very similar than those we face in every Big Data problem.

5

20/05/2014Vineyards in La Geria, Lanzarote6Case of Use. Looking for a FungusApplication to infections in agricultural crops when it is no possible to identify the real fungus.The responsible needs to make decisions about what to do, what medicine apply, or procedure is better.A fragment of fungus DNA must be sequenced in the lab.Then the scientist looks for it in molecular data bases by means of sequence searching (DB homology search).Some alignment algorithms (Blast, Fasta) are executed to return the best matches.

gtttacgctctacaaccctttgtgaacatacctacaactgttgcttcggcgggtagggtctccgcgaccctcccggcctcccgcctccgggcgggtcggcgcccgccggaggataaccaaactctgatttaacgacgtttcttctgagtggtacaagcaaataatcaaaacttttaacaaccggatctcttggttctggcatcgatgaagaacgcagcgaaatgcgataagtaatgtgaatThe sequence

20/05/20147EBI: European Bioinformatics InstituteChoose the tools available into the web siteFasta3 Select DATABASE: Nucleic ACIDSFUNGIFit sequences and run queriesA sorted list (but not complete) from better to worst similarity is returned.

Data Base and Algorithm SelectionPIC 2014, Shanghai

Case of Use

20/05/20148EBI Web Site

PIC 2014, ShanghaiCase of Use

20/05/2014PIC 2014, Shanghai9Web Toolbox in EBI

Case of Use

20/05/201410

Algorithm Fasta 3

PIC 2014, ShanghaiCase of Use

20/05/201411DATABASES NUCLEIC ACIDS: FUNGI

PIC 2014, ShanghaiCase of Use

20/05/201412Fit sequences and run FASTA 3

PIC 2014, ShanghaiCase of Use

20/05/201413The output

FASTA searches a protein or DNA sequence data bank version 3.3t09 May 18, 2001Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

@:1-: 241 nt vs EMBL Fungi librarysearching /ebi/services/idata/v225/fastadb/em_fun library

104701680 residues in 66478 sequences statistics extrapolated from 60000 to 61164 sequences Expectation_n fit: rho(ln(x))= -1.2290+/-0.000361; mu= 72.1313+/- 0.026 mean_var=907.6270+/-295.007, 0's: 68 Z-trim: 4246 B-trim: 15652 in 3/79 Lambda= 0.0426

FASTA (3.39 May 2001) function [optimized, +5/-4 matrix (5:-4)] ktup: 6 join: 48, opt: 33, gap-pen: -16/ -4, width: 16 Scan time: 3.180The best scores are: opt bits E(61164)EM_FUN:CGL301988 AJ301988.1 Colletotrichum glo (1484) [f] 1184 88 5.7e-17EM_FUN:AF090855 AF090855.1 Colletotrichum gloe ( 500) [f] 1205 88 7.3e-17EM_FUN:CGL301986 AJ301986.1 Colletotrichum glo (1484) [f] 1166 87 1.2e-16EM_FUN:CGL301908 AJ301908.1 Colletotrichum glo (2868) [f] 1148 87 1.3e-16EM_FUN:CGL301909 AJ301909.1 Colletotrichum glo (2868) [f] 1148 87 1.3e-16EM_FUN:CGL301907 AJ301907.1 Colletotrichum glo (2867) [f] 1148 87 1.3e-16EM_FUN:CGL301919 AJ301919.1 Colletotrichum glo (1171) [f] 1166 87 1.6e-16EM_FUN:CGL301977 AJ301977.1 Colletotrichum glo (1876) [f] 1148 86 2e-16EM_FUN:CFR301912 AJ301912.1 Colletotrichum fra (2870) [f] 1137 86 2.1e-16PIC 2014, ShanghaiCase of Use

Our background about BioinformaticsBioinformatics (Master in researching in Informatics, UCM)Several Master Thesis & publicationsAlignment of sequences with R and Rhadoop*Analysis & Visualization with R Language and Chernoff facesOthers14

Big Data

From Data Warehouse to Big Data (large Data Bases) 151970 relational model inventedRDBMS declared mainstream till 90sOne-size fits all, Elephant vendors- heavily encoded even indexing by B-trees.

Alex ' Sandy' Pentland, director of 'Media Lab' at Massachusetts Institute of Technology (MIT): The big data revolution, 2013 Campus Party Europe16Nowadays bussiness needs a high avalailability of data, then new techniques must be developed: Complex analytics, Graph DatabasesData Volume is increasing exponentially44x increase from 2009 2020From 0.8 zettabytes to 35zb

Big Data

unstructured data17Quin genera Big Data?

Progress and innovation are no longer hampered by the ability to collect data, but the ability to manage, analyze, synthesize, visualize, and discover knowledge from data collected in a timely manner and in a scalable waySocial Networks (public profiles)Scientific/mobile devicesSmartphonesSensors everywhere

Big Data

Big Data 3+1+1 Vs 18

Value

From data to valueBig Data CollectionMonitoringData cleaning and integrationHosted Data Platforms and the Cloud Big Data StorageModern Data BasesDistributed Computing Platforms NoSQL, NewSQL Big Data Systems SecurityMulticore scalabilityVisualization and User Interfaces Big Data AnalyticsFast algorithmsData compressionMachine learning toolsVisualization & Reporting

19

The MIT proposal stage list to deal with Big Data

Big Data in useHigh Availability is now a requirement Host (not only in house) and Cloudcomputing Running in parallel Data Aggregation processAnalytics on DataGraphDBMSs similaritiesNot only SQL: Cassandra* and MongoDB**

*The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.**Document oriented storage20

MONGO

21Main feature: scalability to many nodesScan of 100 TB in 1 node @ 50 MB/sec = 23 daysScan in a cluster of 1000 nodes = 33 minutesMapReduceParallel programming modelSimple concept, smart, suitable for multiple applicationsBig datasets multi-node in multiprocessorsSets of nodes: Clusters or Grids (distributed programming)By Google (2004)Able to process 20 PB per dayBased on Map & Reduce, classiclal methods in functional programming related to the classic divide & conquer Come from numeric analysis (big matrix products).

Big Data: Map ReduceMapReduce

Friendly for non technical users

Map Reduce22Big Data: Map Reduce

Hadoop is an open code implementation of the computacional model Map ReduceUsed by Yahoo!, Facebook, Twitter Amazon, eBayCan be used in different architectures: both clusters (in-house) and grid (Cloudcomputing)Strorm and Spark are same model in memory instead of in diskhttp://hadoop.apache.org/

Hadoop23Big Data: Hadoop

More technical informationhttp://www.slideshare.net/vlopezlo

24

www.hortonworks.com www.coursera.com www.Bigdatauniversity.com www.mit.edu

Technology Transfer OpportunitiesA great opportunity for researchers working to transfer technology, who can increase their efforts in developing new techniques in optimization of:Monitoring data (Sensors, smartphones, )Storing data (Cloud Computing, Amazon S3, EC2, Google BigQuery, Tableau )Cleaning, Integrating & Processing data (Data Curation at Scale: The Data Tamer System, M. Stonebraker et al., CIDR 2013) Analysing data (R, SAS but also Google, Amazon, eBay...)Encryption & searching on encrypted dataTechniques of Data Mining (Machine Learning, Data Clustering, Predictive Models, ...) which are compatible with big data by complex analytics

25

Big Data. Big Projects.GoogleeBayAmazonTwitterThey develop big projects with their big data, but also many business get their data to make analysis.Government data. Public data. 26

Working with Big Data in G-TeC group

Esta plantilla se puede usar como archivo de inicio para proporcionar actualizaciones de los hitos del proyecto.

SeccionesPara agregar secciones, haga clic con el botn secundario del mouse en una diapositiva. Las secciones pueden ayudarle a organizar las diapositivas o a facilitar la colaboracin entre varios autores.

NotasUse la seccin Notas para las notas de entrega o para proporcionar detalles adicionales al pblico. Vea las notas en la vista Presentacin durante la presentacin. Tenga en cuenta el tamao de la fuente (es importante para la accesibilidad, visibilidad, grabacin en vdeo y produccin en lnea)

Colores coordinados Preste especial atencin a los grficos, diagramas y cuadros de texto. Tenga en cuenta que los asistentes imprimirn en blanco y negro o escala de grises. Ejecute una prueba de impresin para asegurarse de que los colores son los correctos cuando se imprime en blanco y negro puros y escala de grises.

Grficos y tablasEn breve: si es posible, use colores y estilos uniformes y que no distraigan.Etiquete todos los grficos y tablas.

27

28

Academia & Industry Working Together

Cules son las dependencias que afectan a la escala de tiempo, costo y resultado de este proyecto?29

Open Data

Open data is data that can be freely used, reused and redistributed by anyone subject only, at most, to the requirement to attribute and sharealike. OpenDefinition.org -Open data is data that can be freely used, reused and redistributed by anyone subject only, at most, to the requirement to attribute and share alike. OpenDefinition.orgAvailability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. The data must bemachine-readable.Universal Participation: everyone must be able to use, reuse and redistribute there should be no discrimination against fields of endeavour or against persons or groups. For example, non-commercial restrictions that would prevent commercial use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.30

Open Data

31

Why Open Data by Open Knowledge Foundation

32

Open Data for SmartcityWhat a citizen can expect when living in a city?Internet of the thingsLibrariesPublic transportation, trafic monitoringPets, devices, cars, even peopleIntelligent agentsInteracting without our controlCredit cards control (BBVA case of use)

33

C-KANThe Comprehensive Knowledge Archive Network (CKAN) is a web-based open source data management system for the storage and distribution of data, such as spreadsheets and the contents of databases. It is inspired by the package management capabilities common to open source operating systems like Linux.34

Its code base is maintained by the Open Knowledge Foundation. The system is used both as a public platform on Datahub and in various government data catalogues (UK's data.gov.uk, the Dutch National Data Register, the United States government's Data.gov and the Australian government's "Gov 2.0)

Open DataBasic structure

Patrn Cliente/ServidorPUBLIC DATA

Web ServiceSERVERCLIENTWEB SERVER35

Smartcity conceptLarge amount of people. Big cities. Search 7 thousand differences Smartcity business.The role of technology in the city: efficiency & securityNormalization of the concept of Smartcity (May, 2014)Better quality of life. Security SustainabilityInnovation opportunitiesMultidiscipline: social researchers, engineers, architects, Relationships are in change. Based on mobile technologies (smartphones, tablets, internet of the things,)Transverse developing projects: sensors and monitoring devices, connectivity, platform, services in the cloud.36

Smartcity conceptLarge amount of non structured informationMachine learning, big data technologies, internet of the things, intelligent systems are needed.Technology development as a service in all areas:Structure:Environment, infrastructure (water, energy, material, mobility, nature), built domainSociety: pubic space, functions, peopleData: information flows, performance

37

Our experience in developing systems to Madrid Open Data

Mariam SaucedoPilar TorralboDaniel Sanz

Recycla.me

Ana AlfaroSergio BallesterosLidia Sesma

Hctor Martoslvaro BustilloArturo Callejo

Beln Abellanas Jaime Ramos Ignacio P. de Ziriza

Victor TorresAlberto SegoviaMiguel Bueno

Mar Octavio de ToledoAntonio SanmartnCarlos Fernndez

MAPA DE RECURSOS RECYCLA.TE38

38

Parks and gardensParkings for CarsMotorbikesBikesRecycing PointsFixedMobileClothsStationsBioetanolGas Oil ElectricRoutes for bikesVas ciclistasCalles segurasResidential Priority Areas

Madrid Smart City39

39

RMapDemostration

40

Open Data

NEW DATA IS COLLECTED.A SERVICE IS GIVEN queryDATA TRANSFER41

Recycla.me42

Data Analytics, Data Scientist

ValueFROM (UNSTRUCTURED) DATA TO VALUE43

PIC 2014MyConference

Este Esta presentacin, que se recomienda ver en modo de presentacin, muestra las nuevas funciones de PowerPoint. Estas diapositivas estn diseadas para ofrecerle excelentes ideas para las presentaciones que crear en PowerPoint 2010.

Para obtener ms plantillas de muestra, haga clic en la pestaa Archivo y despus, en la ficha Nuevo, haga clic en Plantillas de muestra.44

Be ready at PIC 2014 with MyConference 1Main Menu 2Access to Committees 3Venue and localization 4Extra Information

45

https://play.google.com/store/apps/details?id=es.ucm.myconference

Conclusions47Big Data, Open Data and Smartcity

A great opportunity for researchers working to transfer technology, who can increase their efforts in developing new techniques in optimization of:Monitoring dataStoring data Cleaning, Integrating & Processing dataAnalysing data Encryption & searching on encrypted dataTechniques of Data Mining A great future work in relation to development new smart cities in environment, security and infrastructures.

Big and Open Data Challengesfor Smartcity

Victoria LpezGrupo G-TeCwww.tecnologiaUCM.esUniversidad Complutense de Madrid

Thank you very much!

48