12
FF501 Projektkatalog Projekt 1: Algorithms to identify clusters of similar objects in biomedical data Vejleder: Tobias Frisch, [email protected] Institut: Institut for Matematik og Datalogi Praktisk del: Programming skills recommended Gruppeplacering: IMADA Gruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet. Kommentar: Projektet er særligt velegnet til studieordning i datalogi Henvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende. Nøgleord: Bioinformatics, Algorithms, Clustering Beskrivelse Technological advances in many scientific areas lead to an increasingly large number of data sets and also to individual data sets of larger size. This trend is also present in the biomedical area, where new biotechnological discoveries lead to more sensitive ways to analyze complex biological systems such as the human cell. No scientist today can analyze these data amounts manually anymore, which is why they are strongly assisted by software specifically designed to analyze biomedical data in various ways. Often, such software is based on algorithms for formalized Computer Science problems, which can be applied to but are not limited to biomedical data. One such class of algorithms are clustering methods that belong to the field of unsupervised learning. They assist scientists in getting an overview over their data and based on these in suggesting reasonable (and expensive) follow-up experiments. Assuming, that a data set is a collection of data for a set of objects, clustering approaches identify groups (called clusterings) of similar objects (e.g. genes, proteins, patients, ...) in such a data set. Clustering approaches are utilized in many different scientific fields, which all define the “similar” in “groups of similar objects” slightly differently. Thus, there is no unique definition of the Clustering Problem and many clustering tools exist that identify different clusterings for the same data set. The same holds for the validation of the identified clusterings. Various mathematically defined indices exist to measure how well an identified clustering separates the objects of the data sets into distinct groups. Tasks: Study the meathematical models and algorithms behind Clustering Become familiar with the available clustering software tools Analyse ways of assigning the quality of clustering results Apply the acquired knowledge to analyze biomedical data sets Minikurser Obligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX, Posterfremstilling Litteratur J Friedmann, T Hastie, R Tibshirani. The Elements of Statistical Learning, Second Edition 2009, https://web.stanford.edu/~hastie/Papers/ESLII.pdf M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988 Side 1 / 96

New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 1: Algorithms to identify clusters of similar objects inbiomedical dataVejleder: Tobias Frisch, [email protected]: Institut for Matematik og DatalogiPraktisk del: Programming skills recommendedGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: Bioinformatics, Algorithms, Clustering

BeskrivelseTechnological advances in many scientific areas lead to an increasingly large number of data setsand also to individual data sets of larger size. This trend is also present in the biomedical area,where new biotechnological discoveries lead to more sensitive ways to analyze complex biologicalsystems such as the human cell. No scientist today can analyze these data amounts manuallyanymore, which is why they are strongly assisted by software specifically designed to analyzebiomedical data in various ways. Often, such software is based on algorithms for formalizedComputer Science problems, which can be applied to but are not limited to biomedical data. Onesuch class of algorithms are clustering methods that belong to the field of unsupervised learning.They assist scientists in getting an overview over their data and based on these in suggestingreasonable (and expensive) follow-up experiments.Assuming, that a data set is a collection of data for a set of objects, clustering approaches identifygroups (called clusterings) of similar objects (e.g. genes, proteins, patients, ...) in such a data set.Clustering approaches are utilized in many different scientific fields, which all define the “similar” in“groups of similar objects” slightly differently. Thus, there is no unique definition of the ClusteringProblem and many clustering tools exist that identify different clusterings for the same data set. Thesame holds for the validation of the identified clusterings. Various mathematically defined indicesexist to measure how well an identified clustering separates the objects of the data sets into distinctgroups.

Tasks:

Study the meathematical models and algorithms behind ClusteringBecome familiar with the available clustering software toolsAnalyse ways of assigning the quality of clustering resultsApply the acquired knowledge to analyze biomedical data sets

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

LitteraturJ Friedmann, T Hastie, R Tibshirani. The Elements of Statistical Learning, Second Edition 2009,https://web.stanford.edu/~hastie/Papers/ESLII.pdfM. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., Upper SaddleRiver, NJ, USA, 1988

Side 1 / 96

Page 2: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 12: Concurrency DetectivesVejleder: Saverio Giallorenzo, [email protected]: Institut for Matematik og DatalogiPraktisk del: ProgrammeringGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. En gruppe kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: Concurrent Programming, Bug Detection Tools

BeskrivelseInfer is a static program analyser developed and used byFacebook (but also Mozilla, Spotify, Uber, etc.) to check thateach new release of its for Java and C/Objective-C appsrespect good properties like absence of null pointerdereferences and resource and memory leaks, which causesome of the nastiest and most difficult to find and correctbugs.

Infer relies on a kind on mathematical logic, called"separation logic", which facilitates reasoning about mutations to computer memory. It enablesscalability by breaking reasoning into chunks corresponding to local operations on memory, and thencomposing the reasoning chunks together.

The initial step of this project will be to acquire a good understanding of the theory behind Infer andthe programming issues that are modelled and detected by the tool. Then, the students will deepentheir knowledge on the tool, understanding its plugins, each able to track and expose a precise typeof bug. Finally, they will put into practice their knowledge on both Infer and its theory, on areal-world project: the Jolie language interpreter, an open-source project written in Java. The mainplugin the students will be asked to master is called RacerD and regards the detection of potentialconcurrency bugs, called "race conditions". A race condition occurs when there are two concurrentaccesses to a class member variable that are not separated by mutual exclusion, and at least one ofthe accesses is a write. Mutual exclusion can be ensured by synchronisation primitives such as locks,or by knowledge that both accesses occur on the same thread. RacerD does not attempt to provethe absence of concurrency issues, rather, it searches for a high-confidence class of data races.RacerD concentrates on race conditions between methods in a class that is itself intended to bethread safe.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

Litteratur- Calcagno, Cristiano, Dino Distefano, Jérémy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca,Peter W. O'Hearn, Irene Papakonstantinou, Jim Purbrick, and Dulma Rodriguez. "Moving Fast withSoftware Verification." NFM 15 (2015): 3-11.- Infer http://fbinfer.com/- Jolie language repository https://github.com/jolie/jolie

Side 12 / 96

Page 3: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 14: Data Science Competition on kaggle.comVejleder: Jonas Herskind Sejr, [email protected]: Institut for Matematik og DatalogiPraktisk del: ProgrammingGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogi.Henvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: data science, machine learning, statistics

BeskrivelseReal life data science does not start with amachine learning (ML) algorithm. Beforeapplying any algorithm you first mustunderstand the problem you want to solve,decide how to evaluate your solution, acquiredata, prepare the data, analyze, explore andextract features on which you want to applyyou ML algorithms.Predictive data science competitions havebecome a popular way to getpractical experience with real-life data scienceproblems. In addition, predictive competitionsare a great place to compare algorithms ondifferent types of problems."Kaggle is a platform for predictive modeling and analytics competitions in which statisticians anddata miners compete to produce the best models for predicting and describing the datasetsuploaded by companies and users." (Wikipedia.)Apart from hosting competitions, Kaggle offers tutorials, tools, a social platform etc. to supportlearning and knowledge sharing among data scientists on all levels. This makes it very easy to getstarted with predictive analytics.Teams choosing this project will compete in a Kaggle competition. Starting from the most simplesolution participants will gradually improve the solution as they learn about the dataset andpredictive algorithms.Participants will learn about:

Cleaning and preparing dataAnalysing and visualizing data Extracting featuresCompetitive ML algorithms

Check out kaggle.com to see if you find it interesting.Competitive data science can be a great testbed in future studies or a career in data science, deeplearning etc.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

Litteraturhttps://www.kaggle.com/https://www.kaggle.com/startupsci/titanic-data-science-solutionshttps://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/abouthttp://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

Side 14 / 96

Page 4: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 15: Deep learning for DNA-sequence analysisVejleder: Alexander G. B. Grønning, [email protected]: Institut for Matematik og DatalogiPraktisk del: This project will include programming in Python.Gruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogi.Henvendt til: Projektet tilbydes til alle studerende.Nøgleord: Deep learning, Python, computational biology, word embedding

BeskrivelseRecent technical advances have made it possible to effectively generate massive amounts ofbiological data that contain information about the intra- and extracellular environments ofeukaryotes and prokaryotes. Especially advances in sequencing technologies have made it possiblefor many research groups all around the world to produce DNA- and RNA-sequences that describethe binding sites of DNA- and RNA-binding proteins (DBP and RBP), which exert important regulatoryeffects. Knowledge of the the activity of DBPs and RBPs and their binding affinities for differentsequential patterns can help us understand intracellular mechanisms and pathways and help usinvent and discover treatments for various diseases. With the massive amounts of sequential dataavailable we need effective algorithms that can help us understand the information that is present inDNA- and RNA-sequences.

In this project, deep learning techniques will be used to implement a binary classifier that should beable to classify seemingly random DNA-sequences into two distinctive groups. Deep learning hasproduced state of the art results in various fields such as face recognition, object detection and textgeneration. In general, the accuracy of deep learning architectures increases with the number oftraining examples, which is why it is an obvious choice when analyzing DNA-sequences. Deeplearning has already with much success been applied to the discovery of binding sites of DBPs andRBPs, which is why this project will only serve as an introduction to deep learning techniques andhow it can be implemented in Python and used for DNA-sequence analysis. Hopefully, you will beable to use what you learn during this project in the future, to perfect our current deep learningalgorithms and aid us in the fight against diseases.

An overview of what this project could include is:

An introduction to the central dogma of molecular biology.An Introduction to binding motifs.An introduction to the basic principles of deep learning and the math behind it.A small discussion of how to represent the bases of DNA-strings as values or vectors.Basic Python programming and how to implement functions and use deep learning libraries.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

Litteraturhttp://neuralnetworksanddeeplearning.com/ http://www.deeplearningbook.org/D’haeseleer. P. What are DNA sequence motifs? Nat. Biotechnol. 24, 423–425 (2006).

Side 15 / 96

Page 5: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 17: DevOps, as Microservices DevOpsVejleder: Saverio Giallorenzo, [email protected]: Institut for Matematik og DatalogiPraktisk del: ProgrammeringGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. En gruppe kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: culture, software development, DevOps, microservices

BeskrivelseDevOps is a portmanteau of "development"and "operations" and represents a growingengineering practice that unifies softwaredevelopment (programmers, testers, andquality assurance personnel) andoperations (system, database, and networkadministrators).

First and foremost, DevOps is a culturalmovement that impacts on a wholeorganisation. Its principles advocate for aworking environment based on strongcommunication and trust among itsmembers.Among the technical characteristics ofDevOps, there is a strong tension towards defining automation for all stages of the softwarelife-cycle: construction, integration, testing, packaging, release, configuration, and monitoring.

The set of software tools used at each stage is an important decision for any DevOps team. Thesetools become part of a toolchain that comprise IDEs, software version systems, test frameworks,configuration managers, and monitors.

While DevOps does not promote a particular software architectural style, the high degree offlexibility and independence of microservices has made microservice architectures the standardstyle for building systems following the DevOps principles. In this context, Jolie is a newprogramming language specifically suited for the development of microservices.

The initial step of this project will be to acquire a solid understanding of the DevOps principles. Then,on the studied principles, the assigned students will apply their knowledge to design a developmentplan and devise a DevOps toolchain to support the development of a tool for DevOps in Jolie: forexample, a program where developers define tasks which are executed when an indicated event istriggered on a code repository (e.g., by using Github Webhooks).

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

Litteratur- Farcic, Viktor. The DevOps 2.0 Toolkit. Packt Publishing Ltd, 2016.- Kim, Gene, Kevin Behr, and Kim Spafford. The phoenix project: A novel about IT, DevOps, andhelping your business win. IT Revolution, 2014.- Github Webhooks Documentation https://developer.github.com/webhooks/

Side 17 / 96

Page 6: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 26: Finding consensus in distributed systemsVejleder: Larisa Safina, [email protected]: Institut for Matematik og DatalogiPraktisk del: Analysis, ProgrammingGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: distributed systems, consensus algorithms

BeskrivelseThe consensus is one of the fundamental problems in distributed systems. The aim of it is to find theagreement on some specific value between several parties. The parties must vote for a value(provided by one or more of them earlier) and communicate it with each other until they reach theagreement. Note that some of the parties could be faulty or unavailable; for example, imagine thesystem of clocks, where some nodes are ahead of the time or late and some do not work at all.

The objective of this project is to implement one of the consensus algorithms for a real-word problemand learn how to justify the algorithm and programming language choice from the softwareengineering point of view.

You can start the project with selecting a real-word problem where the consensus problem may arise(e.g. time synchronization, load balancing, banking transactions etc.). Then you can:

Choose one of the well-known consensus algorithms and apply to your problem.Argue on which algorithm works better for your problem (e.g. raft versus paxos);Implement the chosen algorithm in different programming languages and argue whichimplementation was most suitable.Work on your own cool idea.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

LitteraturAndrew S. Tanenbaum and Maarten van Steen. 2006. Distributed Systems: Principles and Paradigms(2nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

Side 26 / 96

Page 7: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 40: How well is your code covered?Vejleder: Larisa Safina, [email protected]: Institut for Matematik og DatalogiPraktisk del: Testing, ProgrammingGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: code coverage

BeskrivelseCode coverage is a measure showing to which extent the code of a program was executed duringtesting. In other terms, it is an estimate of the chances of having undetected software bugs inuncovered areas. There are many different metrics for calculating code coverage with a differentlevel of rigor. However, being more rigorous is not always good, since there are tradeoffs betweenaccuracy and, for example, performance.

The objective of this project is to evaluate an arbitrary project from the point of view ofcode-coverage using different approaches and tools.

What is expected from you? First of all you will need to find a project with some level oftest-coverage (you can use your own or take something from github or sourceforge). Thendepending on what your goal is you can

Compare how differently coverage tools behave on the project in terms of performanceCompare the efficiency of different types of code coverage, for example, by checking if anybugs are missed (e.g using mutation testing) Implement your own code coverage tool Work on your own cool idea

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

LitteraturIngen litteratur opgivet.

Side 40 / 96

Page 8: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 60: Logic Puzzles and Boolean SatisfiabilityVejleder: Luís Cruz-Filipe, [email protected]: Institut for Matematik og DatalogiPraktisk del: Programmingserfaring er IKKE en forudsætning, men det er en fordel, hvis du har

arbejdet med Maple/Matlab/Python eller lignendeGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er velegnet til alle studerende med en vis interesse i matematiske eller

datalogiske emnerHenvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: Boolean Logic, Satisfiaibility, Constraint Solving

BeskrivelseLogic puzzles are both great recreational fun (sudoku, kakuro,hitori, sokoban etc.) and at the core of many industrial andresearch applications (scheduling, optimization, routing,verification, etc.). Common to these problems is that they areconsidered to be hard problems (NP-complete), for which we donot expect to be able to come up with fast algorithms in thegeneral case.Boolean Satisfiability is a logic puzzle, where given apropositional formula (DK: udsagnslogisk formel) one is todetermine whether it can be made true by assigning its variablesto either true or false. This logic puzzle is also hard, butresearchers have worked for nearly 60 years on algorithms forsolving this particular problem.Many other logic puzzles can be solved by representing theirelements using Boolean variables that can be either true or false(think of them as bits) and expressing the puzzles as Booleansatisfiabiliy. Then small but complex and highly efficient computer programs (freely available andusually open source) can be used to solve these puzzles, in turn giving solutions for the original logicpuzzles.The goal of this project is to learn how to use Boolean Satisfiability to solve other logic puzzles. Thepuzzle can be one of those mentioned above, but there are a virtually infinite number of otherinteresting choices available. For small logic puzzles (like the 5x5 hitori puzzle in the picture),expressing them can be done without programming. Larger logic puzzles require (very basic)programming skills to automate some of the tedious work.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

Litteraturhttps://en.wikipedia.org/wiki/Boolean_satisfiability_problemhttps://en.wikipedia.org/wiki/Hitori

Side 60 / 96

Page 9: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 77: Ruteplanlægning for flyVejleder: Anders Nicolai Knudsen, [email protected]: Institut for Matematik og DatalogiPraktisk del: ProgrammeringGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: Flyruter, Algoritmer, Grafer, Korteste vej

BeskrivelseDette projekt handler om at finde optimaleruter for flyrejser og derved minimere bådetid og brændstof brugt. Flyrejser er en afnutidens mest anvendte transportformer.Hver dag flyver over 100.000 fly verden overmed mange forskellige formål som foreksempel turisme, erhverv og militær. Derbruges over 250 milliarder liter brændstofsom både er dyrt og en anseelig del afverdens samlede CO2-udslip. Ud overbrændstof har luftfartsselskaber også mangeandre udgifter som er forbundet med hvorlænge flyet er i luften. Det er derfor iflyindustriens interesse at minimere densamlede flyvetid for deres flyvninger.

Flyruter er ikke så nemme at lave som manskulle tro. Hvis man fx. skal flyve fraKøbenhavn til Paris er det ikke tilladt blot atflyve den direkte vej, da sammenstød i så faldville være svære at forhindre. I stedet har man opdelt luftrum rundt omkring i verden i netværk afpunkter, også kaldet waypoints. Disse waypoints danner en 3D-graf som man så skal finde denbilligste vej igennem. Netværket er underlagt af en række regler og gebyrer som besværliggørrutegenereringen. Desuden varierer et flys ydeevne baseret på betingelserne i luften omkring det,hvilket yderligere komplicerer problemstillingen.

Målet for projektet er at udvikle et program som givet et netværk af waypoints og data for et flysydeevne kan generere en optimal rute fra én lufthavn til en anden. Problemet er af den velkendtetype, korteste-vej, men de unikke betingelser som netværket samt flyets ydeevne spiller ind med,gør at de studerende også vil stifte bekendtskab med heuristikker, constraints, samt udnyttelse afdatastrukturer.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

LitteraturAnders Knudsen (2015). Introduction to Flight Route Optimization. URL:www.imada.sdu.dk/~andersnk/FF501_Introduction.pdf

Side 77 / 96

Page 10: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 78: Sequence Structure Prediction in JavaVejleder: Philipp Weber, [email protected]: Institut for Matematik og DatalogiPraktisk del: Programming, no need for biological backgroundGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studerende.Nøgleord: Bioinformatics, RNA folding, User Interface Design

BeskrivelseRNA, DNA and proteins are thebasic molecules of life on Earth.DNA is used to store and replicateinformation, proteins are the basicbuilding blocks and machinery in acell and RNA plays a number ofimportant roles in the production ofproteins and signaling in the cell.One of the fundamental problemsin bioinformatics is to predict thestructure of a molecule only from agiven sequence of DNA or RNA.This structure in combination withchemical properties of the exposedregions will be responsible for thefunction of a molecule. Thereforewe can say, that the nucleotide sequences in our cells are truly the code of life.

In this project you will build a tool to read in a nucleotide sequence, predict a 2D or 3D structureusing a folding algorithm of your choice and visualize the resulting molecule.

Goal of this project is to gain more insight into the intricate details of RNA and DNA structureprediction, compare algorithms and work with tools used to build user interfaces by thousands ofJava developers (decide between Swing and JavaFx). The group will need to discuss and decide onfeatures and functionalities needed in the project and work together to achieve their goal.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

Litteratur- R Nussinov, G Pieczenik, JG Griggs, DJ Kleitman (1978) Algorithms for Loop Matching. SIAM J ApplMath 35, 68-81.- JavaFx, http://www.oracle.com/technetwork/java/javase/overview/javafx-overview-2158620.html, Accessed December 2017- Swing, https://docs.oracle.com/javase/tutorial/uiswing/start/index.html, Accessed December 2017

Side 78 / 96

Page 11: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 81: SpilteoriVejleder: Christian Kudahl, [email protected]: Institut for Matematik og DatalogiPraktisk del: Projektet forventes at indeholde programmering.Gruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. Tre grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordningen i datalogi.Henvendt til: Projektet tilbydes til alle studier, dog ikke farmaci studerende.Nøgleord: Spilteori, diskret matematik, beviser, programmering

BeskrivelseSpil er sjove at spille, og de kan også værespændende at analysere. Da mange virkeligeproblemstillinger kan opfattes som spil, eranalysen også yderst relevant i praksis. I detteprojekt vil de studerende få mulighed for atbeskæftige sig med en række spil. Spilleneanalyseres teoretisk med henblik på at findestrategier og bevise at disse er optimale. Derbliver rig mulighed for at anvende et bredtspektrum af bevisteknikker som induktion ogmodstrid. I nogle tilfælde vil det være muligt atvise ikke-konstruktive beviser for vindendestrategier.

Projektet indeholder forventeligt også enprogrammeringsdel. I denne vil det være muligt atimplementere de vindende strategier, man harfundet i den teoretiske analyse. Alternativt kancomputeren bruges som værktøj til at findevindende strategier i lidt mere komplekse spil.

Der er stor frihed til, at hver gruppe trækkerprojektet i en spændende og unik retning.

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

LitteraturWinning Ways for your Mathematical Plays, Berlekamp et al.

Side 81 / 96

Page 12: New Projekt 1: Algorithms to identify clusters of similar objects in …marco/Teaching/FF501/2018/cs... · 2018. 2. 25. · DNA- and RNA-sequences. In this project, deep learning

FF501 Projektkatalog

Projekt 95: Webtechnologies and Visualization of Breath DataVejleder: Philipp Weber, [email protected]: Institut for Matematik og DatalogiPraktisk del: Programming, no need for biological backgroundGruppeplacering: IMADAGruppestørrelse: Mindst 3 og maks 5 deltagere. To grupper kan arbejde med projektet.Kommentar: Projektet er særligt velegnet til studieordning i datalogiHenvendt til: Projektet tilbydes til alle studerende.Nøgleord: Data Visualization, Data-Driven Documents, Bioinformatics

BeskrivelseThe world wide web is not only aplace for cat pictures and memes,but also filled with tremendousamounts of data. This data can beused to govern decisions of public,scientific and personal lives. Butwhat good is data, if we cannotcomprehend it? Here visualizationplays a pivotal role. The process ofvisualization can be seen as thejoint between many different fields,namely engineering, statistics andgraphics design. When applied tolarge datasets, automaticgeneration of visualizations andplots becomes a necessity. Here,the understanding of the data canbe facilitated by adding interactive views that allow for a better overview while also providing detailsof the underlying data on demand. In this project you will build web-based visualizations from breathmeasurements and their processed analysis files. The original data is retrieved with a MCC-IMS (MultiCapillary Column – Ion Mobility Spectrometry)-device. This very sensitive technology is commonlyused in airports to detect explosives and by doctors to detect the substances exhaled by a patient.Clinicians try to find patterns in their patients breath and correlate them to disease patterns such aslung cancer or COPD. The task of the group will be to create, compose and connect different visualizations to enable aneasy exploration of the data. In order to succeed in this task, the group will learn about classicalvisualization techniques and recent technologies used to create such plots. Furthermore they will beintroduced to the Python based Flask framework and the basics of web-development. Then they willbuild their own representations in HTML, CSS and JavaScript, which should be made available in webbrowsers by serving them with Flask and D3 (Data-Driven Documents – a JavaScript library used tocreate interactive and scalable vectorgraphics).

MinikurserObligatorisk: Skriftlig formidling og rapportskrivning (online), Rapportskrivning med LaTeX,

Posterfremstilling

Litteratur- J. Heer and B. Shneiderman. "Interactive Dynamics for visual analysis" Commun ACM, 55(4), 2012- Armin Ronacher, Flask a microframework for Python, http://flask.pocoo.org/, 2017- Mike Bostock, Data-Driven Documents, https://d3js.org/, 2017

Side 95 / 96