View
31
Download
0
Category
Tags:
Preview:
DESCRIPTION
Scaffolding Problems. Gao Song 2010/04/27. Outline. Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work. Concepts. Contig : Edge (PET ): library size Scaffolding: a sequence of contigs Happy Edge: - PowerPoint PPT Presentation
Citation preview
OutlineConceptsProblem definitionNon-error CaseEdge-error CaseDisconnected ComponentsSimulated DataFuture Work
ConceptsContig:Edge (PET): library sizeScaffolding: a sequence of contigsHappy Edge:
Real distance <= expected distanceOrientation of both contigs are correct
Problem DefinitionVersion 1: Given a set of contigs and a set of
edges, find a scaffold which has at most p unhappy edges
Version 2: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges and is also the optimal solution
Non-error CaseConnected graphPartial Layout:
Dangling Edge: only one end in partial layoutActive region: the sequence from the first
contig having dangling edges to the end of partial layout; less than library size
Domain of a partial layout: all nodes in partial layout
Non-error CaseTheorem: if two partial layout l1 and l2 have
same active region and dangling set, then (1) they have same domain(2) both or neither of them can extend to a
solutionProof:
ProcedureFind the unassigned node
Select the nearest node as next assigned nodeUpdate current partial layout
Remove all dangling edges incident to new node
Add new dangling edges of new nodeRemove contigs from active region
Main ProcedureFind all nodes which has no ancestors and
select one to startFrom an active region, get all unassigned
nodes, and update the partial layoutRemember all visited partial layoutIf dangling edge set is empty, output the
results
Time and space complexityTwo possibilities
k vertices in active region – one possible next nodes
Less than k vertices in active region – n possible next nodes
ComlexityO(nk)*O(1)O(nk-1)*O(n)Total time complexity: O(nk)Total space complexity: store all visited partial
order
Introduce Edge ErrorTypes of edge error
Chimeric PETs: Mapping errorMisassembled contigs
SolutionFiltering – filter chimeric PETs
Select x% of PETs Shuffle them to get chimeric PETs Cluster them to find threshold
Local threshold
.
.
.
.
.
.
Introduce Edge ErrorThere are p unhappy edges in final
scaffoldingPartial layout
Dangling edges: real dangling edges; wrong edges
Equivalent ClassActive region, dangling edges’ set,
count of current wrong edgesSame domainAssumption: the partial order is a connected
graph
Get Unassigned NodesSort the unassigned nodesProperties of nodes:
Steps to reach this nodeDistance to the end of active regionUnhappy edges introduced due to this node
Sort Unassigned NodesBreadth-first search
Select the smallest possible distance: > threshold
Sort nodes:Less than 5 steps, compare with distance;
same distance, compare with unhappy edges
Update Partial LayoutCheck if all incident un-wrong dangling edges are
happyIf yes, just remove all those edges and add new nodeIf no, check if setting all unhappy edges as omitted
will result in disconnected graph If no, just add new node and remove dangling edges If yes, discard current partial layout – to avoid insert
disconnected component into sequenceAdd new dangling edgesRemove all dangling edges which is not happy –
check connectness
Main ProcedureIf active region is empty
Current connected component is finishedCheck if dangling edge set is empty
If yes, output the result If no, using dangling edges to find a new node and
start another scaffolding
Disconnected ComponentsFirst find all the connected components and sort them
according to the number of nodes
From the first component, find a solution, which omits p1 edges
For ith component, if there is no solution omits p-sum(p1,…, pi-1) edges, remember all the stop point, return to (i-1)th component, and see if it can find a solution which omits less than pi-1 edges. If yes, continue from the stop point of ith component.
If ith component finishes the whole search and found more than one solutions. Then, only remember the solution with minimum pi. Then, in the future, when comes to this component, just use this solution as part of the partial results
Simulated Data ResultNode Num: 1522 nodesContig length: 600 - 10,000
Wrong edges p Time(ms)0 0 27651 1 29842 2 49843 3 65624 4 70005 5 73286 6 72817 7 73438 8 74069 9 5181310 10 216984
Future WorkFind the optimal solutionWrong contigsRepeatsHow to deal with large pFind a good way to sort the unassigned nodes
Recommended