(5) collections algorithms

1

Nico Ludwig (@ersatzteilchen)

Collections – Part V

2

2

● Collections – Part V

– Implementation Strategies of Map/Dictionary-like associative Collections

● BST-based Implementation

● Hashtable-based Implementation

– A 20K Miles Perspective on Trees in Computer Science

– Equality

– Multimaps

– Set-like associative Collections

TOC

3

3

● In the last lecture we learned about associative collections.

– Associative collections allow associating keys with values.

● Example: In white pages names are associated to phone numbers.

– Associative collections allow to lookup a value for a certain key, e.g. getting the phone number of a specific person.

– Associative collections could be implemented with two arrays as shown before, but this would be very inefficient for common cases.

● In this lecture we're going to understand how associative collections are implemented.

● We'll start with SortedDictionary, which was introduced in the last lecture. We already know about SortedDictionary that

– it allows only one value per key,

– that it organizes its keys by equivalence,

– that a .Net type provides equivalence by implementing IComparable or delegating equivalence to objects implementing IComparer.

● Ok, the organization of keys is based on equivalence, but when a key is in a SortedDictionary, where does it have to "be"?

– I.e. where and how will it be stored?

– We're going to understand this in this lecture!

Implementation of Associative Collections

4

4

● In opposite to indexed/sequential collections, items got and set by analyzing their relationships in associative collections.

● The idea of SortedDictionary is to keep the contained items sorted by key whilst adding items!

– So, its way of key organization is the analysis of the sort order of keys of the items.

● To make this happen, SortedDictionary is implemented with a tree-like storage organization for the keys.

– When we analyzed algorithms we learned that the most efficient sorting algorithms can be visualized and analyzed with trees.

● I.e. those implementing "divide and conquer", like mergesort and quicksort.

● Before we discuss SortedDictionary's tree-based implementation, we're going to talk about alternative implementations.

● Sidebar: Java's implementation of a sorted associative collection is called TreeMap. It implements the interface Map.

Associative Collections – What's in a SortedDictionary?

5

5

● SortedDictionary could be implemented with a sorted list.

– Finding would be fast (O(logn)), but inserting would be costly (O(n), because of resizing).

– It is simple to code.

● SortedDictionary could be implemented with a sorted linked list.

– Inserting would be fast if the location is known (O(1)), but finding would be costly

(O(n), finding involves the iteration of a linked list).

– More memory per item is required (e.g. one item plus a few references, whereby

(sorted) lists don't (just one item, due to continuous memory)).

● Problems of these approaches:

– For lists continuos memory is the problem. → It makes the data structure rigid. → Resizing required.

– For linked lists finding items is a problem due to the loose structure. → "Linked iteration"

● Let's finally discuss how we come from sorted lists and sorted linked lists to the real implementation of SortedDictionary.

"Helen"

"James""Gil""Clark""Ed"

"Clark" "Ed" "Gil" "James"

null

"Helen"

SortedDictionary – two alternative Implementations

6

6

● When we examine a sorted linked list we'll find an interesting way of thinking about finding items.

– If we had a reference to the middle item, we could determine the "direction" or half to insert a new item to keep the linked list sorted.

– Therefor it would be nice to allow moving into both directions, so we should use a sorted doubly linked list (3rd alternative)!

– The idea of referencing only the middle item of an already sorted doubly linked list can be recursively applied to the sorted halves.

– The resulting structure is a tree, on which the real implementation of SortedDictionary is built upon!

"Clark" "Ed" "Helen" "James""Gil"

"Clark" "Ed" "Helen" "James""Gil"

"Bud" "Will"

null

"Will"

null

"Bud"

null

null"Clark"

"Ed"

null

"Helen"

"James"null null

"Gil"

"Will"

nullnullnull

"Bud"

null

"Kyle"

"Kyle"

"Kyle"

"Kyle"

SortedDictionary – Coming to the real Implementation

7

7

● Trees represent one of the many data structures available in cs (others are graphs, relational, heap, stack etc.)

– Similar to collections data structures can be categorized in different ways, e.g. a tree is a special form of a graph.

● Basic features of trees:

– Trees are recursive branching structures consisting of linked nodes.

– Trees have only one root node.

● In cs the root of a tree is written on the top at graphical representations.

– A node can be reached by exactly one path from the root.

● This is only true for linear paths, in xpath there exist various expressions to select any node from the root.

● Trees represent data that has a hierarchy. This is very common in computer business.

– In the last example the hierarchy was the order: left => less than root, right => greater than root.

– Directory and file structures.

– A function call hierarchy: a call stack.

– "Has a" and "is a" associations in object-oriented systems.

– The structure of an XML file.

Trees in Computer Science (cs)

8

8

● The final solution we found to handle sorted collections is a special form of a tree, a binary search tree (BST).

– The implementation of SortedDictionary is based on a BST (BST-based).

– The core strategy of BSTs is to maintaining pointers/references to the middle of subtrees.

● Here some general terminology on trees (seen from "James"):

– "James" is a node or cell. The arrows can be called edges.

– "James" has two children.

● Below "James" there is a subtree of two children.

– "James"' parent is "Gil".

– "Gil" is the root of the tree.

● In computer science trees are evolving upside down, having the root at the top.

– "Will" is a leaf, i.e. it has no children at all.

● A node having only one child is sometimes called half-leaf.

– "Clark" and "James" are siblings.

● The BST is called "binary", because each node has at most two children.

– A binary search can be issued on a BST in a very simple manner.

null"Clark"

"Ed"

null

"Helen"

"James"null null

"Gil"

"Will"

nullnullnull

"Bud"

null

Trees – Terminology

● There also exist ternary, n-ary etc. trees.

9

9

● Basically all operations on trees involve recursion!

● The iteration of a tree is called traversal. It means to visit every node in the tree from a certain starting node.

– (1) Do something with the current node.

– (2) If the left (or right node) is not null make it the current node and continue with (1). → Recurse!

● There are three classical ways to traverse (binary) trees:

– Inorder: visit the left node, then the root and finally the right node.

– Postorder: visit the left node, then the right node and finally the root.

– Preorder/depth-first: visit the root, then the left node and finally the right node.

● The most important traversal for BSTs is inorder, because it visits the nodes in sorted order.

– Notice, how the way of traversal could be encapsulated behind the iterator design pattern!

public class Node { // Mind the recursive definition!public string Value { get; set;}public Node Left { get; set;}public Node Right { get; set;}

}

public static void PrintTreePostorder(Node root) {if (null != root) {

PrintTreePostorder(root.Left); // Recurse! PrintTreePostorder(root.Right);// Recurse!

Console.WriteLine(root.Value);}

}

public static void PrintTreeInorder(Node root) {if (null != root) {

PrintTreeInorder(root.Left); // Recurse!Console.WriteLine(root.Value);PrintTreeInorder(root.Right); // Recurse!

}}

public static void PrintTreePreorder(Node root) {if (null != root) {

Console.WriteLine(root.Value);PrintTreePreorder(root.Left); // Recurse!PrintTreePreorder(root.Right); // Recurse!

}}

Traversal of Trees

● The deletion of all nodes in a BST needs to be done postorder.

● Among others there also exists "level-order", which is also called "breadth-first", it iterates all nodes beginning at the root from left to right.

10

10

● So SortedDictionary is implemented with a BST.

– The string that was used in each node in former examples, plays the role of SortedDictionary's search key.

– The nodes of SortedDictionary have one more field: the value that is associated with the search key.

● Using a BST in the implementation allows fast implementations to find items.

– E.g. the methods Add() and ContainsKey() apply a binary search on SortedDictionary's BST to locate items.

– The involved search algorithms are implemented recursively.

● Complexity.

– The costs for finding one item only depends on the height of the tree.

– We've already discussed that a tree's height can be retrieved by log n.

– This is valid for perfectly balanced trees.

– Other costs like swapping points will not be

taken into consideration as always.

– This makes the complexity for finding (and

related operations) O(log n) for BSTs.

● Max. ten comparisons to find an item in 1000 items.

null"Clark"

"Ed"

null

"Helen"

"James"null null

"Gil"

"Will"

nullnullnull

"Bud"

null

"Kyle"

"Kyle"

(Node)

8758 value

null root

* left

* right

"Gil" key

SortedDictionary – final Words

● Looking at the key we know on which half an item has to be inserted or found. (left => less than root; right => greater than root, this makes divide and conquer work).

● BSTs can be kept balanced internally to maintain logarithmic behavior, this yields an additional cost factor.

11

11

Associative Collections – A completely different Strategy

● Up to now we discussed associative collections using equivalence to manage keys.

– We used a BST and got logarithmic cost for inserting and finding elements. - This is pretty fast!

– An interesting point is that the organization of BSTs is concretely imaginable/understandable/drawable.

● Now we're going to discuss associative collections using equality to manage keys, which is really cool!

– Instead of keeping the associative collection sorted, we manage a set of collections that collect items having something in common.

– These individual collections are then called buckets.

● E.g. an associative collection telephoneDictionary: for each bucket the contained stringswill have the same first character.

– (This is exactly how "real" dictionaries work, where there are tabs collecting words starting with the same character. The tabs play the role of the buckets.)

– This type of associative collection is called hashtable.

– A so called hash-function maps a key to a hash-code.

● In this case the hash-function maps a string (= the key) to its first letter (= the hash-code).

– The hash-function determines, in which bucket to find or insert a key (or key-value-pair).

● When the bucket is determined, a specific operation takes only place on the items in that bucket. telephoneDictionary

(Bucket)

"Bud""B"

"Ben"

(Bucket)

"Gil""G"

(Bucket)

"Will""W"

"Walt"

3821

9427

7764

4689

1797

public class Bucket {public string HashCode { get; set; }public IList<KeyValuePair<string, int>> Items { get; set; }

}

12

12

Associative Collections – Hashtable-based Implementations - Keys

● OK, but how will an implementation using a hashtable help? - To understand it, let's dissect hashtable starting with the keys.

● The idea of a hashtable is to have a hash-function that calculates a position in a table from the input item.

– How can we have a function that is able to do that for an item?

– The idea is to have a method of an item-object, more specifically the key, that calculates this position.

● In the frameworks Java and .Net each UDT inherits a base type (Object in both frameworks), forming a cosmic hierarchy.

– In Java a UDT can override Object's method hashCode() in order to calculate the hash-code.

– In .Net a UDT can override Object's method GetHashCode() in order to calculate the hash-code.

– Now we have hash-functions/methods to "hash" a name string to the int value of the first letter (incl. some error handling).

– We can use instances of NameKey as keys for hashtables!

// C#public class NameKey {

public string Name { get; set; }

public override int GetHashCode() {// Return int-value of first letter.return string.IsNullOrEmpty(Name) ? 0 : Name[0];

}}

// Javapublic class NameKey { // (members hidden)

public String name; // (getter/setter hidden)

@Overridepublic int hashCode() {

// Return int-value of first letter.return (null == getName() || getName().isEmpty()) ? 0 : getName().charAt(0);

}}

● Java's and .Net's inherited equals()/Equals() (inherited from Object esp. for .Net reference types) will just compare references. It means that equals()/Equals() will only return true, if the same object is compared to itself, so the inherited implementations do an identity comparison.

● Java's inherited hashCode() (inherited from Object) just returns the object's address in memory. .Net's inherited GetHashCode() (inherited from Object i.e. for .Net reference types) uses some number that is guaranteed to be unique within the .Net AppDomain.

13

13

Associative Collections – Hashtable-based Implementations - .Net's Dictionary

● In .Net we can now use NameKey as type for the key in an IDictionary that is implemented in terms of a hashtable.

– The hashtable-based implementation of IDictionary is simply called Dictionary! It can (of course) be used like any other IDictionary:

– When GetHashCode() is overridden in the key's type, the hashtable will automatically work:

● When those items are added into the telephoneDictionary, following will happen:

– (1) For the key NameKey{ Name = "Bud" } the hash-code is retrieved. GetHashCode() returns 66.

– (2) In the hashtable-array (here an int[]) the entry on the index 66 is fetched:

● If there is an empty entry at 66, it will be set to the value 3821,

● otherwise the value at the existing entry at 66 will be overwritten with the value 3821.

– Another key (like NameKey{ Name = "Gil" }) yields another hash-code and another indexin the hashtable.

– The operations in (2) are O(1) operations, because accessing indexed collections(like arrays) on indexes only cost O(1).

● Oversimplification! There's more and it's getting more contrived! And we've to understand it to get the whole picture!

IDictionary<NameKey, int> telephoneDictionary = new Dictionary<NameKey, int>();

telephoneDictionary[new NameKey{ Name = "Bud" }] = 3821; // GetHashCode() returns 66.telephoneDictionary[new NameKey{ Name = "Gil" }] = 7764; // GetHashCode() returns 71.

telephoneDictionary

hashTable : int[]

382166

......

776471

......

......

38213821

adding these objects

(NameKey)

"Bud"3821

(NameKey)

"Gil"7764

● What we discuss here is a sparse array implementation of a hashtable, in which many slots of the table remain empty, whilst only few slots of the table with the indexes matching the hash-code are filled. Real-world implementations of hashtable are much more compact (usually the index is calculated as "index = hash-code % hashtable-length"), but we're not going to discuss them in this course, because the mere idea is the same.

● We didn't discuss the implementation of hashcode()/GetHashCode() in great depth, esp. if many fields are to be taken into consideration in the hash-code. Many suggestions involve the combination of the hash-codes with xor operations in order to get a good distribution including the multiplication of magic constants to yield prime numbers, which may result in yet a better distribution, or retrieving hash-codes of different field-types in different ways. That's all fine, but many of these tips rely on special assumptions of the platform and the implementation. Using a simple xor combination is OK for most cases and magic should only be taken into consideration, if performance- (i.e. lookup-) problems are present. Don't be too clever!

14

14

Associative Collections – Hashtable-based Implementations – Hash Collisions and Buckets

● But there is a problem: what if we try adding keys having the same hash-code, e.g. names with the same first letter?

– The thing that happens here is called hash collision. To handle hash collisions in a hashtable we have to review the idea of Buckets.

● Actually a hashtable is an array of Bucket objects (Not just an int[]!).

– For each hash-code there exists one Bucket.

– The hash-code is the index of the hashtable, where thathash-code's Bucket resides.

– Within a Bucket all inserted items with a key having thesame hash-code are collected as key-value-pairs.

● Hm... wait! If a hash collision leads to putting itemsinto the same Bucket, how can the associated valuebe retrieved from the hashtable in a definite way?

– In other words: the hashtable story is even more complex!

telephoneDictionary[new NameKey{ Name = "Bud" }] = 3821; // GetHashCode() returns 66.telephoneDictionary[new NameKey{ Name = "Ben" }] = 9427; // GetHashCode() also returns 66!

telephoneDictionary

hashTable : Bucket[]

(Bucket)66

......

(Bucket)71

......

......

(Bucket)

(Key : NameKey)

"Bud"

(Key : NameKey)

"Ben"

66 hashCode

(Bucket)

(Key : NameKey)

"Gil"

71 hashCode

(Value : int)

3821

(Value : int)

9427

(Value : int)

7764

adding these objects

(NameKey)

"Bud"3821

(NameKey)

"Ben"9427

15

15

Associative Collections – Hashtable-based Implementations – Equality

● As a matter of fact hash collisions can not be avoided, because any objects could have the same hash-code!

– E.g. two names could have the same first letter. This fact is not under our control!

● The answer is that all the key-value-pairs of a hash-code-matching Bucket need to be searched for a certain key.

– The keys of a hash-code-matching Bucket are linearly searched and compared for equality to find the exactly matching key.

– But what means "exactly matching key" in opposite to a (just) hash-code-matching key?

– We have to prepare our key-type to provide another method to check for the exact equality of two keys!

– Similar to objects providing hash-codes, Java and .Net want us overriding the method equals()/Equals() inherited from Object:

● Let's discuss how this works!

public class NameKey { // (members hidden)public override int GetHashCode() { /* pass */ }public override bool Equals(object other) {

// Simple and naïve implementation of Equals():NameKey otherNameKey = (NameKey)other;return Name == otherNameKey.Name;

}}

public class NameKey { // (members hidden)@Overridepublic int hashCode() { /* pass */ }@Override

public boolean equals(Object other) {// Simple and naïve implementation of equals():NameKey otherNameKey = (NameKey) other;return getName().equals(otherNameKey.getName());

}}

16

16

Associative Collections – Hashtable Implementations – Implementing Equality

● .Net: Mechanics of comparing two NameKey objects with the method Equals():

– Equals() has a parameter that is of the very base type Object (static type).

– We assume the dynamic type behind the passed argument is NameKey (only NameKeys can be equality-compared to each other).

● We can cast the parameter other to NameKey blindly. (And this can be a problem, which we'll discuss shortly.)

– When we've the NameKey behind the parameter, we equality-compare the argument's Name property against this' Name property.

● Mind that we used the operator== to compare the Name properties (well, we could have used Equals() as well); those are of type string.

● Equals() makes a deeper (more expensive) comparison than GetHashCode(), the latter only deals with Name's first letter:

public class NameKey { // (members hidden)public override bool Equals(object other) {

NameKey otherNameKey = (NameKey)other;return this.Name == otherNameKey.Name;

}}

// Both hash-codes evaluate to 66, but these objects are not equal!NameKey key1 = new NameKey{ Name = "Bud" }; NameKey key2 = new NameKey{ Name = "Ben" };

bool hashCodesAreTheSame = key1.GetHashCode() == key2.GetHashCode();// evaluates to truebool areNotEqual = key1.Equals(key2);// evaluates to false

public class NameKey { // (members hidden)public override bool Equals(object other) {

NameKey otherNameKey = (NameKey)other;return this.Name == otherNameKey.Name;

}}

public class NameKey { // (members hidden)public override int GetHashCode() {

return string.IsNullOrEmpty(this.Name) ? 0 : this.Name[0];}

}

17

17

Associative Collections – Hashtable-based Implementations – Rules of Equality

● We'll not discuss all facets of the implementation of GetHashCode() and Equals(), but here are the most important rules:

– Two objects having the same hash-code need not to return true for calling Equals()!

– But, if two objects having the same dynamic type return true for calling Equals(), then those need to have the same hash-code!

– GetHashCode() and Equals() have to return the same results for the "structurally" same objects unless one of them is modified.

– GetHashCode() and Equals() should be implemented to work very fast.

– Neither GetHashCode() nor Equals() are allowed to throw exceptions!

● If null is passed to Equals() the result has to be false.

● Finally GetHashCode() and Equals() for NameKey could be implemented like so to fulfill these rules:

public class NameKey { // (members hidden)public override int GetHashCode() {

return string.IsNullOrEmpty(Name) ? 0 : Name[0];}public override bool Equals(object other) { // Stable implementation of Equals()

if (this == other) {return true;

}if (null != other && GetType() == other.GetType())) {

return Name == ((NameKey)other).Name;}return false;

}}

- checks for identity- checks for nullity (to avoid exceptions)- checks the dynamic type of this and the other object- the cast is type safe- the Name properties of both objects are equality-compared

18

18

Associative Collections – Hashtable-based Implementations – The Lookup Algorithm

● All right! Now we have the methods GetHashCode() and Equals() in place. But how do hashtables use these tools to work?

● Let's assume following content in the hashtable-based Dictionary telephoneDictionary:

– Then we'll lookup/search the phone number of "Bud":

● The lookup will initiate following algorithm basically:

– Dictionary will call GetHashCode() on the indexer's (i.e. operator[]) argument 'new NameKey{ Name = "Bud" }' and the result is 66.

– Dictionary will get the Bucket at the hashtable's index 66. This Bucket has two entries!

– Dictionary will call 'Equals(new NameKey{ Name = "Bud" })' against each key of the key-value-pairs in the just returned Bucket.

– The value of the key-value-pair for which the key was equal to 'new NameKey{ Name = "Bud" })' will be returned.

● Read these steps for multiple times and make sure you understood those! Now we have to discuss some details...

– The algorithms to insert or update a key-value-pair work the same way as for lookups!

telephoneDictionary


(Bucket)66

......

......

(Bucket)

(Key : NameKey)

"Bud"

(Key : NameKey)

"Ben"

66 hashCode

(Value : int)

3821

(Value : int)

9427

searching this key

(NameKey)

"Bud"

IDictionary<NameKey, int> telephoneDictionary = new Dictionary<NameKey, int> {{ new NameKey { Name = "Bud" }, 3821 },{ new NameKey { Name = "Ben" }, 9427 }

} ;

int no = telephoneDictionary[new NameKey { Name = "Bud" }];

19

19

Associative Collections – Hashtable-based Implementation – best Case Complexity

● Although the algorithm to lookup keys is complex, the analysis of a hashtable's complexity is really simple!

● The best case yields a complexity of O(1)! - Wow!

– If every item can be associated to exactly one distinct Bucket each, we haveone item in every Bucket: an 1 : 1 association between items and Buckets.

– This means that every item has a distinct hash-code and thus a distinct index in the hashtable.

– As the hashtable is an array and arrays can access their items by indexwith O(1) complexity, we have the best case for hashtable!

● In the best case, accessing a hashtable, means accessing an array byindex with constant complexity.

– This is better than O(log n) for tree-based associative collections!

– This is the best we can get for collections!

● But there is also a worst case!

telephoneDictionary


(Bucket)66

......

(Bucket)71

......

......

(Bucket)

(Key : NameKey)

"Gil"

71 hashCode

(Value : int)

7764

(Bucket)

(Key : NameKey)

"Bud"

66 hashCode

(Value : int)

3821

(Bucket)

(Key : NameKey)

"Will"

87 hashCode

(Value : int)

4689

(Bucket)87

......

20

20

Associative Collections – Hashtable-based Implementations – worst Case Complexity

● The worst case yields a complexity of O(n)! - Oh no!

– If every item can be associated to the same Bucket, we haveall items in only one single Bucket: an n : 1 association between items andBuckets.

– This means that every item has the same hash-code and thus the same index in the hashtable.

– As the only one Bucket needs to be search linearily to find the equal key,the worst complexity boils down to the linear complexity O(n)!

● In the worst case the workload is moved to a single Bucket that mustbe searched in a linear manner.

telephoneDictionary


(Bucket)66

......

......

(Bucket)

(Key : NameKey)

"Bud"

(Key : NameKey)

"Ben"

66 hashCode

(Value : int)

3821

(Value : int)

9427

(Key : NameKey)

"Betty"

(Value : int)

8585

● Java's java.util.concurrent.ConcurrentHashMap is able to switch the storage of the buckets from a List-implementation to a sorted tree implementation, if the buckets grow too large (i.e. too many keys with the same hashcode). The benefit is that searching a bucket will be more efficient then (O(n) for List, but O(log n) for sorted trees).

21

21

Associative Collections – Hashtable-based Implementations – Controlling Performance – Part I

● An interesting point when working with .Net's Dictionary is how we can influence the performance as developers:

– We can override GetHashCode() and Equals() for our own UDTs being used as keys and this is what we are going to discuss now.

● Concretely we have a problem with the UDT NameKey: we'll get the same hash-code for keys having the same first letter!

– However, we can override GetHashCode(), so that a "deeper" or in other words "more distinct" hash-code will be produced.

– .Net's string is able to produce its own hash-code that is calculated concerning the whole string-value and not only the first letter:

● We already know that each .Net type inherits Object and can override GetHashCode() and Equals().

– A type implementing equality in the .Net framework needs overriding GetHashCode() and Equals()!

● Think: GetHashCode() => level one equality, Equals() => level two equality.

– Many types of the .Net framework provide useful overrides of GetHashCode() and Equals() to implement equality.

– (GetHashCode() is not only used with Dictionary's but also in other places of the .Net framework.)

– (A type implementing equivalence in the .Net framework needs implementing IComparable or another type implementing IComparer.)

public class NameKey {public string Name {get; set;}

public override int GetHashCode() {// Return int-value of first letter.return string.IsNullOrEmpty(Name) ? 0 : Name[0];

}}

public class NameKey {public string Name {get; set;}

public override int GetHashCode() {// Return the hash-code of the string Name.return string.IsNullOrEmpty(Name) ? 0 : Name.GetHashCode();

}}

● switch-case with strings in C# and Java uses the strings' hash-code to do the comparisons.

● It should be mentioned that some methods of list-types also use equality to function, e.g. methods like Contains(), Remove() or IndexOf().

22

22

NO! Stop it! We're not going to discuss it here! As a matter of fact implementing equality correctly in .Net and Java is not simple and it is often done downright wrong and potentially dangerous!

I have one simple tip: Don't be too clever and follow the rules!

Ok, we'll discuss it in depth in a future lecture.

***Under Construction – Equality – Under Construction***

● An interesting point is that each object in Java/.Net implements equals()/Equals() and hashcode()/GetHashCode().

– But overrides of those methods need to follow certain rules. Let's discuss those.

23

23

Associative Collections – Hashtable-based Implementations – Controlling Performance – Part II

● The last implementation of NameKey does basically delegate equality fully to its property Name, which is of type string.

– Hm! - To drive this point home: we can get rid of the type NameKey and use string instead as key-type! Hey!

● As a matter of fact, it is often not required to program extra UDTs as key types!

– The present .Net types are often sufficient to be used as keys for hashtables. Most often int and string are used as key types.

– Nevertheless, .Net (and Java) developers have to know, how equality needs to be implemented correctly!

● In principle the implementation pattern and even the idiom for equality works the same way in Java!

// A dictionary that associates strings (names) to ints (phone numbers):IDictionary<string, int> telephoneDictionary = new Dictionary<string, int>();// Adding two string-int key-value-pairs:telephoneDictionary["Bud"] = 3821;telephoneDictionary["Gil"] = 7764;// Looking up the phone number of "Bud":int no = telephoneDictionary["Bud"];

24

24

● Without any further explanation it should be clear that maintaining a collection having only unique items is very painful.

– E.g. collecting only the surnames from a deck of invitations. It could be done by filling a list w/o duplicates:

● There is a variant of associative collections that just avoids the presence of duplicates. Those are often called sets.

– Sets are associative collections like dictionaries, in which the keys are the values!

– In Java/.Net we can use BST-based (TreeSet/SortedSet) and hashtable-based (HashSet/HashSet) implementations of sets.

– In C++ we can use set either the BST-based std::set (<set>) or the hashtable-based std::unordered_set (C++11: <unordered_set>).

// JavaList<String> allSurnames = Arrays.asList("Taylor", "Miller", "Taylor", "Miller");

List<String> uniqueSurnames = new ArrayList<>();for (String surName : allSurnames) {

if (!uniqueSurnames.contains(surName)) { // Filters unique items out.uniqueSurnames.add(surName);

}}

for (String surName : uniqueSurnames) {System.out.println(surName);

}// > Taylor// > Miller

// Implementation using a Set:Set<String> uniqueSurnames2 = new HashSet<>();for (String surName : allSurnames) {

uniqueSurnames2.add(surName);}

for (String surName : uniqueSurnames2) {System.out.println(surName);

}

Associative Collections – Sets

● As for dictionaries (and all associative collections) the cause using sets is to exploit its self-organization (keeping items unique), instead of organizing it ourselves.

25

25

● Set collections can really represent mathematical sets, this also includes operations on sets (e.g. via .Net's ISet interface):

● Subsets

● Union

// C#/.NetISet<int> A = new HashSet<int>{ 1, 2, 3, 4, 5, 6, 7, 8, 9 };ISet<int> B = new HashSet<int>{ 3, 6, 8 };

bool BSubSetOfA = B.IsSubsetOf(A);// >truebool BProperSubSetOfA = B.IsProperSubsetOf(A);// >true

Associative Collections – Operations on Sets – Part I

// JavaSet<Integer> A = new HashSet<>(Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9));Set<Integer> B = new HashSet<>(Arrays.asList(3, 6, 8));

boolean BSubSetOfA = A.containsAll(B);// >true// Java doesn't provide a test for _proper_ subsets.

A⊆B A⊂B A={1,2,3,4,5,6,7,8,9};B={3,6,8}

A B

A∪B A={1,2,3,4,5,6};B={4,5,6,7,8,9};{1,2,3,4,5,6}∪{4,5,6,7,8,9}={1,2,3,4,5,6,7,8,9}

ISet<int> A = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };

A.UnionWith(B); // The set A will be _modified_!// >{1, 2, 3, 4, 5, 6, 7, 8, 9}

// LINQ's Union() extension method will create a _new_// sequence:ISet<int> A2 = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B2 = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };IEnumerable<int> A2UnionB2 = A2.Union(B2);// >{1, 2, 3, 4, 5, 6, 7, 8, 9}

Set<Integer> A = new TreeSet<>(Arrays.asList(1, 2, 3, 4, 5, 6));Set<Integer> B = new TreeSet<>(Arrays.asList(4, 5, 6, 7, 8, 9));

boolean AWasModified = A.addAll(B); // The set A will be _modified_!// >trueSystem.out.println(A);// >[1, 2, 3, 4, 5, 6, 7, 8, 9]

A B

● A proper subset means, that a set is a subset of another set, but both sets are not equal!

● Subsets can also be expressed with LINQ, but usually the type ISet and its implementors should be used, because it leads to more expressive code:

// C#/.Net/LINQISet<int> A = new HashSet<int>{ 1, 2, 3, 4, 5, 6, 7, 8, 9 };ISet<int> B = new HashSet<int>{ 1, 2, 3, 4, 5, 6, 7, 8, 9 };ISet<int> C = new HashSet<int>{ 3, 6, 8 };

// Subset:bool BIsSubsetOfA = !B.Except(A).Any();// >truebool CIsSubsetOfA = !C.Except(A).Any();// >true

// Proper Subset:bool BIsProperSubSetOfA = A.Except(B).Any();// > falsebool CIsProperSubSetOfA = A.Except(C).Any();// > true

26

26

● Difference

● Symmetric difference

// C#/.NetISet<int> A = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };

A.ExceptWith(B); // The set A will be _modified_!// >{1, 2, 3}

// LINQ's Except() method will create a _new_ sequence:ISet<int> A2 = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B2 = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };IEnumerable<int> A2ExceptB2 = A2.Except(B2);// >{1, 2, 3}

Associative Collections – Operations on Sets – Part II

A B

A ∖B

// JavaSet<Integer> A = new TreeSet<>(Arrays.asList(1, 2, 3, 4, 5, 6));Set<Integer> B = new TreeSet<>(Arrays.asList(4, 5, 6, 7, 8, 9));

A.removeAll(B); // The set A will be _modified_!System.out.println(A);// >[1, 2, 3]

A={1,2,3,4,5,6};B={4,5,6,7,8,9};{1,2,3,4,5,6}∖{4,5,6,7,8,9}={1,2,3}

A▵B(:=(A ∖B)∪(B∖ A )) A={1,2,3,4,5,6};B={4,5,6,7,8,9};{1,2,3,4,5,6}▵{4,5,6,7,8,9}={1,2,3,7,8,9}

A B

ISet<int> A = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };

A.SymmetricExceptWith(B); // The set A will be _modified_!// >{1, 2, 3, 7, 8, 9}

// Using LINQ we can create a _new_ sequence:ISet<int> A2 = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B2 = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };IEnumerable<int> A2SymmetricExceptB2 =

A2.Except(B2).Union(B2.Except(A2));// >{1, 2, 3, 7, 8, 9}

Set<Integer> A = new TreeSet<>(Arrays.asList(1, 2, 3, 4, 5, 6));Set<Integer> B = new TreeSet<>(Arrays.asList(4, 5, 6, 7, 8, 9));Set<Integer> A2 = new TreeSet<>(Arrays.asList(1, 2, 3, 4, 5, 6));

A.removeAll(B); // The sets A and B will be _modified_!B.removeAll(A2);A.addAll(B);System.out.println(A);// >[1, 2, 3, 7, 8, 9]

27

27

● Intersection

Associative Collections – Operations on Sets – Part III

A∩B A={1,2,3,4,5,6};B={4,5,6,7,8,9}; {1,2,3,4,5,6}∩{4,5,6,7,8,9}={4,5,6}

A B

// C#/.NetISet<int> A = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };

A.IntersectWith(B); // The set A will be _modified_!// >{4, 5, 6}

// LINQ's Intersect() method will create a _new_ sequence:ISet<int> A2 = new SortedSet<int>{ 1, 2, 3, 4, 5, 6 };ISet<int> B2 = new SortedSet<int>{ 4, 5, 6, 7, 8, 9 };IEnumerable<int> A2IntersectionWithB2 = A2.Intersect(B2);// >{4, 5, 6}

// JavaSet<Integer> A = new TreeSet<>(Arrays.asList(1, 2, 3, 4, 5, 6));Set<Integer> B = new TreeSet<>(Arrays.asList(4, 5, 6, 7, 8, 9));

A.retainAll(B); // The set A will be _modified_!System.out.println(A);// >[4, 5, 6]

28

28

● Inserting values having already present keys in an associative collection overwrites or updates present values.

– Sometimes it is not desired. E.g. mind a telephoneDictionary, in which a name can have more than one phone number!

● Some collection frameworks provide associative collections that can handle multiplicity.

– In C++ we can use std::multimap (<map>) and std::multiset (<set>).

● In other frameworks (Java, .Net etc.) multi-associative collections need to be explicitly implemented or taken from 3rd party.

– 3rd party sources: Apache Commons (Java)

// C++11std::multimap<std::string, int> telephoneDictionary {

{ "Ben", 9427 }, // Mind: two values with the same key are added here.{ "Ben", 4367 },{ "Jody", 1781 }, // Mind: three values with the same key are added here.{ "Jody", 9032 },{ "Jody", 8038 }

};

// It is required to use STL iterators, because std::multimap provides no subscript operator:for (auto item = telephoneDictionary.begin(); item != telephoneDictionary.end(); ++item) {

std::cout<<"Name: "<<item->first<<", Phone number: "<<item->second<<std::endl;}

Special associative Collections – Multiplicity of Keys

telephoneDictionary

9427"Ben" 4367

1781"Jody" 9032 8038

29

29

● Key

– Refrain from modifying key objects managed in associative collections.

● Sorted/equivalence-based associative collections:

– TreeMap/TreeSet (Java), SortedDictionary/SortedSet (.Net), std::set/std::map (C++),

– The internal organization of keys is equivalence-based:

● Equivalence-based means that key objects are organized due to their relative/natural order. This is often implemented with a BST.

● The key-type needs implementing Comparable (Java) or IComparable (.Net) or operator< (C++ STL) for semantically correct "natural order".

● Or a Comparator (Java) or IComparer (.Net) or a comparison functor (C++ STL) needs to be specified for the associative collection that implements

"natural ordering" for the key-type.

– Searching/inserting/removing items can be done very fast (O(logn)), because binary searches are used (BST).

– Sorted associative collections are unordered!

● The iteration will yield the items in sorted order (BST in-order). The order, which items have been added/inserted/removed doesn't matter!

– => Use these associative collections, if sorting of keys or the control of key-comparison (e.g. reverse order) is needed!

Things to ponder about Associative Collections – Part I

30

30

● Equality-based associative collections:

– HashMap/HashSet (Java), Dictionary/HashSet (.Net), NSDictionary/NSSet (Cocoa), std::unordered_map/std::unordered_set (C++11),

associative arrays (JavaScript)

– The internal organization of keys is equality-based:

● By equality. I.e. by hash-codes and the result of the equality check (methods hashCode()equals() (Java) or GetHashCode()/Equals() (.Net)).

● The key-type needs implementing hashCode()/equals() (Java) or GetHashCode()/Equals() (.Net) for semantically correct equality.

● Or an IHashCodeProvider/IEqualityComparer (.Net) needs to be specified for the associative collection that implements equality for the key-type.

● In C++11 STL a hasher functor should be specified for the associative collection that provides hash-codes for the key-type.

– Searching/inserting/removing items can be done O(1) no extra search operation is required, the hash-code is used as an index.

– These associative collections are unordered!

● The iteration order makes no guarantees about how items are yielded. It can change completely when items are added.

● In most cases equality-based associative collection should be our first choice, those are potentially most efficient.

– => Use these associative collections, if sorting of keys or the control of key-comparison doesn't matter!

● Ordered associative collections:

– LinkedHashMap (Java) and OrderedDictionary (.Net) will iterate in that order, in which the items were put into the collection.

● LinkedHashMap is the type Groovy uses to implement map literals.

Things to ponder about Associative Collections – Part II

31

31

● Old and new collections:

– Java:

● The old Hashtable is not null-aware on keys, adding null will throw a NullPointerException. Better use HashMap/HashSet (Java 1.2 or newer).

● Hashtable and HashMap return null, if a requested key has no value (i.e. key is not present).

– .Net:

● The old object-based Hashtable should not be used for new code using .Net 1 or newer. Better use Dictionary/SortedDictionary (.Net 2 or newer).

● Be aware that Hashtable returns null, if a key has no value (i.e. key is not present). Dictionary/SortedDictionary will throw a KeyNotFoundException.

● Keys with the value null:

– Java:

● TreeMap/TreeSet don't allow keys having the value null, then a NullPointerException will be thrown.

● HashMap/HashSet can digest keys having the value null.

– .Net:

● Dictionary/SortedDictionary don't allow keys having the value null, then an ArgumentNullException will be thrown.

● HashSet/SortedSet can digest keys having the value null.

● It was never mentioned explicitly: There are basically no constraints on the types of the values in associative collections!

Things to ponder about Associative Collections – Part III

32

32

Thank you!

Technology

(5) collections algorithms