The Next Big Thing in Big Data

  • Published on

  • View

  • Download

Embed Size (px)


<ol><li> 1. Because I have a lot to cover there wont be time for questions at the end. And Im guessing some of the question wont have simple answers. So you can go to my blog where each of the sections is a separate post that you can comment on or ask questions about. </li><li> 2. First lets look at the data explosion that everyone is talking about at the moment. </li><li> 3. This is a quote from a paper about the importance of data compression because of data explosion. It seems reasonable. Store information as efficiently as possible so that the effects of the explosion are manageable. [TRANSITION] This was written in 1969. So the data explosion is not a new phenomenon. It has been going on since the mid 60s. </li><li> 4. This is another quote, much more recent, that you might see online. This says that the amount of data being created and stored is multiplying by a factor of 10 every two years. I have not found any numerical data to back this up so I will drill into this in a few minutes. </li><li> 5. So consider this graph of data quantities. It looks like it might qualifies as a data explosion. But this is actually just the underlying trend of data growth with no explosion happening. This graph is just shows hard drive sizes for home computers. Starting with the first PCs with 10MB drives in 1983, and going up to a 512GB drive today. Some of you might recognize that this exponential growth is the storage equivalent of Moores law, which states that the computing power doubles every 2 years. And we can see from these charts that hard drives have followed along at the same rate. This exponential growth in storage combines with a second principle. </li><li> 6. This statement is not just an ironic observation. This effect is due to the fact that the amount of data stored is also affected by Moores law. With twice the computing power, you can process images that are twice as big. You can run applications with twice the logic. You can watch movies with twice the resolution. You can play games that are twice as detailed. All of these things require twice the space to store them. Today an HD movie can be 3 or 4 gigabytes. In 2001 that was your entire hard drive. </li><li> 7. With processing power doubling at the same rate that storage is increasing what does this say about any gap between the data explosion and the CPU power required to process it? </li><li> 8. This is the growth in data </li><li> 9. This is the growth in processing power </li><li> 10. If we divide the amount of data by the amount of processing power we get a constant. We get a straight line. If this holds to be true then we will never drown in our own data. </li><li> 11. Can we really call it an explosion, if it is just a natural trend? We dont talk about the explosion of processing power its just Moores law. Is there a new explosion that is over and above the underlying trend. If so how big is it and will it continue? We are going to find the answers to all of these questions. Before we do there are some things to understand. </li><li> 12. Firstly there is a point, for any one kind of data, where the explosion stops or slows down. It is the point at which the data reaches its natural maximum granularity, and beyond which there is little practical value to increasing the granularity. Im going to demonstrate this natural maximum using some well known data types. </li><li> 13. Lets start with color. Back in the early 80s we went from black and white computers to 16 color computers. The 16 color palette was a milestone because each color needed to have a name, and most computer programmers at the time couldnt name that many colors. So we had to learn teal, and fushcia, and cyan and magenta. Then 256 colors arrived a few years later. Which was great because it was too many colors to name, so we didnt have to. Then 4,000 colors. And within the decade we were up to 24-bit color with 16 million colors. Since then color growth has slowed down. 30-bit color came 10 years later, followed by 48-bit color a decade ago, with its 280 trillion colors. But in reality most image and video editing software, and most images and </li><li> 14. Because we see a similar thing with video resolutions. They have increased, but not exponentially. The current standard is 4K, which has 4 times the resolution of 1080p. With 20/20 vision 4K images exceed the power of the human eyeball when you view them from a typical distance. The retina displays on Apple products are called that because they have a resolution designed to match the power of human vision. So images and video are just reaching their natural maximum but these files will continue to grow in size as we gradually reduce the compression and increase the quality of the content. </li><li> 15. In terms of ability, the human hearing system lies in between 16-bit sound and 24-bit sound. So again we have hit the natural limit of this data type. If you still dont believe in the natural granularity have I one further example. </li><li> 16. Dates. In the 60s and 70s we stored dates in COBOL as 6 digits. This gave rise to the Y2K issue. We managed to avoid that apocalypse. With 32-bit dates we extended the date range by 38 years. But since the creation of 64bit systems and 64 bit dates, the next crisis for dates is? Everyone should have this in their diary. Its a Sunday afternoon. December 4th. But what year? Anyone? Its the year 292 billion blah blah blah. </li><li> 17. So this is the graph showing the natural granularity of dates for the next 290 billion years. [TRANSITION] For reference the green line shows the current age of the universe, which is 14 billions years. </li><li> 18. So now that we understand that different data types have a natural maximum granularity, how does it relate to big data and the data explosion? </li><li> 19. Look at the example of a utility company that used to record your power consumption once and month and now does it every 10 seconds. Your household applicances, the dishwasher, fridge, oven, heating and air conditioning, TVs, computers dont turn on an off that often. The microwave has the shortest duration, but usually 20 seconds is the shortest time it is on for. [TRANSITION] So this seems like a reasonable natural maximum </li><li> 20. Now lets take a cardiologist who, instead of seeing a patient once a month to record some data, now can get data recorded once a second, 24 hours a day. Your heart rate, core temperature, and blood pressure dont change on a sub- second interval. [TRANSITION] So again this seems like a reasonable natural maximum </li><li> 21. As companies create a production big data system the amount of data stored will increase dramatically until they have enough history stored anywhere from 2 to 10 years of data. Then the growth will reduce again. So the amount of data will explode, or pop, over a period of a few years. </li><li> 22. If this is your data before it pops [TRANSITION] Then this is your data after it pops </li><li> 23. There are millions of companies in the world. If you only talk to the top 1000 companies in the USA you only get a very small view of the whole picture. </li><li> 24. This brings us back to this claim, which aligns with the hype. How can we really asses the growth in data? </li><li> 25. My thought is that if the data explosion is really going at a rate of 10x every two years, then HP, Dell, Cisco, and IBM must be doing really well, as these manufacturers account for 97% of the blade server market in North America. And Seagate, and Sandisk, and Fujitsu, and Hitachi must be doing well really well too, as they make the storage. And Intel and AMD must be doing really well because they make the processors. Lets look at HP who has 43% of worldwide market in blade servers. </li><li> 26. From graphs of stock prices we can see that IBM, Cisco, Intel, EMC, and HP dont have growth rates that substantiate a data explosion. </li><li> 27. When we look at memory and drive manufacturers the best of all of these is Seagate and Micron with about a 200-300% growth over 5 years. That is a multiplier of about a 1.7 year over year. </li><li> 28. If we apply that multiplier of 1.7 to the underlying data growth trend we see that the effect is noticeable but not really that significant. And that represents the maximum growth of any vendor, so the actual growth will be less than this. </li><li> 29. When we look at the computing industry from a high level we see a shift in values, from hardware in the 60s and 70s with IBM as the king. To software with Microsoft, then to tech products and solutions from companies like Google and Apple and finally to products that are based purely on data like Facebook and LinkedIn. Over the same time periods we have seen statistics [TRANSITION] be augmented with machine learning [TRANSITION] and more recently with deep learning [TRANSITION] The emergence of deep learning is interesting, because it provides unsupervised or semi-supervised data manipulation for creating predictive models. It is interesting because </li><li> 30. Its like the difference between mining for gold when you can just hammer lumps of it out of the ground, and panning for tiny gold flakes in a huge pile of sand and stones </li><li> 31. The number of data scientists is not increasing at the same rate as the amount of data and the number of data analysis projects is. We are not doubling the number of data scientists every two years. This is why Deep Learning is a big topic at the moment because it automates part of the data science process. The problem is that the tools and techniques are very complicated. </li><li> 32. For an example of complexity here is a classical problem know as the German Tank Problem </li><li> 33. In the second world war, leading up to the D-Day invasions the allies wanted to know how many tanks the Germans were making </li><li> 34. So statistics was applied to the serial numbers found on captured and destroyed tanks. </li><li> 35. As you can see the formulas are not simple. </li><li> 36. And this problem deals with a very small data set. </li><li> 37. The results were very accurate. [TRANSITION] When intelligence reports estimated that 1,400 tanks were being produced per month, [TRANSITION] the statistics estimated 273. [TRANSITION] The actual figure was later found to be 274. </li><li> 38. This next example is one of the greatest early works in the field of operation research. This is interesting for several reasons. Firstly because, with the creation of Storm and Spark Streaming and other real-time technologies we are seeing a dramatic increase in the number of real-time systems that include advanced analytics, machine learning, and model scoring. But this field is not new. The other reason this is interesting is that it shows that correctly interpreting the analysis is not always obvious and is more important than crunching the data. </li><li> 39. In an effort to make bombers more effective each plane returning from a mission was examined for bullet holes and a map showing the density of bullet holes over the planes was generated. Tally ho chaps, said the bomber command commanders, slap some lovely armor on these beauties where-ever you see the bullet holes [TRANSITION] Hold on a minute, said one bloke. I do not believe you want to do that. Well, who are you said the bomber command commanders? My name is Abram Wald. I am hungarian statistician who sounds a lot like Michael Caine for some reason. Walds reasoning was that they should put the armor where there are no bullet holes, because thats where the planes that dont make it back must be getting hit. Which happened to be places like the cockpit and the engines. </li><li> 40. I deliberately chose two examples from 70 years ago to show that the problems of analysis and interpretation are not new, and they are not easy. In 70 years we have managed to make tools more capable but not much easier. But this has to change. </li><li> 41. So these are my conclusions on data science. We have more and more data, but not enough human power to handle it, so something has to change. </li><li> 42. Lets move onto technology </li><li> 43. Google trends shows us that interest in Hadoop is not dropping off. </li><li> 44. And that R now has as much interest as SAS and SPSS combined. </li><li> 45. Up until recently there was more interest in MapReduce than Spark and so today we see mainly MapReduce in production. But as we can see from the chart this is likely to change soon. </li><li> 46. The job market shows us similar data with the core Hadoop technologies currently providing more than 3 /4 of the job opportunities </li><li> 47. And also that Java, the language of choice for big data technologies, has the largest slice of the open jobs positions </li><li> 48. One issue that is not really solved well today is SQL on Big Data. </li><li> 49. On the job market HBase is the most sought after skill set. But you can see that Phoenix, which is the SQL interface for HBase, is not represented in terms of jobs. This chart also shows that the many proprietary big data SQL solutions are not sought after skills at the moment. We dont have a good solution for SQL on big data yet. </li><li> 50. Today aspects of an application that relate to the value of the data are typically a version 2 afterthought for application developers. [TRANSITION] This affects the design of both applications, and data analysis projects. </li><li> 51. For a software applications, the value of the data is not factored in, the natural granularity is not considered, and the data analysis is not part of the architecture. So we see architectures like this. With a database and business logic and a web user interface. </li><li> 52. The data analysis has to be built as a separate system, which is created and integrated after the fact. At a high-level, it will be something like this for a big data project, given the the charts and trends we saw earlier. We see commonly see Hadoop, MapReduce, Hbase, and R. </li><li> 53. So here are the summary points for todays technology stack. </li><li> 54. Now lets look into the future a little </li><li> 55. If data is more valuable than software, [TRANSITION] we should design the software to maximize the value of the data, and not the other way around. </li><li> 56. We should design applications with the purpose of storing and getting value from the natural maximum granularity </li><li> 57. We should provide access to the granular data for the purpose of analysis and operational support </li><li> 58. If data is where the value is, then the use and treatment of the data should be factored into an application from the start. [TRANSITION] It should be a priority of version 1. </li><li> 59. [TRANSITION] Valuing the data more than the software is a new requirement. [TRANSITION] Which demands new applications [TRANSITION] Which need new architectures </li><li> 60. To illustrate this lets take the example of Blockbuster as an old architecture. Hollywood studios would create content that was packaged and loaded in batches to Blockbuster stores. The consumer would then jump in their...</li></ol>