16
Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Embed Size (px)

Citation preview

Page 1: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Critical Thinking and Argumentation in Software Engineering-CT60A7000

Big Data

Chapter3-Messy

Behnaz Norouzi Francis Matheri

Page 2: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Increasing the volume opens the door to inexactitude.

One of the fundamental shifts of going to big data from small considering inexactitude unavoidable and learn to live with them. Instead of treating them as problems and trying to get rid of them.

In the world of small data reducing errors & ensuring high quality data are essential. 

In the world of sampling the obsession with exactitude was even more critical.

Page 3: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

In the middle of nineteen century the quest for exactitude began in Europe.

If one could measure a phenomenon the implicit belief was, one could understand it.

Later measurement was tied to the scientific method of observation and explanation:

Lord Kelvin: “to measure is to know”

Francis Bacon: “Knowledge is power”

By the nineteenth century France developed a precise system to capture space, time and more.

Half a century later the discovery of quantum mechanics shattered forever the dream of comprehensive and perfect measurement.

Page 4: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

In many new situations, allowing for messiness may be a positive feature not a shortcoming.

A tradeoff: in return for allowing errors, one can get ahold of much more data.

It isn’t just that “More trumps some”, but sometimes “More trumps better”.

The likelihood of errors increases as you add more data points.

Page 5: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Messiness itself is messy It can arise when we extract or process the data, since in doing so; we are transforming it, turning it into something else. Such as when we perform sentiment analysis on twitter messages to predict Hollywood box office receipts.

Example measuring the temperature in a vineyard:

If we have only one temperature sensor for the whole plot of land we must make sure it is accurate no messiness allowed.

If we have a sensor for every one hundred of vines using cheaper sensors messiness is allowed.

It was again a tradeoff we sacrificed the accuracy of each data point for breath, and in return we received details that we otherwise could not have seen.

Page 6: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Big data transforms figures into something more probabilistic than precise.

More data improvements in computing

Example Chess algorithms. Using (N=all)

Banko and Brill : “ We want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development.”

What is the story of this saying?!

The result More data, better performance.

Page 7: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Google’s idea language translation

The result An IBM computer translated sixty Russian phrases into English in 1954.

The problem posed by a committee of machine-translation grandees translation is not just about memorization and recall, it is about choosing the right words from many alternatives.

Page 8: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

A novel idea of IBM researchers Instead of feeding computer with explicit linguistic rules let the computer use statistical probabilities to calculate which word or phrase in one language is the most proper one in another language.

Google’s mission in 2006 Organize the world’s information and make it universally accessible and useful.

The result despite messiness of input, Google’s service works the best. And it is far, far richer.

Why it works well? Fed in more data (not just high quality.)

Page 9: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Conventional sampling analysts

Accepting messiness is difficult for them.

They use multiple error-reducing strategies.

The problem

Such strategies are costly

Exacting standards of collection are unlikely to be achieved consistently at such scale.

More Trumps Better

Page 10: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Moving into a world of big data will require us to change our thinking about the merits of exactitude.

In dealing with even more comprehensive datasets, we no longer need to worry so much about individual data points biasing the overall analysis.

Take the way sensors are making their way into factories.

Example At a factory in Washington, wireless sensors are installed throughout the plant, forming an invisible mesh that produces vast amount of data in real time.

Page 11: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Moving to a large scale changes:

Expectations of precision

Practical ability to achieve exactitude

Technology is imperfect messiness is a practical reality we must deal with.

To get the inflation number the Bureau of Labor Statistics employs hundreds of staffs to do related matters and it costs around $250 million a year.

The problem by the time the numbers come out, they are already a few weeks old.

Solution quicker access to inflation numbers that cannot be achieved with conventional methods focused on sampling.

Page 12: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Two economist at Massachusetts Institute of Technology Using big data is the shape of using software to crawl the web and collected half a million prices of products sold in the U.S. every single day.

The benefit combining big data collection with clever analysis led to the defection of a deflationary swing in prices immediately after Lehman Brothers field for bankruptcy in September 2008.

Page 13: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Move and messy OVER fewer and exact

Categorizing content hierarchical systems such as taxonomies and indexes are imperfect.

Photo sharing site, Flickr

In 2011 held more than six billion photos from more than 75 million users.

Tried to label each photo according preset categories.

They replaced the preset by mechanisms that are messier but more flexible.

The imprecision inherent in tagging is about accepting the natural messiness of the world.

Messiness In Action

Page 14: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Database design

Traditional Databases: require highly structured and precise data.

Traditional databases are good for a world in which data is sparse, and thus can be curated carefully.

This view of storage is at odds with reality.

The big shift noSQL databases.

It accepts data of varying type and size and allows it to be searched successfully.

They require more processing and storage resources for permitting structural messiness.

Pat Helland: “It is OK if we have “Lossy” answers. That’s frequently what business needs.”

Page 15: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Hadoop An open source rival to Google’s MapReduce system.

Why Hadoop is very good at processing large quantities of data?

It takes for granted that the quantity of data is so breathtakingly enormous that it can’t be moved and must be analyzed where it is.

Page 16: Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

By allowing for imprecision, we open a window into an untapped universe of insights.

In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools.

As big data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N=all of the mind.

Big data with its emphasis on comprehensive datasets and messiness helps us get closer to reality than did our dependence on small data and accuracy.

Big data may require us to change. To become more comfortable with disorder and uncertainty.