Volume 5 Number 2 March 2013 Big Data - Life Servantimage.lifeservant.com/siteuploadfiles/VSYM/99B5C5E7-8B46-4D14-A… · were available before or didn’t exist before. Now it is

Innovative minds are creating brand new ways of analyzing data and extracting knowledge from it.

Big Data In Government

3 BigDataIsABigDeal

4 BigIdeasForBigData

6 AFewMinutesWithNSF’sDr.SuziIacono

8 BigDataDemystified

10 MakingYourBigDataMove

12 ExecutiveInterviewWithNIST’sAshitTalukder

14 Let’sTalkTech…

SMEOne-On-Ones

15 Brocade:ScottPearson&CaseyMiles

16 CommVault:ShawnSmucker

17 EMCIsilon:AudieHittle

18 HP:DianeZavala

19 Informatica:ToddGoldman

20BearingBigDataFruit

22CallingAllGWACs

Volume 5 Number 2 March 2013

Inside Big Data

Published by

Download Your Digital Edition at www.OnTheFrontLines.net.

Cour

tesy

Nat

iona

l Sci

ence

Fou

ndat

ion:

Fuq

ing

Zhan

g an

d Yo

nghu

i Wen

g,

Penn

sylv

ania

Sta

te U

nive

rsity

; Fra

nk M

arks

, NOA

A; G

rego

ry P.

Joh

nson

, Rom

y Sc

hnei

der,

John

Caze

s, Ka

rl Sc

hulz

, Bill

Bar

th, T

he U

nive

rsity

of T

exas

at A

ustin

http://www.onthefrontlines.net


BIG DATA IT’S ON ISILON

• Store&AnAlyzeyourDAtA

• ProtectyourDAtA

• Secureyournetwork

www.emc.com/federal

GAme-chAnGinGtechnoloGieSforiSr&fmV

http://www.emc.com/federal

By Jeff Erlichman, Editor, On The FrontLines

ItOccursToMe...

Big Data Is A Big Deal!A

fter collecting volumes of data from a variety of formats

and sources and then analyzing them using my human

processor, I’ve come up with my Big Data “Top Ten”.

1. Big Data needs a consensus definition.Go to any tech event and Big Data talk is on the lips of attend-

ees. While most say they “know Big Data when they see it”, when

asked to give a specific definition, answers vary widely.

And that’s for good reason. So far there is no one accepted gov-

ernment definition for Big Data like there is for cloud computing.

In its 2012 “Demystifying Big Data” report the TechAmerica

Foundation defined Big Data as a term that describes:

“large volumes of high velocity, complex and variable data

that require advanced techniques and technologies to en-

able the capture, storage, distribution, management, and

analysis of the information.”

Currently NIST is leading the effort to define Big Data for gov-

ernment, along with building common languages and reference

models. Stay tuned.

2. Big Data is well, big!In 1978, I sold Big Data storage to the Naval Research Lab. My

format was an 8” floppy disk storing a robust 8 kilobytes. Hard

multiple-platter Disk Packs stored maybe 1 megabyte. Today hard

disk capacities of 1.5 terabyte are commonplace.

According to the TechAmerica report, in 2009, the govern-

ment produced 848 petabytes of data and healthcare data alone

reached 150 exabytes. Five exabytes (1018 gigabytes) would con-

tain all words ever spoken by human beings on earth.

3. Big Data is new — Not trueWhile the term “Big Data” with initial caps is new, big data

itself is not new.

For example, NOAA/National Weather Service has been pro-

cessing it since the 1950s. Today NOAA manages over 30 pet-

abytes of new data per year. (How many 8K floppy’s is that?)

4. Big Data characteristics: The 4Vs Dealing with Big Data means dealing with:

(a)Volume — The sheer amount of data generated or data

intensity that must be ingested, analyzed, and managed.

(b)Velocity — The speed data is being produced and changed;

and the speed with which data must be received, understood, and

processed.

(c)Variety — Structured and unstructured data in a variety

of formats creates integration, management, governance, and

architectural pressures on IT.

(d)Veracity — The quality and provenance of received data

must be verified.

5. Why Big Data now!Technology finally has the power to handle the 4Vs. We now

have the tools to really ask ‘what if’ and to explore data sets that

were available before or didn’t exist before. Now it is possible to

really think about ‘the art of the possible’. We are witnessing a

true democratization of data.

6. Big Data is now affordable. You don’t need to start from scratch with new IT invest-

ments. You can use your existing capabilities, technologies and

infrastructure to begin pilots. There is no single technology is

required; there are no ‘must haves’.

7. The Big Data path is a circle not a straight line.Define. Assess. Plan. Execute. Review. TechAmerica recom-

mends a five step cyclical approach that is iterative versus se-

rial in nature, with a constant closed feedback loop that informs

ongoing efforts.

8. The “Help Wanted” sign is up!We have to grow the next generation of data scientists. The

McKinsey Global Institute Analysis report predicts a shortfall

of 200,000 data scientists over the next five or so years and

shortfall of 1 million managers in organizations where data will

be critical for them to be successful in their organization — e.g.

government.

9. Government is funding foundational Big Data R&D.Projects are moving forward via the $200 million investment

announced by the administration in March 2012. In October 2012,

NSF/NIH announced 8 awards covering three areas: data man-

agement, data analytics and collaboration/data sharing. New

solicitations, competitions and prizes are in the offing with op-

portunities for anyone who has a good idea.

10. The Big Data Market is growing bigger.Deltek forecasts demand for big data solutions by the U.S. gov-

ernment will increase from $4.9 billion in FY 2012 to $7.2 billion

in FY 2017 at a compound annual growth rate (CAGR) of 8.2%.

Big Data means big opportunities. That’s truly a big deal. n

© Copyright 2013 Trezza Media Group, Public Sector Communications, LLC

BigData 3

http://www.techamericafoundation.org/bigdata

http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation

http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release.pdf

http://www.nsf.gov/news/news_summ.jsp?cntn_id=125610

http://www.deltek.com/company/mediacenter/deltektv.aspx

Big Ideas For Big DataB

ig Data inspires visions where we imagine ‘what if’. In real-

ity, it poses the real-life question of how we are going to

get from today to ‘what if’?

The answer: “We have a lot of science to do,” declared NSF’s

Dr. Suzi Iacono in a recent interview with OTFL.

“We truly believe there is a chance for trans-

formation in science and engineering to help us

with our biggest national challenges in energy,

health and climate change.”

Dr. Iacono is the co-chair of the interagency

Big Data Senior Steering Group, a part of the Net-

working and Information Technology Research

and Development (NITRD) program and a senior science advisor

at NSF.

The NITRD program chartered the Big Data Senior Steering

Group in 2011. NSF co-chairs the committee with NIH and mem-

bers include DARPA, DOD OSD, DHS, DOE-Science, HHS, NARA,

NASA, NIST, NOAA, NSA, and USGS.

Dr. Iacono explained this is a long-term national Big Data R&D

effort with four major components:

1. Foundational research to develop new techniques and tech-

nologies to derive knowledge from data.

2. New cyberinfrastructure to manage, curate, and serve data

to research communities.

3. New approaches for education and workforce development.

4. Challenges and competitions to create new data analytic

ideas, approaches, and tools from a more diverse stake-

holder population.

At NSF, Basic ResearchDr. Iacono said NSF is a basic science agency focusing more on

the long-term; the kinds of foundational research the private sector

cannot fund or would not have dedicated researchers working on.

“When people ask me the question ’what is research doing for

me today in terms of Big Data?’, I tell them ‘if you think you have

problems today, just wait five or ten years from now and you’ll be

glad we are doing this basic research we are doing today’.”

She notes that the issues and challenges of dealing with vol-

umes, velocity and variety are only going to increase. “The scien-

tists and engineers will want to ask bigger and more important

research questions so we have bigger and better discoveries.”

Dr. Iacono explained NSF is working on three core Big Data

issues:

• Collection, storage and management

• Data analytics

• Data sharing and collaboration.

“We really have to figure out how to scale from the kind of

small data sets that we have today to very large and very hetero-

geneous data sets,” said Dr. Iacono.

Heterogeneity is one of the biggest challenges. It includes

large, diverse, complex, longitudinal, and/or distributed data

sets generated from instruments, sensors, Internet transactions,

email, video, click streams, and/or all other digital sources avail-

able today and in the future according to Dr. Iacono.

In October 2012 NSF/NIH announced awards for eight projects

that cover data management, data analytics and collaboration/

data sharing. (Details on page 20)

With the Administration’s $200 million Big Data R&D Initiative, government is taking tangible steps to address the challenges and opportunities of Big Data.

Dr. Suzi IaconoNSF

4 BigData

http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release.pdf

That’s just the beginning.

“Those were just the mid-scale awards. We are about to an-

nounce our small projects and there will be more of those,”

Iacono said adding that ‘small’ means just smaller projects with

smaller budgets and fewer researchers working on the projects

than the mid-scale awards..

“Then a new solicitation will be coming out for competitions

and prizes,” she added, “because we want to reach out beyond

those who traditionally propose. Anyone who has good ideas can

submit ideas for competition.

NIST Big Data/Cloud WorkshopDr. Iacono presented her Big Data message to the commu-

nity at the NIST Big Data/Cloud Workshop held in January 2013.

Before an SRO audience she spoke about the four thrusts and

everyone’s common challenges.

“All the agencies understand what the challenges are whether

we are NASA, NOAA, USGS, NIST or DARPA,” she said. “The issue

is we share common challenges; and how are we going to solve

them in new meaningful ways that really going to make a huge

difference down the line.”

Ashit Talukder, Chief, Information Access Division, ITL,

NIST said NIST, other federal agencies, the private sector and

academia are working together to come up with consensus

standards, guidelines and ways to measure Big Data efforts. And

there is that Big Data definition issue.

“We had one breakout dedicated to coming up with a consen-

sus definition and common language for Big Data technologies.

We are working to develop a shared technology roadmap that

allows the community and all stakeholders to work together.”

Talukder said NIST is taking a similar approach as it did for

cloud computing, developing a common definition and roadmap

that all can follow.

“We have a working group trying to come up with consensus

definition for Big Data. We need to start with the fundamental

questions,” he said.

“Then we can have a definition so we can speak common lan-

guage; have a shared set of reference architectures that is well

recognized and used throughout the community. That is key so

that we have a more uniform way of programs.

After all said Talukder, “the end game for Big Data is knowl-

edge extraction; turning data into information into knowledge;

then taking action based on that knowledge.” n

During a recent Federal Executive Forum broadcast on Federal

News Radio, leaders from NASA and NGA spoke about their Big

Data strategies and efforts.

National Geospatial-Intelligence Agency

Dr. Robert Norris

Chief Architect

National Geospatial-Intelligence Agency (NGA)

NASA Goddard Spaceflight Center

Myra Bambacus

Geospatial Information Officer (GIO)

NASA Goddard Spaceflight Center

Big Data Imaging at NASA At NASA Johnson Space Center, NASA’s imagery collection

of still photography and video spans more for than half a cen-

tury: from the early Gemini and Apollo missions to the Space

Station. This imagery collection currently consists of over 4

million still images, 9.5 million feet of 16mm motion picture film,

over 85,000 video tapes and files representing 81,616 hours of

video in analog and digital formats.

Eight buildings at JSC house these enormous collections and

the imagery systems that collect, process, analyze, transcode,

distribute and archive these historical artifacts.

NASA’s imagery collection is growing exponentially, and its

sheer volume of unstructured information is the essence of

Big Data.

NASA has developed best practices through technologies

and processes to comply with NASA records retention sched-

ules and archiving guidance; migrate imagery to an appropriate

storage medium and format destined for the National Archives;

and develop technology to digital store down-linked images

and video directly to tape libraries.

OTFL @ The Federal Executive Forum

“ The end game for Big Data is knowledge extraction; turning data into information into knowledge; then taking action based on that knowledge.”

— Ashit Talukder, Chief, Information Access Division, ITL, NIST

Source: TechAmerica Foundation Demystifying Big Data Report, Oct. 2012

BigData 5

http://www.nist.gov/itl/cloud/nist-joint-cloud-and-big-data-workshop-webcast.cfm

http://www.youtube.com/watch?v=Lv3yGcrlPFc&list=PL988BE17ECB31A774&index=2

http://www.youtube.com/watch?v=dP2IbzmZymI&list=PL988BE17ECB31A774


Grap

hic

cour

tesy

Nat

iona

l Sci

ence

Fou

ndat

ion:

Leon

id A

. Mirn

y an

d Er

ez L

iebe

rman

-Aid

en

6 BigData

Dr. Suzi IaconoSeniorScienceAdvisorfortheDirectorateforComputerandInformationScienceandEngineering(CISE)NationalScienceFoundation(NSF)

AFewMinutesWith...

Dr. Suzi Iacono co-chairs the interagency Big Data Senior

Steering Group, a part of the Networking and Information

Technology Research and Development (NITRD) program.

Dr. Iacono is passionate about her work at NSF and NSF’s role

in tackling basic scientific research.

“It is our precisely our mission to not think about the end

game. We see science beyond the frontiers, actually imagining

what we should do to tackle some of our biggest problems.”

NSF funded the basic research for “recommender” software

that Amazon and other online retailers use. They also funded the

basic technologies that allow companies like e-Bay do online auc-

tions without cheating.

“At NSF, that’s what we do, it’s ‘where discoveries begin’,” Dr.

Iacono declared. She was kind of enough to talk with OTFL re-

cently about Big Data. What follows is a lightly edited transcript.

OTFL: Every agency has a unique mission. What about NSF?Dr. Iacono: We are basic science agency, so

while there is some practical aspects to the

research we fund, it is really long-term.

(Because private sector R&D bud-

gets are not large) we fund the kinds

of things companies would not invest

in or would not have internal research-

ers working on those questions.

It is our precisely our mission to not think

about the end game, to see science beyond the

frontiers, actually imagining what we should do to

tackle some of our biggest problems.

It’s often true that scientists come up with brand new ways of

doing things that were never thought of before; and that’s what

we are trying to do with our research programs at NSF — to think

outside the box, be creative, innovative and come up with brand

new ways of analyzing data and extracting knowledge from it.

OTFL: Can you give examples?Dr. Iacono: One example of trying to bring communities together

is a project called EarthCube. Our geoscience director has three

divisions: Ocean, Earth and Atmospheric.

Some smart people in that directorate asked the question:

Wouldn’t it be good if geoscientists could use all that data to ask

really big questions about climate change?

If you are a geoscientist who only worked with Oceans, you

are using only the Ocean community database. But what if you

could get information about the Earth and atmosphere?

Then you could really pose some much more interesting

grand challenge research questions and we could find out a lot

more about things like climate change.

Another example is using real time data to evacuate people

in a big storm; being able to take the data that FEMA, NOAA and

USGS have and be able to actually integrate it in real time; and get

real actionable knowledge of what to do.

This will change the course of how we handle public safety in

this country, because being able to get people going down the

right road rather than the wrong one will save lives.

OTFL: What about new tools?Dr. Iacono: We need brand new ways to scale small databases up

so they can handle very, very large data sets. What that means is

new algorithms, new statistical approaches, new

kinds of mathematics have to be brought to bear

on the problem; so we bring together multi-disci-

plinary teams from across these disciplines to

work on these problems.

Take education for example... education

researchers in US are funded by NSF. They

are interested in the science of learning.

To do that they go to local elementary

school and they compare how kids learn

in two classes of 30 kids. They observe,

they take tests, they survey, but the the data

points are really small.

Compare that with really large online

learning community like Stanford. They have

200,000 students taking a course. So, as they

click to move around the website, you could cap-

ture every interaction with every student. Find out how long they

rest on a page and how long they read things.

The amount of data is so much different, orders of magnitude

larger. We could really make a big difference in how we teach

our young people. We can understand when someone does not

understand a concept. We can intervene before someone drops

out. We have so much data along the whole process rather than

waiting to take a test at the end.

(In the future) those mining their data will have that much

more data to mine. What kind of new analytical tools will they

have at hand to actually come up with new knowledge? The issue

for us is what can we do today to really give people new tools,

new approaches new technologies they can use five or ten years

down the road. n

http://www.informatica.com

8 BigData

In the 1990s doing business online was doing ‘eCommerce or

eBusiness’. Today, it’s just ‘commerce or business’.

That transformation occurred because the underlying In-

ternet technologies matured and the public gained trust in doing

business online.

That same transformation will hopefully occur with Big Data.

In the end, once the technologies mature and using Big Data appli-

cations in real-time becomes common-

place, Big Data will become just data.

“Collection, use cases, analysis,

sharing will all be par for the course,”

explained Chris Wilson, Vice President

and Counsel, Communications, Privacy,

and Internet Policy at TechAmerica

Foundation.

“Big Data terminology will become

less important than the actual results,”

Wilson told OTFL in a recent interview.

“Government will ask ‘how can Big Data

help?’ Results and outcomes, not termi-

nology will dominate the conversation.”

Big Data Commission: Bringing Clarity

A few short years ago, defining —

and then migrating and using the cloud

—consumed the IT conversation. Now,

the technological, policy and business

infrastructures are in place and cloud

computing is quickly becoming main-

stream computing. What got cloud go-

ing was NIST defining the cloud so everyone had a reference point.

A few short years ago, Big Data (though it has always been

with us) wasn’t even in the discussion. Now, of course, it is. Now

Big Data is in a similar position as cloud’s early years. Everyone

knows Big Data when they see it, but so far, there is no consensus

definition. And with no starting point, there can be no clarity.

“We felt it was necessary to provide some clarity to the gen-

eral discussion within industry and government,” Wilson said,

describing the formation and goals of its Big Data Commission,

consisting of 25 of America’s leading IT firms.

The result is its report: “Demystifying Big Data: A Practical

Guide To Transforming The Business of Government”.

Wilson said the report’s goal was to talk about Big Data in plain

language, develop a working definition, describe the technology

underpinnings and provide practical advice on getting started.

(More on these on pages 10 and 14)

“We have tangible case studies on how Big Data has been

implemented in real world scenarios,” he said. These include

implementations by NASA, NOAA/National Weather Service and

the National Archives and Records Administration (NARA). How

these agencies solved their Big Data requirements provide guid-

ance for new endeavors.

The report breaks the Big Data dis-

cussion down into five parts:

1. Big Data Definition & Business/

Mission Value

2. Big Data Case Studies

3. Technical Underpinnings

4. The Path Forward: Getting Started

5. Public Policy

According to the report, “the Com-

mission based its findings on the practi-

cal experiences of those government

leaders who have established early

successes in leveraging Big Data, and

the academics and industry leaders who

have supported them.

The intent is to ground report rec-

ommendations in these best practices

and lessons learned, in an effort to cut

through the hype, shorten the adoption

curve, and provide a pragmatic road map

for adoption.”

Positive ReactionSince its release in October 2012, Wil-

son said the feedback has been positive.

“The feedback from the administration has been positive, es-

pecially from OSTP who thinks it has a lot of value,” said Wilson,

who also said there was a positive reaction from the Hill.

Looking to the future, Wilson said the TechAmerica plan is

for its Big Data subcommittee to use the report as template for

advocacy and engagement with government and look for ways to

collaborate of Big Data messaging.

Wilson says the report shows that with all the amazing stuff

going on we are just at the tip of the iceberg.

“While I think data analysis is a huge part of Big Data, a Big

Data definition encompasses more. Data collection, use and shar-

ing are all part of the Big Data world wrapped around the data

analysis axle,” he said. “Our working definition includes all of

these things.” n

Big Data DemystifiedIn its “Demystifying Big Data” report, the TechAmerica Foundation presents a practical guide to how Big Data can transform how government does business and delivers services.

“Big Data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information.”

TechAmerica Foundation Report: Demystifying Big Data; October 2012

TechAmerica Foundation: Federal Big Data Commission 1

Cover Page

A Practical Guide To Transforming The Business of Government

DEMYSTIFYINGBIG DATA

Prepared by TechAmerica Foundation’s Federal Big Data Commission

http://www.techamericafoundation.org/content/wp-content/uploads/2012/10/Final-Big-Data-Case-Study-NASA-Human-Spaceflight-Imagery.pdf

http://www.techamericafoundation.org/content/wp-content/uploads/2012/10/Final-Big-Data-Case-Study-NOAA-NWS.pdf

http://www.techamericafoundation.org/content/wp-content/uploads/2012/10/Final-Big-Data-Case-Study-NARA-ERA.pdf


Introducing an exponential leap forwardin data management software

http://www.commvault.com/simpana-software


10 BigData

The path is a circle, not a straight line.

You know the data you steward could do more to im-

prove the lives of citizens, or improve social and welfare

services or save the government money. You know you are a

prime candidate for Big Data. You also know that unless you are

getting part of the $200 million in announced grants, new money

is not an option.

But that doesn’t mean you can’t get started.

NIST’s Ashit Talukder told OTFL “first you must start with

fundamental questions including: ‘What is Big Data?’ Having a

consensus definition allows people to speak in a common lan-

guage, use a shared set of reference architectures and is the key

for more uniform way of approaching Big Data.”

At the moment, there is no consensus definition and NIST and

others are working hard to formulate one. But in the meantime,

Big Data projects need go forward. So, how do you proceed?

TechAmerica’s Big Data Commission in its “Demystifying Big

Data” report presents a proven “five step cyclical approach to

take advantage of the Big Data opportunity”.

Review each step continually along the way. To think Big Data

is to think in cyclical not, serial terms.

“These steps are iterative versus serial in nature, with a con-

stant closed feedback loop that informs ongoing efforts,” the Big

Data Commission writes.

Start from “a simple definition of the business and operational

imperatives” you want to address and a “set of specific business

requirements and use cases that each phase of deployment with

support.”

“At each phase, review progress, the value of the investment,

the key lessons learned, and the potential impacts on gover-

nance, privacy, and security policies. In this way, the organization

can move tactically to address near term business challenges, but

operate within the strategic context of building a robust Big Data

capability.”

How To SucceedSpecifically, the Big Data Commission says successful Big

Data implementations:

1. Define business requirements: Start with a set of specific

and well defined mission requirements, versus a plan to deploy a

universal and unproven technical platform to support perceived

future requirements. The approach is not “build it and they will

come,” but “fit for purpose.”

2. Plan to augment and iterate: Augment current IT invest-

ments rather than building entirely new enterprise scale sys-

tems. The new integrated capabilities should be focused on

initial requirements but be part of a larger architectural vision

that can include far wider use cases in subsequent phases of

deployment.

3. Big Data entry point: Successful deployments are character-

ized by three “patterns of deployment” underpinned by the selec-

tion of one Big Data “entry point” that corresponds to one of the

key characteristics of Big Data.

• Velocity: Use cases requiring both a high degree of velocity

in data processing and real time decision making, tend to require

Streams as an entry point.

• Volume: Those struggling with the sheer volume in the data

they seek to manage, often select a database or warehouse archi-

tecture that can scale out without pre-defined bounds.

• Variety: Those use cases requiring an ability to explore, un-

derstand and analyze a variety of data sources, across a mixture

of structured, semi-structured and unstructured formats, hori-

zontally scaled for high performance while maintaining low cost,

imply Hadoop or Hadoop-like technologies as the entry point.

4. Identify gaps: Once an initial set of business requirements

have been identified and defined, government IT leaders assess

their technical requirements and ensure consistency with their

long term architecture. Leaders should identify gaps and then

plan the investments to close the gaps.

5. Iterate: From Phase I deployments you can then expand to

adjacent use cases, building out a more robust and unified Big

Data platform. This platform begins to provide capabilities that

cut across the expanding list of use cases, and provide a set of

common services to support an ongoing initiative. n

Making Your Big Data Move To think Big Data is to think in cyclical not serial terms.

1. Define. 2. Assess. 3. Plan. 4. Execute.

5. Review.

Chart courtesy of TechAmerica Foundation.


Megumi Kiyama/Janis Ceresi

None ITC Franklin Gothic (Book, Book Condensed,

Demi Condensed, Medium Condensed), TT

Slug OTF (Regular)

6.75” x 9.3125”

7.75” x 10.25”

8.5” x 11.25”

6/29/12

None

2

Lisa

Q

Thomas

Bridgette

4/C

None

BR9.068

Brocade

BR9.068_Federal_3_Print_A_Size_Ad.indd

Federal #3 Print Ad

None

6-29-2012 4:20 PM

Job No.:

Client:

File Name:

Title:

Date:

Pubs:

PRODUCTION NOTES

Please examine these publication materials carefully. Any questions regarding the materials, please contact Cindy Jarvis at (415) 217-2831

Live:

Trim:

Bleed:

Mat Close:

1st Insert:

Version:

READER

LASER%

DATE

Studio Manager

Art Director

Copywriter

Account Mgt.

Production Mgt.

Proof Reader

BYAPPROVALS Production:Art Director:

Copywriter:

Account Mgr:

Print Prod:

Color/BW:

Fonts:

None

© 2012 Brocade Communications Systems, Inc. All Rights Reserved.

Brocade is helping federal agencies deliver data center-class reliability and scalability to the edges of the network and into the cloud.

Brocade is, quite simply, the leader in cloud-optimized networking for the federal government. With the largest breadth of federally certified products, Brocade is committed to achieving the highest standards of interoperability and reliability required for all federal solutions and the Cloud First mandate. Brocade builds network foundations that ensure federal data center consolidations enable cutting-edge cloud services, seamlessly.

When the mission is critical, the network is Brocade. Learn more at brocade.com/everywhere

Brocade. Unlock thefull potential of the cloud.

B18943_1b_Fed_A12.22.2011 150 linescreenAMA

A18943x03E_Asize_325ucr.tif

http://www.brocade.com/everywhere

12 BigData

Ashit TalukderChief,InformationAccessDivisionInformationTechnologyLaboratory(ITL)NationalInstitutesofStandardsandTechnology(NIST)

OTFLExecutiveInterview

More than 600 gathered at the 2013 NIST joint Cloud and

Big Data Workshop to explore the “what if” possibilities

at the intersection of cloud computing and Big Data.

“Cloud is a multiplier when it’s combined with other tech-

nologies, and it frees the users from the constraints of loca-

tion and catalyzing data-driven inspiration for discovery,” Dr.

Patrick Gallagher, Under Secretary of Commerce for Standards

and Technology and NIST Director told attendees.

“Now, Big Data, unlike cloud, doesn’t have a common defini-

tion yet. We haven’t yet agreed as a community what exactly we

mean by Big Data. But whatever it is, it’s here. “

And like cloud, Big Data is going to change everything. We

are really looking at a new paradigm, a place of data primacy

where everything starts with consideration of the data rather

than consideration of the technology.”

At the workshop, the NIST Information Technology Labora-

tory Big Data team led sessions that focused on some of Big

Data’s most pressing concerns. Speakers included:

• Christopher L. Greer, Acting Senior Advisor for Cloud

Computing, Associate Director, Program Implementation Office

• John Garofolo, Senior Advisor, Information Access Division

• Ashit Talukder, Chief, Information Access Division

• Mary Brady, Leader, Information Systems Group, Software

and Systems Division

But to make this all happen, there are many challenges. In

this OTFL interview, NIST’s Ashit Talukder, a key member of the

NIST Big Data team, spoke about some of the important issues

facing the Big Data community.

On Lifecycle Management: Measurement Science, Benchmarking and Evaluation Challenges

OTFL: What can people do to meet those challenges?

Ashit Talukder, NIST: The key to meeting these important chal-

lenges is to have plans that cover all phases of the life cycle, from

initial capture to long term archiving; including steps that may not

often be considered.

Dispensing with data that are not needed or for which the

cost of preservation exceeds the value is an example. Underly-

ing this example is the principle that not all data may need to be

preserved.

For example, the raw data output of a computer simulation

may be massive and expensive to preserve. But if the simulation

itself — and the ability to run the simulation — is compact and can

be efficiently preserved, the data can be regenerated on demand.

In contrast, historical observational data cannot be regener-

ated and these typically require different evaluation and preser-

vation strategies.

These examples point out that thoughtful data curation — in-

cluding continuously evaluating the preservation value of data

— is a key part of life cycle management.

Data lifecycle management also includes adding data identi-

fiers, data descriptors such as metadata, ontologies, data com-

pression (using lossless or lossy techniques), and methodologies

to search and access data.

Properly evaluating the preservation value of data is certainly

a challenge. A key factor lies in assessing the current and poten-

tial uses of the data.

The NIST Scientific Data Committee, composed of data sci-

entists and technologists from fields across NIST, is currently

working on the challenge of developing effective metrics and

measures for data impact.

The group has started its work by focusing on software tools

and resources that NIST makes available for downloading and use

by others. This initial work will be extended to NIST Standard Ref-

erence Data as well as the other data resources developed across

the NIST labs.

http://www.nist.gov/itl/cloud/nist-joint-cloud-and-big-data-workshop-webcast.cfm

BigData 13

On Big Data InfrastructureOTFL: Do we have the infrastructure in place currently?

What is needed and what is not?

Ashit Talukder, NIST: There have been significant advances in

Big Data infrastructure over the past few years that have enabled

deployment of scalable Big Data solutions on platforms.

Hadoop on cluster-based environments is currently the most

commonly used infrastructure, but others that employ HPCC-

based solutions may be used as alternatives, depending on the

types of applications.

Research on different hardware-based solutions including

GPUs, multicores and cluster computing are underway that may

offer specialized solutions in different problem domains.

Networking or the movement of large volumes of data is a

challenge. Solutions that enable faster movement of data from

storage to memory, or between nodes in a cluster environment

are being investigated and will potentially assist in faster Big Data

computations.

On Analytics, Processing and Interaction: Measurement Science, Benchmarking and Evaluation Challenges OTFL: What can people do to meet those challenges?

Ashit Talukder, NIST: One challenge in processing Big Data is

being able to effectively extract knowledge and present relevant

information to users in a timely manner. The heterogeneity of

data types in Big Data includes unstructured information such as

text, speech, video, and images.

Many of the existing analytics solutions cannot handle un-

structured data, assimilate variety of data types, or may not be

suitable for handling distributed data, or be able to extract in-

formation from real-time data streams (compared to traditional

solutions that do batch processing on pre-stored data). Visual-

ization, interaction, and usability of Big Data interfaces also are

areas that need significant improvements.

While promising solutions for analytics on Big Data have been

offered recently, there is a dearth of techniques to measure the

efficacy of different analytics solutions.

Measurement and benchmarking of analytics solutions could

include many metrics, including some such as accuracy, false/

true classification rates, computation speed, generalization, data

management efficiency, and many others.

We’re working with the defense and intelligence communities

and others on integrating, analyzing and interpreting streaming

video, audio, text, and sensor data to address important secu-

rity needs and to advance state of art in analytics, search and

retrieval across many domains.

NIST has been organizing a measurement and evaluation

challenge series (TREC) on large-scale text document search and

retrieval for the past 22 years, and the latest TREC conference

includes measurement of search accuracy on 1 billion web pages,

search analysis on 100 million real-time tweets, legal e-discovery

on 7 million documents. Plans are underway to potentially extend

such measurement and evaluations to more data types and do-

mains/applications in Big Data.

On The TechAmerica Foundation Report on “Demystifying Big Data” What are your thoughts about the recent TechAmerica

Report on Big Data?

Ashit Talukder, NIST: The TechAmerica report is thoughtful and

well done, and I encourage your readers to review it.

We are very interested in gathering a range of perspectives on

the Big Data landscape, from private sector stakeholders like Te-

chAmerica, academic researchers, and our colleagues through-

out the government sector.

To this end, we recently held a very well-received workshop on

the intersection of Cloud Computing and Big Data.

The webcast is archived and available for viewing on our NIST

Cloud Computing web site. The message that emerged from the

more than 600 individuals who attended in person or participat-

ed remotely was that the time to move forward on Big Data is now.

There is a need to focus the conversation through a consen-

sus definition for Big Data (there are many different definitions

today), a common language, and a shared technology roadmap

that allows us to work together effectively.

We are anxious to work with the community to pursue

these goals. n

Let’s Talk Tech...SMEs from Brocade, CommVault, EMC, HP Enterprise Services and Informatica have big things to say about Big Data.

In putting together this issue of OTFL, I had my own “Big Data”

moment.

I suddenly realized that in trying to take all of the informa-

tion I researched in print, online, via audio, via video and through

conversations with Big Data experts, I was dealing first-hand with

the same issues business managers, technologists and data sci-

entists were dealing with when it comes to Big Data.

I was dealing with:

1. More data being collected than I could use.

2. Many data sets were too large “to download” into usable

memory.

3. Many data sets were too poorly organized to be usable.

4. Many data sets are heterogeneous in type, structure, se-

mantics, organization, granularity, accessibility.

5. My use of the data was limited by my ability to interpret and

use it.

For example, take the contents from a NSF Big Data presen-

tation from June 2012 explaining Big Data. It said Big Data is a

paradigm shift from hypothesis-driven to data-driven discovery;

and that:

• Data-driven discovery is revolutionizing scientific explora-

tion and engineering innovations

• Automatic extraction of new knowledge about the physical,

biological and cyber world continues to accelerate

• Multi-cores, concurrent and parallel algorithms, virtualiza-

tion and advanced server architectures will enable data min-

ing and machine learning, and discovery and visualization of

Big Data.

At that point I knew I needed a Hadoop type technology in my

brain to make sense of things to turn this “dirty data into informa-

tion into action”!

Fortunately for me I was lucky enough to communicate one-

on-one with some of the leading industry experts from Brocade,

CommVault, EMC, HP Enterprise Services and Informatica about

Big Data. They were my Hadoop.

Those conversations appear on the following 5 pages (15-19). n

By Jeff Erlichman, Editor, On The FrontLines

14 BigData

Source & DataApplications

Data Preparation Data TransformationMetadata Repository

Business IntelligenceDecision Support

Analysts

Streaming Data

Text Data

Multi-Dimensional

Time Series

Geo Spatial

Video& Image

Relational

Social Network

Data Mining & Statistics

Optimization& Simulation

SemanticAnalysis

FuzzyMatching

NetworkAlgorithms

NewAlgorithms

IndustryDomain Expert

AnalyticsSolution End

User

OtherAnalysts and

Users

Data Acquisition Filtering, Cleansing,and Validation

Storage, Hadoop, &Warehousing Core Analytics Users

Security, Governance, Privacy, Risk Management

Chart courtesy of TechAmerica Foundation.

Notional Information Flow — The Information Supply Chain

OTFL SME One-On-OneScott PearsonDirector, Big Data Solutions Brocade

Casey MilesTechnical Marketing Manager (Big Data)Brocade

The IT Infrastructure Itself Is The MissionTo support Big Data environments and analytics initiatives re-

quires networks to be integrated into solutions.

“There is a cultural and business model shift in motion,”

Scott Pearson, Brocade’s Director for Big Data Solutions told

OTFL.

In Big Data, 40% of enterprises have developed their own

applications, which they consider their competitive advantage

he added. This requires a business and technology model

which is flexible and adaptable in architecting and delivering

integrated customized solutions.

“Brocade, being at the intersection of integrated Big Data

solutions, is able to be at the core of this comprehensive, pack-

aged, supportable solution; or go the route of prototyping vari-

ous design ideas around a tested Ethernet fabric,” Pearson said.

Everyone ContributesIntegrating a Big Data Analytics solution into your enter-

prise can be a very daunting process, filled with risk. Success

ultimately depends upon breaking up the silo mentality where

server, applications and network people only talk among them-

selves.

‘This simply cannot work in an area of IT that is primarily

business driven,” explained Casey Miles, Brocade’s Technical

Marketing Manager for Big Data. “In Big Data, the IT infrastruc-

ture itself is the mission.”

This changes ownerships, goals, and requirements drasti-

cally in a culture that just isn’t configured to support that kind

of pressure Miles noted. Brocade has stepped up and pulled

industry together to support the business decision maker.

“We got a lot of inquisitive looks when we held the first

workshop to benchmark Big Data at Brocade headquarters

from the various software engineers, database managers, and

analytics professors on the committee,” Miles said.

The first question was, “Why is this at a network hardware

company?”

That’s exactly the paradigm that needed to be broken Miles

declared.

“Over the past year, Brocade has operated as a central

hub, bringing together software, database, server, academia,

standards bodies, and enterprise companies, forming strong

partnerships, and exposing the very problems that needed to

be overcome in order to turn IT from a back office into a busi-

ness community.”

Now, Brocade hardware engineers talk about finding value

in large data sets, application designers talk about hashing

algorithms to improve cluster performance, and everyone

contributes to the standards and benchmarks that will soon be

announced for Big Data.

“Now, our new culture around Big Data strives to under-

stand what value each player can bring to a solution, optimize

it, integrate it, and support it as if it was part of their widget,”

Miles said. “This is the ideal culture that drives standardization,

adoption, and trust that will overcome the barriers of security

concerns and data migration.”

Pilots Please“Just as important in this process is building a small scale

prototype. Use this to prove that you can meet objectives and

deliver results, then scale up to full production size,” Miles

counseled.

Brocade matches the best practice of starting small and

scaling up with unique Ethernet fabric products that take the

risk out of unpredictable performance at scale.

“As far as applications themselves, we are seeing a lot of

ideas around the graphing functions, complex embedded SQL

queries, and very creative data visualization techniques on mo-

bile devices. The industry is still getting its legs underneath it.”

Miles said there hasn’t been a “killer app” produced as of

yet and “until we know enough about the capabilities that data

presents, we won’t know where the curve stops and starts to

turn downward.”

Pearson and Miles both agree that the first step in Big Data

is to identify and articulate the problems you are attempting to

solve and/or purpose of your Big Data or Analytic Initiatives.

Step 2 is to get involved and join or participate in collab-

orative industry and academic forums such as the San Diego

Supercomputing Center sponsored Workshop for Big Data

Benchmarking (WBDB) or Center for Large Scale Data Systems

Research (CLDS).

“One thing to keep in mind is that this Big Data integration

process is an iterative one. It’s important not be too rigid in the

development of your program,” noted Miles.

“Big Data is only limited by an organization’s ambition.” n

BigData 15

http://www.brocade.com/everywhere

OTFL SME One-On-OneShawn SmuckerDistrict Manager, Federal Systems EngineeringCommVault

Understand, Then CreateManaging the storage costs for Big Data is a big deal. Organiza-

tions are faced with not only an explosive growth of data, but

the time value of data (such as machine generated data) is in-

creasingly contracting.

“Storage vendors typically provide capacity based reporting

that show how much space is being consumed on an array, a

drive, or even an individual platter,” Shawn Smucker, District

Manager, Federal Systems Engineering told OTFL in a recent

interview.

“However this level of reporting does not provide insight

into the data that is using that capacity. A storage administra-

tor may know 90% of an array’s capacity is being used, but they

cannot determine who owns those files, when the files were cre-

ated, when they were last accessed, or even if any of the files

are a prohibited type.”

Understand, Then CreateCommVault Simpana Primary Storage Reporting and File

System Archiving helps organizations first understand what

data they have; and then create intelligent policies to migrate

the data to the most cost effective storage tier.

It provides granular, host based analysis of the storage uti-

lization, so storage and application administrators can quickly

determine the number files, file ownership, create and access

times, as well as duplicate files for a given system.

“Once the administrators have an understanding of all of

the data consuming storage resources, they can create intel-

ligent, automated policies that ensure data only resides in cost-

effective storage tiers,” Smucker said.

“No longer do they find themselves simply reacting to stor-

age shortages. Now they can strategically plan their storage

investments and manage the migration policies to meet those

plans.”

CommVault’s File System Archiving manages the automated

migration across storage tiers.

“As soon as data is created, sent, or received CommVault

will begin monitoring the data and, when the defined criteria

are met, the data will be migrated to the next tier of storage,”

he explained.

CommVault’s heterogeneous storage support and native

cloud storage integration allows the data to migrate across as

many tiers of storage or in as many different locations as an

organization requires.

File System Archiving can also be leveraged to manage pro-

hibited file types on primary storage. “If for instance MP3 files

are not allowed, the archiving agent can delete or move those

as soon as they are discovered on the system,” Smucker said.

“Taken together, Primary Storage Reporting and File Sys-

tem Archiving allow organizations to fully understand their

data, plan the full life cycle of the data in their environment, and

make intelligent storage investment decisions. This leads to

direct cost savings in storage infrastructure and management.”

Reuse and RecoupFederal IT organizations purchased nearly $1 billion of new

electronic data storage in FY 2011. Before buying new storage,

Smucker urges them to reuse or recoup capacity on their exist-

ing storage.

“CommVault’s combination of data life cycle management,

de-duplication and heterogeneous storage support provides IT

organizations the means to better manage their existing invest-

ments,” he said.

“Migrating data from the expensive tier 1 storage would free

capacity and allow for more strategic investments in produc-

tion storage. De-duplication of data managed by CommVault

ensures only unique data is being stored on tier 2 and tier 3

storage.

“Heterogeneous storage support allows organizations to

make the most cost effective investments across all tiers of

storage.”

One future aspect of managing Big Data will be what do we

retain and how long will we retain it? Not all data is created

equally but most organizations treat it all the same, keeping it

for the same amount of time regardless of who created it or

what type it is.

“Object based retention will change the way IT organizations

manage and retain data. Characteristics of individual data ob-

jects will determine where and for how long data is retained,”

Smucker said.

“For example, HR documents will be retained for years

while system generated logs will be retained for days. The key

to making object base retention manageable is eliminating the

need to know where the data originated from.”

Smucker explained an administrator will need to know the

characteristics of the data they need to manage, create a policy

around those characteristics, and then apply the policy to all

data being managed.

“The policies would span hosts and storage without the ad-

ministrator needing apply the policies on a granular system by

system basis.” n

16 BigData


OTFL SME One-On-OneAudie HittleCTO FederalEMC Isilon Storage Division

Agencies are being driven to transform their data storage and

information management operations to achieve levels of effi-

ciency and effectiveness previously not thought possible.

How is this being achieved?

Audie Hittle, CTO Federal told OTFL that “increasingly it’s

through the analysis of Big Data enabling insights and decision

support only recently made possible through technological ad-

vances like Isilon’s OneFS.”

As CTO Federal, Hittle embraces a mission focused the

transformations underway from the introduction of Big Data. He

observes and translates market trends, translates user require-

ments and needs into solutions, or future product roadmap

considerations, and translates technical jargon into capabilities

customers can appreciate.

“With the EMC Isilon portfolio of industry-leading scale-out

network attached storage (NAS) solutions, everyone involved

from the IT professional through end-users have more time to

think and focus on how to achieve their organization’s transfor-

mational goals,” Hittle observed.

The Information Culture“I believe we are at that stage of our ‘information culture’

which is primed and ready for the type of transformation the

EMC Isilon scale-out data storage solution can deliver. While it

innovatively addresses other issues, such as security, data mi-

gration and standardization, the way it solves the culture chal-

lenge is transformational.”

The intelligence and automation in the Isilon operating sys-

tem, OneFS — which feature cluster Auto-balancing, SyncIQ for

remote synchronization, SmartPools next-generation tiering,

and SmartLock for data protection and retention — are funda-

mentally changing the way data storage is managed.

“While I’ve heard the Isilon transformation referred to as

‘giving IT professionals their weekends back’, it goes way be-

yond that,” Hittle explained.

“It enables the team to focus more of their time and energy

on higher priority, more challenging and gratifying efforts

contributing directly to more cost efficient and effective op-

erations.”

A program, group or company implementing the EMC Isilon

data storage solution can achieve dramatic operational and cul-

tural changes. How dramatic?

“After implementing an Isilon solution, one government or-

ganization recently reported a decrease from 10-12 dedicated

full-time data storage professionals to one or two full-time

equivalents (FTE) team members an 80-90 percent reduction,”

said Hittle.

This enabled the reallocation of those personnel to higher

priority missions and providing the flexibility to absorb pending

federal budget cuts.

Reliability and AvailabilityHittle recently met with an industry systems integration

team that was considering options for a Big Data project in the

Intelligence, Surveillance and Reconnaissance (ISR) market.

“Like many organizations they were interested in the latest

market trends and technologies,” he recounted. “But interest-

ingly they were more concerned about the reliability and avail-

ability of their planned data storage project.”

Hittle advises government buyers to ask about the system

reliability and availability, “because it doesn’t matter what a

‘good deal’ you got on it (read as how little you paid), if it doesn’t

work right after the first week, month or year on the job.”

The #1 concern of smart federal buyers today is on “efficien-

cies” — translation: the bottom line.

“Fortunately, more buyers are asking about the Total Cost of

Ownership or TCO for a Big Data solution,” he said.

“This includes everything from initial procurement costs to

energy to full life-cycle operations and management costs. So,

perhaps a second important question I would recommend is for

government buyers to look under the covers at the TCO — not

just the up-front, lowest cost per terabyte.”

When asked about the future of Big Data, Hittle observed

that “that the question isn’t where the Big Data opportunities

are, but rather where are there not Big Data opportunities?”

“The insights and enhanced decision support generated

from the access to and analysis of Big Data are relevant to al-

most every market and program,” he noted.

“Just as the ‘open’ movement, mobility, social media, and Big

Data have captured our collective imagination and mind share,

I think the next big thing to happen with Big Data will be the

innovation that occurs.”

That innovation will stem from the application of technology,

like EMC Isilon’s scale-out data storage and information man-

agement solutions,” Hittle explained. n

Driving Organizational Transformation

BigData 17

http://www.emc.com/federal

OTFL SME One-On-OneDiana ZavalaDirector, Information Management and AnalyticsHP Enterprise Services, U.S. Public Sector

Enabling Meaning Based ComputingHow do you harness the volumes of different data at the speed

required? How do you manage and derive insights from the

“human” information, which is a complex task?

“We’ve moved beyond just the structured data. You must

tackle unstructured data to get the whole picture,” Diana Zavala,

Director, Information Management and Analytics told OTFL.

HP is uniquely positioned to help bridge the gap between

the challenges of Big Data (Velocity, Volume, Variety, and Time

to Value) and the needs of agencies to support their mission

said Zavala.

HP’s Next Generation Information Platform is anchored by

HP’s Autonomy Intelligent Data Operating Layer (IDOL) engine.

Autonomy enables Meaning Based Computing (MBC), which

forms an understanding of information and recognizes the re-

lationships that exist within it.

“IDOL can derive meaning from complex data sources, such

as e-mail, video, audio, social media, blogs, and streaming me-

dia, as well as traditional sources, to provide valuable insights,”

Zavala explained.

“This is a unique capability that stretches beyond keyword

matching, or ‘crowd-sourcing’ of topics, and other traditional

search algorithms.”

Autonomy technology was used during the 2012 Summer

Olympics to monitor social media and determine if there was

social activity regarding organized protests that would disrupt

the games or pose a potential security risk to participants or

spectators.

“Consider the benefit of this ability in a situation where you

want to gauge the effectiveness of or reaction to government

programs or legislation,” she said. “By creating ‘social intelli-

gence’, the government can adapt public policies in response

to citizen feedback.”

IDOL is at the core of the HP FUSION solution for situational

awareness. FUSION provides analytic capabilities to automate

information flows and the ability to interrogate data elements

to discover patterns and anomalies and bring this information

forward for easy consumption by decision makers.

“The goal is to exploit information when, where, and how it’s

needed in a way that makes it relevant and has a positive im-

pact,” Zavala explained.

Finding Insights Another key component of the HP Next Generation Informa-

tion Platform is Vertica, a massively parallel database, as well

as an extensible analytics framework, optimized for real-time

analysis.

“Vertica finds insights hidden in massive amounts of data

and has been used for fraud detection by uncovering patterns

of misuse,” Zavala said.

Further HP’s end-to-end security service provides a 360-de-

gree view of the organization to protect data and covers the

range of security technology, security consulting services, and

managed security services.

“Today, we protect the U.S. Navy and Marine Corps from

multiple intrusions every month and we perform constant se-

curity research in order to discover more vulnerabilities than

anyone else in the world,” noted Zavala.

Pointed QuestionsDecisionmakers seeking to optimize the information needed

to support their objectives would be wise to ask the following

questions she said:

• Do I have access to 100% of my structured and unstruc-

tured data? And, can it be readily queried and analyzed?

• Can my information partner provide a complete, trusted

end-to-end solution for my Big Data problems?

Zavala counsels managers to consider buying decisions

within the bigger picture of how to optimize information across

their organization.

“Are you buying solutions to support an enterprise informa-

tion strategy that enables an “agile” approach to information?

The focus should be on building a consistent information struc-

ture that provides value ranging from information security and

compliance to analytics and agility.”

Zavala says agencies must evolve their approach to more

efficiently extract value from their data, because increased

velocity and volume of will drive new mission requirements and

opportunities.

“Consider at least three intersecting areas: data volume,

speed, and cost. How do you filter to find the most relevant

information? New ways of analyzing the explosion of data are

needed to meet the expectation that decisions will be made in

more real time at the moment of risk.”

Moving towards information optimization requires an ability

to apply science to business and a comprehensive information

optimization strategy and technology Zavala said. n

18 Big Data

OTFL SME One-On-OneTodd GoldmanGeneral Manager for Data IntegrationInformatica

Harnessing Hadoop PowerTodd Goldman is responsible for Informatica’s data integration

and data quality product lines, which includes the firm’s Big

Data initiative and in turn what’s going on with Hadoop.

According to PC Magazine Hadoop is an “open source proj-

ect from the Apache Software Foundation that provides a soft-

ware framework for distributing and running applications on

clusters of servers. Inspired by Google’s MapReduce program-

ming model as well as its file system, Hadoop was originally

written for the Nutch search engine project.”

Goldman explained that Hadoop can take raw data (he called

it dirty) in a variety of structured and unstructured formats and

convert it into finished goods; that is information that can be

used to make decisions.

“What has happened is technologies like Hadoop provide

a different fabric on which to do this data integration,” Gold-

man told OTFL, explaining that Hadoop is in the ‘early adopter’

phase.

Goldman says the next step for Hadoop is to move to the

‘early majority’ phase where people buy Hadoop because they

have seen successful deployments.

“The challenge is to effectively use Hadoop. For that you

need people who are specialists in Hadoop and these Data Sci-

entists who can manage a 100 node Hadoop cluster are not easy

to find,” explained Goldman.

PowerCenter Developer = Hadoop Developer“Our customers use PowerCenter, which is a developer in-

terface and engine that is used to extract, transform and load

(ETL) data,” said Goldman. “Now we are giving them another

engine that scales to bigger data and adds real time data char-

acteristics to it”

Goldman said if you use PowerCenter as your factory for

converting data into information, you can now use PowerCenter

on top of Hadoop.

“The same skills you spent 10 years learning are still good,”

he said. “If you are a PowerCenter developer, presto-chango

you are now a Hadoop developer with no training.”

Goldman said customers tell him the effort to extract data is

roughly 80% of any kind of data analysis project. The last 20%

is done by the domain expert.

As an example, Goldman said to imagine two garages that

have tools, garden equipment and a lawn mower, but they are

not organized in the same way.

“Multiply that by a million for data when combining them

from different sources,” he said. “First you have to understand

what is in each source, how each is organized; then you have to

extract data from the source; then clean it up and then make it

consistent and combine it.”

Extracting data is what Informatica has been doing for years,

but not on Hadoop said Goldman, noting customers can use the

same GUI to use Hadoop that they were using on PowerCenter.

Hadoop PowerIn the Hadoop world you can write code manually as you

would program any application declared Goldman, who said

that Informatica is creating tool sets that are democratizing the

use of Hadoop “so more people can do more work more quickly

at a higher level of quality than they can do by hand.”

“We are creating a set of tools that make it easier to use

Hadoop and automate processes so they don’t have to do 80%

of the background work. This gives them the information in a

format they can use for whatever their endeavor might be. We

won’t more Hadoop experts, but give people the tools and train-

ing so they can use it.”

Goldman predicts in the next two years “you are going to

come to the time when you imagine the problem you want to

solve and you won’t need a team of Hadoop experts to solve it.”

Tools make progression to the ‘early majority’ phase

super-fast.

“Now I get comfortable. I can iterate so much faster, 5 times

faster than by hand. I can do 5 times more projects; I can do 5

projects in the time it would take me to do one.”

As an example Goldman pointed to drug company clinical

trials. Just two short years ago a smart scientist would examine

10,000 trials with 1,000 attributes and only study 100 attributes

because it was expensive and took too much time.

“Now with Hadoop you can dump all the data, used the tools

and do an analysis on all 1,000 attributes. Now I can find new

insights from this old data and find maybe this drug could be

used for something new.”

The result is a democratization of the use of large scale data

analysis.

“We will see breakthroughs from those who don’t know data

integration, but can ask question and leverage information in

new ways we never thought possible.” n

BigData 19

http://www.informatica.com

Bearing Big Data Fruit!The National Science Foundation is investing in projects to accelerate and use knowledge from Big Data.

In October 2012, The National Science Foundation (NSF),

with support from the National Institutes of Health (NIH),

announced nearly $15 million in new Big Data fundamental

research projects.

“These awards aim to develop new tools and methods to ex-

tract and use knowledge from collections of large data sets to

accelerate progress in science and engineering research and in-

novation,” said the NSF press release.

These grants were made in response to a joint NSF-NIH call

for proposals issued in conjunction with the March 2012 Big Data

Research and Development Initiative launch: NSF Leads Federal

Efforts in Big Data.

NSF said the aim is to “enable new types of collaborations —

multi-disciplinary teams and communities.” The awards will also

help NIH get more value from their huge biological data sets by

addressing the technological challenges for extracting impor-

tant, biomedically relevant information from large amounts of

complex data.

The eight projects announced range from scientific tech-

niques for Big Data management to new data analytic approach-

es to e-science collaboration environments with possible future

applications in a variety of fields, such as physics, economics and

medicine.

“Big Data is characterized not only by the enormous volume

or the velocity of its generation, but also by the heterogeneity, di-

versity, and complexity of the data,” said Dr. Suzi Iacono, co-chair

of the interagency Big Data Senior Steering Group, a part of the

Networking and Information Technology Research and Develop-

ment program and senior science advisor at NSF.

“There are enormous opportunities to extract knowledge

from these large-scale, diverse data sets, and to provide powerful

new approaches to drive discovery and decision-making, and to

make increasingly accurate predictions. We’re excited about the

awards we are making today and to see what the idea generation

competition will yield.”

Big Data AwardsDCM: Collaborative Research: Eliminating the Data Ingestion

Bottleneck in Big-Data Applications

Awardedto:RutgersUniversity,StonyBrookUniversity

Big Data practice suggests that there is a tradeoff between

the speed of data ingestion, the ability to answer queries quickly

(e.g., via indexing), and the freshness of data. This tradeoff has

manifestations in the design of all types of storage systems. In

this project the principal investigators show that this is not a fun-

damental tradeoff, but rather a tradeoff imposed by the choice of

data structure. They depart from the use of traditional indexing

methodologies to build storage systems that maintain indexing

200 times faster in databases with billions of entries.

ESCE: DCM: Collaborative Research: DataBridge —

A Sociometric System for Long-Tail Science Data Collections

AwardedtoUniversityofNorthCarolinaatChapelHill,Harvard

University,NorthCarolinaAgriculture&TechnicalState

University

The sheer volume and diversity of data present a new set of

challenges in locating all of the data relevant to a particular line of

scientific research. Taking full advantage of the unique data in the

“long-tail of science” requires new tools specifically created to

assist scientists in their search for relevant data sets. DataBridge

supports advances in science and engineering by directly en-

abling and improving discovery of relevant scientific data across

large, distributed and diverse collections using socio-metric net-

works. The system will also provide an easy means of publishing

data through the DataBridge, and incentivize data producers to

do so by enhancing collaboration and data-oriented networking.

DCM: A Formal Foundation for Big Data Management AwardedtoUniversityofWashington

This project explores the foundations of Big Data manage-

ment with the ultimate goal of significantly improving the pro-

ductivity in Big Data analytics by accelerating data exploration.

It will develop open source software to express and optimize ad

hoc data analytics. The results of this project will make it easier

for domain experts to conduct complex data analysis on Big Data

and on large computer clusters.

DA: Analytical Approaches to Massive Data Computation with Applications to Genomics

Awarded to Brown University

The goal of this project is to design and test mathematically

well-founded algorithmic and statistical techniques for analyzing

large scale, heterogeneous and so called noisy data. This project

is motivated by the challenges in analyzing molecular biology

data. The work will be tested on extensive cancer genome data,

contributing to better health and new health information tech-

nologies, areas of national priority.

DA: Distribution-based Machine Learning for High-dimensional Datasets

Awarded to Carnegie Mellon University

The project aims to develop new statistical and algorithmic

approaches to natural generalizations of a class of standard ma-

chine learning problems. The resulting novel machine learning

20 BigData






http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1247750









approaches are expected to benefit other scientific fields in which

data points can be naturally modeled by sets of distributions,

such as physics, psychology, economics, epidemiology, medicine

and social network-analysis.

DA: Collaborative Research: Genomes Galore — Core Techniques, Libraries, and Domain Specific Languages for High-Throughput DNA Sequencing

Iowa State University, Stanford University, Virginia Tech

The goal of the project is to develop core techniques and

software libraries to enable scalable, efficient, high-performance

computing solutions for high-throughput DNA sequencing, also

known as next-generation sequencing. The research will be con-

ducted in the context of challenging problems in human genetics

and metagenomics, in collaboration with domain specialists.

DA: Collaborative Research: Big Tensor Mining: Theory, Scalable Algorithms and Applications

Awarded to Carnegie Mellon University, University of

Minnesota Twin Cities

The objective of this project is to develop theory and algo-

rithms to tackle the complexity of language processing, and to

develop methods that approximate how the human brain works

in processing language. The research also promises better al-

gorithms for search engines, new approaches to understanding

brain activity, and better recommendation systems for retailers.

ESCE: Collaborative Research: Discovery and Social Analytics for Large-Scale Scientific Literature

Awarded to Rutgers University, Cornell University, Princeton

University

This project will focus on the problem of bringing massive

amounts of data down to the human scale by investigating the

individual and social patterns that relate to how text repositories

are actually accessed and used. It will improve the accuracy and

relevance of complex scientific literature searches. n

More NSF Big Data Explorations!NSF is currently leading a variety of programs in the Big

Data space. Click here for a complete list.

Core Techniques and Technologies for Advancing Big Data Science & Engineering (Big Data)

To advance the core scientific and technological means of

managing, analyzing, visualizing and extracting useful infor-

mation from large, diverse, distributed and heterogeneous

data sets.

Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21)

Program develops, consolidates, coordinates, and lever-

ages a set of advanced cyberinfrastructure programs and

efforts across NSF to create meaningful cyberinfrastructure,

as well as develop a level of integration and interoperability of

data and tools to support science and education.

CIF21 Track for IGERTEstablish a new CIF21 track as part of its Integrative Gradu-

ate Education and Research Traineeship (IGERT) program to

educate and support a new generation of researchers able to

address fundamental Big Data challenges across disciplines.

Data CitationProgram provides transparency and increased opportuni-

ties for the use and analysis of data sets, was encouraged in a

dear colleague letter initiated by NSF’s Geosciences director-

ate, demonstrating NSF’s commitment to responsible steward-

ship and sustainability of data resulting from federally funded

research.

Digging into Data ChallengeHow Big Data changes the research landscape for the hu-

manities and social sciences, in which new, computationally-

based research methods are needed to search, analyze, and

understand massive databases of materials such as digitized

books and newspapers, and transactional data from web

searches, sensors and cell phone records.

EarthCubeProgram supports the development of community-guided

cyberinfrastructure to integrate data into a framework that will

expedite the delivery of geoscience knowledge.

The Open Science Grid (OSG)This enables over 8,000 scientists worldwide to collabo-

rate on discoveries, including the search for the Higgs boson.

High-speed networks distribute over 15 petabytes of data

each year in real-time from the Large Hadron Collider (LHC)

at CERN in Switzerland to more than 100 computing facilities.

Partnerships provide the advanced fabric of services for data

transfer and analysis, job specification and execution, security

and administration, shared across disciplines including phys-

ics, biology, nanotechnology, and astrophysics.

Source: NSF

BigData 21

Cour

tesy

: Nat

iona

l Sci

ence

Fou

ndat

ion:

Car

rasc

o-Go

nzal

ez e

t al.,

Cur

ran

et a

l.,Bi

ll Sa

xton

, NRA

O/AU

I/NSF

, NAS

A








http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504767



http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf12059




http://www.diggingintodata.org/

http://earthcube.ning.com/

https://www.opensciencegrid.org/bin/view

https://www.opensciencegrid.org/bin/view

By Jeff Erlichman, Editor,

On The FrontLines

ItOccursToMe–TakeTwo

Calling All GWACs!Eventually you have to buy Big Data. GWACs can provide the procurement flexibility you need.

Right now the National Science Foundation (NSF) is fund-

ing foundational research in Big Data through grants,

competitions and prizes. They have $200 million at their

disposal. Their goal is open-ended research that will allow scien-

tists to ask and answer questions that are still

in their imaginations.

Right now data scientists at NOAA, NASA,

the National Archives and Records Adminis-

tration, DHS as well as DOD are already well

into doing the hard work to needed to turn

their Big Data to knowledge to action.

These organizations have already bud-

geted for Big Data.

But what about a program manager

at Transportation or Commerce who

sees the benefits of Big Data? Do they

have the funding? And if they do, what

procurement vehicles can they use to

actually buy Big Data capabilities and

services?

Eventually, there has to be a State-

ment of Work (SOW). Eventually a

contracting officer has to get involved.

If Big Data is to spread, if data is truly

to be democratized, agencies are going

to find ways to fund these efforts. And

then buy what they need.

Because my mind focuses on these

practical concerns and TechAmerica

included procurement as part of its

Demystifying Big Data report, I was

interested to discuss the issue in my

recent interview with TechAmerica’s

Chris Wilson.

Removing Barriers First of all Wilson said there is no

real need for new procurement vehicles

specifically for Big Data.

“We haven’t asked for a specific Big

Data vehicle; we don’t really need a new

one,” Wilson explained.

In fact the TechAmerica report states that industry already

has “to participate through numerous, duplicative contracting

channels to maximize the ability to win government business.”

Further the report cites a Bloomberg study saying the num-

ber of multiple award contracts (MACs) has more than

doubled since 2007 inflating costs without

adding value. This could actually hamper

Big Data efforts.

Wilson said instead of new MACs, exist-

ing contracts — especially the GWACs —

need to be cognizant that with Big Data,

you can’t put specifics in an RFP. The

SOW needs to be flexible.

“It’s an inherent Big Data issue; you

are not going to prove a thesis,” he noted.

A CIO may sense that there is more that

can be done with the data they have, but

the need the latitude to ask ”what

questions are out there that I need

to discover and do a deeper dive?”

The idea is “to create pilot pro-

grams. The SOW is not as specific as

you would want, but well enough to

get the thing going.”

GWACs Are In PositionIn its report, TechAmerica says the

government should avoid new contracts

and promote the use of existing GWACs

and the Federal Supply Schedules.

“Channeling Big Data solutions

through the minimum number of con-

tract vehicles necessary would allow

maximum integration, strategic sourc-

ing, governance, and standardization.”

Because of the nature of Big Data,

Wilson urged contract vehicles “to make

them sales friendly to acquisitions that

may not have a clear defined end game.”

Are you listening NITAAC, SEWP and

Alliant? n

Volume 5 Number 1 January/February 2013

Contract Guide

OMB Authorized GWACs for IT AcquisitionSMALL BUSINESS

OMB Authorized GWACs for IT Acquisition

OMB Authorized GWACs for IT Acquisition PRODUCTS/SERVICES

NITAACFY2013/2014 NIH Information Technology Acquisition and Assessment Center

NITAAC’s CIO-SP3, CIO-SP3 Small Business and ECS IIIGWACs provide everything IT.SM

YOU DRIVE THEACQUISITION.WE PROVIDE THE VEHICLES.

G O V E R N M E N T - W I D E A C Q U I S I T I O N C O N T R A C T S

nitaac.nih.gov

Published by

Download Your Digital Edition at www.OnTheFrontLines.net.

Volume 4 Number 7 September 2012

SEWP IVContract Guide

2012-2013 NASA

Copyright 2012 Trezza Media Group, Public Sector Communications, LLC.

CHAT withSEWP Customer Servicewww.sewp.nasa.govClick on the CHAT button

Inside Your “How To” Guide To SEWP IV3 ItOccursToMe...4 BeyondTheVision5 TheSEWPRecipe6 OnlineQuoteRequestTool(QRT)

8 ITBuyingIsEasyUsingSEWP!10 CustomerServiceFromStart

ToFinish12 SEWPSolutionCentral14 SupportingCustomersandContract

Holders16 Outreach=Two-WayCommunication18 OTFLOne-On-One’swithSEWP

ContractHolders22SEWPContractHolders

New!Quote Request Tool (QRT) makes it easy to use SEWP!

Details on page 6.

Download OTFL’s NITAAC and SEWP Contract Guide’s at www.OnThe FrontLines.net.

22 BigData


Dedicated to Advancing Innovation & Best Practices In Government

Each interactive, digital magazine is dedicated to one topic and brings you in-depth, informative and timely articles, video and audio from thought leaders from government and the private sector.

Download current and archived issues at www.onthefrontlines.net

Contact Tom Trezza at 201-670-8153 or email [email protected] to learn about sponsorship benefits and rates.

OTFL 2013 Issue Calendar

January NIH NITAAC Contract Guide 2013 February Big Data in Government March Cybersecurity April Mobility in Government May Geospatial Trends in Government June Government Cloud Computing July Data Center Consolidation August NASA SEWP Contract Guide

2013/2014 September Healthcare IT In Government October Cybersecurity Solutions in Government November Big Data in Government December Sustainability in Government

Schedule subject to change at any notice.

On The FrontLines is published by

Thought leaders discuss government efforts to harness their data assets.

Big Data

Volume 4 Number 5 June/July 2012

Inside

3 Big Data Boosts Baseball!4 The Big Data Issue Is Big Data6 Big Data Workshop At NIST7 Making The Big Data Journey8 108 Terabytes In 60 Seconds10 The New Data Wave Has Arrived12 Connecting Cross Domain14 The Art Of The Algorithm16 Imagine The Innovation18 From Milliseconds To Microseconds

Thought leaders discuss government efforts to harness their data assets.

Cloud Computing

Volume 4 Number 6 August 2012

Government


DIGITAL EDITIONView Archives at www.OnTheFrontLines.net

Developing technology and workforce solutions for a cloud-enabled, mobile government.

CybersecurityVolume 4 Number 4 April-May 2012

INSIDE3 DevelopYourCyberSmarts4 ACRISPEffort4 OTFL@FederalExecutiveForum6 GettingCloudSmart8 CybersecurityIsNotOneSizeFitsAll10 Hint!UseAPassphrase,NotAPassword!12 JerryDavis,VA:ExecutiveInterview14 FightingACyberWar

16 CloakYourNetwork17 ItStarts–AnEnds–WithFundamentals18 CreateYourDynamicDefesne19 TakeAHolisticApproach

20IntegratingTraditionalSecuritySilos21 ManageRisk,BuildTrust22Resources

©Copyright2012TrezzaMediaGroup,PublicSectorCommunications,LLC

How federal consolidators are aggressively pursuing FDCCI FY2015 targets.

Data CenterConsolidation

Volume 4 Number 8 Oct./Nov. 2012


Inside3 Realistic Roadmap4 Charting Progress6 Your Sound Technical Roadmap8 Three Energetic Efforts10 Anil Karmel—NNSA Executive Interview

12 On The Ground View OTFL SME One-On-Ones14 Steven Wallo—Brocade15 Bill Clark—CATechnologies16 Robert Stein—NetApp 17 Paul Christman—Quest Software, now part of Dell18 Resources


http://www.hp.com/go/connectedintelligence

Documents

Volume 5 Number 2 March 2013 Big Data - Life Servantimage.lifeservant.com/siteuploadfiles/VSYM/99B5C5E7-8B46-4D14-A… · were available before or didn’t exist before. Now it is