Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Innovative minds are creating brand new ways of analyzing data and extracting knowledge from it.
Big Data In Government
3 BigDataIsABigDeal
4 BigIdeasForBigData
6 AFewMinutesWithNSF’sDr.SuziIacono
8 BigDataDemystified
10 MakingYourBigDataMove
12 ExecutiveInterviewWithNIST’sAshitTalukder
14 Let’sTalkTech…
SMEOne-On-Ones
15 Brocade:ScottPearson&CaseyMiles
16 CommVault:ShawnSmucker
17 EMCIsilon:AudieHittle
18 HP:DianeZavala
19 Informatica:ToddGoldman
20BearingBigDataFruit
22CallingAllGWACs
Volume 5 Number 2 March 2013
Inside Big Data
Published by
Download Your Digital Edition at www.OnTheFrontLines.net.
Cour
tesy
Nat
iona
l Sci
ence
Fou
ndat
ion:
Fuq
ing
Zhan
g an
d Yo
nghu
i Wen
g,
Penn
sylv
ania
Sta
te U
nive
rsity
; Fra
nk M
arks
, NOA
A; G
rego
ry P.
Joh
nson
, Rom
y Sc
hnei
der,
John
Caze
s, Ka
rl Sc
hulz
, Bill
Bar
th, T
he U
nive
rsity
of T
exas
at A
ustin
BIG DATA IT’S ON ISILON
• Store&AnAlyzeyourDAtA
• ProtectyourDAtA
• Secureyournetwork
www.emc.com/federal
GAme-chAnGinGtechnoloGieSforiSr&fmV
By Jeff Erlichman, Editor, On The FrontLines
ItOccursToMe...
Big Data Is A Big Deal!A
fter collecting volumes of data from a variety of formats
and sources and then analyzing them using my human
processor, I’ve come up with my Big Data “Top Ten”.
1. Big Data needs a consensus definition.Go to any tech event and Big Data talk is on the lips of attend-
ees. While most say they “know Big Data when they see it”, when
asked to give a specific definition, answers vary widely.
And that’s for good reason. So far there is no one accepted gov-
ernment definition for Big Data like there is for cloud computing.
In its 2012 “Demystifying Big Data” report the TechAmerica
Foundation defined Big Data as a term that describes:
“large volumes of high velocity, complex and variable data
that require advanced techniques and technologies to en-
able the capture, storage, distribution, management, and
analysis of the information.”
Currently NIST is leading the effort to define Big Data for gov-
ernment, along with building common languages and reference
models. Stay tuned.
2. Big Data is well, big!In 1978, I sold Big Data storage to the Naval Research Lab. My
format was an 8” floppy disk storing a robust 8 kilobytes. Hard
multiple-platter Disk Packs stored maybe 1 megabyte. Today hard
disk capacities of 1.5 terabyte are commonplace.
According to the TechAmerica report, in 2009, the govern-
ment produced 848 petabytes of data and healthcare data alone
reached 150 exabytes. Five exabytes (1018 gigabytes) would con-
tain all words ever spoken by human beings on earth.
3. Big Data is new — Not trueWhile the term “Big Data” with initial caps is new, big data
itself is not new.
For example, NOAA/National Weather Service has been pro-
cessing it since the 1950s. Today NOAA manages over 30 pet-
abytes of new data per year. (How many 8K floppy’s is that?)
4. Big Data characteristics: The 4Vs Dealing with Big Data means dealing with:
(a)Volume — The sheer amount of data generated or data
intensity that must be ingested, analyzed, and managed.
(b)Velocity — The speed data is being produced and changed;
and the speed with which data must be received, understood, and
processed.
(c)Variety — Structured and unstructured data in a variety
of formats creates integration, management, governance, and
architectural pressures on IT.
(d)Veracity — The quality and provenance of received data
must be verified.
5. Why Big Data now!Technology finally has the power to handle the 4Vs. We now
have the tools to really ask ‘what if’ and to explore data sets that
were available before or didn’t exist before. Now it is possible to
really think about ‘the art of the possible’. We are witnessing a
true democratization of data.
6. Big Data is now affordable. You don’t need to start from scratch with new IT invest-
ments. You can use your existing capabilities, technologies and
infrastructure to begin pilots. There is no single technology is
required; there are no ‘must haves’.
7. The Big Data path is a circle not a straight line.Define. Assess. Plan. Execute. Review. TechAmerica recom-
mends a five step cyclical approach that is iterative versus se-
rial in nature, with a constant closed feedback loop that informs
ongoing efforts.
8. The “Help Wanted” sign is up!We have to grow the next generation of data scientists. The
McKinsey Global Institute Analysis report predicts a shortfall
of 200,000 data scientists over the next five or so years and
shortfall of 1 million managers in organizations where data will
be critical for them to be successful in their organization — e.g.
government.
9. Government is funding foundational Big Data R&D.Projects are moving forward via the $200 million investment
announced by the administration in March 2012. In October 2012,
NSF/NIH announced 8 awards covering three areas: data man-
agement, data analytics and collaboration/data sharing. New
solicitations, competitions and prizes are in the offing with op-
portunities for anyone who has a good idea.
10. The Big Data Market is growing bigger.Deltek forecasts demand for big data solutions by the U.S. gov-
ernment will increase from $4.9 billion in FY 2012 to $7.2 billion
in FY 2017 at a compound annual growth rate (CAGR) of 8.2%.
Big Data means big opportunities. That’s truly a big deal. n
© Copyright 2013 Trezza Media Group, Public Sector Communications, LLC
BigData 3
Big Ideas For Big DataB
ig Data inspires visions where we imagine ‘what if’. In real-
ity, it poses the real-life question of how we are going to
get from today to ‘what if’?
The answer: “We have a lot of science to do,” declared NSF’s
Dr. Suzi Iacono in a recent interview with OTFL.
“We truly believe there is a chance for trans-
formation in science and engineering to help us
with our biggest national challenges in energy,
health and climate change.”
Dr. Iacono is the co-chair of the interagency
Big Data Senior Steering Group, a part of the Net-
working and Information Technology Research
and Development (NITRD) program and a senior science advisor
at NSF.
The NITRD program chartered the Big Data Senior Steering
Group in 2011. NSF co-chairs the committee with NIH and mem-
bers include DARPA, DOD OSD, DHS, DOE-Science, HHS, NARA,
NASA, NIST, NOAA, NSA, and USGS.
Dr. Iacono explained this is a long-term national Big Data R&D
effort with four major components:
1. Foundational research to develop new techniques and tech-
nologies to derive knowledge from data.
2. New cyberinfrastructure to manage, curate, and serve data
to research communities.
3. New approaches for education and workforce development.
4. Challenges and competitions to create new data analytic
ideas, approaches, and tools from a more diverse stake-
holder population.
At NSF, Basic ResearchDr. Iacono said NSF is a basic science agency focusing more on
the long-term; the kinds of foundational research the private sector
cannot fund or would not have dedicated researchers working on.
“When people ask me the question ’what is research doing for
me today in terms of Big Data?’, I tell them ‘if you think you have
problems today, just wait five or ten years from now and you’ll be
glad we are doing this basic research we are doing today’.”
She notes that the issues and challenges of dealing with vol-
umes, velocity and variety are only going to increase. “The scien-
tists and engineers will want to ask bigger and more important
research questions so we have bigger and better discoveries.”
Dr. Iacono explained NSF is working on three core Big Data
issues:
• Collection, storage and management
• Data analytics
• Data sharing and collaboration.
“We really have to figure out how to scale from the kind of
small data sets that we have today to very large and very hetero-
geneous data sets,” said Dr. Iacono.
Heterogeneity is one of the biggest challenges. It includes
large, diverse, complex, longitudinal, and/or distributed data
sets generated from instruments, sensors, Internet transactions,
email, video, click streams, and/or all other digital sources avail-
able today and in the future according to Dr. Iacono.
In October 2012 NSF/NIH announced awards for eight projects
that cover data management, data analytics and collaboration/
data sharing. (Details on page 20)
With the Administration’s $200 million Big Data R&D Initiative, government is taking tangible steps to address the challenges and opportunities of Big Data.
Dr. Suzi IaconoNSF
4 BigData
That’s just the beginning.
“Those were just the mid-scale awards. We are about to an-
nounce our small projects and there will be more of those,”
Iacono said adding that ‘small’ means just smaller projects with
smaller budgets and fewer researchers working on the projects
than the mid-scale awards..
“Then a new solicitation will be coming out for competitions
and prizes,” she added, “because we want to reach out beyond
those who traditionally propose. Anyone who has good ideas can
submit ideas for competition.
NIST Big Data/Cloud WorkshopDr. Iacono presented her Big Data message to the commu-
nity at the NIST Big Data/Cloud Workshop held in January 2013.
Before an SRO audience she spoke about the four thrusts and
everyone’s common challenges.
“All the agencies understand what the challenges are whether
we are NASA, NOAA, USGS, NIST or DARPA,” she said. “The issue
is we share common challenges; and how are we going to solve
them in new meaningful ways that really going to make a huge
difference down the line.”
Ashit Talukder, Chief, Information Access Division, ITL,
NIST said NIST, other federal agencies, the private sector and
academia are working together to come up with consensus
standards, guidelines and ways to measure Big Data efforts. And
there is that Big Data definition issue.
“We had one breakout dedicated to coming up with a consen-
sus definition and common language for Big Data technologies.
We are working to develop a shared technology roadmap that
allows the community and all stakeholders to work together.”
Talukder said NIST is taking a similar approach as it did for
cloud computing, developing a common definition and roadmap
that all can follow.
“We have a working group trying to come up with consensus
definition for Big Data. We need to start with the fundamental
questions,” he said.
“Then we can have a definition so we can speak common lan-
guage; have a shared set of reference architectures that is well
recognized and used throughout the community. That is key so
that we have a more uniform way of programs.
After all said Talukder, “the end game for Big Data is knowl-
edge extraction; turning data into information into knowledge;
then taking action based on that knowledge.” n
During a recent Federal Executive Forum broadcast on Federal
News Radio, leaders from NASA and NGA spoke about their Big
Data strategies and efforts.
National Geospatial-Intelligence Agency
Dr. Robert Norris
Chief Architect
National Geospatial-Intelligence Agency (NGA)
NASA Goddard Spaceflight Center
Myra Bambacus
Geospatial Information Officer (GIO)
NASA Goddard Spaceflight Center
Big Data Imaging at NASA At NASA Johnson Space Center, NASA’s imagery collection
of still photography and video spans more for than half a cen-
tury: from the early Gemini and Apollo missions to the Space
Station. This imagery collection currently consists of over 4
million still images, 9.5 million feet of 16mm motion picture film,
over 85,000 video tapes and files representing 81,616 hours of
video in analog and digital formats.
Eight buildings at JSC house these enormous collections and
the imagery systems that collect, process, analyze, transcode,
distribute and archive these historical artifacts.
NASA’s imagery collection is growing exponentially, and its
sheer volume of unstructured information is the essence of
Big Data.
NASA has developed best practices through technologies
and processes to comply with NASA records retention sched-
ules and archiving guidance; migrate imagery to an appropriate
storage medium and format destined for the National Archives;
and develop technology to digital store down-linked images
and video directly to tape libraries.
OTFL @ The Federal Executive Forum
“ The end game for Big Data is knowledge extraction; turning data into information into knowledge; then taking action based on that knowledge.”
— Ashit Talukder, Chief, Information Access Division, ITL, NIST
Source: TechAmerica Foundation Demystifying Big Data Report, Oct. 2012
BigData 5
Grap
hic
cour
tesy
Nat
iona
l Sci
ence
Fou
ndat
ion:
Leon
id A
. Mirn
y an
d Er
ez L
iebe
rman
-Aid
en
6 BigData
Dr. Suzi IaconoSeniorScienceAdvisorfortheDirectorateforComputerandInformationScienceandEngineering(CISE)NationalScienceFoundation(NSF)
AFewMinutesWith...
Dr. Suzi Iacono co-chairs the interagency Big Data Senior
Steering Group, a part of the Networking and Information
Technology Research and Development (NITRD) program.
Dr. Iacono is passionate about her work at NSF and NSF’s role
in tackling basic scientific research.
“It is our precisely our mission to not think about the end
game. We see science beyond the frontiers, actually imagining
what we should do to tackle some of our biggest problems.”
NSF funded the basic research for “recommender” software
that Amazon and other online retailers use. They also funded the
basic technologies that allow companies like e-Bay do online auc-
tions without cheating.
“At NSF, that’s what we do, it’s ‘where discoveries begin’,” Dr.
Iacono declared. She was kind of enough to talk with OTFL re-
cently about Big Data. What follows is a lightly edited transcript.
OTFL: Every agency has a unique mission. What about NSF?Dr. Iacono: We are basic science agency, so
while there is some practical aspects to the
research we fund, it is really long-term.
(Because private sector R&D bud-
gets are not large) we fund the kinds
of things companies would not invest
in or would not have internal research-
ers working on those questions.
It is our precisely our mission to not think
about the end game, to see science beyond the
frontiers, actually imagining what we should do to
tackle some of our biggest problems.
It’s often true that scientists come up with brand new ways of
doing things that were never thought of before; and that’s what
we are trying to do with our research programs at NSF — to think
outside the box, be creative, innovative and come up with brand
new ways of analyzing data and extracting knowledge from it.
OTFL: Can you give examples?Dr. Iacono: One example of trying to bring communities together
is a project called EarthCube. Our geoscience director has three
divisions: Ocean, Earth and Atmospheric.
Some smart people in that directorate asked the question:
Wouldn’t it be good if geoscientists could use all that data to ask
really big questions about climate change?
If you are a geoscientist who only worked with Oceans, you
are using only the Ocean community database. But what if you
could get information about the Earth and atmosphere?
Then you could really pose some much more interesting
grand challenge research questions and we could find out a lot
more about things like climate change.
Another example is using real time data to evacuate people
in a big storm; being able to take the data that FEMA, NOAA and
USGS have and be able to actually integrate it in real time; and get
real actionable knowledge of what to do.
This will change the course of how we handle public safety in
this country, because being able to get people going down the
right road rather than the wrong one will save lives.
OTFL: What about new tools?Dr. Iacono: We need brand new ways to scale small databases up
so they can handle very, very large data sets. What that means is
new algorithms, new statistical approaches, new
kinds of mathematics have to be brought to bear
on the problem; so we bring together multi-disci-
plinary teams from across these disciplines to
work on these problems.
Take education for example... education
researchers in US are funded by NSF. They
are interested in the science of learning.
To do that they go to local elementary
school and they compare how kids learn
in two classes of 30 kids. They observe,
they take tests, they survey, but the the data
points are really small.
Compare that with really large online
learning community like Stanford. They have
200,000 students taking a course. So, as they
click to move around the website, you could cap-
ture every interaction with every student. Find out how long they
rest on a page and how long they read things.
The amount of data is so much different, orders of magnitude
larger. We could really make a big difference in how we teach
our young people. We can understand when someone does not
understand a concept. We can intervene before someone drops
out. We have so much data along the whole process rather than
waiting to take a test at the end.
(In the future) those mining their data will have that much
more data to mine. What kind of new analytical tools will they
have at hand to actually come up with new knowledge? The issue
for us is what can we do today to really give people new tools,
new approaches new technologies they can use five or ten years
down the road. n
8 BigData
In the 1990s doing business online was doing ‘eCommerce or
eBusiness’. Today, it’s just ‘commerce or business’.
That transformation occurred because the underlying In-
ternet technologies matured and the public gained trust in doing
business online.
That same transformation will hopefully occur with Big Data.
In the end, once the technologies mature and using Big Data appli-
cations in real-time becomes common-
place, Big Data will become just data.
“Collection, use cases, analysis,
sharing will all be par for the course,”
explained Chris Wilson, Vice President
and Counsel, Communications, Privacy,
and Internet Policy at TechAmerica
Foundation.
“Big Data terminology will become
less important than the actual results,”
Wilson told OTFL in a recent interview.
“Government will ask ‘how can Big Data
help?’ Results and outcomes, not termi-
nology will dominate the conversation.”
Big Data Commission: Bringing Clarity
A few short years ago, defining —
and then migrating and using the cloud
—consumed the IT conversation. Now,
the technological, policy and business
infrastructures are in place and cloud
computing is quickly becoming main-
stream computing. What got cloud go-
ing was NIST defining the cloud so everyone had a reference point.
A few short years ago, Big Data (though it has always been
with us) wasn’t even in the discussion. Now, of course, it is. Now
Big Data is in a similar position as cloud’s early years. Everyone
knows Big Data when they see it, but so far, there is no consensus
definition. And with no starting point, there can be no clarity.
“We felt it was necessary to provide some clarity to the gen-
eral discussion within industry and government,” Wilson said,
describing the formation and goals of its Big Data Commission,
consisting of 25 of America’s leading IT firms.
The result is its report: “Demystifying Big Data: A Practical
Guide To Transforming The Business of Government”.
Wilson said the report’s goal was to talk about Big Data in plain
language, develop a working definition, describe the technology
underpinnings and provide practical advice on getting started.
(More on these on pages 10 and 14)
“We have tangible case studies on how Big Data has been
implemented in real world scenarios,” he said. These include
implementations by NASA, NOAA/National Weather Service and
the National Archives and Records Administration (NARA). How
these agencies solved their Big Data requirements provide guid-
ance for new endeavors.
The report breaks the Big Data dis-
cussion down into five parts:
1. Big Data Definition & Business/
Mission Value
2. Big Data Case Studies
3. Technical Underpinnings
4. The Path Forward: Getting Started
5. Public Policy
According to the report, “the Com-
mission based its findings on the practi-
cal experiences of those government
leaders who have established early
successes in leveraging Big Data, and
the academics and industry leaders who
have supported them.
The intent is to ground report rec-
ommendations in these best practices
and lessons learned, in an effort to cut
through the hype, shorten the adoption
curve, and provide a pragmatic road map
for adoption.”
Positive ReactionSince its release in October 2012, Wil-
son said the feedback has been positive.
“The feedback from the administration has been positive, es-
pecially from OSTP who thinks it has a lot of value,” said Wilson,
who also said there was a positive reaction from the Hill.
Looking to the future, Wilson said the TechAmerica plan is
for its Big Data subcommittee to use the report as template for
advocacy and engagement with government and look for ways to
collaborate of Big Data messaging.
Wilson says the report shows that with all the amazing stuff
going on we are just at the tip of the iceberg.
“While I think data analysis is a huge part of Big Data, a Big
Data definition encompasses more. Data collection, use and shar-
ing are all part of the Big Data world wrapped around the data
analysis axle,” he said. “Our working definition includes all of
these things.” n
Big Data DemystifiedIn its “Demystifying Big Data” report, the TechAmerica Foundation presents a practical guide to how Big Data can transform how government does business and delivers services.
“Big Data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information.”
TechAmerica Foundation Report: Demystifying Big Data; October 2012
TechAmerica Foundation: Federal Big Data Commission 1
Cover Page
A Practical Guide To Transforming The Business of Government
DEMYSTIFYINGBIG DATA
Prepared by TechAmerica Foundation’s Federal Big Data Commission
Introducing an exponential leap forwardin data management software
10 BigData
The path is a circle, not a straight line.
You know the data you steward could do more to im-
prove the lives of citizens, or improve social and welfare
services or save the government money. You know you are a
prime candidate for Big Data. You also know that unless you are
getting part of the $200 million in announced grants, new money
is not an option.
But that doesn’t mean you can’t get started.
NIST’s Ashit Talukder told OTFL “first you must start with
fundamental questions including: ‘What is Big Data?’ Having a
consensus definition allows people to speak in a common lan-
guage, use a shared set of reference architectures and is the key
for more uniform way of approaching Big Data.”
At the moment, there is no consensus definition and NIST and
others are working hard to formulate one. But in the meantime,
Big Data projects need go forward. So, how do you proceed?
TechAmerica’s Big Data Commission in its “Demystifying Big
Data” report presents a proven “five step cyclical approach to
take advantage of the Big Data opportunity”.
Review each step continually along the way. To think Big Data
is to think in cyclical not, serial terms.
“These steps are iterative versus serial in nature, with a con-
stant closed feedback loop that informs ongoing efforts,” the Big
Data Commission writes.
Start from “a simple definition of the business and operational
imperatives” you want to address and a “set of specific business
requirements and use cases that each phase of deployment with
support.”
“At each phase, review progress, the value of the investment,
the key lessons learned, and the potential impacts on gover-
nance, privacy, and security policies. In this way, the organization
can move tactically to address near term business challenges, but
operate within the strategic context of building a robust Big Data
capability.”
How To SucceedSpecifically, the Big Data Commission says successful Big
Data implementations:
1. Define business requirements: Start with a set of specific
and well defined mission requirements, versus a plan to deploy a
universal and unproven technical platform to support perceived
future requirements. The approach is not “build it and they will
come,” but “fit for purpose.”
2. Plan to augment and iterate: Augment current IT invest-
ments rather than building entirely new enterprise scale sys-
tems. The new integrated capabilities should be focused on
initial requirements but be part of a larger architectural vision
that can include far wider use cases in subsequent phases of
deployment.
3. Big Data entry point: Successful deployments are character-
ized by three “patterns of deployment” underpinned by the selec-
tion of one Big Data “entry point” that corresponds to one of the
key characteristics of Big Data.
• Velocity: Use cases requiring both a high degree of velocity
in data processing and real time decision making, tend to require
Streams as an entry point.
• Volume: Those struggling with the sheer volume in the data
they seek to manage, often select a database or warehouse archi-
tecture that can scale out without pre-defined bounds.
• Variety: Those use cases requiring an ability to explore, un-
derstand and analyze a variety of data sources, across a mixture
of structured, semi-structured and unstructured formats, hori-
zontally scaled for high performance while maintaining low cost,
imply Hadoop or Hadoop-like technologies as the entry point.
4. Identify gaps: Once an initial set of business requirements
have been identified and defined, government IT leaders assess
their technical requirements and ensure consistency with their
long term architecture. Leaders should identify gaps and then
plan the investments to close the gaps.
5. Iterate: From Phase I deployments you can then expand to
adjacent use cases, building out a more robust and unified Big
Data platform. This platform begins to provide capabilities that
cut across the expanding list of use cases, and provide a set of
common services to support an ongoing initiative. n
Making Your Big Data Move To think Big Data is to think in cyclical not serial terms.
1. Define. 2. Assess. 3. Plan. 4. Execute.
5. Review.
Chart courtesy of TechAmerica Foundation.
Megumi Kiyama/Janis Ceresi
None ITC Franklin Gothic (Book, Book Condensed,
Demi Condensed, Medium Condensed), TT
Slug OTF (Regular)
6.75” x 9.3125”
7.75” x 10.25”
8.5” x 11.25”
6/29/12
None
2
Lisa
Q
Thomas
Bridgette
4/C
None
BR9.068
Brocade
BR9.068_Federal_3_Print_A_Size_Ad.indd
Federal #3 Print Ad
None
6-29-2012 4:20 PM
Job No.:
Client:
File Name:
Title:
Date:
Pubs:
PRODUCTION NOTES
Please examine these publication materials carefully. Any questions regarding the materials, please contact Cindy Jarvis at (415) 217-2831
Live:
Trim:
Bleed:
Mat Close:
1st Insert:
Version:
READER
LASER%
DATE
Studio Manager
Art Director
Copywriter
Account Mgt.
Production Mgt.
Proof Reader
BYAPPROVALS Production:Art Director:
Copywriter:
Account Mgr:
Print Prod:
Color/BW:
Fonts:
None
© 2012 Brocade Communications Systems, Inc. All Rights Reserved.
Brocade is helping federal agencies deliver data center-class reliability and scalability to the edges of the network and into the cloud.
Brocade is, quite simply, the leader in cloud-optimized networking for the federal government. With the largest breadth of federally certified products, Brocade is committed to achieving the highest standards of interoperability and reliability required for all federal solutions and the Cloud First mandate. Brocade builds network foundations that ensure federal data center consolidations enable cutting-edge cloud services, seamlessly.
When the mission is critical, the network is Brocade. Learn more at brocade.com/everywhere
Brocade. Unlock thefull potential of the cloud.
B18943_1b_Fed_A12.22.2011 150 linescreenAMA
A18943x03E_Asize_325ucr.tif
12 BigData
Ashit TalukderChief,InformationAccessDivisionInformationTechnologyLaboratory(ITL)NationalInstitutesofStandardsandTechnology(NIST)
OTFLExecutiveInterview
More than 600 gathered at the 2013 NIST joint Cloud and
Big Data Workshop to explore the “what if” possibilities
at the intersection of cloud computing and Big Data.
“Cloud is a multiplier when it’s combined with other tech-
nologies, and it frees the users from the constraints of loca-
tion and catalyzing data-driven inspiration for discovery,” Dr.
Patrick Gallagher, Under Secretary of Commerce for Standards
and Technology and NIST Director told attendees.
“Now, Big Data, unlike cloud, doesn’t have a common defini-
tion yet. We haven’t yet agreed as a community what exactly we
mean by Big Data. But whatever it is, it’s here. “
And like cloud, Big Data is going to change everything. We
are really looking at a new paradigm, a place of data primacy
where everything starts with consideration of the data rather
than consideration of the technology.”
At the workshop, the NIST Information Technology Labora-
tory Big Data team led sessions that focused on some of Big
Data’s most pressing concerns. Speakers included:
• Christopher L. Greer, Acting Senior Advisor for Cloud
Computing, Associate Director, Program Implementation Office
• John Garofolo, Senior Advisor, Information Access Division
• Ashit Talukder, Chief, Information Access Division
• Mary Brady, Leader, Information Systems Group, Software
and Systems Division
But to make this all happen, there are many challenges. In
this OTFL interview, NIST’s Ashit Talukder, a key member of the
NIST Big Data team, spoke about some of the important issues
facing the Big Data community.
On Lifecycle Management: Measurement Science, Benchmarking and Evaluation Challenges
OTFL: What can people do to meet those challenges?
Ashit Talukder, NIST: The key to meeting these important chal-
lenges is to have plans that cover all phases of the life cycle, from
initial capture to long term archiving; including steps that may not
often be considered.
Dispensing with data that are not needed or for which the
cost of preservation exceeds the value is an example. Underly-
ing this example is the principle that not all data may need to be
preserved.
For example, the raw data output of a computer simulation
may be massive and expensive to preserve. But if the simulation
itself — and the ability to run the simulation — is compact and can
be efficiently preserved, the data can be regenerated on demand.
In contrast, historical observational data cannot be regener-
ated and these typically require different evaluation and preser-
vation strategies.
These examples point out that thoughtful data curation — in-
cluding continuously evaluating the preservation value of data
— is a key part of life cycle management.
Data lifecycle management also includes adding data identi-
fiers, data descriptors such as metadata, ontologies, data com-
pression (using lossless or lossy techniques), and methodologies
to search and access data.
Properly evaluating the preservation value of data is certainly
a challenge. A key factor lies in assessing the current and poten-
tial uses of the data.
The NIST Scientific Data Committee, composed of data sci-
entists and technologists from fields across NIST, is currently
working on the challenge of developing effective metrics and
measures for data impact.
The group has started its work by focusing on software tools
and resources that NIST makes available for downloading and use
by others. This initial work will be extended to NIST Standard Ref-
erence Data as well as the other data resources developed across
the NIST labs.
BigData 13
On Big Data InfrastructureOTFL: Do we have the infrastructure in place currently?
What is needed and what is not?
Ashit Talukder, NIST: There have been significant advances in
Big Data infrastructure over the past few years that have enabled
deployment of scalable Big Data solutions on platforms.
Hadoop on cluster-based environments is currently the most
commonly used infrastructure, but others that employ HPCC-
based solutions may be used as alternatives, depending on the
types of applications.
Research on different hardware-based solutions including
GPUs, multicores and cluster computing are underway that may
offer specialized solutions in different problem domains.
Networking or the movement of large volumes of data is a
challenge. Solutions that enable faster movement of data from
storage to memory, or between nodes in a cluster environment
are being investigated and will potentially assist in faster Big Data
computations.
On Analytics, Processing and Interaction: Measurement Science, Benchmarking and Evaluation Challenges OTFL: What can people do to meet those challenges?
Ashit Talukder, NIST: One challenge in processing Big Data is
being able to effectively extract knowledge and present relevant
information to users in a timely manner. The heterogeneity of
data types in Big Data includes unstructured information such as
text, speech, video, and images.
Many of the existing analytics solutions cannot handle un-
structured data, assimilate variety of data types, or may not be
suitable for handling distributed data, or be able to extract in-
formation from real-time data streams (compared to traditional
solutions that do batch processing on pre-stored data). Visual-
ization, interaction, and usability of Big Data interfaces also are
areas that need significant improvements.
While promising solutions for analytics on Big Data have been
offered recently, there is a dearth of techniques to measure the
efficacy of different analytics solutions.
Measurement and benchmarking of analytics solutions could
include many metrics, including some such as accuracy, false/
true classification rates, computation speed, generalization, data
management efficiency, and many others.
We’re working with the defense and intelligence communities
and others on integrating, analyzing and interpreting streaming
video, audio, text, and sensor data to address important secu-
rity needs and to advance state of art in analytics, search and
retrieval across many domains.
NIST has been organizing a measurement and evaluation
challenge series (TREC) on large-scale text document search and
retrieval for the past 22 years, and the latest TREC conference
includes measurement of search accuracy on 1 billion web pages,
search analysis on 100 million real-time tweets, legal e-discovery
on 7 million documents. Plans are underway to potentially extend
such measurement and evaluations to more data types and do-
mains/applications in Big Data.
On The TechAmerica Foundation Report on “Demystifying Big Data” What are your thoughts about the recent TechAmerica
Report on Big Data?
Ashit Talukder, NIST: The TechAmerica report is thoughtful and
well done, and I encourage your readers to review it.
We are very interested in gathering a range of perspectives on
the Big Data landscape, from private sector stakeholders like Te-
chAmerica, academic researchers, and our colleagues through-
out the government sector.
To this end, we recently held a very well-received workshop on
the intersection of Cloud Computing and Big Data.
The webcast is archived and available for viewing on our NIST
Cloud Computing web site. The message that emerged from the
more than 600 individuals who attended in person or participat-
ed remotely was that the time to move forward on Big Data is now.
There is a need to focus the conversation through a consen-
sus definition for Big Data (there are many different definitions
today), a common language, and a shared technology roadmap
that allows us to work together effectively.
We are anxious to work with the community to pursue
these goals. n
Let’s Talk Tech...SMEs from Brocade, CommVault, EMC, HP Enterprise Services and Informatica have big things to say about Big Data.
In putting together this issue of OTFL, I had my own “Big Data”
moment.
I suddenly realized that in trying to take all of the informa-
tion I researched in print, online, via audio, via video and through
conversations with Big Data experts, I was dealing first-hand with
the same issues business managers, technologists and data sci-
entists were dealing with when it comes to Big Data.
I was dealing with:
1. More data being collected than I could use.
2. Many data sets were too large “to download” into usable
memory.
3. Many data sets were too poorly organized to be usable.
4. Many data sets are heterogeneous in type, structure, se-
mantics, organization, granularity, accessibility.
5. My use of the data was limited by my ability to interpret and
use it.
For example, take the contents from a NSF Big Data presen-
tation from June 2012 explaining Big Data. It said Big Data is a
paradigm shift from hypothesis-driven to data-driven discovery;
and that:
• Data-driven discovery is revolutionizing scientific explora-
tion and engineering innovations
• Automatic extraction of new knowledge about the physical,
biological and cyber world continues to accelerate
• Multi-cores, concurrent and parallel algorithms, virtualiza-
tion and advanced server architectures will enable data min-
ing and machine learning, and discovery and visualization of
Big Data.
At that point I knew I needed a Hadoop type technology in my
brain to make sense of things to turn this “dirty data into informa-
tion into action”!
Fortunately for me I was lucky enough to communicate one-
on-one with some of the leading industry experts from Brocade,
CommVault, EMC, HP Enterprise Services and Informatica about
Big Data. They were my Hadoop.
Those conversations appear on the following 5 pages (15-19). n
By Jeff Erlichman, Editor, On The FrontLines
14 BigData
Source & DataApplications
Data Preparation Data TransformationMetadata Repository
Business IntelligenceDecision Support
Analysts
Streaming Data
Text Data
Multi-Dimensional
Time Series
Geo Spatial
Video& Image
Relational
Social Network
Data Mining & Statistics
Optimization& Simulation
SemanticAnalysis
FuzzyMatching
NetworkAlgorithms
NewAlgorithms
IndustryDomain Expert
AnalyticsSolution End
User
OtherAnalysts and
Users
Data Acquisition Filtering, Cleansing,and Validation
Storage, Hadoop, &Warehousing Core Analytics Users
Security, Governance, Privacy, Risk Management
Chart courtesy of TechAmerica Foundation.
Notional Information Flow — The Information Supply Chain
OTFL SME One-On-OneScott PearsonDirector, Big Data Solutions Brocade
Casey MilesTechnical Marketing Manager (Big Data)Brocade
The IT Infrastructure Itself Is The MissionTo support Big Data environments and analytics initiatives re-
quires networks to be integrated into solutions.
“There is a cultural and business model shift in motion,”
Scott Pearson, Brocade’s Director for Big Data Solutions told
OTFL.
In Big Data, 40% of enterprises have developed their own
applications, which they consider their competitive advantage
he added. This requires a business and technology model
which is flexible and adaptable in architecting and delivering
integrated customized solutions.
“Brocade, being at the intersection of integrated Big Data
solutions, is able to be at the core of this comprehensive, pack-
aged, supportable solution; or go the route of prototyping vari-
ous design ideas around a tested Ethernet fabric,” Pearson said.
Everyone ContributesIntegrating a Big Data Analytics solution into your enter-
prise can be a very daunting process, filled with risk. Success
ultimately depends upon breaking up the silo mentality where
server, applications and network people only talk among them-
selves.
‘This simply cannot work in an area of IT that is primarily
business driven,” explained Casey Miles, Brocade’s Technical
Marketing Manager for Big Data. “In Big Data, the IT infrastruc-
ture itself is the mission.”
This changes ownerships, goals, and requirements drasti-
cally in a culture that just isn’t configured to support that kind
of pressure Miles noted. Brocade has stepped up and pulled
industry together to support the business decision maker.
“We got a lot of inquisitive looks when we held the first
workshop to benchmark Big Data at Brocade headquarters
from the various software engineers, database managers, and
analytics professors on the committee,” Miles said.
The first question was, “Why is this at a network hardware
company?”
That’s exactly the paradigm that needed to be broken Miles
declared.
“Over the past year, Brocade has operated as a central
hub, bringing together software, database, server, academia,
standards bodies, and enterprise companies, forming strong
partnerships, and exposing the very problems that needed to
be overcome in order to turn IT from a back office into a busi-
ness community.”
Now, Brocade hardware engineers talk about finding value
in large data sets, application designers talk about hashing
algorithms to improve cluster performance, and everyone
contributes to the standards and benchmarks that will soon be
announced for Big Data.
“Now, our new culture around Big Data strives to under-
stand what value each player can bring to a solution, optimize
it, integrate it, and support it as if it was part of their widget,”
Miles said. “This is the ideal culture that drives standardization,
adoption, and trust that will overcome the barriers of security
concerns and data migration.”
Pilots Please“Just as important in this process is building a small scale
prototype. Use this to prove that you can meet objectives and
deliver results, then scale up to full production size,” Miles
counseled.
Brocade matches the best practice of starting small and
scaling up with unique Ethernet fabric products that take the
risk out of unpredictable performance at scale.
“As far as applications themselves, we are seeing a lot of
ideas around the graphing functions, complex embedded SQL
queries, and very creative data visualization techniques on mo-
bile devices. The industry is still getting its legs underneath it.”
Miles said there hasn’t been a “killer app” produced as of
yet and “until we know enough about the capabilities that data
presents, we won’t know where the curve stops and starts to
turn downward.”
Pearson and Miles both agree that the first step in Big Data
is to identify and articulate the problems you are attempting to
solve and/or purpose of your Big Data or Analytic Initiatives.
Step 2 is to get involved and join or participate in collab-
orative industry and academic forums such as the San Diego
Supercomputing Center sponsored Workshop for Big Data
Benchmarking (WBDB) or Center for Large Scale Data Systems
Research (CLDS).
“One thing to keep in mind is that this Big Data integration
process is an iterative one. It’s important not be too rigid in the
development of your program,” noted Miles.
“Big Data is only limited by an organization’s ambition.” n
BigData 15
OTFL SME One-On-OneShawn SmuckerDistrict Manager, Federal Systems EngineeringCommVault
Understand, Then CreateManaging the storage costs for Big Data is a big deal. Organiza-
tions are faced with not only an explosive growth of data, but
the time value of data (such as machine generated data) is in-
creasingly contracting.
“Storage vendors typically provide capacity based reporting
that show how much space is being consumed on an array, a
drive, or even an individual platter,” Shawn Smucker, District
Manager, Federal Systems Engineering told OTFL in a recent
interview.
“However this level of reporting does not provide insight
into the data that is using that capacity. A storage administra-
tor may know 90% of an array’s capacity is being used, but they
cannot determine who owns those files, when the files were cre-
ated, when they were last accessed, or even if any of the files
are a prohibited type.”
Understand, Then CreateCommVault Simpana Primary Storage Reporting and File
System Archiving helps organizations first understand what
data they have; and then create intelligent policies to migrate
the data to the most cost effective storage tier.
It provides granular, host based analysis of the storage uti-
lization, so storage and application administrators can quickly
determine the number files, file ownership, create and access
times, as well as duplicate files for a given system.
“Once the administrators have an understanding of all of
the data consuming storage resources, they can create intel-
ligent, automated policies that ensure data only resides in cost-
effective storage tiers,” Smucker said.
“No longer do they find themselves simply reacting to stor-
age shortages. Now they can strategically plan their storage
investments and manage the migration policies to meet those
plans.”
CommVault’s File System Archiving manages the automated
migration across storage tiers.
“As soon as data is created, sent, or received CommVault
will begin monitoring the data and, when the defined criteria
are met, the data will be migrated to the next tier of storage,”
he explained.
CommVault’s heterogeneous storage support and native
cloud storage integration allows the data to migrate across as
many tiers of storage or in as many different locations as an
organization requires.
File System Archiving can also be leveraged to manage pro-
hibited file types on primary storage. “If for instance MP3 files
are not allowed, the archiving agent can delete or move those
as soon as they are discovered on the system,” Smucker said.
“Taken together, Primary Storage Reporting and File Sys-
tem Archiving allow organizations to fully understand their
data, plan the full life cycle of the data in their environment, and
make intelligent storage investment decisions. This leads to
direct cost savings in storage infrastructure and management.”
Reuse and RecoupFederal IT organizations purchased nearly $1 billion of new
electronic data storage in FY 2011. Before buying new storage,
Smucker urges them to reuse or recoup capacity on their exist-
ing storage.
“CommVault’s combination of data life cycle management,
de-duplication and heterogeneous storage support provides IT
organizations the means to better manage their existing invest-
ments,” he said.
“Migrating data from the expensive tier 1 storage would free
capacity and allow for more strategic investments in produc-
tion storage. De-duplication of data managed by CommVault
ensures only unique data is being stored on tier 2 and tier 3
storage.
“Heterogeneous storage support allows organizations to
make the most cost effective investments across all tiers of
storage.”
One future aspect of managing Big Data will be what do we
retain and how long will we retain it? Not all data is created
equally but most organizations treat it all the same, keeping it
for the same amount of time regardless of who created it or
what type it is.
“Object based retention will change the way IT organizations
manage and retain data. Characteristics of individual data ob-
jects will determine where and for how long data is retained,”
Smucker said.
“For example, HR documents will be retained for years
while system generated logs will be retained for days. The key
to making object base retention manageable is eliminating the
need to know where the data originated from.”
Smucker explained an administrator will need to know the
characteristics of the data they need to manage, create a policy
around those characteristics, and then apply the policy to all
data being managed.
“The policies would span hosts and storage without the ad-
ministrator needing apply the policies on a granular system by
system basis.” n
16 BigData
OTFL SME One-On-OneAudie HittleCTO FederalEMC Isilon Storage Division
Agencies are being driven to transform their data storage and
information management operations to achieve levels of effi-
ciency and effectiveness previously not thought possible.
How is this being achieved?
Audie Hittle, CTO Federal told OTFL that “increasingly it’s
through the analysis of Big Data enabling insights and decision
support only recently made possible through technological ad-
vances like Isilon’s OneFS.”
As CTO Federal, Hittle embraces a mission focused the
transformations underway from the introduction of Big Data. He
observes and translates market trends, translates user require-
ments and needs into solutions, or future product roadmap
considerations, and translates technical jargon into capabilities
customers can appreciate.
“With the EMC Isilon portfolio of industry-leading scale-out
network attached storage (NAS) solutions, everyone involved
from the IT professional through end-users have more time to
think and focus on how to achieve their organization’s transfor-
mational goals,” Hittle observed.
The Information Culture“I believe we are at that stage of our ‘information culture’
which is primed and ready for the type of transformation the
EMC Isilon scale-out data storage solution can deliver. While it
innovatively addresses other issues, such as security, data mi-
gration and standardization, the way it solves the culture chal-
lenge is transformational.”
The intelligence and automation in the Isilon operating sys-
tem, OneFS — which feature cluster Auto-balancing, SyncIQ for
remote synchronization, SmartPools next-generation tiering,
and SmartLock for data protection and retention — are funda-
mentally changing the way data storage is managed.
“While I’ve heard the Isilon transformation referred to as
‘giving IT professionals their weekends back’, it goes way be-
yond that,” Hittle explained.
“It enables the team to focus more of their time and energy
on higher priority, more challenging and gratifying efforts
contributing directly to more cost efficient and effective op-
erations.”
A program, group or company implementing the EMC Isilon
data storage solution can achieve dramatic operational and cul-
tural changes. How dramatic?
“After implementing an Isilon solution, one government or-
ganization recently reported a decrease from 10-12 dedicated
full-time data storage professionals to one or two full-time
equivalents (FTE) team members an 80-90 percent reduction,”
said Hittle.
This enabled the reallocation of those personnel to higher
priority missions and providing the flexibility to absorb pending
federal budget cuts.
Reliability and AvailabilityHittle recently met with an industry systems integration
team that was considering options for a Big Data project in the
Intelligence, Surveillance and Reconnaissance (ISR) market.
“Like many organizations they were interested in the latest
market trends and technologies,” he recounted. “But interest-
ingly they were more concerned about the reliability and avail-
ability of their planned data storage project.”
Hittle advises government buyers to ask about the system
reliability and availability, “because it doesn’t matter what a
‘good deal’ you got on it (read as how little you paid), if it doesn’t
work right after the first week, month or year on the job.”
The #1 concern of smart federal buyers today is on “efficien-
cies” — translation: the bottom line.
“Fortunately, more buyers are asking about the Total Cost of
Ownership or TCO for a Big Data solution,” he said.
“This includes everything from initial procurement costs to
energy to full life-cycle operations and management costs. So,
perhaps a second important question I would recommend is for
government buyers to look under the covers at the TCO — not
just the up-front, lowest cost per terabyte.”
When asked about the future of Big Data, Hittle observed
that “that the question isn’t where the Big Data opportunities
are, but rather where are there not Big Data opportunities?”
“The insights and enhanced decision support generated
from the access to and analysis of Big Data are relevant to al-
most every market and program,” he noted.
“Just as the ‘open’ movement, mobility, social media, and Big
Data have captured our collective imagination and mind share,
I think the next big thing to happen with Big Data will be the
innovation that occurs.”
That innovation will stem from the application of technology,
like EMC Isilon’s scale-out data storage and information man-
agement solutions,” Hittle explained. n
Driving Organizational Transformation
BigData 17
OTFL SME One-On-OneDiana ZavalaDirector, Information Management and AnalyticsHP Enterprise Services, U.S. Public Sector
Enabling Meaning Based ComputingHow do you harness the volumes of different data at the speed
required? How do you manage and derive insights from the
“human” information, which is a complex task?
“We’ve moved beyond just the structured data. You must
tackle unstructured data to get the whole picture,” Diana Zavala,
Director, Information Management and Analytics told OTFL.
HP is uniquely positioned to help bridge the gap between
the challenges of Big Data (Velocity, Volume, Variety, and Time
to Value) and the needs of agencies to support their mission
said Zavala.
HP’s Next Generation Information Platform is anchored by
HP’s Autonomy Intelligent Data Operating Layer (IDOL) engine.
Autonomy enables Meaning Based Computing (MBC), which
forms an understanding of information and recognizes the re-
lationships that exist within it.
“IDOL can derive meaning from complex data sources, such
as e-mail, video, audio, social media, blogs, and streaming me-
dia, as well as traditional sources, to provide valuable insights,”
Zavala explained.
“This is a unique capability that stretches beyond keyword
matching, or ‘crowd-sourcing’ of topics, and other traditional
search algorithms.”
Autonomy technology was used during the 2012 Summer
Olympics to monitor social media and determine if there was
social activity regarding organized protests that would disrupt
the games or pose a potential security risk to participants or
spectators.
“Consider the benefit of this ability in a situation where you
want to gauge the effectiveness of or reaction to government
programs or legislation,” she said. “By creating ‘social intelli-
gence’, the government can adapt public policies in response
to citizen feedback.”
IDOL is at the core of the HP FUSION solution for situational
awareness. FUSION provides analytic capabilities to automate
information flows and the ability to interrogate data elements
to discover patterns and anomalies and bring this information
forward for easy consumption by decision makers.
“The goal is to exploit information when, where, and how it’s
needed in a way that makes it relevant and has a positive im-
pact,” Zavala explained.
Finding Insights Another key component of the HP Next Generation Informa-
tion Platform is Vertica, a massively parallel database, as well
as an extensible analytics framework, optimized for real-time
analysis.
“Vertica finds insights hidden in massive amounts of data
and has been used for fraud detection by uncovering patterns
of misuse,” Zavala said.
Further HP’s end-to-end security service provides a 360-de-
gree view of the organization to protect data and covers the
range of security technology, security consulting services, and
managed security services.
“Today, we protect the U.S. Navy and Marine Corps from
multiple intrusions every month and we perform constant se-
curity research in order to discover more vulnerabilities than
anyone else in the world,” noted Zavala.
Pointed QuestionsDecisionmakers seeking to optimize the information needed
to support their objectives would be wise to ask the following
questions she said:
• Do I have access to 100% of my structured and unstruc-
tured data? And, can it be readily queried and analyzed?
• Can my information partner provide a complete, trusted
end-to-end solution for my Big Data problems?
Zavala counsels managers to consider buying decisions
within the bigger picture of how to optimize information across
their organization.
“Are you buying solutions to support an enterprise informa-
tion strategy that enables an “agile” approach to information?
The focus should be on building a consistent information struc-
ture that provides value ranging from information security and
compliance to analytics and agility.”
Zavala says agencies must evolve their approach to more
efficiently extract value from their data, because increased
velocity and volume of will drive new mission requirements and
opportunities.
“Consider at least three intersecting areas: data volume,
speed, and cost. How do you filter to find the most relevant
information? New ways of analyzing the explosion of data are
needed to meet the expectation that decisions will be made in
more real time at the moment of risk.”
Moving towards information optimization requires an ability
to apply science to business and a comprehensive information
optimization strategy and technology Zavala said. n
18 Big Data
OTFL SME One-On-OneTodd GoldmanGeneral Manager for Data IntegrationInformatica
Harnessing Hadoop PowerTodd Goldman is responsible for Informatica’s data integration
and data quality product lines, which includes the firm’s Big
Data initiative and in turn what’s going on with Hadoop.
According to PC Magazine Hadoop is an “open source proj-
ect from the Apache Software Foundation that provides a soft-
ware framework for distributing and running applications on
clusters of servers. Inspired by Google’s MapReduce program-
ming model as well as its file system, Hadoop was originally
written for the Nutch search engine project.”
Goldman explained that Hadoop can take raw data (he called
it dirty) in a variety of structured and unstructured formats and
convert it into finished goods; that is information that can be
used to make decisions.
“What has happened is technologies like Hadoop provide
a different fabric on which to do this data integration,” Gold-
man told OTFL, explaining that Hadoop is in the ‘early adopter’
phase.
Goldman says the next step for Hadoop is to move to the
‘early majority’ phase where people buy Hadoop because they
have seen successful deployments.
“The challenge is to effectively use Hadoop. For that you
need people who are specialists in Hadoop and these Data Sci-
entists who can manage a 100 node Hadoop cluster are not easy
to find,” explained Goldman.
PowerCenter Developer = Hadoop Developer“Our customers use PowerCenter, which is a developer in-
terface and engine that is used to extract, transform and load
(ETL) data,” said Goldman. “Now we are giving them another
engine that scales to bigger data and adds real time data char-
acteristics to it”
Goldman said if you use PowerCenter as your factory for
converting data into information, you can now use PowerCenter
on top of Hadoop.
“The same skills you spent 10 years learning are still good,”
he said. “If you are a PowerCenter developer, presto-chango
you are now a Hadoop developer with no training.”
Goldman said customers tell him the effort to extract data is
roughly 80% of any kind of data analysis project. The last 20%
is done by the domain expert.
As an example, Goldman said to imagine two garages that
have tools, garden equipment and a lawn mower, but they are
not organized in the same way.
“Multiply that by a million for data when combining them
from different sources,” he said. “First you have to understand
what is in each source, how each is organized; then you have to
extract data from the source; then clean it up and then make it
consistent and combine it.”
Extracting data is what Informatica has been doing for years,
but not on Hadoop said Goldman, noting customers can use the
same GUI to use Hadoop that they were using on PowerCenter.
Hadoop PowerIn the Hadoop world you can write code manually as you
would program any application declared Goldman, who said
that Informatica is creating tool sets that are democratizing the
use of Hadoop “so more people can do more work more quickly
at a higher level of quality than they can do by hand.”
“We are creating a set of tools that make it easier to use
Hadoop and automate processes so they don’t have to do 80%
of the background work. This gives them the information in a
format they can use for whatever their endeavor might be. We
won’t more Hadoop experts, but give people the tools and train-
ing so they can use it.”
Goldman predicts in the next two years “you are going to
come to the time when you imagine the problem you want to
solve and you won’t need a team of Hadoop experts to solve it.”
Tools make progression to the ‘early majority’ phase
super-fast.
“Now I get comfortable. I can iterate so much faster, 5 times
faster than by hand. I can do 5 times more projects; I can do 5
projects in the time it would take me to do one.”
As an example Goldman pointed to drug company clinical
trials. Just two short years ago a smart scientist would examine
10,000 trials with 1,000 attributes and only study 100 attributes
because it was expensive and took too much time.
“Now with Hadoop you can dump all the data, used the tools
and do an analysis on all 1,000 attributes. Now I can find new
insights from this old data and find maybe this drug could be
used for something new.”
The result is a democratization of the use of large scale data
analysis.
“We will see breakthroughs from those who don’t know data
integration, but can ask question and leverage information in
new ways we never thought possible.” n
BigData 19
Bearing Big Data Fruit!The National Science Foundation is investing in projects to accelerate and use knowledge from Big Data.
In October 2012, The National Science Foundation (NSF),
with support from the National Institutes of Health (NIH),
announced nearly $15 million in new Big Data fundamental
research projects.
“These awards aim to develop new tools and methods to ex-
tract and use knowledge from collections of large data sets to
accelerate progress in science and engineering research and in-
novation,” said the NSF press release.
These grants were made in response to a joint NSF-NIH call
for proposals issued in conjunction with the March 2012 Big Data
Research and Development Initiative launch: NSF Leads Federal
Efforts in Big Data.
NSF said the aim is to “enable new types of collaborations —
multi-disciplinary teams and communities.” The awards will also
help NIH get more value from their huge biological data sets by
addressing the technological challenges for extracting impor-
tant, biomedically relevant information from large amounts of
complex data.
The eight projects announced range from scientific tech-
niques for Big Data management to new data analytic approach-
es to e-science collaboration environments with possible future
applications in a variety of fields, such as physics, economics and
medicine.
“Big Data is characterized not only by the enormous volume
or the velocity of its generation, but also by the heterogeneity, di-
versity, and complexity of the data,” said Dr. Suzi Iacono, co-chair
of the interagency Big Data Senior Steering Group, a part of the
Networking and Information Technology Research and Develop-
ment program and senior science advisor at NSF.
“There are enormous opportunities to extract knowledge
from these large-scale, diverse data sets, and to provide powerful
new approaches to drive discovery and decision-making, and to
make increasingly accurate predictions. We’re excited about the
awards we are making today and to see what the idea generation
competition will yield.”
Big Data AwardsDCM: Collaborative Research: Eliminating the Data Ingestion
Bottleneck in Big-Data Applications
Awardedto:RutgersUniversity,StonyBrookUniversity
Big Data practice suggests that there is a tradeoff between
the speed of data ingestion, the ability to answer queries quickly
(e.g., via indexing), and the freshness of data. This tradeoff has
manifestations in the design of all types of storage systems. In
this project the principal investigators show that this is not a fun-
damental tradeoff, but rather a tradeoff imposed by the choice of
data structure. They depart from the use of traditional indexing
methodologies to build storage systems that maintain indexing
200 times faster in databases with billions of entries.
ESCE: DCM: Collaborative Research: DataBridge —
A Sociometric System for Long-Tail Science Data Collections
AwardedtoUniversityofNorthCarolinaatChapelHill,Harvard
University,NorthCarolinaAgriculture&TechnicalState
University
The sheer volume and diversity of data present a new set of
challenges in locating all of the data relevant to a particular line of
scientific research. Taking full advantage of the unique data in the
“long-tail of science” requires new tools specifically created to
assist scientists in their search for relevant data sets. DataBridge
supports advances in science and engineering by directly en-
abling and improving discovery of relevant scientific data across
large, distributed and diverse collections using socio-metric net-
works. The system will also provide an easy means of publishing
data through the DataBridge, and incentivize data producers to
do so by enhancing collaboration and data-oriented networking.
DCM: A Formal Foundation for Big Data Management AwardedtoUniversityofWashington
This project explores the foundations of Big Data manage-
ment with the ultimate goal of significantly improving the pro-
ductivity in Big Data analytics by accelerating data exploration.
It will develop open source software to express and optimize ad
hoc data analytics. The results of this project will make it easier
for domain experts to conduct complex data analysis on Big Data
and on large computer clusters.
DA: Analytical Approaches to Massive Data Computation with Applications to Genomics
Awarded to Brown University
The goal of this project is to design and test mathematically
well-founded algorithmic and statistical techniques for analyzing
large scale, heterogeneous and so called noisy data. This project
is motivated by the challenges in analyzing molecular biology
data. The work will be tested on extensive cancer genome data,
contributing to better health and new health information tech-
nologies, areas of national priority.
DA: Distribution-based Machine Learning for High-dimensional Datasets
Awarded to Carnegie Mellon University
The project aims to develop new statistical and algorithmic
approaches to natural generalizations of a class of standard ma-
chine learning problems. The resulting novel machine learning
20 BigData
approaches are expected to benefit other scientific fields in which
data points can be naturally modeled by sets of distributions,
such as physics, psychology, economics, epidemiology, medicine
and social network-analysis.
DA: Collaborative Research: Genomes Galore — Core Techniques, Libraries, and Domain Specific Languages for High-Throughput DNA Sequencing
Iowa State University, Stanford University, Virginia Tech
The goal of the project is to develop core techniques and
software libraries to enable scalable, efficient, high-performance
computing solutions for high-throughput DNA sequencing, also
known as next-generation sequencing. The research will be con-
ducted in the context of challenging problems in human genetics
and metagenomics, in collaboration with domain specialists.
DA: Collaborative Research: Big Tensor Mining: Theory, Scalable Algorithms and Applications
Awarded to Carnegie Mellon University, University of
Minnesota Twin Cities
The objective of this project is to develop theory and algo-
rithms to tackle the complexity of language processing, and to
develop methods that approximate how the human brain works
in processing language. The research also promises better al-
gorithms for search engines, new approaches to understanding
brain activity, and better recommendation systems for retailers.
ESCE: Collaborative Research: Discovery and Social Analytics for Large-Scale Scientific Literature
Awarded to Rutgers University, Cornell University, Princeton
University
This project will focus on the problem of bringing massive
amounts of data down to the human scale by investigating the
individual and social patterns that relate to how text repositories
are actually accessed and used. It will improve the accuracy and
relevance of complex scientific literature searches. n
More NSF Big Data Explorations!NSF is currently leading a variety of programs in the Big
Data space. Click here for a complete list.
Core Techniques and Technologies for Advancing Big Data Science & Engineering (Big Data)
To advance the core scientific and technological means of
managing, analyzing, visualizing and extracting useful infor-
mation from large, diverse, distributed and heterogeneous
data sets.
Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21)
Program develops, consolidates, coordinates, and lever-
ages a set of advanced cyberinfrastructure programs and
efforts across NSF to create meaningful cyberinfrastructure,
as well as develop a level of integration and interoperability of
data and tools to support science and education.
CIF21 Track for IGERTEstablish a new CIF21 track as part of its Integrative Gradu-
ate Education and Research Traineeship (IGERT) program to
educate and support a new generation of researchers able to
address fundamental Big Data challenges across disciplines.
Data CitationProgram provides transparency and increased opportuni-
ties for the use and analysis of data sets, was encouraged in a
dear colleague letter initiated by NSF’s Geosciences director-
ate, demonstrating NSF’s commitment to responsible steward-
ship and sustainability of data resulting from federally funded
research.
Digging into Data ChallengeHow Big Data changes the research landscape for the hu-
manities and social sciences, in which new, computationally-
based research methods are needed to search, analyze, and
understand massive databases of materials such as digitized
books and newspapers, and transactional data from web
searches, sensors and cell phone records.
EarthCubeProgram supports the development of community-guided
cyberinfrastructure to integrate data into a framework that will
expedite the delivery of geoscience knowledge.
The Open Science Grid (OSG)This enables over 8,000 scientists worldwide to collabo-
rate on discoveries, including the search for the Higgs boson.
High-speed networks distribute over 15 petabytes of data
each year in real-time from the Large Hadron Collider (LHC)
at CERN in Switzerland to more than 100 computing facilities.
Partnerships provide the advanced fabric of services for data
transfer and analysis, job specification and execution, security
and administration, shared across disciplines including phys-
ics, biology, nanotechnology, and astrophysics.
Source: NSF
BigData 21
Cour
tesy
: Nat
iona
l Sci
ence
Fou
ndat
ion:
Car
rasc
o-Go
nzal
ez e
t al.,
Cur
ran
et a
l.,Bi
ll Sa
xton
, NRA
O/AU
I/NSF
, NAS
A
By Jeff Erlichman, Editor,
On The FrontLines
ItOccursToMe–TakeTwo
Calling All GWACs!Eventually you have to buy Big Data. GWACs can provide the procurement flexibility you need.
Right now the National Science Foundation (NSF) is fund-
ing foundational research in Big Data through grants,
competitions and prizes. They have $200 million at their
disposal. Their goal is open-ended research that will allow scien-
tists to ask and answer questions that are still
in their imaginations.
Right now data scientists at NOAA, NASA,
the National Archives and Records Adminis-
tration, DHS as well as DOD are already well
into doing the hard work to needed to turn
their Big Data to knowledge to action.
These organizations have already bud-
geted for Big Data.
But what about a program manager
at Transportation or Commerce who
sees the benefits of Big Data? Do they
have the funding? And if they do, what
procurement vehicles can they use to
actually buy Big Data capabilities and
services?
Eventually, there has to be a State-
ment of Work (SOW). Eventually a
contracting officer has to get involved.
If Big Data is to spread, if data is truly
to be democratized, agencies are going
to find ways to fund these efforts. And
then buy what they need.
Because my mind focuses on these
practical concerns and TechAmerica
included procurement as part of its
Demystifying Big Data report, I was
interested to discuss the issue in my
recent interview with TechAmerica’s
Chris Wilson.
Removing Barriers First of all Wilson said there is no
real need for new procurement vehicles
specifically for Big Data.
“We haven’t asked for a specific Big
Data vehicle; we don’t really need a new
one,” Wilson explained.
In fact the TechAmerica report states that industry already
has “to participate through numerous, duplicative contracting
channels to maximize the ability to win government business.”
Further the report cites a Bloomberg study saying the num-
ber of multiple award contracts (MACs) has more than
doubled since 2007 inflating costs without
adding value. This could actually hamper
Big Data efforts.
Wilson said instead of new MACs, exist-
ing contracts — especially the GWACs —
need to be cognizant that with Big Data,
you can’t put specifics in an RFP. The
SOW needs to be flexible.
“It’s an inherent Big Data issue; you
are not going to prove a thesis,” he noted.
A CIO may sense that there is more that
can be done with the data they have, but
the need the latitude to ask ”what
questions are out there that I need
to discover and do a deeper dive?”
The idea is “to create pilot pro-
grams. The SOW is not as specific as
you would want, but well enough to
get the thing going.”
GWACs Are In PositionIn its report, TechAmerica says the
government should avoid new contracts
and promote the use of existing GWACs
and the Federal Supply Schedules.
“Channeling Big Data solutions
through the minimum number of con-
tract vehicles necessary would allow
maximum integration, strategic sourc-
ing, governance, and standardization.”
Because of the nature of Big Data,
Wilson urged contract vehicles “to make
them sales friendly to acquisitions that
may not have a clear defined end game.”
Are you listening NITAAC, SEWP and
Alliant? n
Volume 5 Number 1 January/February 2013
Contract Guide
OMB Authorized GWACs for IT AcquisitionSMALL BUSINESS
OMB Authorized GWACs for IT Acquisition
OMB Authorized GWACs for IT Acquisition PRODUCTS/SERVICES
NITAACFY2013/2014 NIH Information Technology Acquisition and Assessment Center
NITAAC’s CIO-SP3, CIO-SP3 Small Business and ECS IIIGWACs provide everything IT.SM
YOU DRIVE THEACQUISITION.WE PROVIDE THE VEHICLES.
G O V E R N M E N T - W I D E A C Q U I S I T I O N C O N T R A C T S
nitaac.nih.gov
Published by
Download Your Digital Edition at www.OnTheFrontLines.net.
Volume 4 Number 7 September 2012
SEWP IVContract Guide
2012-2013 NASA
Copyright 2012 Trezza Media Group, Public Sector Communications, LLC.
CHAT withSEWP Customer Servicewww.sewp.nasa.govClick on the CHAT button
Inside Your “How To” Guide To SEWP IV3 ItOccursToMe...4 BeyondTheVision5 TheSEWPRecipe6 OnlineQuoteRequestTool(QRT)
8 ITBuyingIsEasyUsingSEWP!10 CustomerServiceFromStart
ToFinish12 SEWPSolutionCentral14 SupportingCustomersandContract
Holders16 Outreach=Two-WayCommunication18 OTFLOne-On-One’swithSEWP
ContractHolders22SEWPContractHolders
New!Quote Request Tool (QRT) makes it easy to use SEWP!
Details on page 6.
Download OTFL’s NITAAC and SEWP Contract Guide’s at www.OnThe FrontLines.net.
22 BigData
Dedicated to Advancing Innovation & Best Practices In Government
Each interactive, digital magazine is dedicated to one topic and brings you in-depth, informative and timely articles, video and audio from thought leaders from government and the private sector.
Download current and archived issues at www.onthefrontlines.net
Contact Tom Trezza at 201-670-8153 or email [email protected] to learn about sponsorship benefits and rates.
OTFL 2013 Issue Calendar
January NIH NITAAC Contract Guide 2013 February Big Data in Government March Cybersecurity April Mobility in Government May Geospatial Trends in Government June Government Cloud Computing July Data Center Consolidation August NASA SEWP Contract Guide
2013/2014 September Healthcare IT In Government October Cybersecurity Solutions in Government November Big Data in Government December Sustainability in Government
Schedule subject to change at any notice.
On The FrontLines is published by
Thought leaders discuss government efforts to harness their data assets.
Big Data
Volume 4 Number 5 June/July 2012
Inside
3 Big Data Boosts Baseball!4 The Big Data Issue Is Big Data6 Big Data Workshop At NIST7 Making The Big Data Journey8 108 Terabytes In 60 Seconds10 The New Data Wave Has Arrived12 Connecting Cross Domain14 The Art Of The Algorithm16 Imagine The Innovation18 From Milliseconds To Microseconds
Thought leaders discuss government efforts to harness their data assets.
Cloud Computing
Volume 4 Number 6 August 2012
Government
Copyright 2012 Trezza Media Group, Public Sector Communications, LLC.
DIGITAL EDITIONView Archives at www.OnTheFrontLines.net
Developing technology and workforce solutions for a cloud-enabled, mobile government.
CybersecurityVolume 4 Number 4 April-May 2012
INSIDE3 DevelopYourCyberSmarts4 ACRISPEffort4 OTFL@FederalExecutiveForum6 GettingCloudSmart8 CybersecurityIsNotOneSizeFitsAll10 Hint!UseAPassphrase,NotAPassword!12 JerryDavis,VA:ExecutiveInterview14 FightingACyberWar
16 CloakYourNetwork17 ItStarts–AnEnds–WithFundamentals18 CreateYourDynamicDefesne19 TakeAHolisticApproach
20IntegratingTraditionalSecuritySilos21 ManageRisk,BuildTrust22Resources
©Copyright2012TrezzaMediaGroup,PublicSectorCommunications,LLC
How federal consolidators are aggressively pursuing FDCCI FY2015 targets.
Data CenterConsolidation
Volume 4 Number 8 Oct./Nov. 2012
Copyright 2012 Trezza Media Group, Public Sector Communications, LLC.
Inside3 Realistic Roadmap4 Charting Progress6 Your Sound Technical Roadmap8 Three Energetic Efforts10 Anil Karmel—NNSA Executive Interview
12 On The Ground View OTFL SME One-On-Ones14 Steven Wallo—Brocade15 Bill Clark—CATechnologies16 Robert Stein—NetApp 17 Paul Christman—Quest Software, now part of Dell18 Resources