69
National Library of Medicine Data Science Coordinating Unit Workforce Excellence Team Report to the NLM Director The State of Data Science Workforce Development January 8, 2018

Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

National Library of Medicine

Data Science Coordinating Unit Workforce Excellence Team

Report to the NLM Director The State of Data Science Workforce Development

January 8, 2018

Page 2: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

2

Data Science Coordinating Unit, Workforce Excellence Team

Lisa Federer, MLIS, Data Science Training Coordinator Maryam Zaringhalam, PhD, AAAS Science and Technology Policy Fellow Michael F. Huerta, PhD, Associate Director of NLM for Program Development and NLM Coordinator of Data Science and Open Science Initiatives

Acknowledgements

The authors gratefully acknowledge the following extramural staff who generously provided their input and expertise in interviews: National Cancer Institute Elizabeth Hsu Ming Lei Jonathan Wiest National Heart, Lung, and Blood Institute Giuseppe Pintucci Jane Scott National Human Genome Research Institute Tina Gatlin Bettie Graham National Institute on Aging Robin Barr National Institute of Allergy and Infectious Diseases Shawn Gaillard Diana Lawrence Rosemary McKaig

National Institute of Environmental Health Sciences Jennifer Collins Carol Shreffler National Institute of General Medical Sciences Susan Gregurick Shiva Singh National Institute of Mental Health Nancy Desmond Jamie Driscoll Nicole North Ashlee Van’t Veer National Institute of Neurological Disorders and Stroke Stephen Korn Letitia Weigand National Library of Medicine Valerie Florance

The authors also thank Jim Corrigan (NCI) and Jennifer Sutton (OER) for their advice and guidance, and Chris Belter (NIH Library), Ben Busby (NLM), Doug Joubert (NIH Library), Alicia Livinski (NIH Library), Anand Merchant (NCI), Krisztina Miner (FAES), Adam Thomas (NIMH), and Burke Squires (NIAID) for providing statistics on intramural training activities.

Page 3: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

3

TABLE OF CONTENTS Executive summary 5

Working definition of data science 5

Approach and scope 5

Key findings 6

Summary of recommendations 7

1 Background and previous reports 10

1.1 2011 report on F and T data science funding 10

1.2 NIH ACD Data and Informatics Working Group Report 12

1.3 NIH ACD Biomedical Research Workforce Working Group Report 12

1.4 NLM RFI on Educational Resources 12

1.5 NIH ACD NLM Working Group Report 13

1.6 NLM Strategic Planning Process 13

1.7 NLM RFI on Next-Generation Data Science Challenges 13

1.8 NASEM Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report 14

2 The state of extramural data science workforce development 15

2.1 Trends in NIH-Supported Doctoral Fields of Study 16

2.2 Update to the 2011 report on data science funding 17

2.3 Data science-related Funding Opportunity Announcements 19

2.4 Analyzing data science Training, Fellowship, and Career Development Awards 20

2.4.1 Training (T) awards 22

2.4.2 Fellowship (F) awards 26

2.4.3 Career Development (K) awards 29

2.4.4 Overall trends in data science training funding, FY2010-2017 34

2.5 BD2K Awards 34

2.5.1 BD2K Investments in Data Science Training 34

2.5.2 BD2K Training Coordination Center 35

2.6 Qualitative analysis of data science workforce development 37

2.6.1 Data science remains a nebulous term 37

2.6.2 Data science training is relevant to the broader biomedical community 38

2.6.3 Data science transcends IC-specific domains 39

2.6.4 Data science is an interdisciplinary team science 39

Page 4: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

4

2.7 Conclusions 41

3 The state of data science training for NIH staff 42

3.1 On-campus data science instruction 44

3.1.1 Center for Information Technology 45

3.1.2 Foundation for Advanced Education in the Sciences 46

3.1.3 NCI Bioinformatics Training and Education Program 47

3.1.4 NIH Library 48

3.1.5 NIMH Data Science and Sharing Team 49

3.2 NIH Data Science SIG 50

3.2.1 SIG Events 51

3.2.2 Data Science Mentoring 52

3.3 Conclusions 53

4 Recommendations 54

5 Appendices 59

Appendix A - Qualitative Interview Guide 59

Appendix B – R Code Used for Text Mining and Topic Mapping Analysis 60

Appendix C – Previous Report to the NLM Director 62

Page 5: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

5

Executive summary Data science holds the promise of transforming and advancing biomedical research by providing new ways to analyze, visualize, understand, and gain insight from large complex sets of genomic, connectomic, image, health record, behavioral, and other kinds of data. Needed for such transformation is training aimed at producing three levels of expertise in, and understanding of, data science. These range from training those who will shape the leading edge of data science in the context of the NIH mission, to those who will implement and tailor solutions to address particular scientific challenges, to those who deeply understand the principles of data science and will add value to research investment by assuring good data management practices across the data life cycle. This analysis was undertaken to better understand training activities across NIH, in the past and present that are pertinent to the emerging area of biomedical data science. This report details activities in support of data science workforce development across the NIH, both within the extramural and intramural communities, focusing on activities conducted between FY2010 and FY2017. The findings detailed in this report, in concert with findings and recommendations from previous reports from working groups of the Advisory Committee to the Director of NIH, a report from the National Academies of Science, Engineering and Medicine, responses to requests for information, and from the NLM strategic planning process, form the basis of several recommendations for training a diverse workforce enabling biomedical research to realize the transformative promise of data science.

Working definition of data science Data science is a relatively new field, encompassing a variety of disciplines and methodologies. For the purposes of this report, data science is defined as the discipline that sits at the intersection of subject matter knowledge, mathematical and statistical expertise, and computer science skills. In other words, data science research develops and utilizes computational methods to apply statistics-based models to analyze and, through visualization, inferential statistics, and other methods, help extract insight from large, complex datasets. Training a workforce that can apply these methods to many types of data, including genomic, connectomic, image, health record, behavioral and other kinds, will be essential to advancing biomedical research, including current initiatives around precision medicine, cancer research, and neuroscience.

Approach and scope NIH funds a broad range of training activities aimed at researchers across their career span. A variety of extramural training mechanisms, including Ts, Fs, and Ks, are all within the scope of this report. Additionally, R25s funded through the Big Data to Knowledge (BD2K) initiative are included. Though primarily focused on extramural activities, this report also addresses activities

Page 6: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

6

aimed at developing workforce capacity among intramural researchers and other NIH staff. Many of the recommendations from this report apply to data science broadly and could be implemented in either extramural or intramural settings. To gain as complete an understanding as possible of the current state of data science training, this study utilized both qualitative and quantitative methods. Qualitative interviews were conducted with extramural staff from ten institutes and centers (ICs) that have been significantly involved in funding data science training. These interviews focused on the ICs’ current and planned activities to support data science workforce development, as well as exploring challenges and opportunities. In addition, a review of the NIH portfolio of extramural and intramural data science training activities was performed. This review considered training programs that IC staff had specifically indicated were relevant to data science, as well as additional programs located through searching the NIH Query, View, and Report System (QVR). To ensure retrieval of all relevant projects, a targeted search strategy was developed to identify documents even if they do not include the term “data science.” The target results for search were awards that were similar to those supported via BD2K, which most clearly represent what NIH implicitly considers data science. Search strategies and terms were tuned to produce such search results. In addition, statistics were collected on training for intramural researchers at NIH, compiling data from various groups offering relevant training. These findings validate and build upon data presented in a previous report to the NLM director, which is synthesized in this report and included in its entirety as Appendix C. Several major reports, reviews, and strategic planning activities related to data science, training, and the future of the NLM were considered carefully to provide background and context for the study and to inform development of the recommendations of this report.

Key findings The findings in this report indicate that the NIH values the development of data science skills of the biomedical research workforce. NIH has acknowledged the importance of data science to the future of biomedical discovery and understanding and has made significant investments in data science and data science-related workforce development during the eight-year period between FY2010 and FY2017. Using search terms and strategies described in detail elsewhere in this report, extramural training awards relevant to data science comprised:

● 666 unique data science training (T) awards were funded for a total of 2,907 award years, representing 13-22% of all NIH T awards funded per year (mean = 17.5%), for a total of nearly $930 million invested across 25 ICs.

● 772 unique data science fellowship (F) awards were funded for a total of 1,597 award years, representing 5-9% of all NIH F awards funded per year (mean = 6%), for a total of over $66 million invested across 22 ICs.

Page 7: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

7

● 898 unique data science career development (K) awards were funded for a total of 2,375 award years, representing 4-11% (mean = 7%), for a total of nearly $356 million invested across 22 ICs.

Across these T, F, and K mechanisms, NIH has invested about $1.35 billion in extramural data science workforce development over eight years. Adding in the $15.6 million invested in BD2K R25s and the $7.6 million for the BD2K TCC, that total rises to over $1.37 billion, or an average of nearly $172 million per year. Though funding for all mechanisms combined steadily decreased by 20% from FY2010 to FY2015, it increased by 15% between FY2015 and FY2016, reaching an overall maximum of $187.5 million in FY2017. Interviews with extramural staff who administer workforce development activities for ten different ICs similarly reflected the importance of data science expertise to the biomedical research enterprise. Data science workforce development has also been a focus of efforts on the NIH campus, with a variety of groups and organizations providing training and other activities for NIH intramural researchers and other NIH staff. Attendance and enrollment data were collected for 264 data science classes provided for NIH staff between January 2016 and December 2017, for an average of 11 classes per month. With over 6,600 attendees in this time, over 275 NIH staff attended data science classes in an average month. However, even with so many classes, demand often exceeded availability, with many classes receiving more applicants than could be accommodated.

Summary of recommendations Based on the findings in this study, five broad recommendations are suggested to help create a diverse workforce prepared to respond to the challenges and seize the opportunities of an increasingly open and data-intensive biomedical research enterprise. To the extent possible and appropriate, implementation of these recommendations should align with recommendations of the NIH ACD Biomedical Research Workforce Working Group Report. Training at three levels of expertise will be needed to realize the promise of an increasingly open and digital biomedical research environment. First is training of pure data scientists who work in the context of biomedical science. These biomedical data scientists would generate next generation analytics, novel ways to visualize and otherwise present data, new artificial intelligence approaches such as deep learning, at-scale curation solutions and provenance-tracking through distributed ledger technologies, and other means of accelerating and transforming discovery and biomedical progress. The research conducted by these biomedical data scientists would include work related to the methods and approaches (e.g., validation, comparison), as well as addressing biomedical research problems. The second level of training is for expertise conferred by having biomedical scientists cross-trained in data science, and data scientists cross trained in biomedical science. The former would be conversant in data science and its tools and would be well-poised as early adopters and adaptors of the cutting-edge approaches and methods developed by the biomedical data scientists described above, providing new capabilities to analyze, visualize, and otherwise gain

Page 8: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

8

insight from their biomedical data. Data scientists cross-trained in biomedical science would be conversant in a defined area of biomedical subject matter and would be able to expand their research horizons into those knowledge niches. Such cross-trained data scientists could apply data science approaches to produce better analysis, visualization, and understanding of specified biomedical data, and would serve important roles as leaders in bringing biomedical digital research objects (including, but not limited to data, software tools, etc.) in line with the FAIR principles (i.e., making such objects findable, accessible, interoperable, and re-usable). This second level of expertise also includes cross-training librarians and information scientists in data science to lead activities that will grow in importance as biomedical research becomes more data-centric and open, including assuring that digital research objects are FAIR and that best practices in data management are applied throughout the research data life cycle. The third level of training would promulgate data science literacy across the biomedical workforce and beyond. This level includes training on the nature, power, and limitations of data science, as well as on good data management practices and the importance and means of making digital research objects FAIR. Those to be trained at this level include not only biomedical scientists (ideally, all), but also NIH extramural program, review, and policy staff, as well as medical and health science librarians and other information professionals. Training in biomedical data science literacy also extends to those who are not yet in the biomedical workforce, but who might be drawn to it through such training, including undergraduates and K-12 students. Importantly, this training would best be conducted through a variety of didactic methodologies, including innovative and non-traditional modes like webinars, hackathons, and curriculum modules. Beyond these three levels of expertise, reflected in Recommendations 2, 3, and 4, recommendations are also made to develop a common understanding of data science across NIH (Recommendation 1) and assure a coherent perspective on data science training both intramurally and extramurally (Recommendation 5). Recommendation 1. Develop a common programmatic understanding of what constitutes biomedical data science and its practice (both of which will evolve). Recommendation 1a. Work across NIH toward a unified sense of biomedical data science

for programmatic consistency across NIH. Recommendation 1b. Work across NIH to identify core competencies for biomedical data

scientists. Recommendation 1c. Work across NIH to identify core competencies for data science

literacy for all biomedical scientists. Recommendation 2. Expand and enhance training of data science experts. Recommendation 2a. Expand and enhance training of pure data scientists in the context

of biomedical science.

Page 9: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

9

Recommendation 3. Provide training across data science, biomedical science, and information science. Recommendation 3a. Train data scientists in biomedical science, providing an on ramp to

extend their research horizons to biomedicine and lead efforts to make biomedical digital research objects FAIR.

Recommendation 3b. Train biomedical scientists in data science methods and approaches, providing them new capabilities to analyze, visualize, and better understand their data.

Recommendation 3c. Train librarians and information scientists in data science, providing them with the knowledge and tools to lead crucial activities such as assuring digital research objects abide by FAIR principles, and implementing best practices in data management, including curation and preservation.

Recommendation 4. Promote a data science-literate biomedical workforce. Recommendation 4a. Work across NIH to identify and use mechanisms to broadly train

biomedical investigators on the nature, power, and limitations of data science, as well as on good data management practices and the importance and means of making digital research objects FAIR.

Recommendation 4b. Work across NIH to identify and use mechanisms to broadly train NIH program, review, and policy staff on the nature, power, and limitations of data science, as well as on good data management practices and the importance and means of making digital research objects FAIR.

Recommendation 4c. Identify and use mechanisms to broadly train information professionals on the nature, power, and limitations of data science, as well as on good data management practices and the importance and means of making digital research objects FAIR.

Recommendation 4d. Encourage the next generation of biomedical data scientists by engaging the broader public, especially students younger than college-age and populations not well represented in the current cohort of data scientists.

Recommendation 4e. Explore non-traditional training approaches to promote data science literacy across diverse audiences, including hackathons, boot camps, Carpentry sessions, MOOCs, etc.

Recommendation 5. Promote programmatic coherence for biomedical data science training and workforce development across NIH. Recommendation 5a. Establish a trans-NIH committee to facilitate communication,

collaboration, and coordination of extramural biomedical data science training and workforce development.

Recommendation 5b. Establish a trans-NIH committee to facilitate communication, collaboration, and coordination of biomedical data science training and workforce development of NIH staff.

Page 10: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

10

1 Background and previous reports Several previous reports and activities have gathered information about the state of funding for, and needs in, data science and training broadly, as well as data science training specifically, at the NIH. These reports drew on various methodologies to assess data science training, including: portfolio analysis, Requests for Information (RFIs), and convening of experts from around the world in focused groups. These reports and activities have been reviewed closely for this study and are briefly described here.

1.1 2011 report on F and T data science funding In preparation to launch the NIH Big Data to Knowledge (BD2K) initiative, the Office of Research Information Systems (ORIS) created a list of F (individual fellowship) and T (institutional training) awards between 2005 and 2011 that included the words “computational,” “biostatistics,” “informatics,” “bioinformatics,” or “pharmacoinformatics” in the title. Table 1-1 shows the distribution of awards and funding among the topics from the ORIS study of FY2005 to FY2011, representing a total investment of over $406 million over 13 years. Table 1-1 includes counts for unique awards overall, and for F and T mechanisms. Many of these awards are funded for multiple years; therefore Table 1-1 includes a count for “unique awards” and for “award years” (which counts each award once for each year it is funded). Note that the sums of columns do not necessarily add up to the totals because some awards contain multiple keywords in the title (e.g. “Bioinformatics and Computational Biology Training Program”) and therefore appear in more than one topic.

Topic Total awards F awards T awards

Total funding amount Unique

awards Award years

Unique awards

Award years

Unique awards

Award years

Computational 57 183 22 43 35 140 $32,824,202

Biostatistics 46 232 0 0 46 232 $53,450,861

Informatics, bioinformatics, or pharmacoinformatics

47 261 7 12 40 249 $122,167,086

Totals 146 676 29 55 117 621 $208,442,149

Table 1-1. Awards and funding with data science-related terms in the title, FY2005- FY2011, including F awards (F30, F31, F32, F33, F37, and F38) and T awards (T15, T32, and T90). Figure 1-1 shows funding for each of the three topics over time.

Page 11: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

11

Figure 1-1. Funding amount, in millions of dollars, by topic and year, for or F and T data science-related training awards from FY2005 to FY2011. Figure 1-2 shows the distribution of total investment by administering IC. NLM is highly represented in these areas, administering awards amounting to $104 million. The grants administered by NLM accounted for 27% of the total count of awards.

Figure 1-2. Funding amount, in millions of dollars, by administering IC for F and T data science-related training awards from FY2005 to FY2011.

Page 12: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

12

1.2 NIH ACD Data and Informatics Working Group Report In response to the increasing importance of digital data and associated tools and approaches to biomedical research, the Advisory Committee to the Director of NIH (ACD) was charged with forming a Data and Informatics Working Group (DIWG) in 2011. The DIWG’s report provided recommendations in four broad areas, including Recommendation 3, “Build Capacity by Training the Workforce in the Relevant Quantitative Sciences such as Bioinformatics, Biomathematics, Biostatistics, and Clinical Informatics.” Specifically, the DIWG recommended that NIH should analyze the demand for computational and quantitative experts in biomedical research and increase funding for fellowships and training programs accordingly. The report also recognized the importance of ensuring a pool of reviewers who are sufficiently knowledgeable to assess training grants that focus on quantitative methods. Finally, the report acknowledged that a basic proficiency in computational and quantitative skills is important to all scientists, even those who do not go on to become data scientists as such. Therefore, the DIWG recommended that the NIH identify core competencies in these areas, to be included in all fellowship and training grants.

1.3 NIH ACD Biomedical Research Workforce Working Group Report In 2011, the NIH convened a working group of the Advisory Committee to the Director of NIH to develop a sustainable model for training a diverse biomedical workforce in numbers and with expertise appropriate for various biomedical research communities. While this report did not specifically address data science, its findings are applicable to data science workforce development. Their final report, issued in July 2012, includes recommendations for biomedical workforce development at the graduate level and above. Of particular note, the working group recommended strengthening the role of staff scientists and differentiating them from scientists who rely on competitive grants for their employment status. They note that staff scientists can bring stability and continuity to the research enterprise, which is especially important for sustainable data and software management practices. Extending the group’s findings to data science, research groups will likely need technical staff, such as bachelors- and masters-levels data scientists, programmers, and data science librarians who, while crucial, would not likely be competing for extramural grants.

1.4 NLM RFI on Educational Resources In 2014, NLM issued an RFI (NOT-LM-15-001) under the auspices of BD2K, asking respondents to submit existing educational resources on data management and data science. Sixteen responses were received, providing information about 205 online and in person courses. The analysis of the responses included a suggestion that these resources be included in the BD2K-funded ERuDIte index of educational resources, which was in development at that time.

Page 13: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

13

1.5 NIH ACD NLM Working Group Report In 2015, the Advisory Committee to the Director of NIH convened a Working Group on NLM that issued an RFI (NOT-OD-15-067) seeking information related to a strategic vision for NLM, which garnered over 600 responses and analyzed deeply by NLM staff. Based on this input and other findings, the group’s report made several recommendations, including that NLM provide intellectual and programmatic leadership to data science for NIH (Recommendation 3) and that NLM strengthen its leadership in data science and related training. Specifically, Recommendation 4b stated that:

NLM should be the center for nurturing the core science and methodologies of biomedical informatics, data science, and library science through research and training programs; it should also nurture partnerships with other NIH programs, other Federal agencies, and outside organizations in which informatics and biostatistics are a core component.

The report further noted the importance of building upon the expertise of the workforce broadly, recommending that NLM develop programs that span from high school to post-doctoral training and ensure that a diverse workforce is developed.

1.6 NLM Strategic Planning Process In 2016, the NLM commenced an 18-month strategic planning process, which included the issuance of an RFI (NOT-LM-17-002) seeking information about the future role of NLM around four themes, one of which was data science, open science, and biomedical informatics. More than 100 responses from a broad array of respondents were received and analyzed. The planning process also included the convening of five panels of experts from around the world who met for two days to discuss strategic directions and vision for NLM; much of those discussions were centered on data science. From this process, there was a clear call to NLM to expand and enhance research training in biomedical data science, per se, as well as to assure a broad understanding and appreciation of data science in the biomedical workforce so that non-experts can leverage the power of data science to accelerate discovery and ultimately improve health.

1.7 NLM RFI on Next-Generation Data Science Challenges NLM issued an RFI (NOT-LM-17-006) in September 2017 seeking input on data science research directions that could address key challenges in biomedical, social/behavioral, and health-related research. Workforce development and diversity was one of three main focal areas of the RFI. 54 respondents spanning academia, government, industry, publishing, professional societies, and the nonprofit sector offered comment, and many of the responses converged on themes relevant to this report. Several respondents suggested specific core competencies required of a data scientist — concrete programming and computational skills to softer skills like effective communication and collaboration practices. Respondents also noted that data science is inherently a team science, which requires incentivizing collaborative research practices and creating alternative reward structures to drive data science forward. Finally, as data science

Page 14: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

14

continues to grow and evolve as a field, respondents suggested creating training opportunities outside the pre- and postdoctoral levels.

1.8 NASEM Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report In September 2017, the National Academies of Sciences, Engineering, and Medicine published an interim report from the Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective. The report describes initial observations and findings gathered from the first half of a larger study aimed at laying out a vision for the future of data science education. The report encourages the use of innovative training mechanisms, such as hybrid courses, hackathons, and modular courses, and encourages engagement with current research problems from diverse disciplines. The importance of engaging underrepresented student populations and developing programs designed to reach K-12 students is also highlighted. The Committee points to the real-world relevance of data science as a critical component in attracting a wider variety of students to the field. The report concludes with outstanding questions for public input informed by these preliminary observations and themes. The questions solicited suggestions on a range of topics — from approaches to embed broad participation, diversity, and inclusion in data science education to which data science skills and concepts to teach in different educational contexts (i.e. a two-year versus a four-year institution).

Page 15: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

15

2 The state of extramural data science workforce development The previous reports described in Section 1 provided this study with valuable background, context, and input for developing the recommendations of this report. Nevertheless, some of the analyses reported in those documents were conducted five years ago or more, prior to the funding for data science that arose from the BD2K initiative. Therefore, in this report, some of these analyses were updated and additional information was gathered to better understand the current state of data science training and identify emerging needs and potential directions for workforce development. This report combines qualitative and quantitative methods to gain a more complete understanding of current efforts and future needs. Quantitative data were collected to characterize the types of data science training that NIH funds. A comprehensive search strategy, designed to balance sensitivity with specificity to identify programs relevant to data science, was utilized in the NIH Query, View, and Report System (QVR), and the results were analyzed. This analysis expands upon a preliminary report to the NLM Director, included here as Appendix C, which focused on a narrower time frame (FY2014-2017) and utilized only data publicly available through NIH Research Portfolio Online Tool (RePORT). The results presented in the previous report were validated by comparing RePORT and QVR results for FY2014-2017. RePORT and QVR returned identical results; however, QVR results contains additional information that is not publicly available, allowing for a more robust analysis to be conducted here. This report also draws on additional data sources, such as the NIH Data Book. To gather further information about ICs’ activities and plans for data science workforce development, semi-structured interviews were conducted with extramural staff from ten ICs, selected because of their involvement in previous data science funding efforts as well as the size of their training programs. The full list of interview questions is included as Appendix A. In addition to identifying relevant programs currently funded by each IC, the interviews helped provide a better understanding of how these ICs define data science and some of the challenges they face in preparing their workforce for data-intensive research. Qualitative analysis of the interview data utilized an inductive content analysis methodology, focusing on broad themes that arose as issues for multiple ICs. These themes are described in Section 2.6. Although the term “data science” began to appear in the biomedical literature as early as the mid-2000s, the term was not commonly used in NIH training programs until 2014, coinciding with the launch of BD2K (Figure 2-1). This report focuses on activities between FY2010 and FY2017 to assess the evolution and scope of extramural data science training programs. Since the term “data science” was not in common use during the entire period of study, this report also utilizes a variety of strategies to locate training programs that fall within the scope of the definition of data science as described in this report.

Page 16: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

16

Figure 2-1. Number of applications with “data science” in the title, abstract, or specific aims. The dotted line indicates the launch of BD2K in FY2014.

2.1 Trends in NIH-Supported Doctoral Fields of Study The NIH Data Book reports the fields of study for PhD recipients who have received NIH support through training awards (T15, T32, T35, T90, TL1, TU2, F30, F31, or F32). The trend in self-reported fields of study that fall under this report’s working definition of data science is shown in Figure 2-2 as a percentage of the total number of NIH-supported PhD recipients received in a given each year between FY2010 and FY2015 (the most recent set of available data). Overall, the number of PhD students concentrating in data science-related fields of study has remained steady over these five years. Because the Data Book only reports on students who have completed the PhD, it may not accurately reflect the fields of study chosen by current PhD students. Given that BD2K awards funded a number of data science T programs, the number of PhD recipients in data science-related fields may increase starting in 2019, when the first cohort of BD2K trainees graduate.

Page 17: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

17

Figure 2-2. Percentage of NIH-supported PhD recipients in data science-related fields between FY2010 and FY2015.

2.2 Update to the 2011 report on data science funding To provide a comparison to the findings of the ORIS 2011 report described in Section 1.1 of this report, the search was run again to add awards funded between FY2012 and FY2017 to the original set of awards from FY2005-FY2011. The search was also expanded to include awards with “data science” or “big data” in the title, in addition to the original set of terms, since these terms had come into common use since the original report. Table 2-1 shows the distribution of awards and funding among the topics from FY2005 to FY2017, representing a total investment of over $406 million over 13 years. Table 2-1 includes counts for unique awards overall, and for F and T mechanisms. Many of these awards are funded for multiple years; therefore Table 2-1 includes a count for “unique awards” and for “award years” (which counts each award once for each year it is funded). Note that the sums of columns do not necessarily add up to the totals because some awards contain multiple keywords in the title (e.g. “Bioinformatics and Computational Biology Training Program”) and therefore appear in more than one topic.

Page 18: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

18

Table 2-1. Awards and funding containing data science-related terms in their title, FY2005- FY2017, including F (F30, F31, F32, F33, F37, and F38) and T awards (T15, T32, and T90).

Topic Total awards F awards T awards

Total funding amount Unique

awards Award years

Unique awards

Award years

Unique awards

Award years

Computational 97 397 54 120 43 277 $65,097,208

Biostatistics 56 411 0 0 56 411 $103,636,661

Informatics, bioinformatics, or pharmacoinformatics

61 514 11 29 59 494 $227,016,096

Data Science or Big Data

19 35 0 0 19 35 $10,481,957

Totals 221 1357 65 140 156 1217 $406,231,922

Figure 2-3 shows funding for each of the four topics over time. Awards using the term “data science” or “big data” in the title were first awarded in FY2015, likely owing to the announcement of the BD2K initiative. These findings suggest that “data science,” as defined in the 2011 ORIS analysis, has represented an important target for training over the last decade, with $30.8 million invested per year on average.

Figure 2-3. Funding amount, in millions of dollars, by topic and year, for or F and T data science training-related awards from FY2005 to FY2017. Note that a search over the same years using richer search terms and wider search space indicates an increase, rather than a decrease as is shown here, in the data science-related training awards (see Figure 2-12).

Page 19: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

19

Figure 2-4 shows the distribution of total investment by administering IC. As with the previous study that ended in 2011, NLM is highly represented in these areas, administering 26% of all the awards, including 89% of the awards with “data science” or “big data” in the title, for a total of $189 million over the period FY2005-2017. As will be discussed in Section 2.5, some of these awards were funded by BD2K, with administration handled by various ICs.

Figure 2-4. Funding amount, in millions of dollars, by topic and year, for F and T data science-related training awards from FY2005 to FY2017.

2.3 Data science-related Funding Opportunity Announcements The NIH Guide to Grants and Contracts was searched to identify programs funding various activities in support of data science workforce development. Funding Opportunity Announcements (FOAs), both active and inactive, were searched for data science-related keywords (“data science,” “big data,” “computational,” “quantitative science,” or “informatics”), and results were filtered to include T, F, K, and R25 mechanisms, posted on or after January 1, 2010. As of January 1, 2018, only 10 active FOAs for these funding mechanisms include these data science-related terms. Figure 2-5 shows the number of FOAs identified in the search, by IC, status (active or inactive) and activity code. FOAs for workforce development mechanisms that fund data science-related activities are a relatively small proportion of all workforce development FOAs issued by NIH ICs, except among Ts, where 40% of all active FOAs are related to data science. About 8% of all active FOAs for F, T, K, and R25 mechanisms are related to data science. Table 2-2 shows data science-related FOAs as a percent of all F, T, K, and R25 mechanisms, active and inactive.

Page 20: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

20

Figure 2-5. Active and inactive data science-related Funding Opportunity Announcements, posted on or after January 2010 through December 2017, by IC. Table 2-2. Data science-related FOAs as a percent of all active and inactive FOAs for workforce development mechanisms and overall.

Percent of active FOAs related to data science

Percent of inactive FOAs related to data science

All F mechanisms 9.1% 3.1%

All K mechanisms 7.1% 5.2%

All T mechanisms 40.0% 27.1%

R25 0.0% 12.1%

All workforce development mechanisms (F, K, T, and R25) 7.8% 10.3%

2.4 Analyzing data science Training, Fellowship, and Career Development Awards While the original 2011 ORIS report and the expanded 2005-2017 data discussed above are informative, the number of awards they retrieved was limited by the fact that the search looked for keywords only in the title of the award. To gain a more complete view of all data science workforce development activities funded by the NIH, this study expands upon those findings by

Page 21: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

21

searching not only the title, but also the abstract and specific aims of awards. In addition, Career Development (K) awards are included to cover the range of workforce development activities across scientific career stages. Finally, the set of keywords searched was updated to reflect terms in current use to describe data science and its methodologies. Search strategies for T, F, and K awards, described in the following sections, were designed to balance sensitivity and specificity. Given the imprecise definition of data science and the many different terms used to describe this type of work, designing a search strategy that identified all relevant awards without also including many non-related awards would be difficult. It is important to note, however, that the strategies used here were validated by comparing results to a set of known data science awards (e.g. BD2K awards, awards mentioned by IC staff during interviews). All but one of the known awards was returned using the search strategies, suggesting they are adequate to identify relevant awards. Because the number of awards retrieved using these strategies was large, a text mining approach was utilized for analysis. To categorize the types of data science training that the NIH funded, topic modeling was used to extract key areas of data science training in each of the mechanisms (T, F, and K) using the ‘tm' (v. 0.7-1) and ‘topicmodels’ (v. 0.2-7) packages in R (v. 3.4.2). Because not all proposals in this dataset included specific aims, analysis was restricted to project abstracts for each of the awards. To prepare the abstracts for modeling, a series of text cleaning tasks was performed, including removing stopwords from a list of common English stopwords, as well as a custom stopword list consisting of words that appeared in most of the abstracts that were not meaningful to analysis (such as “fellow,” “postdoctoral,” “program,” “student,” and “training”). For each of the three award mechanisms, topic modelling was performed on the prepared text using the latent Dirichlet allocation (LDA) model with Gibbs sampling, tested over three runs using three, five, or seven topics. For each run, the top ten keywords listed per topic were evaluated to determine whether the analysis converged on unique, discernable categories for data science training. Five was determined to be the optimal number of categories. The full R code for this analysis is available as Appendix B. With this topic modelling approach, patterns and clusters are identified among texts based on similarities in the data. Topics are not determined a priori by human analysts, but identified by the algorithm based on groupings, which can then be characterized and named by subject experts. Given that all awards broadly deal with the same topics, some similarities arose in the analysis of the various award mechanisms, but because the T, F, and K awards were analyzed in separate groups, slightly different topic categories were identified by the algorithm. For example, a category best described as “Mechanistic and Computational Biology” appeared in the F awards, while “Cancer Biology” arose as a topic in the K awards. It should be kept in mind that these categories are subcategories of data science that were defined by search strategies and validated as described above. The findings for topics per each award mechanism are described below.

Page 22: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

22

2.4.1 Training (T) awards To identify key themes across NIH’s extramurally awarded institutional training grants for data science, training awards (T32s, T34s, T35s, T90s, and T15s) funded between FY2010 and FY2017 were analyzed. To locate the relevant awards, QVR was queried using the iSearch interface for the terms “data science,” “big data,” “computational,” “quantitative science,” or “informatics” in the award title, abstract, or specific aims. The search returned 666 unique awards, accounting for 2,907 award years, funded across 25 ICs. Funding by year Overall, funding for data science training decreased between FY2010 and FY2015, then began to increase again in FY2016 and FY2017. Figure 2-6 shows data science training awards as a percent of all T awards funded by each IC by year; the top row shows funding aggregated across all NIH ICs. Data science T awards represented 13-22% of all NIH training awards funded during the period between 2010-2017 (mean = 17.5%), accounting for a total investment of over $930 million. For some ICs, such as NLM and NHGRI, data science training funding formed a significant portion of their T awards across the entire period. Others gradually increased or decreased their support for data science as a percent of their total training portfolio over time. Table 2-3 shows the same data as is graphically depicted in Figure 2-6. Table 2-3. Data science-related T awards (T32s, T34s, T35s, T90s, and T15s) as a percent of total IC T awards, by year, FY 2010-2017.

2010 2011 2012 2013 2014 2015 2016 2017 Total Funding

All ICs 22.23% 19.91% 18.81% 18.04% 15.28% 13.44% 15.83% 16.17% $930,062,766

FIC 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% $125,000

NCATS 0.00% 0.00% 0.00% 50.00% 50.00% 50.00% 0.00% 0.00% $750,000

NCCIH 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 11.11% 0.00% $84,000

NCI 13.85% 10.71% 11.18% 10.46% 7.89% 6.54% 8.00% 9.66% $43,900,544

NCRR 6.67% 4.44% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% $1,546,414

NEI 38.64% 40.00% 37.50% 35.00% 34.15% 31.71% 32.50% 33.96% $23,026,194

NHGRI 80.00% 78.26% 76.19% 75.00% 78.57% 70.59% 89.47% 95.45% $53,427,665

NHLBI 16.72% 15.19% 14.12% 14.63% 14.48% 9.48% 10.31% 11.88% $102,000,000

NIA 11.84% 9.59% 9.72% 10.61% 10.61% 7.04% 9.59% 7.96% $13,390,319

NIAAA 6.67% 10.34% 9.09% 6.45% 5.56% 6.67% 3.45% 4.88% $4,493,839

NIAID 15.14% 15.24% 14.14% 14.61% 8.38% 5.08% 5.67% 5.98% $43,697,213

NIAMS 7.41% 9.62% 10.00% 8.00% 6.12% 6.38% 5.77% 8.00% $6,950,272

NIBIB 40.00% 37.50% 35.56% 40.48% 27.50% 34.88% 34.04% 31.91% $26,700,680

Page 23: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

23

2010 2011 2012 2013 2014 2015 2016 2017 Total Funding

NICHD 17.22% 17.36% 16.55% 19.42% 17.69% 13.45% 13.89% 14.29% $35,175,323

NIDA 30.88% 30.88% 29.03% 30.16% 16.87% 19.05% 19.64% 16.85% $28,763,659

NIDCD 20.00% 18.42% 20.00% 15.63% 3.23% 6.06% 5.88% 4.84% $11,509,217

NIDCR 18.52% 14.81% 24.00% 20.83% 22.73% 18.18% 25.00% 10.26% $9,193,992

NIDDK 11.95% 8.84% 6.05% 6.42% 3.69% 1.83% 5.09% 5.93% $27,894,128

NIEHS 32.69% 32.69% 27.45% 17.39% 17.02% 13.04% 11.11% 23.29% $31,823,866

NIGMS 34.10% 30.73% 28.50% 24.54% 21.84% 19.65% 22.14% 21.37% $296,000,000

NIMH 18.75% 16.42% 16.00% 16.53% 14.81% 15.00% 18.81% 19.01% $32,201,789

NINDS 23.81% 28.57% 25.81% 22.83% 22.37% 15.07% 18.18% 20.00% $26,638,914

NINR 11.54% 8.70% 8.00% 7.69% 12.50% 9.09% 22.73% 22.86% $4,600,835

NLM 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% $93,834,532

OD 40.00% 0.00% 4.17% 7.41% 6.12% 14.58% 35.09% 29.63% $12,334,371

Figure 2-6. Data science-related T awards (T32s, T34s, T35s, T90s, and T15s) as a percent of total IC T awards, by year, FY 2010-2017.

Page 24: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

24

Training topics Based on the keywords associated with the various categories identified by topic modelling in the set of T awards, the five categories of T-supported data science training are best described as: (1) Clinical and Translational Science; (2) Cell and Molecular Biology; (3) Biomedical Informatics; (4) Genomics and Bioinformatics; (5) Neuroscience and Behavioral Science. Figure 2-7 shows awards in these categories as a percent of all T awards funded by each IC for FY2010-FY2017, as well as for all ICs. The top histogram shows the total number of awards per category; the right histogram shows total funding amount per Institute across all five categories. ICs are listed in order of total funding amount in the dataset. Table 2-4 shows the same data as Figure 2-7. For example, NLM funded 176 T award years total, all of which were related to data science training, with most of them in the Biomedical Informatics or Genomics and Bioinformatics categories. On the other hand, 25% of all of NIGMS’s 3,136 T award years are related to data science, but they are spread more across categories, with at least some training awards in each category. Considering all NIH, data science training awards were about evenly distributed across all topics, with 17.6% of all IC’s T awards addressing data science topics. Table 2-4. Data science-related T awards (T32s, T34s, T35s, T90s, and T15s) as a percent of total IC T awards, by topic and total, FY 2010-2017.

Clinical and

Translational Science

Cell and Molecular Biology

Biomedical Informatics

Genomics and Bioinformatics

Neuroscience and Behavioral

Science

All Data Science Topics

All ICs 3.9% 3.5% 3% 3.3% 3.9% 17.6%

FIC 0% 0% 0% 100% 0% 100%

NCATS 0% 0% 0% 21.40% 0% 21.40%

NCCIH 0% 0% 0% 0% 1.30% 1.30%

NCI 4.90% 1% 1.70% 2.20% 0.10% 10%

NCRR 0% 0% 3.30% 2.20% 0% 5.60%

NEI 2.70% 2.40% 0.90% 2.40% 27.10% 35.40%

NHGRI 2.50% 11.50% 10.20% 56.70% 0% 80.90%

NHLBI 8.30% 0.90% 1.90% 1.90% 0.50% 13.40%

NIA 1.60% 0.20% 0.70% 0.20% 6.90% 9.50%

NIAAA 0% 0% 1.50% 0% 5% 6.60%

NIAID 7.20% 1.10% 1.90% 0.10% 0.40% 10.60%

NIAMS 4.20% 1.70% 0% 1.70% 0% 7.70%

NIBIB 3% 12.40% 3.90% 14.10% 1.90% 35.40%

Page 25: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

25

Clinical and

Translational Science

Cell and Molecular Biology

Biomedical Informatics

Genomics and Bioinformatics

Neuroscience and Behavioral

Science

All Data Science Topics

NICHD 6.10% 2.10% 0.50% 1.10% 6.60% 16.40%

NIDA 2.50% 2% 1.10% 3.60% 14.50% 23.70%

NIDCD 1.30% 0.30% 0% 0.30% 9.50% 11.50%

NIDCR 2.40% 0% 13.10% 2.90% 0% 18.40%

NIDDK 4.90% 0.40% 0.70% 0.20% 0.20% 6.40%

NIEHS 1.90% 1.70% 5.60% 13.10% 0% 22.30%

NIGMS 0.90% 13.20% 5.20% 3.60% 2.50% 25.30%

NIMH 1.80% 0% 1.60% 0.50% 13% 17%

NINDS 2.90% 1% 1.90% 0.70% 15.80% 22.30%

NINR 2% 0% 0% 11.30% 0% 13.30%

NLM 4.50% 0% 57.40% 38.10% 0% 100%

OD 0.60% 0.60% 4.10% 10.90% 0.60% 16.90%

Figure 2-7. Data science-related T awards (T32s, T34s, T35s, T90s, and T15s) as a percent of total IC T awards, by topic and total, FY 2010-2017.

Page 26: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

26

2.4.2 Fellowship (F) awards To identify key themes across NIH’s extramurally awarded fellowships for data science, fellowship awards (F30, F31, F32, and F33) funded between FY2010 and FY2017 were analyzed. To locate the relevant awards, QVR was queried using the iSearch interface for the terms “data science,” “big data,” “computational,” “quantitative science,” “informatics”, “machine learning,” “forecasting,” “modelling,” or “deep learning” in the award title, abstract, or specific aims. Additional keywords not used in the T award search were added to capture awards that used data science methodologies, even if they did not explicitly refer to their work as data science. The search returned 772 unique awards, accounting for 1,597 award years, funded across 22 ICs. Fellowship funding by year Overall, funding for data science fellowships remained about the same from FY2010-2017, although the number of awards steadily increased. Figure 2-8 shows data science fellowships as a percent of all F awards funded by each Institute by year; the top row shows funding aggregated across all NIH ICs. Data science F awards represented about 5-9% of all NIH F awards per year, accounting for a total investment of just over $66 million. NLM does not award a significant number of F awards; none were awarded between FY2010 and FY2014, and only seven were awarded from FY2015-2017. Of these, 71% were identified using the search described above. Data science F awards formed a sizeable portion of NHGRI’s F awards, but for most ICs, data science-related awards represented a small portion of overall F awards. Table 2-5 shows the same data as Figure 2-8. Table 2-5. Data science-related F awards (F05s, F30s, F31s, F32s, F33s, and F99s) as a percent of total IC F awards by year, FY 2010-2017.

2010 2011 2012 2013 2014 2015 2016 2017 Total Funding

All ICs 4.54% 4.98% 5.49% 5.14% 5.86% 6.69% 7.12% 8.71% $66,028,327

NCI 5.56% 4.86% 4.17% 4.87% 4.28% 5.65% 5.21% 6.33% $6,576,549

NEI 7.25% 4.76% 7.14% 9.09% 7.46% 13.58% 17.20% 15.79% $2,917,092

NHGRI 16.67% 30.00% 23.08% 44.44% 83.33% 83.33% 88.89% 54.55% $1,287,051

NHLBI 3.91% 5.60% 5.83% 5.69% 5.63% 4.27% 5.95% 6.61% $4,923,683

NIA 3.51% 3.65% 4.66% 3.47% 4.64% 2.92% 5.56% 6.16% $2,275,135

NIAAA 0.93% 0.91% 0.00% 0.00% 1.03% 1.89% 1.80% 4.42% $481,074

NIAID 0.78% 0.00% 1.41% 2.56% 2.12% 2.34% 2.83% 5.70% $1,428,608

NIAMS 0.00% 1.54% 6.25% 3.45% 4.62% 3.08% 2.90% 5.48% $759,378

NIBIB 10.53% 11.76% 19.05% 14.29% 21.05% 14.29% 20.00% 20.00% $1,112,470

NICHD 2.33% 2.97% 10.71% 10.00% 10.00% 8.42% 8.70% 9.09% $2,811,002

NIDA 2.19% 2.45% 4.19% 3.21% 4.93% 8.09% 7.20% 6.11% $2,147,106

Page 27: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

27

2010 2011 2012 2013 2014 2015 2016 2017 Total Funding

NIDCD 8.59% 7.69% 10.81% 11.30% 13.16% 10.53% 10.69% 11.11% $3,855,928

NIDCR 4.41% 3.49% 2.20% 0.00% 1.92% 3.00% 5.68% 6.48% $991,433

NIDDK 1.58% 1.49% 2.29% 3.97% 5.06% 6.25% 3.59% 4.73% $3,590,514

NIEHS 1.82% 0.00% 4.55% 6.98% 6.25% 8.82% 13.89% 8.70% $894,781

NIGMS 10.70% 12.53% 9.95% 10.42% 11.30% 14.93% 12.56% 13.72% $17,174,403

NIMH 4.82% 5.86% 5.69% 4.14% 5.10% 6.41% 7.04% 12.16% $5,257,049

NIMHD 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 11.11% $61,194

NINDS 4.84% 4.43% 5.26% 3.35% 4.91% 4.35% 7.23% 9.26% $7,048,506

NINR 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.96% 7.69% $190,617

NLM 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 100.00% 60.00% $195,710

OD 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 14.29% $49,044

Figure 2-8. Data science-related F awards (F05s, F30s, F31s, F32s, F33s, and F99s) as a percent of total IC F awards by year, FY 2010-2017. Fellowship topics Based on the keywords associated with the various categories in the set of F awards, the five categories of F-supported data science training as identified in topic modeling are best described

Page 28: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

28

as: (1) Neuroscience and Behavioral Science; (2) Clinical and Translational Science; (3) Genetics and Genomics; (4) Mechanistic and Computational Biology; and (5) Cell and Cancer Biology. Figure 2-9 shows awards in these categories as a percent of all F awards funded by each IC for FY2010-FY2017, as well as aggregated across all ICs. The top histogram shows the total number of grants per category; the right histogram shows total funding amount per IC across all five categories. ICs are listed in order of total funding amount in the dataset. Table 2-6 shows the same data as Figure 2-9. Most ICs had fellowships spread relatively evenly across several different categories, while a few ICs had more fellowships focused in one category. For example, data science F awards in the Genetics and Genomics category represented 43% of NHGRI’s overall F awards. Across all NIH, data science fellowship awards were about evenly distributed across all topics, about 1% per topic, with a total of 6.14% of all IC’s F awards addressing data science topics. Table 2-6. Data science-related F awards (F05s, F30s, F31s, F32s, F33s, and F99s) as a percent of total IC F awards, by topic and total, FY 2010-2017.

Neuroscience

and Behavioral

Science

Clinical and Translational

Science Genetics and

Genomics

Mechanistic and

Computational Biology

Cell and Cancer Biology

All Data Science Topics

All ICs 1.64% 0.94% 0.91% 1.25% 1.40% 6.14%

NCI 0.00% 1.34% 0.59% 0.90% 2.36% 5.19%

NEI 10.10% 0.50% 0.00% 0.33% 0.00% 10.93%

NHGRI 0.00% 0.00% 43.42% 0.00% 3.95% 47.37%

NHLBI 0.09% 3.15% 0.46% 0.50% 1.28% 5.48%

NIA 0.78% 1.16% 1.16% 0.39% 0.78% 4.27%

NIAAA 0.00% 0.12% 0.35% 0.23% 0.70% 1.39%

NIAID 0.00% 0.06% 1.02% 0.57% 0.70% 2.36%

NIAMS 0.00% 1.69% 0.57% 0.38% 0.75% 3.39%

NIBIB 1.28% 8.33% 0.64% 2.56% 3.85% 16.67%

NICHD 3.40% 0.63% 0.88% 0.00% 2.90% 7.81%

NIDA 1.83% 1.00% 0.50% 0.83% 0.42% 4.57%

NIDCD 9.10% 1.12% 0.00% 0.00% 0.20% 10.43%

NIDCR 0.00% 0.40% 1.07% 0.67% 1.20% 3.33%

NIDDK 0.00% 0.37% 0.50% 0.37% 2.42% 3.66%

NIEHS 0.00% 1.17% 0.59% 0.00% 4.11% 5.87%

Page 29: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

29

Neuroscience

and Behavioral

Science

Clinical and Translational

Science Genetics and

Genomics

Mechanistic and

Computational Biology

Cell and Cancer Biology

All Data Science Topics

NIGMS 0.06% 0.39% 2.47% 6.63% 2.50% 12.06%

NIMH 5.41% 0.19% 0.24% 0.34% 0.14% 6.33%

NIMHD 6.25% 0.00% 0.00% 0.00% 0.00% 6.25%

NINDS 3.08% 0.79% 0.35% 0.31% 0.91% 5.44%

NINR 0.00% 0.97% 0.00% 0.00% 0.00% 0.97%

NLM 0.00% 14.29% 42.86% 14.29% 0.00% 71.43%

OD 0.00% 0.00% 3.85% 0.00% 0.00% 3.85%

Figure 2-9. Data science-related F awards (F05s, F30s, F31s, F32s, F33s, and F99s) as a percent of total IC F awards, by topic and total, FY 2010-2017.

2.4.3 Career Development (K) awards Career development awards (all K mechanisms) funded between FY2010 and FY2017 were analyzed to characterize trends in data science workforce development for early to mid-career scientists. To locate relevant awards, QVR was queried using the iSearch interface for the terms “data science,” “big data,” “computational,” “quantitative science,” “informatics”, “machine learning,” “forecasting,” “modelling,” or “deep learning” in the award title, abstract, or specific aims.

Page 30: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

30

Additional keywords not used in the T award search were added to capture awards that used data science methodologies, even if they did not explicitly refer to their work as data science. The search returned 898 unique grants, accounting for 2,375 award years, funded across 22 ICs. Career development funding by year Overall, funding for data science K awards remained a small percentage of all K awards funded from FY2010-2017, although the number of awards steadily increased. Figure 2-10 shows data science career development awards as a percent of all K awards funded by each Institute by year; the top row shows funding aggregated across all NIH ICs. Table 2-7 shows the same data as Figure 2-10. Data science K awards represented 4-11% of K awards funded across all ICs in FY2010-2017 (mean = 7%), accounting for a total investment of just over $355 million. NLM does not award a significant number of K awards; an average of 10 were awarded per FY between FY2010-2017. Of the 78 K awards funded by NLM during this time, 90% were data science-related. Data science K awards also formed a sizeable portion of NHGRI’s and a moderate portion of NIBIB’s K awards, but for most ICs, these Ks represented a small portion of overall K awards. Table 2-7. Data science-related K awards (K01s, K02s, K05s, K07s, K08s, K12s, K18s, K22s, K23s, K24s, K25s, K26s, K43s, K76s, and K99s) as a percent of total IC K awards by year, FY 2010-2017.

2010 2011 2012 2013 2014 2015 2016 2017 Total funding

All ICs 4.46% 5.33% 6.02% 6.80% 8.14% 9.18% 10.31% 11.00% $356,000,000

FIC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 2.17% $57,663

NCATS 0.00% 0.00% 11.11% 0.00% 0.00% 0.00% 0.00% 0.00% $290,815

NCCIH 0.00% 0.00% 2.13% 2.63% 0.00% 0.00% 0.00% 2.86% $343,259

NCI 5.88% 5.23% 4.98% 7.75% 9.07% 8.45% 11.46% 12.72% $44,586,477

NCRR 4.62% 7.27% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% $1,124,031

NEI 7.35% 6.85% 8.82% 6.41% 5.26% 7.32% 9.64% 4.76% $6,235,896

NHGRI 75.00% 75.00% 33.33% 36.36% 60.00% 50.00% 40.91% 64.71% $5,254,366

NHLBI 4.83% 6.19% 7.27% 8.88% 9.92% 10.00% 9.11% 11.78% $69,356,714

NIA 4.86% 6.30% 9.29% 8.78% 8.25% 8.93% 11.07% 7.27% $19,207,179

NIAAA 5.94% 4.21% 5.00% 3.00% 2.04% 5.49% 6.93% 4.59% $6,112,826

NIAID 2.49% 3.79% 4.98% 5.64% 6.57% 7.27% 8.51% 8.88% $18,210,848

NIAMS 3.82% 4.29% 5.59% 6.29% 7.65% 10.43% 10.39% 12.41% $12,054,629

NIBIB 12.12% 22.58% 22.22% 24.14% 22.86% 25.00% 19.23% 22.73% $7,128,664

Page 31: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

31

2010 2011 2012 2013 2014 2015 2016 2017 Total funding

NICHD 3.54% 5.21% 4.07% 3.89% 4.58% 3.97% 6.64% 8.80% $15,185,079

NIDA 3.16% 2.19% 2.68% 4.29% 4.98% 6.19% 7.73% 8.48% $14,939,658

NIDCD 3.23% 9.38% 3.45% 2.56% 9.52% 10.87% 13.33% 18.18% $4,261,732

NIDCR 7.46% 5.17% 5.17% 6.25% 4.55% 6.52% 9.52% 16.00% $3,849,038

NIDDK 2.79% 4.05% 4.13% 5.26% 7.20% 7.72% 8.17% 8.41% $35,121,023

NIEHS 9.30% 4.35% 4.35% 4.00% 7.69% 10.00% 13.73% 16.67% $4,650,605

NIGMS 8.65% 9.38% 11.83% 13.98% 14.44% 15.00% 15.00% 15.63% $11,866,719

NIMH 3.61% 5.12% 6.92% 6.69% 6.67% 7.76% 8.81% 10.28% $29,790,979

NINDS 5.44% 6.14% 6.33% 7.11% 8.41% 9.91% 12.57% 12.56% $23,316,957

NINR 0.00% 0.00% 3.23% 6.90% 9.68% 12.90% 20.00% 24.00% $2,619,484

NLM 100.00% 100.00% 100.00% 75.00% 85.71% 86.67% 92.86% 87.50% $8,335,557

OD 2.35% 3.13% 3.37% 3.96% 13.33% 22.12% 21.78% 20.83% $11,975,730

Figure 2-10. Data science-related K awards (K01s, K02s, K05s, K07s, K08s, K12s, K18s, K22s, K23s, K24s, K25s, K26s, K43s, K76s, and K99s) as a percent of total IC K awards by year, FY 2010-2017.

Page 32: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

32

Career development award topics Based on the keywords associated with the various categories in the set of K awards, the five categories of K-supported data science training as identified in topic modeling are best described as: (1) Cancer Biology; (2) Neuroscience and Behavioral Science; (3) Clinical Research; (4) Genetics and Genomics; and (5) Translational Research. Figure 2-11 shows awards in these categories as a percent of all K awards funded by each IC for FY2010-2017, as well as aggregated across all ICs. The top histogram shows the total number of grants per category; the right histogram shows total funding amount per IC across all five categories. ICs are listed in order of total funding amount in the dataset. Table 2-8 shows the same data as Figure 2-11. For most ICs, career development awards in each of the five topics were a small percentage of their overall K awards. However, some topics made up a significant portion of several ICs’ K awards. For example, Cancer Biology and Genetics and Genomics each account for about 20% of NHGRI’s K awards. Across all NIH, data science K awards were about evenly distributed across all topics, about 1.5% per topic, with a total of 7.56% of all IC’s K awards addressing data science topics. Table 2-8. Data science-related K awards (K01s, K02s, K05s, K07s, K08s, K12s, K18s, K22s, K23s, K24s, K25s, K26s, K43s, K76s, and K99s) as a percent of total IC K awards by topic and total, FY 2010-2017.

Cancer Biology

Neuroscience and

Behavioral Science

Clinical Research

Genetics and Genomics

Translational Research

All Data Science Topics

All ICs 2.17% 1.57% 1.38% 1.41% 1.02% 7.56%

FIC 0.00% 0.00% 0.00% 0.45% 0.00% 0.45%

NCATS 0.00% 0.00% 0.00% 0.00% 5.26% 5.26%

NCCIH 0.00% 0.32% 0.00% 0.63% 0.00% 0.95%

NCI 4.93% 0.70% 0.87% 0.87% 0.70% 8.07%

NCRR 1.67% 0.00% 1.67% 0.00% 2.50% 5.83%

NEI 0.49% 4.41% 0.16% 1.47% 0.49% 7.03%

NHGRI 18.95% 4.21% 7.37% 18.95% 1.05% 50.53%

NHLBI 2.45% 0.52% 1.79% 1.95% 1.72% 8.43%

NIA 1.65% 2.13% 1.97% 1.33% 0.96% 8.06%

NIAAA 0.50% 2.39% 0.13% 1.64% 0.00% 4.65%

NIAID 1.33% 0.17% 0.77% 2.66% 0.90% 5.82%

NIAMS 3.49% 0.54% 0.08% 2.25% 1.16% 7.53%

NIBIB 2.55% 14.47% 0.00% 2.98% 1.28% 21.28%

Page 33: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

33

Cancer Biology

Neuroscience and

Behavioral Science

Clinical Research

Genetics and Genomics

Translational Research

All Data Science Topics

NICHD 0.74% 1.35% 1.35% 0.09% 1.40% 4.93%

NIDA 1.24% 0.96% 1.35% 0.34% 1.01% 4.90%

NIDCD 0.97% 7.14% 0.00% 0.32% 0.97% 9.42%

NIDCR 2.42% 0.24% 2.18% 2.66% 0.00% 7.51%

NIDDK 1.92% 0.31% 1.75% 1.20% 0.67% 5.85%

NIEHS 4.47% 0.53% 0.79% 1.84% 1.05% 8.68%

NIGMS 6.35% 1.30% 0.52% 1.68% 3.11% 12.95%

NIMH 0.48% 3.91% 1.21% 0.83% 0.38% 6.82%

NINDS 1.53% 4.89% 0.00% 1.00% 0.94% 8.36%

NINR 0.00% 0.00% 9.05% 0.00% 0.00% 9.05%

NLM 15.38% 10.26% 43.59% 11.54% 8.97% 89.74%

OD 2.15% 1.75% 3.22% 2.95% 2.01% 12.08%

Figure 2-11. Data science-related K awards (K01s, K02s, K05s, K07s, K08s, K12s, K18s, K22s, K23s, K24s, K25s, K26s, K43s, K76s, and K99s) as a percent of total IC K awards by topic and total, FY 2010-2017.

Page 34: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

34

2.4.4 Overall trends in data science training funding, FY2010-2017 Trends in funding for the T, F, and K awards described above are summarized in Figure 2-12. While funding for T awards has decreased 29% from $152.5 million in 2010 to $107.6 million in 2017, investment in individual F and K awards has increased over this time period, increasing by 132% and 141%, respectively. Funding for all mechanisms combined steadily decreased by 20% from 2010 to 2015, but increased by 15% between 2015 and 2016, reaching an overall maximum of $187.5 million in 2017.

Figure 2-12. Funding trends for data science-related T, F, and K awards by year, between FY2010 and FY2017.

2.5 BD2K Awards

2.5.1 BD2K Investments in Data Science Training A major goal of the BD2K initiative centers on training in the development and use of biomedical big data science methods and tools. Over the course of BD2K’s Phase I (FY2014-FY2017), over $35 million (17.5% of the $200 million investment) has been awarded in mentored career development grants (K01, $10.4 million), predoctoral training grants (T15/T32, $9.1 million), and research education grants (R25, $15.6 million). Figure 2-13 illustrates the breakdown of BD2K training investments by administering IC.

Page 35: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

35

Figure 2-13. Phase I BD2K funding for training awards broken down by administering IC and award mechanism. The set of T and K awards described in Section 2.4 contained all but one BD2K K01 award, suggesting that the search methodology utilized was adequate for identifying data science training awards. While BD2K did represent an explicit NIH-wide commitment to enhancing data science training, the investments in T and K awards make up a small percentage of the investment in total data science T and K awards identified in Section 2.4 — 2.3% (or 2.4% for T15/T32 awards alone) and 4.7% (or 19.4% for K01 awards alone), respectively. This finding suggests data science has therefore been an ongoing and significant focus of training activities across the NIH, not just through BD2K. As a trans-NIH initiative, BD2K likely served a coordinating function across ICs that have also been supporting their own data science training activities. Furthermore, BD2K’s high profile has in turn raised the profile of big data and data science in biomedical workforce development. “Big data” and “data science” have increasingly been used as key terms in grant applications, as illustrated in Figure 2-1 and Figure 2-3. BD2K has thus solidified the commitment of NIH on the whole to data science as crucial to building the biomedical workforce of the future.

2.5.2 BD2K Training Coordination Center In addition to formal training opportunities outlined in Section 2.5.1, the BD2K initiative also funded other activities related to data science workforce development. The BD2K Training Coordination Center (TCC) was created with a $7.6 million investment to coordinate and compile all the outputs of BD2K-funded projects into a common resource to improve data science skills. The TCC aims

Page 36: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

36

to target a large audience spanning the scientific community and the public at large, at all ages and stages of career development. The TCC is also meant to coordinate and support innovative collaborations for data science training and innovation. The TCC has organized a variety of activities and resources since launching in 2015, including:

● Education Resource Discovery Index (ERuDIte) To ensure BD2K-supported resources meet FAIR standards (i.e. they are findable, accessible, interoperable, and reusable), ERuDIte aggregates educational resources produced both by BD2K-funded projects and other available online offerings (such as MOOCs) by scraping the web, then assigning metadata from the Data Science Education Ontology. A major premise of ERuDIte is that it will connect users with the best resources to meet their educational goals, acting as an adaptive, personalized online educational platform, recommending training resources based on user data and behavior. However, this functionality is not operational as of December 2017.

● BD2K Guide to the Fundamentals of Data Science This weekly webinar series features presenters from around the country, speaking on a different data science topic each week. The original series ran from September 9, 2016 - May 19, 2017, and videos from each of those weekly presentations are archived on YouTube. In its second year of the series, held in fall 2017, only four sessions were hosted, rather than a full year-long curriculum as in the first year.

● Big Data: Biomedicine In collaboration with the USC School of Cinematic Arts, the TCC created a 22-minute “mini-movie” entitled Big Data: Biomedicine. The short film provides a brief overview of the role of big data in medicine and biomedical research and appears to be intended for a lay audience. It has had about 9,800 views as of December 2017.

● Data Science Innovation Labs The TCC received supplemental funding to participate in organizing two five-day innovation labs, in collaboration with NIH and NSF. The Labs brought together about 30 investigators from different fields to tackle a given data science challenge. The first in June 2016 focused on wearable and/or ambient sensors and the second in June 2017 centered on understanding the microbiome.

● Imagining Tomorrow’s University The TCC also helped support a one-day, invitation-only workshop in March 2017, aiming to bring together leaders in open science and reproducibility to envision how universities might transform to better address the educational needs of tomorrow’s researchers. A resulting paper was published in F1000Research.

● RoAD-Trip Intensive Data Science Residency Program This program partners junior biomedical researchers with data scientists to collaborate on a joint project over the span of a minimum of two weeks. The program is in its second year, with a second round of applications received in September 2017.

Page 37: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

37

2.6 Qualitative analysis of data science workforce development While the analyses above can provide perspective on what has already been funded, these quantitative data cannot capture what ICs are planning for the future or what extramural staff perceive as gaps or needs in addressing data science workforce development. To capture these perspectives, semi-structured interviews were conducted with extramural staff familiar with their IC’s data science training efforts. Ten ICs were selected, based on the size of their training programs and their experience with data science workforce development:

● National Cancer Institute (NCI) ● National Human Genome Research Institute (NHGRI) ● National Heart, Lung, and Blood Institute (NHLBI) ● National Institute on Aging (NIA) ● National Institute of Allergy and Infectious Diseases (NIAID) ● National Institute of Environmental Health Sciences (NIEHS) ● National Institute of General Medical Sciences (NIGMS) ● National Institute of Mental Health (NIMH) ● National Institute of Neurological Disorders and Stroke (NINDS) ● National Library of Medicine (NLM)

Analysis of these interviews revealed several key themes and challenges in data science workforce development. These themes, described in-depth here, provide additional context and background for the recommendations in Section 4.

2.6.1 Data science remains a nebulous term A lack of clarity around what constitutes data science was a recurring theme in discussions with Institute staff. Most ICs did not have a formal definition of what does (and does not) constitute data science. For some, data science is about working with large, complex, and varied datasets, or what might be considered “Big Data.” Others pointed to disciplines that could be thought to fall “under the umbrella” of data science: bioinformatics, computer science, biostatistics, epidemiology, computational biology, genomics (and the other ‘omics). Still others mentioned methodological approaches that they considered the domain of data science, such as machine learning and predictive modelling. The varied language used in FOAs, proposals, and other funding documents similarly reflects this lack of consistency in the ways that different research communities and even different individuals think about and discuss data science. Similarly unclear is what type of work makes one a data scientist. For example, is someone who develops the computational tools used to conduct biomedical data science a data scientist? Conversely, is someone who uses such tools but is not involved in their development a data scientist? What about someone who collects the data underlying the analysis, or someone who coordinates and curates the data? Such questions are foundational to developing programs that will prepare researchers to become biomedical data scientists.

Page 38: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

38

Most ICs agreed that a data scientist must have knowledge and expertise in three broad areas: computer science, statistics and mathematics, and biological science (or other subject matter expertise). However, the exact formula for how much knowledge a data scientist would need in each area remains unclear. Most ICs did not consider it feasible for a trainee to be expected to develop deep expertise in so many different types of science. As one interviewee put it, looking for a data scientist with expertise in computer science and statistics as well as several biological fields was like “trying to find a unicorn.” A more realistic goal might be to train a workforce of data scientists who have a depth of expertise in a specific area (for example, computer science) and are at least conversant in the others (for example, biological subject matter knowledge and statistics). Such an approach would have implications for how training programs are funded and which ICs take leadership in providing training, as well as how data scientists function within the broader context of biomedical research.

2.6.2 Data science training is relevant to the broader biomedical community As a significant body of literature and NIH reports have previously demonstrated, data science methodologies and techniques have become increasingly important to the biomedical research community, and this trend is likely to continue for the foreseeable future. However, concerns about the current level of expertise in the existing workforce arose in interviews with several ICs. Given that data science is a relatively new discipline, are there adequate faculty and mentors with the expertise to support the development of the biomedical data science workforce? In addition, some ICs wondered whether a sufficient pool of reviewers with the relevant data science expertise currently exists to evaluate data science-related proposals, not only for training mechanisms, but also for R01s and other research grants. Providing reviewers training on the foundations of data science could help raise the overall expertise of review panels, as could making an effort to include a diverse range of reviewers, such as data scientists from industry or non-biomedical data scientists. Many ICs also considered certain data-related skills and knowledge relevant to all biomedical researchers, not just data scientists. One IC even mentioned that data science is specifically included in their new strategic plan. While most biomedical researchers will not need the breadth of multi-disciplinary expertise associated with data science, they do need skills that will enable them to work effectively in a research environment that is increasingly driven by large and complex datasets. Statistical proficiency, data management and visualization skills, and computational science literacy are all foundational to good science in general. Some ICs discussed the value of developing a core curriculum for data that would be required in all NIH-funded programs, similar to the current requirement for training on Responsible Conduct of Research. Such an approach would ensure that the biomedical researchers have a basic understanding of best practices for working with data. This type of required training could also help address recent concerns about research rigor and reproducibility, both concerns that were also mentioned in discussions with ICs. Several ICs spoke about the importance of providing training opportunities through a variety of means and for researchers at various stages in their careers. Some ICs mentioned the value of

Page 39: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

39

short courses and online training opportunities, particularly for established researchers who may not have a significant amount of time to dedicate to training.

2.6.3 Data science transcends IC-specific domains Data science is organ- and disease-agnostic, cutting across categorical domains that form the mandate for most NIH ICs. Nevertheless, training grants centered on data science are typically expected to have an explicit application to a given IC’s organ or disease-related mission. Forty percent of active FOAs for institutional training grants are related to data science, suggesting that data science concepts have become an integral component of many disciplinary training programs. NINDS, for example, recently issued an FOA for a T32 Award (PAR-17-096) around quantitative literacy. The announcement notes that “a key component will be a curriculum that provides a strong foundation in experimental design, statistical methodology, and quantitative reasoning.” These elements also figure prominently in data science. Still, several IC representatives noted that there are few programs that support biomedical data science training more broadly. Rather, training programs supported by categorical ICs produce domain-specialized data scientists. This specialization is especially evident in K and F awards, which tend to focus on highly specialized applications of data science methodologies. As indicated in Section 2.3, a small fraction of K and F FOAs are specifically related to data science. IC representatives noted that many of the projects identified in Section 2.4 were funded based on the merit of the science, rather than explicitly weighing the applicant’s interest in gaining data science expertise. Individual workforce development awards with an explicit data science focus therefore remain a potential opportunity to develop more data scientists with a broad range of skills that transcend a specific domain or disease. An FOA for an F or K award could, for instance, require a focus on a particular set of core competencies required of a data scientist or emphasize specific methodologies, such as machine learning, natural language processing, or computer vision. Several IC representatives noted that NLM is well poised to play a role in shaping more broadly focused opportunities for data science training programs. The science of NLM encompasses biomedical data science as well as the highly related areas of biomedical informatics and information science. In line with this focus, 100% of NLM’s T awards (Table 2-4), 71.4% of F awards (Table 2-6), and 89.7% of K awards (Table 2-8) are data science-related, for a collective investment of over $102 million between FY2010 and FY2017. Several IC representatives also noted that NLM's T15 awards could present an opportunity for inter-IC collaboration on future data science training FOAs.

2.6.4 Data science is an interdisciplinary team science While there are examples of biomedical data scientists, that is, individual investigators with significant expertise in biomedical science, quantitative science, and computer science, and while training programs for biomedical data science will surely produce more such examples, data science is often practiced in the context of team science, leveraging the strengths and expertise of a number of individuals throughout the data life cycle — from data generation to analysis to

Page 40: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

40

conclusions — to investigate a particular research question. When presented with this report’s working definition of data science, several IC representatives noted that it may be unrealistic to expect a single scientist to be an expert in statistics and mathematics, computer science, as well as a particular biological domain (as noted in section 2.6.1). Through team science approaches, even data scientists lacking deep expertise in biomedical science could make significant contributions as a Principal Investigator (PI) on a grant with multiple PIs or as a senior scientist working on another PI’s grant. Many IC representatives noted that the standard metric of success for a training award is whether a particular trainee goes on to receive an R01 award. In other words, training programs are evaluated based on the number of independent researchers they produce. Such a metric may not be a useful way to evaluate the success of data scientists who work in these team-based environments as a member, rather than as a leader, of the team. Training the next generation of data scientists therefore requires an alternative evaluation and incentive structure to recognize and reward team science practices. Some IC representatives wondered how NIH could begin to evaluate what success might look like for a team scientist versus an independent researcher. In addition, some ICs are beginning to experiment with training programs that leverage multi-disciplinary expertise to train a cohort of data scientists. NCI, for instance, recently launched BD-STEP (Big Data-Scientist Training Enhancement Program) in collaboration with the Department of Veterans Affairs (VA). BD-STEP takes a team science-based approach to training a new generation of clinical scientists equipped with data science expertise to improve patient care. The program matches graduate students in the physical sciences with VA medical centers for a year-long research and training opportunity with clinician scientists. Programs like BD-STEP that utilize partnerships and explore ways of providing trainees with hands-on experience in a team setting may be a useful way to prepare data scientists with the type of expertise and skills they will need to succeed in both basic and clinical research settings. Data science training could also emphasize building collaboration and communities of practice. IC representatives noted that the current scientific culture often rewards competition rather than collaborative methods of evidence generation and analysis. Competitive incentive structures may tend to discourage the best practices that characterize good data science. Some IC representatives therefore wondered how NIH can better incentive collaborative research practices to motivate training on emerging collaborative research tools, such as Jupyter Notebooks and the Open Science Framework (OSF). Collaboration also requires training on effective communication across disciplines. In other words, how can scientists be trained in how to ask the right questions of their collaborators, with the right vocabulary? Not only do data scientists need to know enough biomedical science to effectively interpret results, but scientists in more traditional disciplines must also be adequately conversant in computational and statistical methods. These issues are echoed in the broader scientific community’s responses to NLM’s RFI on Data Science Challenges in Health and Biomedicine (NOT-LM-18-001).

Page 41: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

41

2.7 Conclusions The findings of this report suggest that data science has been an important target for workforce development at the NIH during the period between FY2010-2017, with about $1.35 billion in data science-related T, F, and K awards over eight years. Adding in the $15.6 million invested in BD2K R25s and the $7.6 million for the BD2K TCC, that total rises to over $1.37 billion, or an average of nearly $172 million per year. IC staff interviewed for this study also indicated that data science remains an important target for workforce development and a focus for new programs and awards. The $930 million invested in T awards between FY2010-2017 accounts for nearly 70% of the total investment and 40% of the award years in all of the workforce development mechanisms (Ts, Ks, and Fs) considered in this study. Data science T awards also make up a significant portion of all NIH T awards, accounting for just over 17% of all T awards funded during the period of study. Even when including additional search terms beyond those included in the search for T awards to capture as complete a view as possible, F and K awards related to data science make up a much smaller percentage of overall awards within their mechanism, just 6.14% and 7.56%, respectively. Data science F and K award FOAs also make up a much smaller percentage of the overall FOAs in their mechanisms, 9.1% and 7.1%, respectively, compared to data science Ts, which make up 40% of all T FOAs. As trainees complete their doctoral programs in data science funded by these T mechanisms, they will need opportunities to continue their career development. As extramural staff mentioned, many data scientists will not be independent researchers who go on to receive R01 awards, but team scientists who work on interdisciplinary projects. Ensuring that graduates have access to F and K awards will help prepare them to take the next steps in their careers.

Page 42: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

42

3 The state of data science training for NIH staff NIH staff, including intramural researchers, program, review, and policy staff, and others working at the main NIH campus and elsewhere, have many of the same needs for data science training as the extramural community. Some of the activities designed for and by the extramural community are also relevant and available to NIH staff. For example, training modules such as those developed through R25 grants, and online activities such as the BD2K Guide to the Fundamentals of Data Science webinar series developed by the BD2K Training Coordination Center, are all potentially useful for NIH staff. However, NIH staff also benefit from access to hands-on training customized to their needs and interests, as well as interaction with a community of like-minded researchers. A variety of different data science training activities are available to NIH staff. Some of these activities are offered by NIH organizations that principally provide training and support for NIH staff, such as the NIH Library, the Center for Information Technology (CIT), and the Foundation for Advanced Education in the Sciences (FAES), while others have been developed by groups who do not have training as their primary mission. Most of these groups, such as the NIMH Data Science and Sharing Team and the NCI Bioinformatics Training & Education Program, were developed to serve a specific IC, but may also open their activities up to other staff. Some activities are provided on a volunteer basis, such as the NIH Data Science Mentoring program, which is administered by NLM and OD staff who lead the NIH Data Science SIG, with scientists from around NIH volunteering as mentors. Table 3-1 provides an overview of major data science workforce development activities for NIH staff. Table 3-1. Overview of training and workforce development activities available for NIH staff.

Activity Primary contact person(s)

Description

CodeNIH None This program is an “incubator” for postbacs, interns, and postdocs who are interested in learning how to code and do data science research. The group meets weekly to help motivate individuals in their learning and provide a space for networking with others interested in coding. They also maintain a GitHub repository of tutorials covering a variety of skill ranges, from basic to advanced, and several different programming languages, including Python, JavaScript, and MATLAB.

Data Science Mentoring Program

Lisa Federer and Ben Busby

To complement classroom training and provide learners the opportunity to gain real-world experience with data science, the Data Science Mentoring Program pairs researchers interested in learning about data science with more experienced mentors. Interested researchers complete a brief application form to allow the organizers to pair learners and mentors who have similar interests and goals.

Page 43: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

43

Activity Primary contact person(s)

Description

FAES Bioinformatics and Data Science Department

Ben Busby FAES provides graduate-level scientific classes for credit at NIH. The Bioinformatics and Data Science Department provides semester-long classes on a variety of topics, including more theoretical or science-based classes, like calculus and statistics, as well as classes on computational tools, like R and Python. Classes are open to all, including non-NIH students, but are fee-based.

High Performance Computing (HPC) Training

Susan Chacko NIH’s High Performance Computing (HPC) group administers several computing resources for the NIH community, including Biowulf (a 90,000 processor Linux cluster) and Helixweb (a set of web-based scientific tools). HPC staff offer Linux classes to help users learn how to use their Linux-based resources as well as classes on cluster-based computing.

NCI CCR Bioinformatics Training and Education Program (BTEP)

Peter Fitzgerald This group teaches hands-on courses that are open to the NIH intramural community, but space is limited and preference is given to CCR staff. Most of their classes are fairly bioinformatics-intensive, related to specific vendor tools or specific bioinformatics research approaches, but they also teach some workshops relevant to data science more broadly, such as statistics and R.

NIAID Bioinformatics and Computational Biosciences Branch (BCBB)/CIT Workshops

Burke Squires and Karlynn Noble

CIT hosts Seminars for Scientists in collaboration with instructors from NIAID’s BCBB. Workshops cover both general topics like Becoming a Reproducible Scientist to more specific like Homology Searching and Sequencing Alignment. CIT also hosts workshops on data science-related technology including Python, UNIX, SQL/noSQL, etc.

NIMH Data Science and Sharing Team

Adam Thomas While this group is primarily a scientific research group, they sponsor workshops and other training events for the NIH community. Recent topics have included citizen science, machine learning, and other data science-related topics. They have also sponsored Software Carpentry.

NIH Library Bioinformatics Support Program

Lynn Young The Library’s Bioinformatics Support Program features training by Library staff as well as visiting vendors who provide instruction on bioinformatics resources that the Library licenses. The program also provides one-on-one and online tutorials and consultations, as well as hosting special events like a Bioinformatics Symposium and Bioinformatics Game Tournament.

Page 44: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

44

Activity Primary contact person(s)

Description

NIH Library Data Services Program

Lisa Federer The Library’s Data Services program provides classroom- and webinar-based training on data management and data science, including classes on R, as well as one-on-one and group consultations. The program also coordinates Software Carpentry and Data Carpentry courses and provides training space for relevant courses in its laptop-equipped Training Room. In addition, the program hosts and sponsors various data science events, such as NIH Pi Day, Hour of Code, and Love Your Data Week.

NLM Partnership with the National

Endowment for the Humanities

(NEH) Office of Digital

Humanities

Jeffrey S. Reznick NLM’s History of Medicine Division has co-sponsored and hosted a number of NEH-funded programs and related initiatives since 2012 to advance and support data-focused interdisciplinary research and training using tools of the digital humanities. Partners in this work have also included the Maryland Institute for Technology in the Humanities, Research Councils UK, Virginia Tech, and the Wellcome Trust.

3.1 On-campus data science instruction Given the decentralized nature of data science training at NIH, collecting statistics about training usage is difficult, requiring staff from multiple organizations to voluntarily share their data. Whereas extramural funding is carefully tracked in a central system designed specifically for this purpose, most statistics on activities conducted on campus are collected in distributed, ad hoc systems, if at all. Representatives from several groups were contacted in order to collect the most thorough statistics possible for this report, but the following information should be viewed as a snapshot of on-campus data science training, rather than a comprehensive summary. Attendance and enrollment data from calendar years 2016-2017 were collected from the Center for Information Technology (CIT), the Foundation for Advanced Education in the Sciences (FAES), NCI CCR’s Bioinformatics Training and Education Program (BTEP), the NIH Library, and the NIMH Data Science and Sharing Team (DSST). Figure 3-1 shows total attendance for all providers. Since FAES offers semester-long courses, enrollment, rather than actual attendance, is shown in the figure, and is represented on the chart during the first month of the course for each semester. Not all training providers were able to provide data for the full two-year period; further details on the period of measurement for each provider is included in the following sections. Table 3-2 shows statistics about class numbers and attendance. Data were provided for 264 classes over the two-year period, for an average of 11 classes per month. Median total attendance for all classes was 18 attendees per class (mean = 25.5). Median in person attendance was 13 attendees (mean = 17.3). For the 75 classes with a webinar attendance option, median webinar attendance was 24 (mean = 29.4). A total of 6,652 attendances were recorded, for an average of nearly 277 attendances per month. From the available data, it cannot be determined how many

Page 45: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

45

of this count are unique attendees (in other words, one individual may have attended multiple classes).

Figure 3-1. Overall attendance at courses for various NIH campus organizations. Table 3-2. Statistics for on-campus data science training activities, 2016-2017.

Total Mean

Total classes 264 11 per month

Classes with webinar option 75 3.1 per month

Total attendance 6,652 25.5 per class 277.2 per month

In person attendance 4,505 17.3 per class 187.7 per month

Webinar attendance 2,147 29.4 per class 89.5 per month

3.1.1 Center for Information Technology The NIH Center for Information Technology (CIT) teaches a variety of courses related to computation and networking services available at NIH, including a series of Seminars for

Page 46: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

46

Scientists. Though hosted by CIT in their training space, this seminar series is taught by staff from NIAID’s Bioinformatics and Computational Biosciences Branch. Classes focus on programming techniques relevant to bioinformatics and biomedical science, particularly with Python. The most recent data available includes the period between September 2016 and April 2017, during which time CIT offered 26 seminars, both in person and by webinar. Figure 3-2 shows in person and webinar attendance for CIT Seminars for Scientists. Median total attendance was 40 attendees per class (mean = 38). Median in person attendance was 11 attendees (mean = 10); for classes with a webinar option, median webinar attendance was 49 attendees (mean = 45). A total of 983 attendances were recorded during the period of data collection.

Figure 3-2. In person and webinar attendance at Seminars for Scientists hosted by CIT.

3.1.2 Foundation for Advanced Education in the Sciences The Foundation for Advanced Education in the Sciences (FAES) provides advanced education and training in a variety of scientific topics, including short biotechnology workshops as well as semester-long courses. The Bioinformatics and Data Science Department offers classes on a variety of topics, including programming, statistics, and bioinformatics. In addition, FAES offers a 14-credit curriculum in Advanced Studies in Bioinformatics and Data Science designed for participants who already hold an advanced degree in life sciences or STEM fields. Enrollment figures were available for Spring 2016 through Fall 2017 semesters. 29 classes were taught in the Bioinformatics and Data Science Department during this time, including 10 bioinformatics and 19 data science courses. FAES staff noted that classes in the Bioinformatics and Data Science Department are the most “oversubscribed” of any department.

Page 47: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

47

Figure 3-3 shows enrollment for classes in the FAES Bioinformatics and Data Science Department in the Spring 2016 through Fall 2017 semesters. Median total attendance was 20 enrollees per class (mean = 22). Median enrollment for bioinformatics courses was 14.5 students (mean = 14.8). Median enrollment for data science courses was 23 students (mean = 25.7). A total of 636 students enrolled in classes during the four semesters reported, 488 in data science courses and 148 in bioinformatics classes.

Figure 3-3. Enrollment in classes in the Bioinformatics and Data Science Department at FAES.

3.1.3 NCI Bioinformatics Training and Education Program NCI Center for Cancer Research’s (CCR) Bioinformatics Training and Education Program (BTEP) was established by the Office of Science and Technology Resources in March 2012 to increase awareness and understanding of bioinformatics techniques and applications. Most hands-on classes are limited to 25 registrants; preference is given to CCR staff, though others may attend as space permits. Over 40 individual labs or branches within CCR are represented in attendance at BTEP classes over the last two years. Figure 3-4 shows attendance for classes offered between January and December 2017. During this time, 16 classes were offered. Median attendance was 23 attendees per class (mean = 29.9). A total of 479 attendances were recorded during the period of data collection.

Page 48: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

48

Figure 3-4. Attendance at BTEP bioinformatics courses.

3.1.4 NIH Library In addition to topics that might traditionally be associated with libraries, the NIH Library offers classes taught by specialists in the Library’s Bioinformatics Support and Data Services programs, established in 2009 and 2013, respectively. Bioinformatics classes typically focus on instruction in a particular bioinformatics tool or technique, while data services classes include topics such as data visualization, data wrangling, statistics, and R programming. Data were provided for classes offered between March 2016 and December 2017. During this time, the NIH Library offered 417 classes, of which 185 (44%) were related to bioinformatics (n = 89, 21%) or data science (n = 96, 23%). Both bioinformatics and data science classes were generally popular, with over half having a waitlist because in person registrations had exceeded the NIH Library’s Training Room capacity of 22 attendees. By comparison, only 6% of the NIH Library’s other classes had a waitlist. Given the popularity of these classes, attendance by webinar was also offered when feasible; about a third of the bioinformatics and data science courses offered webinar attendance as an option. Figure 3-5 shows in person and webinar attendance at NIH Library bioinformatics and data science classes. Median total attendance was 16 attendees per class (mean = 24). Median in person attendance was 12 attendees (mean = 16); for classes with a webinar option, median webinar attendance was 19 attendees (mean = 25). Library staff also taught classes at some large events that drew many attendees, such as a Bioinformatics Symposium in June 2016, a Big

Page 49: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

49

Data Bootcamp in July 2016, and Hour of Code in December 2016 and 2017. A total of 4,330 attendances were recorded during the period of data collection.

Figure 3-5. In person and webinar attendance at bioinformatics and data science classes hosted by the NIH Library.

3.1.5 NIMH Data Science and Sharing Team The NIMH Data Science and Sharing Team (DSST) was formed in 2016 to support investigators within the NIMH intramural program in creating, distributing, and leveraging large, open datasets to accelerate discovery. In addition to providing support for NIMH groups, the DSST teaches courses and organizes lectures open to all NIH staff. Figure 3-6 shows attendance for classes offered between June 2016 and December 2017. During this time, 8 classes were offered. Median attendance was 25 attendees per class (mean = 28). A total of 224 attendances were recorded during the period of data collection.

Page 50: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

50

Figure 3-6. Attendance at NIMH Data Science and Sharing Team courses.

3.2 NIH Data Science SIG The NIH Data Science SIG was formed in early 2016 with the purpose of disseminating data science information, resources, and activities; providing the community opportunities to discuss and facilitate collaboration; and encouraging sharing of data, findings, and methodologies. The Data Science SIG also hoped to provide a means for coordinating with the many other SIGs and groups at NIH that conduct programming of relevance to data science. For example, groups like the Bioinformatics, Deep Learning in Medical Imaging and Behavior, and Statistics Interest Groups have diverse subject matter focus, but all share an interest in topics that could be considered data science. While one group’s events or activities may be of interest to other groups’ members, other groups were unlikely to be aware of such events. One goal of the Data Science SIG was providing a centralized location for announcing such events, as well as publicizing the data science training classes described above. While the SIG does hold in person meetings and events, the primary means of communication with members is the Data Science SIG listserv. As of December 2017, the listserv had 441 members, with a mean of 19 new members joining each month. Figure 3-7 shows membership in the SIG from February 2016 (when the list was created) to December 2017. Significant increases in membership during July 2016 and December 2017 correspond to invitations to join the SIG sent to related listservs (such as the Bioinformatics SIG and Fellows listservs).

Page 51: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

51

Figure 3-7. Cumulative count of members of the NIH Data Science SIG email listserv.

3.2.1 SIG Events The Data Science SIG has organized various events on topics related to data science, often offering attendance both in person and by webinar, for NIH staff as well as other interested individuals. Events held in 2017 include:

● Global Perspectives on Biobanking and Access to Samples, January 2017 A seminar featuring six experts from around the world speaking on their specific activities regarding specimen collection, access to samples, and overcoming obstacles and challenges, including the unique challenges inherent in rare disease specimens.

● Machine Learning Applications, January 2017 Presentations from NIH staff on how they are using machine learning applications in their own research and work.

● Computer Architecture for Data Scientists, April 2017 A talk on system implementations and methodologies for data science, including the use of cloud applications.

● Data Science Week, September 2017 A series of events held daily throughout the week, with a focus on various data science topics, including containerization, use of NLM resources for data science, data science for precision medicine, and text mining. Presenters included both guest speakers and NIH staff.

Figure 3-8 shows attendance at each of the 2017 sessions.

Page 52: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

52

Figure 3-8. Combined in person and webinar attendance at NIH Data Science SIG events.

3.2.2 Data Science Mentoring While classroom-style and webinar-based training activities are useful, one-on-one mentoring can also be beneficial for researchers who are trying to enhance their skills. Such a program also draws upon the significant data science expertise that already exists within the NIH community. The NIH Data Science SIG established a Data Science Mentoring program in early 2017, taking applications from mentors and mentees. The program was announced on various data science-related listservs, including the Data Science SIG listserv, the Data Science Trainers listserv, the NIH Fellows’ listserv, and the Data@NIH newsletter. The first cohort was paired in May 2017, and a second cohort was formed in September 2017. Pairings were made based on the mentee’s interests and the mentor’s expertise. As Figure 3-9 shows, significantly more mentees applied than mentors, so not every applicant could be paired with a mentor. Overall, more than four times more mentees applied than mentors, suggesting that interest in learning data science at NIH outpaces the existing level of expertise. A total of 73 pairings have been made in the first year of the program, but 82% of mentees (n = 353) could not be paired due to lack of qualified mentors.

Page 53: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

53

Figure 3-9. Mentor and mentee applicants to the NIH Data Science Mentoring program in 2017. The total pairings for each cohort is less than the number of mentor applicants because not all were eligible for pairing (non-NIH applicants or those in a geographic location where no mentees applied were not paired).

3.3 Conclusions The findings presented here suggest that data science training and workforce development activities are valuable to and highly utilized by NIH staff. Classes are offered so frequently that on average, half the calendar days in a month would have a data science class. Even with so many classes, many of the activities described here received far more applicants than could be accommodated, suggesting that demand still exceeds availability. Classes also typically focus on intramural research staff, rather than other staff who may also need to learn data science skills. For example, few classes are targeted toward program staff, who will likely need to gain some expertise in data science in order to effectively assess and manage a portfolio that may increasingly include data science-related awards and employ data science approaches. The highly distributed nature of data science training on the NIH campus also suggests a need for greater coordination. Often classes overlap in terms of topics (for example, several different groups teach Python courses, and several others teach R courses) as well as dates. While the Data Science SIG was formed with the vision of providing coordination, the SIG is an entirely volunteer effort, conducted by organizers who have no official mandate to coordinate efforts. A formal body that could coordinate and provide direction for on campus data science training efforts could benefit both NIH staff interested in taking classes and the groups that offer these training sessions.

Page 54: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

54

4 Recommendations Based on the findings in this study, five broad recommendations are suggested to help create a diverse workforce prepared to respond to the challenges and seize the opportunities of an increasingly data-intensive biomedical research enterprise. NLM’s mission to enable biomedical research broadly, without focus on specific disease or organ systems, means it is well-positioned to act as a central resource for training biomedical data scientist, so many of these recommendations could be implemented at the NLM level, as well as more broadly across NIH. To the extent possible and appropriate, implementation of these recommendations should align with recommendations of the NIH ACD Biomedical Research Workforce Working Group Report. Data science holds the promise of transforming and advancing biomedical research by providing new ways to analyze, visualize, understand, and gain insight from large complex sets of genomic, connectomic, image, health record, behavioral, and other kinds of data. Fundamental to realizing this promise is the broad adoption of good data management practices, and assurance that digital research objects, such as data sets, publication citations, software tools, etc., are findable, accessible, interoperable, and reusable (i.e., in accord with the FAIR principles). This transformation will be hastened by training aimed at producing three levels of expertise in, and understanding of, data science. First is training of pure data scientists who work in the context of biomedical science. These biomedical data scientists would generate next generation analytics, novel ways to visualize and otherwise present data, new artificial intelligence approaches such as deep learning, at-scale curation solutions and provenance-tracking through distributed ledger technologies, and other means of accelerating and transforming discovery and biomedical progress. The research conducted by these biomedical data scientists would include work related to the methods and approaches (e.g., validation, comparison), as well as addressing biomedical research problems. The second level of training is for expertise conferred by having biomedical scientists cross-trained in data science, and data scientists cross trained in biomedical science. The former would be conversant in data science and its tools and would be well-poised as early adopters and adaptors of the cutting-edge approaches and methods developed by the biomedical data scientists described above, providing new capabilities to analyze, visualize, and otherwise gain insight from their biomedical data. Data scientists cross-trained in biomedical science would be conversant in a defined area of biomedical subject matter and would be able to expand their research horizons into those knowledge niches. Such cross-trained data scientists could apply data science approaches to produce better analysis, visualization, and understanding of specified biomedical data, and would serve important roles as leaders in bringing biomedical digital research objects (including, but not limited to data, software tools, etc.) in line with the FAIR principles (i.e., making such objects findable, accessible, interoperable, and re-usable). This second level of expertise also includes cross-training librarians and information scientists in data science to lead activities that will grow in importance as biomedical research becomes more data-centric and open, including assuring that digital research objects are FAIR and that best practices in data management are applied throughout the research data life cycle.

Page 55: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

55

The third level of training would promulgate data science literacy across the biomedical workforce and beyond. This level includes training on the nature, power, and limitations of data science, as well as on good data management practices and the importance and means of making digital research objects FAIR. Those to be trained at this level include not only biomedical scientists (ideally, all), but also NIH extramural program, review, and policy staff, as well as medical and health science librarians and other information professionals. Training in biomedical data science literacy also extends to those who are not yet in the biomedical workforce, but who might be drawn to it through such training, including undergraduates and K-12 students. Importantly, this training would best be conducted through a variety of didactic methodologies, including innovative and non-traditional modes like webinars, hackathons, and curriculum modules. Recommendation 1. Develop a common programmatic understanding of what constitutes biomedical data science and its practice (both of which will evolve) Data science is an emerging field, encompassing a variety of disciplines and methodologies; what is and what is not data science is not always clear. Many of the IC representatives interviewed for this report agreed that, within their IC or research community, the definition of data science remained unclear. While developing and conforming to a highly detailed definition of data science would be both difficult and ill-advised, it would be useful to have an NIH-wide definitional sense of what data science comprises to allow for consistent and coherent programmatic activities, including planning, implementation, analysis, and evaluation.

■ Recommendation 1a. Work across NIH toward a unified sense of biomedical data science for programmatic consistency across NIH.

■ Recommendation 1b. Work across NIH to identify core competencies for biomedical data scientists.

■ Recommendation 1c. Work across NIH to identify core competencies for data science literacy for all biomedical scientists.

Recommendation 2. Expand and enhance training of data science experts Data science professionals, including those competing for NIH funding to support breakthrough research are urgently needed for biomedical research. Needed are programs to train students, fellows, and professionals to meet this demand. The longstanding experience and expertise of NLM in providing training in biomedical information science, informatics, and data science, combined with its organ- and disease-agnostic mission, position NLM well to fill an important role in this training.

■ Recommendation 2a. Expand and enhance training of pure data scientists in the context of biomedical science.

Page 56: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

56

Recommendation 3. Provide training across data science, biomedical science, and information science Not all biomedical scientists are experts in genomics or molecular biology, nor will all biomedical scientists become experts in data science, yet the importance of these fields to much of biomedical research is profound, and even non-expert investigators are well-versed in these fields. As biomedical data become more accessible and re-usable, data science will become a similarly important field in which non-expert biomedical scientists will need to be well-versed to excel in a research environment characterized by increasingly large and complex datasets. While such training would not produce investigators capable of developing fundamentally new analytic methods or deep learning algorithms, it would provide them a sophisticated understanding of data science that may transform the nature and trajectory of their research interests, identifying new scientific leads to follow. With the elevated importance of data science in biomedicine there will also be an increased need for data scientists who understand biomedical science, even if their research goals are not limited to biomedical science. Data scientists with a solid understanding of particular aspects of biomedicine would be powerful partners on team science projects, bringing the promise of fresh new perspectives to bear on difficult research problems. Finally, as expectations and policies proliferate that encourage, recommend, or require good data management practices, including making digital research objects findable, accessible, interoperable, and reusable, those who understand the policies, requirements, and appropriate responses to such policies across the data life cycle, will be evermore in need. In accord with these needs are three recommendations.

■ Recommendation 3a. Train data scientists in biomedical science, providing an on ramp to extend their research horizons to biomedicine and lead efforts to make biomedical digital research objects FAIR.

■ Recommendation 3b. Train biomedical scientists in data science methods and

approaches, providing them new capabilities to analyze, visualize, and better understand their data.

■ Recommendation 3c. Train librarians and information scientists in data science, providing

them with the knowledge and tools to lead crucial activities such as assuring digital research objects abide by FAIR principles, and implementing best practices in data management, including curation and preservation.

Recommendation 4. Promote a data science-literate biomedical workforce As pervasive as genomics and molecular biology are across biomedical science, the pervasiveness of data science and related areas will surely eclipse them in the near future. For example, application of principles making data and other digital research objects findable, accessible, interoperable, and re-usable (FAIR) are widely endorsed by funders, publishers, patient advocacy groups, and more. This enthusiasm is not limited to particular (even large)

Page 57: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

57

domains of biomedical research. Similarly, the significance of good data management practices, including selection of data to share, curate, and preserve, is being widely acknowledged and will soon become an essential component of any successful research grant application, contract proposal or other request for research funding. Finally, the tools, methods, and perspectives of data science will likely suffuse much of biomedical research, regardless of the topic. For these reasons and more, the entire biomedical research workforce will need to have a basic level of understanding about data science and related activities. Although Recommendations 4a, b, and c refer to categories of the workforce, they are not meant to be limiting, and it is likely that general and basic training activities for one category might be easily adapted to serve another. The development of general and basic training materials and approaches are also adaptable to other populations not currently comprising the biomedical workforce, including students and those not well represented in data science. Finally, a wide range of approaches providing basic information about data science and related activities should be considered to best fit the purpose and audience, whether through webinars, hackathons, or any of many other possibilities.

■ Recommendation 4a. Work across NIH to identify and use mechanisms to broadly train biomedical investigators about the nature, power, and limitations of data science.

■ Recommendation 4b. Work across NIH to identify and use mechanisms to broadly train

NIH program, review, and policy staff about the nature, power, and limitations of data science.

■ Recommendation 4c. Identify and use mechanisms to broadly train information

professionals about the nature, power, and limitations of data science.

■ Recommendation 4d. Encourage the next generation of biomedical data scientists by engaging the broader public, especially students younger than college-age and populations not well represented in the current cohort of data scientists.

■ Recommendation 4e. Explore non-traditional training approaches to promote data

science-literacy across diverse audiences, including hackathons, boot camps, carpentry sessions, MOOCs, etc.

Recommendation 5. Promote programmatic coherence for biomedical data science training and workforce development across NIH While the substantive focus of each IC varies, data science cuts across these different domains. As with other cross-cutting topics, such as the use of clinical common data elements, data science training would benefit by a trans-NIH forum that could provide awareness of ongoing and planned initiatives, as well as opportunities for collaboration or coordination across NIH. As this report makes clear, many kinds of training activities are currently underway, addressing very different audiences in different contexts, with the major chunks being extramurally funded training and training of NIH staff. Establishing a trans-NIH forum for each of these data science

Page 58: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

58

training domains is recommended, with an explicit charge that each forum is in communication with the other.

■ Recommendation 5a. Establish a trans-NIH committee to facilitate communication, collaboration, and coordination of extramural biomedical data science training and workforce development.

■ Recommendation 5b. Establish a trans-NIH committee to facilitate communication,

collaboration, and coordination of biomedical data science training and workforce development of NIH staff.

Page 59: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

59

5 Appendices

Appendix A - Qualitative Interview Guide Below are the question that guided the semi-structured interviews with the nine ICs’ extramural staff, described in section 2.3 of the report.

1. How does your IC define data science? What sorts of research and activities are within the scope of that definition? What is not within scope?

2. Our working definition of data science is the discipline that combines subject matter

knowledge, mathematical and statistical expertise, and computer science and programming skills. Put another way, we’re interested in training programs that prepare researchers to use computational methods to apply statistical-based models to extract knowledge from datasets. Does that seem to fit with how your IC is scoping data science, or is it too broad or narrow?

3. Can you tell us how your FOAs and funded projects have worked out? What were the

initial outcomes or findings that you can share with us?

4. Besides the FOAs and funded projects we’ve identified, was there anything we missed related to data science training that your IC is involved with that we’ve missed?

5. How are you defining what “success” would look like in these programs? What are some

of the goals and outcomes your IC is expecting will come out of these activities? Are there any sort of formal evaluation measures you’ve defined?

6. What are your IC’s plans going forward to support data science training? Is data science

training something your IC considers a priority? What other data science efforts are needed to help prepare the workforce in your research areas, either within your IC or throughout the rest of NIH?

Page 60: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

60

Appendix B – R Code Used for Text Mining and Topic Mapping Analysis library(topicmodels) library(tm) library(slam) setwd('~/Desktop/Text Mining for T Grants/') # Read in custom stopwords list mystopwords <- scan("stopwords.txt", what="varchar", skip=1) grants <- read.table('T_grants_2010-2017.txt', sep='\t', header=T, stringsAsFactors = F, quote='"') narrowed.grants <- grants[which(grants$Award.Status=="Awarded"),] docs <- data.frame(narrowed.grants$Abstract) id <- narrowed.grants$Full.Grant.Number # Perform text cleaning on corpus corpDocs <- Corpus(DataframeSource(docs)) corpDocs <- tm_map(corpDocs, removePunctuation) corpDocs <- tm_map(corpDocs, content_transformer(tolower)) corpDocs <- tm_map(corpDocs, removeWords, stopwords("English")) corpDocs <- tm_map(corpDocs, stemDocument) corpDocs <- tm_map(corpDocs, stripWhitespace) corpDocs <- tm_map(corpDocs, removeWords, mystopwords) dtm <- DocumentTermMatrix(corpDocs) rownames(dtm) <- id dtm <- dtm[row_sums(dtm) > 0,] dtm <- removeSparseTerms(dtm, 0.99) # View most frequent terms to ensure stopword list was adequate to filter out words not meaningful for analysis tDocs <- findMostFreqTerms(dtm) tDocs theTerms <- col_sums(dtm) theTerms <- sort(theTerms) tail(theTerms, 50) # Perform topic modeling analysis on corpus burnin <- 2000 iter <- 1000

Page 61: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

61

thin <- 250 seed <-list(2003,5,63) nstart <- 3 best <- TRUE k <- 5 ldaOutAimsAbstract <- LDA(dtm,k, method='Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin)) # Write out top 10 keywords associated with each category extracted from topic modelling analysis ldaOut.terms <- as.matrix(terms(ldaOutAimsAbstract,10)) write.csv(ldaOut.terms,file=paste('LDAGibbs_aims_only_better',k,'TopicsToTerms.csv')) # Create data frame with category number matched to the full project number and information for visualization ldaOut.topics <- as.data.frame(topics(ldaOutAimsAbstract)) ldaOut.topics$Grant.Number <- rownames(ldaOut.topics) colnames(ldaOut.topics) <- c('Category', 'Grant.Number') topics.to.map <- inner_join(narrowed.grants, ldaOut.topics, by='Grant.Number')

Page 62: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

62

Appendix C – Previous Report to the NLM Director

Data Science Training at NIH: Extramural and Intramural Activities, 2014 - 2017

Executive Summary This report provides a summary of major NIH activities in support of data science training, including those directed toward the extramural community and those conducted at NIH for intramural researchers. For the purposes of this report, data science is defined as the discipline that sits at the intersection of subject matter knowledge, mathematical and statistical expertise, and computer science (i.e. programming) skills, and that utilizes computational methods to apply statistical-based models in order to extract knowledge from datasets. Based on an analysis of Funding Opportunity Announcements and funded grant proposals since FY2014, we provide an overview of the existing data science training landscape. Most existing grants were funded through NIH’s Big Data to Knowledge (BD2K) initiative; NHGRI and NLM are also major funders of data science training at NIH. Within the intramural community, a range of activities are available to researchers, most for free. Many of these activities are taught on a volunteer basis by researchers whose main duty is not teaching. Through this analysis, we have identified several gaps and opportunities in data science training at NIH:

● Compared to the broad scope of extramural training grants, relatively few projects that center on data science training are currently funded. Furthermore, all of the BD2K funding opportunities have expired, and BD2K-funded projects are soon to expire.

● Current awards focus on training undergraduate students and beyond. There are opportunities to develop training programs at the K-12 level, as well as programs training later career researchers who want to gain competency in data science.

● While intramural data science training opportunities do exist, they are often siloed activities with little coordination oversight or efforts to increase awareness across the entire NIH intramural community.

1. Scope of this Report: Defining “Data Science” Data science is a relatively new field, encompassing a variety of disciplines and methodologies. Given the rapid growth of scientific research data, many researchers utilize increasingly large datasets to answer their research questions, but not all science that relies on data, big or small, can accurately be considered data science. In addition, as researchers across many scientific fields begin to learn and utilize advanced computational tools in their work, the term “data science” is sometimes applied too broadly to a range of research activities that do not truly use data science methodologies.

Page 63: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

63

Further complicating the issue within biomedical research, the fields of bioinformatics and data science overlap in many ways. Both utilize computational methodologies to draw knowledge from large datasets and require varying levels of expertise in science and statistics. Most bioinformatics research does incorporate data science methods to answer questions about certain types of biological data. Data science, on the other hand uses these methods on a wider range of data, in terms of both data types and subject matter, and therefore generally requires a broader skillset than bioinformatics. For the purposes of this report, data science is defined as the discipline that sits at the intersection of subject matter knowledge, mathematical and statistical expertise, and computer science (i.e. programming) skills. In other words, data science research utilizes computational methods to apply statistical-based models in order to extract knowledge from datasets. We have intentionally excluded activities that focus on developing skills for bioinformatics only, and instead, focus on activities that aim to provide the skills that researchers will need to apply data science to a broad range of scientific questions. Training a workforce that can apply these methods to many types of biomedical data will be essential to initiatives around precision medicine, cancer research, and neuroscience. NIH funds a broad range of activities aimed at researchers across their career, from their start as students through to when they are established scientists. In this report, we include in the definition of training a full range of educational activities supported by various NIH funding mechanisms. These may include extramural funding for institutions to establish formal training programs in data science, such as those typically funded by T32. More informal activities, such as web-based resources like Massive Online Open Courses (MOOCs) and online tutorials that have been funded by R25s, are also considered relevant. Individual, rather than institutional, awards are also included, such as K01 fellowships for individual researchers interested in developing their skills in data science. Finally, we also include intramural activities aimed at training researchers at the NIH in data science skills. 2. Extramural Data Science Training Programs The Common Fund’s Big Data to Knowledge (BD2K) initiative marked a major, concerted funding investment on behalf of NIH geared in part towards training a data science workforce. Extramurally funded projects include development of education programs at universities and online and individual training awards for researchers to emerge as leaders in biomedical data science. However, all of BD2K’s funding opportunities have expired, with most of the resulting project awards expiring in the near future. NIH ICs have also supported data science training outside of BD2K. Below is a summary of extramural funding opportunities and awarded projects. 2.1 Funding Opportunity Announcements Related to Data Science Training To assess the landscape of data science training programs supported by the NIH through extramural awards, relevant FOAs were first located and compiled by:

● Searching for the key terms: “data science”, “big data”, “computational”, “quantitative science”, and “informatics”.

Page 64: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

64

● Restricting the search to announcements made between FY2014 and the present. ● Restricting the search to F, K, R25, and T awards.

Of the resulting FOAs, 22 fit the working definition of data science. 15 (68%) of those programs were related to the Big Data to Knowledge (BD2K) initiative. The breakdowns of activity codes and sponsoring ICs are detailed below (Figure 1.1).

Figure 1.1. Data Science Training FOAs broken down by activity code (left) and funding IC (right). 2.2 Data Science Training Awardees To next determine what data science training-related grants are ultimately funded through the NIH, grants were located through NIH RePORTER using a few strategies:

● Searching for grants that were awarded in response to the 22 FOAs located above. ● Searching for the key terms: “data science”, “big data”, “computational”, “quantitative

science”, and “informatics”. The search was restricted to awards funded between FY2014 and the present through either a F, K, R25, or T mechanism. From the results, we selected awards that fit the working definition for data science based on the project’s abstract. This search accounted for awards that could still be considered data science training awards, but were not explicitly in response to the 22 FOAs.

172 funded grants were located in total; of those, 83 (41.5%) were awarded through the 22 FOAs. Awards were granted through 19 of the 27 ICs and administered through 20 ICs, underscoring an NIH-wide commitment to data science training. 77 (44.8%) of the awards were funded by NIH OD, and all 63 BD2K training awards were represented in the collected set of award results. NLM, NIEHS, and NIGMS collectively oversee just over half (52.9%) of the projects. A breakdown of funding and administering ICs is shown below (Figure 2.1).

Page 65: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

65

Figure 2.1. Data Science training awards broken down by funding IC (left) and administering IC (right). No one IC has taken charge of extramural data science training, as evidenced by the distribution of awards and funding opportunities across NIH. Because NIGMS funds and manages a number of data science training projects, the Director of the Division of Biomedical Technology, Bioinformatics, and Computational Biology (BBCB), Susan Gregurick, was interviewed. She noted that NIGMS makes a concerted effort not to target any one particular area of science, including data science. Their interest in data science training is guided by the community, and not mandated by the IC director. Furthermore, the Institute does not fund training programs specific to clinical data. T32s (30.8%), R25s (21.5%), and K01s (20.3%) make up the bulk of data science training projects funded by NIH, with an emphasis on programs from the undergraduate research level and beyond. K-12 data science training, as well as training for later career scientists, therefore remain open areas for funding opportunities. The activity codes for funded projects are detailed below (Figure 2.2).

Figure 2.2. Data Science Training Awards broken down by Activity Code.

Page 66: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

66

2.3 Activities from the BD2K Training Coordination Center (TCC) The BD2K TCC is meant to coordinate and compile all the outputs of BD2K-funded projects into a common resource to improve data science skills and develop the biomedical data science workforce. The TCC aims to target a large audience spanning the scientific community and the public at large, at all ages and stages of career development. The TCC is also meant to coordinate and support innovative collaborations for data science training and innovation. The BD2K TCC grant was awarded to the University of Southern California for their Big Data U project, which coordinates BD2K training activities and hosts training and resources through BigDataU.org, with an investment of $6,952,055 awarded over three years, beginning in 2015. The primary resource on offer is the Education Resource Discovery Index (ERuDIte). To ensure BD2K-supported resources meet FAIR standards, ERuDIte aggregates educational resources produced both by BD2K-funded projects and other available online offerings (i.e. MOOCs) by scraping the web, then assigning the appropriate metadata. In the future, ERuDIte is expected to perform as an adaptive, personalized online educational platform, recommending training resources based on user data and behavior. To do so, ERuDIte requires a high volume of users to provide the relevant data to train machine learning algorithms. However, ERuDIte usership remains low, with 179 registered users as of August 2017. Without a sufficiently engaged usership, the platform cannot deliver on responsive personalized training, as intended. Furthermore, many of the resources, including several of the most highly viewed resources are not related to biomedical data science, like “Sabermetrics 101: Introduction to Baseball Analytics,” which is the second most viewed resource. While general data science resources may be helpful, ERuDIte seems to contain quite a few resources related to specific but non-biomedical topics that may not be relevant if the target audience is users interested in biomedical data science. The TCC also offered a weekly webinar BD2K Guide to the Fundamentals of Data Science. The original series ran from September 9, 2016 - May 19, 2017, but there do not appear to be plans for another series in 2017/2018. The videos from each of the weekly presentations are archived on YouTube. Viewership of the archived BD2K Guide to the Fundamentals of Data Science on YouTube has declined over time, from 3,800 per video to early sessions to approximately 250-500 views for later videos. The decline in viewership further underscores the need for effective outreach if the TCC is to become an effective and widely used resource. The TCC also supports in person training activities, including two five-day Data Science Innovation Labs with support from NIH and NSF. The Labs brought together about 30 investigators from different fields to tackle a given data science challenge. Additionally, the RoAD-Trip Intensive Data Science Residency Program partners junior biomedical researchers with data scientists to collaborate on a joint project over the span of a minimum of two weeks. The program is in its second year, with a second round of applications due September 2017. 3. Intramural Data Science Activities

Page 67: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

67

In addition to the data science activities NIH funds within the extramural community, several ICs and groups have developed efforts intended to enhance the data science skills of the NIH intramural workforce. Most of these activities are open to all NIH staff, including contractors, and many are available as webinar for remote participation by individuals who are not able to come to the main campus in Bethesda. 3.1 Existing Training Efforts The following activities and groups, listed alphabetically, provide training or support for data science. Except where noted, all are free and open to the NIH intramural community.

● CodeNIH (primary contact: various - see full contact list) This program is an “incubator” for postbacs, interns, and postdocs who are interested in learning how to code and do data science research. The group meets weekly to help motivate individuals in their learning and provide a space for networking with others interested in coding. They also maintain a Github repository of tutorials covering a variety of skill ranges, from basic to advanced, and several different programming languages, including Python, Javascript, and MATLAB.

● Data Science Mentoring Program (primary contact: Lisa Federer and Ben Busby) To complement classroom training and provide learners the opportunity to gain real-world experience with data science, the Data Science Mentoring Program pairs researchers interested in learning about data science with more experienced mentors. Interested researchers complete a brief application form to allow the organizers to pair learners and mentors who have similar interests and goals. Since the start of the program in May, almost 75 mentor/learner pairs have been formed.

● FAES Bioinformatics and Data Science Department (primary contact: Ben Busby) FAES provides graduate-level scientific classes for credit at NIH. The Bioinformatics and Data Science Department provides semester-long classes on a variety of topics, including more theoretical or science-based classes, like calculus and statistics, as well as classes on computational tools, like R and Python. Classes are open to all, including non-NIH students, but are fee-based.

● High Performance Computing (HPC) Training (primary contact: Susan Chacko) NIH’s High Performance Computing (HPC) group administers several computing resources for the NIH community, including Biowulf (a 90,000 processor Linux cluster) and Helixweb (a set of web-based scientific tools). HPC staff offer Linux classes to help users learn how to use their Linux-based resources as well as classes on cluster-based computing.

● Library Carpentry (primary contact: Kate Masterton) Based on the popular Software Carpentry and Data Carpentry workshops, Library Carpentry provides data science training taught by and for librarians from NLM and the NIH Library. Topics of instruction include an introduction to data science, OpenRefine, and data processing and visualization with R.

● NCI Bioinformatics Training and Education Program (BTEP) (primary contact: Anand Merchant)

Page 68: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

68

This group teaches hands-on courses that are open to the NIH intramural community, but space is limited and preference is given to CCR staff. Most of their classes are fairly bioinformatics-intensive, related to specific vendor tools or specific bioinformatics research approaches, but they also teach some workshops relevant to data science more broadly, such as statistics and R.

● NIAID Bioinformatics and Computational Biosciences Branch (BCBB)/CIT Workshops (primary contact: Burke Squires and Karlynn Noble) CIT hosts Seminars for Scientists in collaboration with instructors from NIAID’s BCBB. Workshops cover both general topics like Becoming a Reproducible Scientist to more specific like Homology Searching and Sequencing Alignment. CIT also hosts workshops on data science-related technology including Python, UNIX, SQL/noSQL, etc.

● NIMH Center for Multimodal Neuroimaging Workshops (primary contact: Adam Thomas) While this group is primarily a scientific research group, they sponsor workshops and other training events for the NIH community. Recent topics have included citizen science, machine learning, and other data science-related topics. They have also sponsored Software Carpentry.

● NIH Library Bioinformatics Support Program (primary contact: Lynn Young) The Library’s Bioinformatics Support Program features training by Library staff as well as visiting vendors who provide instruction on bioinformatics resources that the Library licenses. The program also provides one-on-one and online tutorials and consultations, as well as hosting special events like a Bioinformatics Symposium and Bioinformatics Game Tournament.

● NIH Library Data Services Program (primary contact: Lisa Federer) The Library’s Data Services program provides classroom- and webinar-based training on data management and data science, including classes on R, as well as one-on-one and group consultations. The program also coordinates Software Carpentry and Data Carpentry courses and provides training space for relevant courses in its laptop-equipped Training Room. In addition, the program hosts and sponsors various data science events, such as NIH Pi Day, Hour of Code, and Love Your Data Week.

● NINR Big Data Bootcamp (primary contact: Pamela Tamez) NINR has sponsored an annual summer “Big Data Bootcamp” since 2015. The one-week intensive training course is open to both intramural researchers and outside attendees. Lectures and workshops introduce topic-based subjects, like nutrigenomics and microbiomics; data science-specific subjects, like R and machine learning; and ethical and legal topics.

3.2 Coordination of Intramural Activities Although a number of activities exist that address a variety of skills at varying levels of expertise, coordination of efforts and awareness within the NIH community remain significant challenges. Without a single coordinating unit to assist with planning activities, events and classes may overlap. Events are often announced via various email listservs or on the website of the sponsoring group, but NIH staff may not hear about relevant events if they don’t already know

Page 69: Report to the NLM Director The State of Data Science ... · 2.6 Qualitative analysis of data science workforce development 37 2.6.1 Data science remains a nebulous term 37 2.6.2 Data

Data Science Coordinating Unit Report to the NLM Director

69

where to look for them. In addition, many of these activities are taught on a volunteer basis by researchers whose main duty is not teaching. A few efforts have been made to help provide coordination and more centralized publicity. The NIH Data Science Scientific Interest Group (SIG) was formed in 2016 to help develop a community of practice for NIH staff interested in data science, as well as provide coordination for some of the many other SIGs whose focus is related to data science (such as Bioinformatics SIG, Single Cell Genomics SIG, Statistics SIG, etc). The SIG’s email listserv has 348 subscribers as of September 2017. In addition, the SIG has also sponsored some of its own events, including talks on machine learning and biobanking, and the first NIH Data Science Week, and administers the NIH Data Science Mentoring Program. In an effort to help plan data science-related training and coordinate instructors, an NIH Data Science Trainers Listserv was created after the first Software Carpentry “Train the Trainer” workshop in 2015. This listserv has 48 subscribers as of September 2017, including those who have completed Software/Data Carpentry training and others with an interest in teaching data science. While this listserv has primarily been used to coordinate Carpentry trainings, it could have potential usefulness in the future as a source of qualified instructors who are interested in providing training. Another resource of potential future usefulness is the NCI Bioinformatics Training & Education Program (BTEP) webpage, which will at some point in the near future contain an NIH-wide bioinformatics training calendar, listing BTEP’s own events as well as events from other groups, like the NIH Library and the Bioinformatics SIG. While there is some overlap between bioinformatics and data science, the DataScience@NIH webpage or other NLM resource could also provide a coordinating calendar similar to BTEP’s, but with a focus on data science-related training and events.