Upload
eva-clowes
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
04/18/23 DRAFT August 13, 2012 1
DRAFTBuilding Global Science
Collaboratories
VIVO 2012 Conference Workshop
August 22, 1:00 pm – 4:30 pm
04/18/23 DRAFT August 13, 2012 2
Workshop Faculty Anil Srivastava, President, Open Health Systems Laboratory
(OHSL) co-located at Johns Hopkins University Montgomery County Campus, Shady Grove, MD, USA
Paul Courtney, Project Manager, Dana-Farber Cancer Institute, Boston, MA, USA
Ajai Kumar/Hemant Darbari/Swati Mehta/Vivek Koul, Center for Development of Advances Computing C-DAC, Pune, India
Rubayi Srivastava, Project Manager, Open Health Systems Laboratory (OHSL), CA, USA
Juliusz Pukacki, Poznan Supercomputing and Networking Center (PSNC), Poznan, Poland
04/18/23 DRAFT August 13, 2012 3
DRAFT Agenda 1:05 Anil - introductions 1:15 Anil – Background and Overview 1:50 Paul – Bootstrapping the global collaboratory
(Methods) 2:30 – 2:45 Break 2:45 CDAC – Techniques and experiences in extracting
data & transforming it for VIVO (Results) 3:30 Julius – Role of VIVO, Semantic Web and Linked
Open Data in advancing global science collaboration & enabling collaboratories (Discussion)
4:15 Anil – Future work/discussant 4:30 Workshop Ends
04/18/23 DRAFT August 13, 2012 4
Faculty “assignments” Anil
Provide context, history, mission & vision of OHSL, What programs & projects concern OHSL and VIVO fits into the portfolio
Paul Provide vision of developing the Global Cancer Collaboratory, where it is going How this effort is connected with other informatics initiatives; historical context of caBIG, the NCI-
NCRI informatics collaborations How is this different from simply putting up a VIVO instance at the OHSL campus in Shady Grove,
MD? Incubating and nurturing connectivity across international boundaries requires a different “business model” than putting up an institutional VIVO site. Aggregating information (early web model) and the role of imperfect data (and its relationship to Tim Berners-Lee’s Linked Data model)
Rubayi Challenges of providing project management support for an international program spanning 12+ time
zones, needing to provide support for logistical and knowledge management for multiple platforms and differing levels of technological infrastructures.
CDAC Technical challenges of obtaining the same information required from multiple sites Examples of what was easily available at some sites, what was difficult, how were the challenges
addressed Julius
Semantic Web and Linked Data
04/18/23 DRAFT August 13, 2012 5
Anil
Background Current Indo-US collaboration projects underway Indo-US Cancer Research Grid
04/18/23 DRAFT August 13, 2012 6
Research Networking Systems (RNS)
“…support individual researchers’ efforts to form and maintain optimal collaborative relationships for conducting productive research within a specific context.”1
Criteria: Involve shared 2-way interests Ongoing, sporadic interaction Creation of joint work products
1Schleyer T, Butler BS, Song M and Spallek, H. 2012. Conceptualizing and advancing research networking systems. ACM Trans. Comput.-Hum. Interact. 19, 1, Article 2 (March 2012), 26 pages.
04/18/23 DRAFT August 13, 2012 7
Research Networking Systems (RNS)1
Within institutions VIVO Harvard Catalyst Stanford CAP
Across institutions Distributed Interoperable Research Experts Collaboration Tool
(DIRECT) as a federated search tool that leverages the “within instution tools”
Research Gate Epernicus Academia.edu BioMed Experts (Elsevier) Elsevier SciVal® Experts Nature Network
1Schleyer T, Butler BS, Song M and Spallek, H. 2012. Conceptualizing and advancing research networking systems. ACM Trans. Comput.-Hum. Interact. 19, 1, Article 2 (March 2012), 26 pages.
04/18/23 DRAFT August 13, 2012 8
Research Networking System Models
04/18/23 DRAFT August 13, 2012 9
Global Cancer Collaboratory (GCC) as RNS
1. Support individual researchers’ efforts to form and maintain optimal collaborative relationships –
GCC will use VIVO as a tool to capture and store researcher information aggregated from cancer centers in India & United States.
2. For conducting productive research – GCC will be a repository for papers written, presentations & workshops produced.
3. Within a specific context – GCC focuses on support of international collaborations in cancer research.
04/18/23 DRAFT August 13, 2012 10
Global Cancer Collaboratory (GCC) as RNS
04/18/23 DRAFT August 13, 2012 11
Information Aggregation in India
04/18/23 DRAFT August 13, 2012 12
GCC Framework Socio-technical approach
Bootstrap by starting as information aggregator VIVO
Using a combination of manual and automated methods – to pull in information from Indian cancer centers as well as from US cancer centers as matter of necessity.
Imperfect data, missing data are expected OHSL, in partnership with CDAC, has established a VIVO environment
[http://cdac-ohsl-vivo.cdac.in/vivo] as a core piece of a Research Network System to serve both countries with a view to foster the creation of team science consortia.
Discovered/developed tools to ease process of information extraction from existing web sites
Confluence wiki for document and mind sharing Other logistical efforts (Rubayi later)
Awareness of cultural, organizational & working style differences is critical
04/18/23 DRAFT August 13, 2012 13
Model: Early Internet Portals
04/18/23 DRAFT August 13, 2012 14
GCC Goals
Demonstrate efficacy of VIVO to provide an efficient means of discovering potential international collaboration partners.
Develop criteria & roadmap for researcher information to encourage institutional websites to be semantically compliant using shared ontologies.
Establish metrics to assess the effectiveness of our methods
04/18/23 DRAFT August 13, 2012 15
GCC Activities Tasks:
Standardization of data and terminology across Cancer Centers Explore sources of data for researchers; lowest hanging fruit
model Explore sources of publication data with IndMED and medIND
repositories included. To date:
Sent SugarCRM profiles to CDAC for ingestion into VIVO Semi-automatically & manually extracted data from cancer
sites in India and US Addressed legal concerns by our partners in India about
web-scraping information from cancer center websites and repackaging for this project
04/18/23 DRAFT August 13, 2012 16
Explicit steps
Unstructured & inconsistent data
Use CSS tags to provide structure; consistent provision of data
Use RDFa tagging using VIVO ontology namespace
Implement VIVO, Catalyst or CAP across institution
Perl, Python, DEiXTo
Manual extraction
Perl, Python, DEiXTo
04/18/23 DRAFT August 13, 2012 17
How to link RNS’ together?
National network using Direct2Experts
What about international networks?
04/18/23 DRAFT August 13, 2012 18
GCC Future Work
To be done: Add in publications: PubMED search of with cancer as
MESH Major Topic and [PL] India over last decade results in 4844 articles
Investigate use of IndMED and medIND databases of publications in India
Establish metrics to assess the effectiveness of our methods to Increase awareness of the potential for international
collaboration Increase awareness of the role of institutions to expose
researcher data that will benefit funding & research opportunities
04/18/23 DRAFT August 13, 2012 19
Rubayi
Logistical challenges Communication Collaboration tools
04/18/23 DRAFT August 13, 2012 20
CDAC Technical Challenges & Lessons Learned
Data extraction and conditioning Ontology for each cancer center
04/18/23 DRAFT August 13, 2012 21
DFCI
Full Name, Specialization, Department, Interests
04/18/23 DRAFT August 13, 2012 22
Fred Hutch
Full Name, Designation/Appointment, Division, Interests, Phone, email, Fax
04/18/23 DRAFT August 13, 2012 23
HCGOncology Cancer Center
Only one profile can be accessed at a time
04/18/23 DRAFT August 13, 2012 24
Doctor Profile in HCGOncology Cancer Center
Fig: Doctor’s Profile in HCGOcology 24
04/18/23 DRAFT August 13, 2012 25
Doctor Profile in HCGOncology Cancer Center
Fig: Doctor’s Profile in HCGOcology 25
<table ><tr>
<td><span class="txtblue">Name:</span></td><td ><span class="txtcont"> Dr Sanjay Mishra </span></td> </tr><tr>
<td><span class="txtblue">Qualification:</span></td><td><span class="txtcont">M.D. (RT) </span></td> </tr> <tr>
<td><span class="txtblue">Specialisation:</span></td><td><span class="txtcont"> Radiation Oncology </span></td> </tr><tr>
<td><span class="txtblue">Location:</span></td><td><span class="txtcont"> Hubli </span></td> </tr>
</table>
Data Structure: class=“txtblue” is the label; class=“txtcont” is the content
04/18/23 DRAFT August 13, 2012 26
Doctor Profile in HCGOncology Cancer Center
26
Name: Dr. N.K.Vinod Qualification: AD, PDCCA Specialization: Anesthesiologist Location: Bangalore
Name: Dr.Prabha Seshachar Qualification: MBBS, DA Specialization: Anesthesiologist Location: Bangalore
Name: Dr. H.C.Rajesh Qualification: MD Specialization: Anesthesiologist Years of Experience: 16 yrs
Name: Dr. Gaurav Dwivedi Qualification: MBBS, MD Specialization: Anesthesiologist Location: Delhi
Name: Dr. Kshirod Kumar Acharya Qualification: MBBS, MS Specialization: Anesthesiologist Location: Cuttack
Name: Dr. Ganesh Nayak Qualification: MS Specialization: Cardio Thoracic Surgery Location: Bangalore
Name: Dr. B C Bommaiah Qualification: MD Specialization: Cardiologist Location: Bangalore
Name: Dr. Kshitish Ch. Mishra Qualification: MBBS, MD Specialization: Clinical Oncology Location: Cuttack
04/18/23 DRAFT August 13, 2012 27
<table > <tr> <td>< table> <tr>
<td > <img src="phpThumb.php?src=uploads/doctors_images/4f840e5181aa2.png& ”/> </td>
<td><table >
<tr> <td ><span class="txtblue">Name:</span></td> <td ><span class="txtcont"> Dr Sanjay Mishra
</span></td> </tr>
<tr> <td><span class="txtblue">Qualification:</span></td><td><span class="txtcont">M.D. (RT) </span></td>
</tr> <tr>
<td><span class="txtblue">Specialisation:</span></td>
<td><span class="txtcont"> Radiation Oncology </span></td> </tr>
<tr> <td><span class="txtblue">Location:</span></td> <td><span class="txtcont"> Hubli </span></td>
</tr> ……..
</table>
Structure of Data for Profile in HCGOncology Cancer Center
• Data of HCG Oncology site is present
in the form of embedded tables.
• Every Profile is present in a separate page, so the structure of data and pages is difficult to retrieve using DEiXTo.
• CDAC has developed an extraction tool to get the data from this site.
27
04/18/23 DRAFT August 13, 2012 28
Researcher Profile In Dana Farber Cancer Institute
Fig: Researcher’s Profile in Dana Farber28
04/18/23 DRAFT August 13, 2012 29
Researcher Profile In Dana Farber Cancer Institute
Fig: Researcher’s Profile in Dana Farber29
<div class="abcGroup"> <h2>A</h2> <ul> <li class="fLeft title"> <a href="/directory/profile.asp?pgt=Gregory+A%2E+Abel%2C+MD%2C+MPH">Gregory A. Abel, MD, MPH</a> <em>Medical Oncologist</em>, Hematologic Oncology </li> <li class="oHide"> <strong>Clinical Interest</strong>Leukemia, Myelodysplastic syndromes, Myeloproliferative disorders </li> <li class="clear">Â </li> </ul>
04/18/23 DRAFT August 13, 2012 30
<div class="abcGroup"> <ul> <li class="fLeft title"> <a href="/directory/profile.asp?pgt=Gregory+A%2E+Abel%2C+MD%2C+MPH ">
Gregory A. Abel, MD, MPH</a> <em>Medical Oncologist</em>, Hematologic Oncology </li> <li class="oHide"> <strong>Clinical Interest</strong> Leukemia, Myelodysplastic syndromes, Myeloproliferative disorders </li> <li class="clear"> </li></ul>....</div>
Structure of Data for Profile in Dana Farber Cancer Center
30
04/18/23 DRAFT August 13, 2012 31
<div class="abcGroup"> <h2>A</h2> <ul> <li class="fLeft title"> <a href="/directory/profile.asp? pgt=Gregory+A%2E+Abel%2C+MD%2C+MPH&">
Gregory A. Abel, MD, MPH</a><em>Medical Oncologist</em>, Hematologic Oncology
</li> <li class="oHide"> <strong>Clinical Interest</strong> Leukemia, Myelodysplastic syndromes, Myeloproliferative disorders </li> <li class="clear">Â </li> </ul> <ul> <li class="fLeft title"> <a href="/directory/profile.asp?dbase=main&setsize=16&last_name=A&pgt=Janet+L%2E+Abrahm%2C+MD&grouptype_typeid_data=2&gs=r&nxtfmt=r&display=Y&pict_id=0000312">Janet L. Abrahm, MD</a><em>Palliative Medicine Physician</em>, Palliative Care (Adult) </li> <li class="oHide"> <strong>Clinical Interests</strong> Palliative medicine, Symptom management, End-of-life care </li> <li class="clear">Â </li> </ul>
Structure of Data for Profile in Dana Farber Cancer Center
31
04/18/23 DRAFT August 13, 2012 32
• In Dana Farber Cancer Institute Profile data is present in structured
form which DEiXTO is able to extract .
• Since Data is organized in Structured manner, we can extract data
using “DEiXTo” Tool.
Observation on Structure Profile Data Present in Dana Farber Cancer Center
DEiXTo (or ΔEiXTo) is a powerful web data extraction tool that is based on the
W3C Document Object Model (DOM). It allows users to create highly accurate
“extraction rules” (wrappers) that describe what pieces of data to scrape from
a website. 32
04/18/23 DRAFT August 13, 2012 33
<tr>
<td ><font size="1“ ><b><font size="2">Kailash S. Sharma </font> </b>M.D., D.A. (Anesthesiology)</font></td>
<td> <img src="../../images/anaesthesia/drsharma.jpg“ ></td>
</tr>
<tr>
<td> <font size="1“ face="Verdana, Arial” ><b>Designation:</b> </font></td>
<td><font size="1“ >Director Academics TMC </font></td>
</tr>
<tr>
<td><font size="1“ face="Verdana, Arial”><b>Area of Work:</b> </font></td>
<td><font size="1“>Anaesthesia</font></td>
</tr>
<tr>
<td valign="top"><font size="1“ face="Verdana, Arial”><b>Special Interests:</b></font></td>
<td><font size="1”>Difficult Airway<br> Monitoring<br> Cancer Pain</font></td>
</tr>
<tr>
<td valign="top"><font size="1” face="Verdana, Arial”><b>Email :</b></font> </td>
<td valign="top"><font size="1“ ><a href="mailto:[email protected]">[email protected]</a>
<br> Phone No.</font> (+9122) 24177044
</td>
</tr>
Structure of Profile Data Present in TATA Memorial Hospital
33
04/18/23 DRAFT August 13, 2012 34
Profile page data structure is not uniform. Insufficient data with profiles Format in which profiles are present are not
uniformly structured. Data extracted manually
Observation on Structure Profile Data Present in TATA Memorial Hospital
34
04/18/23 DRAFT August 13, 2012 35
References
1. http://www.dana-farber.org/
2. http://www.hcgoncology.com/
3. http://tmc.gov.in
4. http://en.wikipedia.org/wiki/Web_scraping
5. http://deixto.com/
35
04/18/23 DRAFT August 13, 2012 36
Indo-US Cancer Collaboratory: A VIVO Pilot
04/18/23 DRAFT August 13, 2012 37
Data Extraction from Website
Fig: Dana Farber Profiles 37
Tues
day,
Apr
il 18
, 202
3
Data Extraction from DFCI and represented in CSV format
04/18/23 DRAFT August 13, 2012 38
Ontology Creation
Fig: Create New Ontology 38
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 39
Ontology Creation Success
Fig: Creation of New Ontology is Success39
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 40
Class Creation
Fig: Create New Class inside a Ontology40
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 41
Create New Link to Super Class
Fig: Add a Super class link to this class 41
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 42
Selection of SuperclassTu
esda
y, A
pril
18, 2
023
42
04/18/23 DRAFT August 13, 2012 43
Create New Link to Super class
Fig: Select a class as Super class from dropdown list 43
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 44
Create New Link to Super Class
Fig: Identify the super class link at cursor position44
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 45
Create Data Property
Fig: Create a Data property inside class45
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 46
Data Property Created
Fig: Data Property Created successfully46
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 47
Create Object Property
Fig: Create a New Object Property47
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 48
Model Creation
Fig: Creation of Models to get the URI’s
48
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 49
After Model Creation
Fig: Models Created Successfully
49
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 50
Convert CSV to RDF
50
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 51
Convert CSV to RDF
Fig: Convert CSV File to RDF51
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 52
Tues
day,
Apr
il 18
, 202
3
52
<http://192.168.81.96/vivo/individual/n1834679> a <http://192.168.81.96/vivo/cdac:DanaFarberProfiles> ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_Department> "Hematologic Oncology" ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_FirstName> "Edwin" ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_FullName> "Edwin P. Alyea III ,MD" ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_Interest> "Stem cell/ bone marrow transplant, Leukemia" ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_LastName> "Alyea" ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_MiddleName> "P." ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_Organization> "Dana Farber" ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_Qualification> "III,MD" ; <http://192.168.81.96/vivo/cdac:danafarberprofiles_Specialization> "Medical Oncologist" .
Subject
Predicate
Object
Predicate
Object
Ingested Data URI's
04/18/23 DRAFT August 13, 2012 53
SPARQL Query
> .
Fig: SPARQL Query to Construct data53
Tues
day,
Apr
il 18
, 202
3
}
04/18/23 DRAFT August 13, 2012 54
Execute SPARQL Query
Fig: Execution of Constructed SPARQL query54
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 55
Execute SPARQL Query
Fig: SPARQL Query executed successfully55
Tues
day,
Apr
il 18
, 202
3
04/18/23 DRAFT August 13, 2012 56
Upload RDF
Fig: Upload RDF file
56
Tues
day,
Apr
il 18
, 202
3
Fig: RDF Successfully Uploaded
04/18/23 DRAFT August 13, 2012 57
View Uploaded Profiles
Fig: Upload Profiles57
Tues
day,
Apr
il 18
, 202
3 Click to View Profile
04/18/23 DRAFT August 13, 2012 58
Juliusz: Poznan Supercomputing and Networking Center
Semantic web and interoperability
04/18/23 DRAFT August 13, 2012 59
Future work
Incorporate medIND and IndMED biomedical journal databases as well as PubMED into VIVO.