20
A REST API for The IUPAC Solubility Data Series: A ‘Skunkworks’ Project Stuart J. Chalk Department of Chemistry University of North Florida [email protected] 2014 Fall ACS Meeting

ACS 248th Paper 108 NIST-IUPAC Solubility Data

Embed Size (px)

DESCRIPTION

A 'skunkworks' project to re-purpose data from the published NIST-IUPAC Solubility Data volumes via a REST interface using CakePHP and MySQL

Citation preview

Page 1: ACS 248th Paper 108 NIST-IUPAC Solubility Data

A REST API forThe IUPAC Solubility Data Series:

A ‘Skunkworks’ Project

Stuart J. ChalkDepartment of ChemistryUniversity of North Florida

[email protected]

2014 Fall ACS Meeting

Page 2: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Motivation What is Website ‘Scraping’? What are REST and API? Project Process NIST Website Analysis Database Definition Data Ingestion Project Website Design Using the Website Future Plans Conclusion

Outline

Page 3: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Linked Open Data (LOD) is important for science Defining a process for grabbing high quality science

data and making it semantically available is useful Providing a REST API makes information easy to find Providing unique REST URLs for data allows linking A semantic description of data makes it more useful Increase value added -> link data to other available

data

SDS data is fundamentally important to chemistry

Motivation

(1) http://en.wikipedia.org/wiki/Linked_data

Page 4: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Data in web pages is available for users to copy/paste

When the available data is large, automation of the scripts is necessary

‘Scraping’ is the processing of web page data using a scripting language

Data can be captured and stored in any format Most useful to capture data in a relational database

so that it can be repurposed at another website This is usually done without the permission of the

authors of the ‘scraped’ web page(s)

What is Website Scraping?

Page 5: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Representational State Transfer (REST) is…“is a software architectural style consisting of a coordinated set of architectural constraints applied to components, connectors, and data elements, within a distributed hypermedia system”2

REST is applied to websites as a style for providing URL access to information in a structured human readable way

Application Programming Interface (API) is…A standardized way for one computer/software system to talk to another. For REST this a set of remote (http) based calls to pre-defined URL’s

What are REST and API?

(2) http://en.wikipedia.org/wiki/Representational_state_transfer(3) http://en.wikipedia.org/wiki/API

Page 6: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Analysis of current NIST Solubility Database website

Definition of database tables needed Code generation to automate data scraping Data cleanup REST API definition and description REST API development Output file format generation Addition of bells and whistles (if there’s time

)

Project Process

Page 7: ACS 248th Paper 108 NIST-IUPAC Solubility Data

http://srdata.nist.gov/solubility/dataSeries.aspxcontains links to all the volumes that are available => volID

http://srdata.nist.gov/solubility/sys_category.aspxcontains all the system types as part of a select list => typeID

http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=SSN contains the different datasets for a specific system type => sysID

http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID>contains details of system: citation, data tables, refs, preparer etc.

http://srdata.nist.gov/solubility/sol_2casno.aspx? STR1=<CASRN1>&STR2=<CASRN2>&OPTION=CASNO allows searching by chemical CASRN (also name (OPTION=CHEM) or formula (OPTION=MOL)

http://srdata.nist.gov/solubility/citation_detail.aspx?REF_NO=<?REFNO?> allows searching system date by paper

NIST Website Analysis

Page 8: ACS 248th Paper 108 NIST-IUPAC Solubility Data

What types of data are available and how should it be organized? By Volume => volID By System Type => typeID By System => sysID By Chemical => CASRN, name, formula By Citation => refNO By Author (new) Also added Tables and Variables during

development Note: the actual site uses sysID for the system

and type and particular set of data about a system type

Database Definition

Page 9: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Data was imported into MySQL either from a tab delimited text file or insertion via PHP scripts

Scraped the volume id’s fromhttp://srdata.nist.gov/solubility/dataSeries.aspx htmlcleaned up to generate a tab delimited text file18 rows

Similarly the system types were scraped fromhttp://srdata.nist.gov/solubility/sys_category.aspx into a tab delimited text file => 2564 rows

Data Ingestion

Page 10: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Individual systems with data were scraped using a PHP script which involved Lookup of system type and retrieval of typeID Construction of system type page URL

http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=SSN

Retrieval of the page content (HTML) into a PHP variable

PCRE Regex expression match for the sysID of each system

Creation of a new entry in the system database table 4817 rows

Data Ingestion

Page 11: ACS 248th Paper 108 NIST-IUPAC Solubility Data

System details were scraped using a PHP script by Lookup of system and retrieval of sysID Construction of system detail page URL

http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID>

Retrieval of the page content (HTML) into a PHP variable

Processing of HTML to retrievecitation, variables, data analysis and tables, method, source, errors, references

Saving of details to systems table and related tables

Data Ingestion

Page 12: ACS 248th Paper 108 NIST-IUPAC Solubility Data

In addition to data extraction Chemical InChI strings were retrieved from NIH CIR1

Citation DOI’s were retrieved from CrossRef2 and saved(article titles and full author names were also added)

Data tables were converted to JSON format for storageand reproduction

Table notes, sources, and additional refs were converted to JSON for storage

Data Ingestion

(1)http://cactus.nci.nih.gov/chemical/structure(2)http://www.crossref.org

Page 13: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Database

Page 14: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Database

Page 15: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Constructed using the CakePHP framework (PHP) Index (listing) and view pages for each of

Authors Chemicals Citations Systems System Types Volumes

Search functionality provided via the homepage Example URL

http://chalk.coas.unf.edu/solubility/systems/view/20_135

Project Website Design

Page 16: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Project Website Design

Page 17: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Project Website Design

Page 18: ACS 248th Paper 108 NIST-IUPAC Solubility Data

Get this project funded Clean up references and link to DOI’s Clean up authors and link to ORCIDs Add procedural references Convert table data into searchable/linked format

Add measurement type, unit, error, and variables Provide searching and plotting of data Automated calculation of additional parameters

e.g. solubility in different units, mole ratio Create solubility ontology => add RDF +

searching Add microdata1 to each web page Next phase ? => Add the other volumes

Future Plans

(1) http://www.w3.org/TR/microdata/

Page 19: ACS 248th Paper 108 NIST-IUPAC Solubility Data

A RESTful version of the IUPAC-NIST Solubility Series Database was successfully created and made available

Metrics 20 Volumes 2564 System Types 4817 Systems 1484 Chemicals 1247 References 1968 Authors 11 MB size of database

One week worth of work

Conclusion