2
F rom 2001: A Space Odyssey, The Terminator, and The Matrix trilogy to the recent Ex Machina, sci-fi movies have captured our imagination with the notion that machines will soon take over. The basic premise is that advances in artificial intelligence (AI) will reach a stage in which control of humans over machines will be challenged, leading to our ultimate annihilation. Recent progress in AI and the related fields of machine learning and data mining have triggered living legends of business (e.g., Bill Gates, Elon Musk), science (e.g., Stephen Hawking), and technology (e.g., Steve Wozniak) to note that serious consideration must be given before the adoption of AI, especially as it relates to weaponry and its military applications. While the rise of the machines in the above sense may not be imminent, it is undeniable that machine learning and data mining have already been adopted across all segments of industry and have had a tremendous impact in the way we live our lives and conduct our business. In particular, great strides have been made in life sciences through the use of text and data mining software systems to support drug repurposing, drug safety, and biomarker discovery. Many other fields of life sciences have benefited from the application of text mining techniques as well. Beyond life sciences, the application of text and data mining has shown great promise in finance, marketing, and energy, as well as the chemical and food industries. There are no theoretical limits to the applicability of text and data mining, and there are significant rewards for those who pursue its implementation in all businesses. Haralambos ‘Babis’ Marmanis describes some challenges and opportunities in text and data mining THE RISE OF THE MACHINES ANALYSIS AND NEWS 16 Research Information OCTOBER/NOVEMBER 2015 @researchinfo www.researchinformation.info ‘Many other fields of life sciences have benefited from the application of text mining techniques’ Sarah Holmlund/Shutterstock.com l Approved content only: Researchers will typically reach out to sources that have been approved by their information procurement office and may miss out on other channels which, for reasons unrelated to text and data mining, have not been included in their list. The publishing industry has been concerned with the above challenges for a long time and some solutions have begun to emerge in the market. In particular, Copyright Clearance Center, in collaboration with many leading publishers, is now offering RightFind XML for Mining. This service ameliorates and, in some cases, completely eliminates a number of the challenges cited above by offering a single source for the ‘orebody’ – that is, access to top peer-reviewed full-text scientific articles from many sources, all in a single machine-readable format. Solutions like CCC’s RightFind XML for Mining will accelerate research and play a catalytic role in innovation and new discoveries because this type of service eliminates many of the practical challenges involved in the research workflow, allowing researchers to focus on the opportunities that lie ahead. Haralambos ‘Babis’ Marmanis is Copyright Clearance Center’s chief technology officer and vice president, for engineering and product. However, practical challenges surface as soon as a project passes from the inception and elaboration phase to its actual implementation. The first decision that text and data miners have to make is the identification of an ‘orebody’. Similar to the case of mining precious metals from an orebody, text and data mining is most effective and efficient when applied to information- rich and relatively high- density material. For example, in the case of life sciences there are at least three large bodies of work that a text and data miner would like to have access to: l Clinical trial data; l Patents related to life sciences; and l Scientific literature. Up until recently, it has been very difficult to obtain full-text scientific articles in a format convenient for mining and, more often than not, researchers had to be content with the mining of abstracts instead. The difficulties are practical and fall into a variety of categories: l Multiple sources: There are many sources of potentially valuable content – to reach out to each individually is extremely labor- and time-intensive. Moreover, the long tail cannot be ignored in text and data mining. Sometimes a few specialised articles can be the most valuable in forming the appropriate connections between two (otherwise disconnected) larger bodies of work. l Format: Even when gathering content from a single source (e.g., a specific publisher), the format of the full-text articles can differ for articles that appear in different journals. Furthermore, it is possible that a format may not be machine-readable without preprocessing. l Agreements and feeds: Obtaining agreements and feeds is complicated and expensive. There are high costs for companies to negotiate directly with content providers, obtain feeds, and deal with the above mentioned differences in formats. In most cases, this process must be maintained over time and cannot be viewed as a ‘one-off’ effort.

ANALYSIS AND NEWS The rise oF The machines€¦ · the notion that machines will soon take over. The basic premise is that advances in artificial intelligence (AI) will reach a stage

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ANALYSIS AND NEWS The rise oF The machines€¦ · the notion that machines will soon take over. The basic premise is that advances in artificial intelligence (AI) will reach a stage

From 2001: A Space Odyssey, The Terminator, and The Matrix trilogy to the recent Ex Machina, sci-fi movies have captured our imagination with the notion that machines will soon

take over. The basic premise is that advances in artificial intelligence (AI) will reach a stage in which control of humans over machines will be challenged, leading to our ultimate annihilation.

Recent progress in AI and the related fields of machine learning and data mining have triggered living legends of business (e.g., Bill Gates, Elon Musk), science (e.g., Stephen Hawking), and technology (e.g., Steve Wozniak) to note that serious consideration must be given before the adoption of AI, especially as it relates to weaponry and its military applications.

While the rise of the machines in the above sense may not be imminent, it is undeniable that machine learning and data mining have already been adopted across all segments of industry and have had a tremendous impact in the way

we live our lives and conduct our business. In particular, great strides have been made in life sciences through the use of text and data mining software systems to support drug repurposing, drug safety, and biomarker discovery.

Many other fields of life sciences have benefited from the application of text mining techniques as well. Beyond life sciences, the application of text and data mining has shown great promise in finance, marketing, and energy, as well as the chemical and food industries.

There are no theoretical limits to the applicability of text and data mining, and there are significant rewards for those who pursue its implementation in all businesses.

Haralambos ‘Babis’ Marmanis describes some challenges and opportunities

in text and data mining

The rise oF The machines ANALYSIS AND NEWS

16 Research Information october/noveMber 2015 @researchinfo www.researchinformation.info

‘Many other fields of life sciences have benefited from the application of text mining techniques’

Sara

h Ho

lmlu

nd/S

hutte

rsto

ck.c

om

l Approved content only: Researchers will typically reach out to sources that have been approved by their information procurement office and may miss out on other channels which, for reasons unrelated to text and data mining, have not been included in their list. The publishing industry has been concerned

with the above challenges for a long time and some solutions have begun to emerge in the market. In particular, Copyright Clearance Center, in collaboration with many leading publishers, is now offering RightFind XML for Mining. This service ameliorates and, in some cases, completely eliminates a number of the challenges cited above by offering a single source for the ‘orebody’ – that is, access to top peer-reviewed full-text scientific articles from many sources, all in a single machine-readable format.

Solutions like CCC’s RightFind XML for Mining will accelerate research and play a catalytic role in innovation and new discoveries because this type of service eliminates many of the practical challenges involved in the research workflow, allowing researchers to focus on the opportunities that lie ahead.

Haralambos ‘Babis’ Marmanis is copyright clearance center’s chief technology officer and vice president, for engineering and product.

However, practical challenges surface as soon as a project passes from the inception and elaboration phase to its actual implementation.

The first decision that text and data miners have to make is the identification of an ‘orebody’. Similar to the case of mining precious metals from an orebody, text and data mining is most effective and efficient when applied to information-rich and relatively high-density material. For example, in the case of life sciences there are at least three large bodies of work that a text and data miner would like to have access to:l Clinical trial data;l Patents related to life sciences;

andl Scientific literature.

Up until recently, it has been very difficult to obtain full-text scientific articles in a format convenient for mining and, more often than not, researchers had to be content with the mining of abstracts instead.

The difficulties are practical and fall into a variety of categories: l Multiple sources: There are many sources

of potentially valuable content – to reach out to each individually is extremely labor- and time-intensive. Moreover, the long tail cannot be ignored in text and data mining. Sometimes a few specialised articles can be the most valuable in forming the appropriate connections between two (otherwise disconnected) larger bodies of work.

l Format: Even when gathering content from a single source (e.g., a specific publisher), the format of the full-text articles can differ for articles that appear in different journals. Furthermore, it is possible that a format may not be machine-readable without preprocessing.

l Agreements and feeds: Obtaining agreements and feeds is complicated and expensive. There are high costs for companies to negotiate directly with content providers, obtain feeds, and deal with the above mentioned differences in formats. In most cases, this process must be maintained over time and cannot be viewed as a ‘one-off’ effort.

Page 2: ANALYSIS AND NEWS The rise oF The machines€¦ · the notion that machines will soon take over. The basic premise is that advances in artificial intelligence (AI) will reach a stage