AI-Powered Digital Innovationb2b.tworld.co.kr/files/images/solution/ATDC2017_K3.pdf · 2017-06-21 · tiple channels of infor mation at differ ent stages, and ou r o arc gi b per

AI-Powered Digital Innovation 장병탁 교수 / 서울대학교 컴퓨터공학부

목 차 1. 인공지능 기술과 산업 동향

- Intelligence Explosion

2. AI + Big Data 기반 디지털 이노베이션 - Smart Machines

3. 인공지능의 미래 - Cognitive AI

1. 인공지능 기술과 산업 동향

• 인공지능(AI): “사람처럼 생각하고 사람처럼 행동하는 기계”(컴퓨터, SW, 로봇)

• 사람이 기계보다 잘 하는 일을 기계가 할 수 있도록 하는 연구

• 지능을 필요로 하는 일을 기계가 할 수 있도록 하는 연구

• 1950: Turing Test, 1956: “Artificial Intelligence (AI)”

• 약인공지능(Weak AI), 강인공지능(Strong AI), 범용인공지능(AGI)

• 머신러닝(ML): 학습을 통해서 인공지능 시스템을 자동으로 개발하는 기술

• 딥러닝(DL): 신경망기반의 머신러닝 모델의 일종. 많은 층을 가진 복잡한 신경망

인공지능(AI)의 개념

1970-1980년대: 붐

전문가/지식기반 시스템

1982-1992:

제5세대 컴퓨터계획 (FGCS)

1990년대: 암흑기

뉴럴넷, 유전자 알고리즘, 퍼지로직

1990대 후반:

인터넷, 웹, 전자상거래

정보검색, 데이터마이닝

아마존, 이베이, 야후, 구글

2010년대: 부흥기

지능형 에이전트

머신러닝/딥러닝

ⓒ 2005-2014 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/

인공지능(AI)의 역사

IBM “Deep Blue” Chess Machine Beats Human Champion (1997)

http://www.youtube.com/watch?v=NJarxpYyoFI

AI Grand Challenge: Thinking Machines

Deep Blue Watson AlphaGo

1997 2011 2016

http://www.youtube.com/watch?v=NJarxpYyoFI

http://www.youtube.com/watch?v=WFR3lOm_xhE

Self-driving Cars

DARPA Grand Challenge

Google Self-driving

Car

RHINO Museum Tour Guide

1997 2005 2010

https://www.youtube.com/watch?v=lT-RF8TKRm0

https://www.youtube.com/watch?v=M2AcMnfzpNg

https://www.youtube.com/watch?v=cdgQpa1pUUE

• 기존의 인공지능 • 프로그래밍에 의해 사람의 지식을 기계에 주입하는 방식

• 기계의 단순한 컴퓨팅 파워와 기억용량에 의존한 최적화에 기반

• 알파고 • 딥러닝과 강화학습을 통해 게임의 전략까지도 스스로 학습

• 자가학습을 통한 스스로 끊임없이 성능을 향상하는 능력

• 인간의 약점을 극복하고 강점을 살리는 지능폭발의 가능성을 보여주는 사례

AlphaGo and New AI

Intelligence Explosion and Singularity

Enabling Technology for the Intelligence Explosion

https://www.youtube.com/watch?v=V1eYniJ0Rnk

• Multiple boundaries are needed (e.g. XOR problem) Multiple Units

• More complex regions are needed (e.g. Polygons) Multiple Layers

• Big Data + Deep Learning

Automatic Programming

왜 딥러닝?

• ImageNet Large-Scale Visual Recognition Challenge

• Image Classification/Localization

• 1.2M labeled images, 1000 classes

• Convolutional Neural Networks (CNNs) has been dominating the contest since.. • 2012 non-CNN: 26.2% (top-5 error)

• 2012: (Hinton, AlexNet)15.3%

• 2013: (Clarifai) 11.2%

• 2014: (Google, GoogLeNet) 6.7%

• (pre-2015): (Google) 4.9%

• Beyond human-level performance

딥러닝 – 물체 인식 Big Data

• ~2010 GMM-HMM (Dynamic Bayesian Models)

• ~2013 DNN-HMM (Deep Neural Networks)

• ~Current LSTM-RNN (Recurrent Neural Networks)

딥러닝 – 음성 인식 Big Data

• Use 3D CNNs to model the temporal patterns as well as the spatial patterns

딥러닝 – 비디오 분석 Big Data

3D Convolut ional N eur al N etwor ks for H uman A ct ion R ecognit ion

(a) 2D convolution

t e m

p o

r a l

(b) 3D convolution

Figure 1. Comparison of 2D (a) and 3D (b) convolutions.In (b) the size of the convolution kernel in the temporaldimension is 3, and the sets of connections are color-codedso that the shared weights are in the same color. In 3Dconvolution, the same 3D kernel is applied to overlapping3D cubes in the input video to extract motion features.

previous layer, thereby capturing motion information.Formally, the value at position (x,y,z) on the j th fea-ture map in the i th layer is given by

vx yzi j = tanh bi j+m

P i −1

p= 0

Q i −1

q= 0

R i −1

r = 0

wpqri j m v(x + p)(y+ q)(z+ r )(i−1)m ,

(2)where R i is the size of the 3D kernel along the tem-poral dimension, wpqri j m is the (p,q,r )th value of thekernel connected to themth featuremap in the previ-ous layer. A comparison of 2D and 3D convolutions isgiven in Figure 1.

Note that a 3D convolutional kernel can only extractone type of features from the frame cube, since thekernel weights are replicated across the entire cube. Ageneral design principle of CNNs is that the numberof feature maps should be increased in late layers bygenerating multiple types of features from the same

t e m

p o

r a l

Figure 2. Extraction of multiple features from contiguousframes. Mult iple 3D convolutions can be applied to con-tiguous frames to extract mult iple features. As in Figure 1,the sets of connections are color-coded so that the sharedweights are in the same color. Note that all the 6 sets ofconnections do not shareweights, result ing in two differentfeature maps on the right.

set of lower-level feature maps. Similar to the caseof 2D convolution, this can be achieved by applyingmultiple 3D convolutions with distinct kernels to thesame location in the previous layer (Figure 2).

2.2. A 3D CNN A rchitecture

Basedon the3D convolution describedabove, a varietyof CNN architecturescan bedevised. In the following,wedescribea 3D CNN architecturethat wehavedevel-oped for human action recognition on the TRECV IDdata set. In this architecture shown in Figure 3, weconsider 7 framesof size60×40centered on thecurrentframeasinputstothe3D CNN model. Wefirst apply aset of hardwired kernels to generatemultiple channelsof information from the input frames. This results in33 featuremaps in thesecond layer in 5 different chan-nels known as gray, gradient-x, gradient-y, optflow-x,and optflow-y. The gray channel contains the graypixel values of the 7 input frames. The feature mapsin the gradient-x and gradient-y channelsareobtainedby computing gradients along the horizontal and ver-tical directions, respectively, on each of the 7 inputframes, and the optflow-x and optflow-y channels con-tain the optical flow fields, along the horizontal andvertical directions, respectively, computed from adja-cent input frames. This hardwired layer is used to en-codeour prior knowledgeon features, and this schemeusually leads to better performance as compared torandom initialization.

3D Convolut ional N eural N etwor ks for H uman A ct ion R ecognit ion

H1:

33@60x40 C2:

23*2@54x34

7x7x3 3D

convolution

2x2

subsampling

S3:

23*2@27x17

7x6x3 3D

convolution

C4:

13*6@21x12

3x3

subsampling

S5:

13*6@7x4

7x4

convolution

C6:

128@1x1

full

connnection

hardwired

input:

7@60x40

Figure 3. A 3D CNN architecture for human action recognition. This architecture consists of 1 hardwired layer, 3 convo-lution layers, 2 subsampling layers, and 1 full connection layer. Detailed descriptions are given in the text.

We then apply 3D convolutions with a kernel size of7× 7×3 (7×7 in the spatial dimension and 3 in thetemporal dimension) on each of the 5 channels sepa-rately. To increase the number of feature maps, twosets of different convolutions are applied at each loca-tion, resulting in 2 setsof featuremaps in theC2 layereach consisting of 23 feature maps. This layer con-tains 1,480 trainable parameters. In the subsequentsubsampling layer S3, we apply 2×2 subsampling oneach of the feature maps in the C2 layer, which leadsto thesamenumber of featuremapswith reduced spa-tial resolution. Thenumber of trainableparameters inthis layer is 92. The next convolution layer C4 is ob-tained by applying 3D convolution with a kernel sizeof 7× 6× 3 on each of the 5 channels in the two setsof feature maps separately. To increase the numberof feature maps, we apply 3 convolutions with differ-ent kernels at each location, leading to 6 distinct setsof feature maps in the C4 layer each containing 13feature maps. This layer contains 3,810 trainable pa-rameters. The next layer S5 is obtained by applying3×3subsamplingon each featuremapsin theC4 layer,which leads to the same number of feature maps withreduced spatial resolution. The number of trainableparameters in this layer is 156. At this stage, the sizeof the temporal dimension is already relatively small(3 for gray, gradient-x, gradient-y and 2 for optflow-xand optflow-y), so we perform convolution only in thespatial dimension at this layer. The size of the con-volution kernel used is 7× 4 so that the sizes of theoutput featuremapsare reduced to1×1. TheC6 layerconsists of 128 feature maps of size 1×1, and each ofthem is connected to all the 78 featuremaps in the S5layer, leading to 289,536 trainable parameters.

By themultiple layersof convolution and subsampling,

the 7 input frames have been converted into a 128Dfeaturevector capturing themotion information in theinput frames. The output layer consists of the samenumber of units as the number of actions, and eachunit is fully connected to each of the 128 units inthe C6 layer. In this design we essentially apply alinear classifier on the 128D feature vector for actionclassification. For an action recognition problem with3 classes, the number of trainable parameters at theoutput layer is 384. The total number of trainableparameters in this 3D CNN model is 295,458, and allof them are initialized randomly and trained by on-line error back-propagation algorithm as described in(LeCun et al., 1998). We have designed and evalu-ated other 3D CNN architectures that combine mul-tiple channels of information at different stages, andour results show that this architecture gives the bestperformance.

3. R elated Work

CNNsbelong to the classof biologically inspired mod-els for visual recognition, and someother variantshavealso been developed within this family. Motivatedby the organization of visual cortex, a similar model,called HMAX (Serre et al., 2005), has been developedfor visual object recognition. In the HMAX model,a hierarchy of increasingly complex features are con-structed by the alternating applications of templatematching and max pooling. In particular, at the S1layer a still input image isfirst analyzed by an array ofGabor filters at multiple orientations and scales. TheC1 layer is then obtained by pooling local neighbor-hoods on the S1 maps, leading to increased invarianceto distortionson the input. TheS2 mapsareobtained

A. Karpathy, L. Fei-Fei, et al., CVPR 2014

S. Ji, K. Yu, et al., PAMI, 2013

AI / ML Cloud Services

Google TensorFlow

Microsoft Oxford API

IBM Watson API

AI / Deep Learning Growth

Personal Assistants / Chatbots

https://www.youtube.com/watch?v=9hEzUvQC21U

http://youtu.be/8ciagGASro0

AI Devices / Smart Speakers

SKT NUGU

Google Home

Amazon Echo

PR2 (Willow Garage)

Care-O-bot 4 (Fraunhofer) Pepper (SoftBank)

Buddy (Blue Frog)

Personal Robots

https://www.youtube.com/watch?v=OwAF8hsO58o

http://www.youtube.com/watch?v=c3Cq0sy4TBs

https://www.youtube.com/watch?v=kCFYw8mIqcc

https://www.youtube.com/watch?v=51yGC3iytbY

Life-Like Robots

Sophia (Hanson Robotics)

Atlas (Boston Dynamics)

https://www.youtube.com/watch?v=W0_DPi0PmF0

https://www.youtube.com/watch?v=rVlhMGQgDkY

2. AI + Big Data에 의한 디지털 이노베이션

4차 산업혁명과 AI

Smart Machines – 4차산업혁명시대 신산업

Mind (SW, Data)

Body (HW, Device)

“Cognitive” Smart Machines

디지털 이노베이션 @ Manufacturing: IoT + Big Data + AI

FANUC Preferred Networks

Siemens MindSphere

GE Digital Predix

디지털 이노베이션 @ Logistics: IoT + Big Data + AI

Amazon Robotics UPS Chatbot

디지털 이노베이션 @ Commerce: IoT + Big Data + AI

Salesforce Einstein

SAP Chatbots

디지털 이노베이션 @ Home: IoT + Big Data + AI

Google Nest

Apple HomeKit

디지털 이노베이션 @ Healthcare: IoT + Big Data + AI

Digital Health

Digital Medicine

Digital Doctor

디지털 이노베이션 @ Service: IoT + Big Data + AI

Pepper Service Robot

Relay Hotel Delivery Robot

3. 인공지능의 미래

Humans (NI) and Machines (AI)

Embodied Mind | Mind Machine ( = Smart Machine)

• Introspectionism – 1920

Psyche

• Behaviorism Cybernetics 1920 – 1950

Mind (= Computer)

• Cognitivism Symbolic AI 1950 – 1980

Brain

• Connectionism Neural Nets (ML) 1980 – 2010

Body

• Action Science Autonomous Robots 2010 –

Environment

인공지능과 미래의 삶

Robot & Frank (2012) Her (2013)

Bicentennial Man (1999) Ex Machina (2015)

https://www.youtube.com/watch?v=6QRvTv_tpw0

https://www.youtube.com/watch?v=q4y8YAMPFhk

https://www.youtube.com/watch?v=z5YMEwX2-88

https://www.youtube.com/watch?v=EoQuVnKhxaM

Future of AI

Narrow AI

AI with Deep Learning

Cognitive AI

Superhuman AI

Agency

1990 2010 2020

Follows given goals and methods

Works out own methods, follows given goals

Works out own goals

2030 1980

Human-Level AI

Modified from Eliezer Yudkowsky & David Wood

2050

Free Will Technology

Time

감사합니다

Documents

AI-Powered Digital Innovationb2b.tworld.co.kr/files/images/solution/ATDC2017_K3.pdf · 2017-06-21 · tiple channels of infor mation at differ ent stages, and ou r o arc gi b per