Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
AI-Powered Digital Innovation 장병탁 교수 / 서울대학교 컴퓨터공학부
목 차 1. 인공지능 기술과 산업 동향
- Intelligence Explosion
2. AI + Big Data 기반 디지털 이노베이션 - Smart Machines
3. 인공지능의 미래 - Cognitive AI
1. 인공지능 기술과 산업 동향
• 인공지능(AI): “사람처럼 생각하고 사람처럼 행동하는 기계”(컴퓨터, SW, 로봇)
• 사람이 기계보다 잘 하는 일을 기계가 할 수 있도록 하는 연구
• 지능을 필요로 하는 일을 기계가 할 수 있도록 하는 연구
• 1950: Turing Test, 1956: “Artificial Intelligence (AI)”
• 약인공지능(Weak AI), 강인공지능(Strong AI), 범용인공지능(AGI)
• 머신러닝(ML): 학습을 통해서 인공지능 시스템을 자동으로 개발하는 기술
• 딥러닝(DL): 신경망기반의 머신러닝 모델의 일종. 많은 층을 가진 복잡한 신경망
인공지능(AI)의 개념
1970-1980년대: 붐
전문가/지식기반 시스템
1982-1992:
제5세대 컴퓨터계획 (FGCS)
1990년대: 암흑기
뉴럴넷, 유전자 알고리즘, 퍼지로직
1990대 후반:
인터넷, 웹, 전자상거래
정보검색, 데이터마이닝
아마존, 이베이, 야후, 구글
2010년대: 부흥기
지능형 에이전트
머신러닝/딥러닝
ⓒ 2005-2014 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/
인공지능(AI)의 역사
IBM “Deep Blue” Chess Machine Beats Human Champion (1997)
AI Grand Challenge: Thinking Machines
Deep Blue Watson AlphaGo
1997 2011 2016
Self-driving Cars
DARPA Grand Challenge
Google Self-driving
Car
RHINO Museum Tour Guide
1997 2005 2010
• 기존의 인공지능 • 프로그래밍에 의해 사람의 지식을 기계에 주입하는 방식
• 기계의 단순한 컴퓨팅 파워와 기억용량에 의존한 최적화에 기반
• 알파고 • 딥러닝과 강화학습을 통해 게임의 전략까지도 스스로 학습
• 자가학습을 통한 스스로 끊임없이 성능을 향상하는 능력
• 인간의 약점을 극복하고 강점을 살리는 지능폭발의 가능성을 보여주는 사례
AlphaGo and New AI
Intelligence Explosion and Singularity
Enabling Technology for the Intelligence Explosion
• Multiple boundaries are needed (e.g. XOR problem) Multiple Units
• More complex regions are needed (e.g. Polygons) Multiple Layers
• Big Data + Deep Learning
Automatic Programming
왜 딥러닝?
• ImageNet Large-Scale Visual Recognition Challenge
• Image Classification/Localization
• 1.2M labeled images, 1000 classes
• Convolutional Neural Networks (CNNs) has been dominating the contest since.. • 2012 non-CNN: 26.2% (top-5 error)
• 2012: (Hinton, AlexNet)15.3%
• 2013: (Clarifai) 11.2%
• 2014: (Google, GoogLeNet) 6.7%
• (pre-2015): (Google) 4.9%
• Beyond human-level performance
딥러닝 – 물체 인식 Big Data
• ~2010 GMM-HMM (Dynamic Bayesian Models)
• ~2013 DNN-HMM (Deep Neural Networks)
• ~Current LSTM-RNN (Recurrent Neural Networks)
딥러닝 – 음성 인식 Big Data
• Use 3D CNNs to model the temporal patterns as well as the spatial patterns
딥러닝 – 비디오 분석 Big Data
3D Convolut ional N eur al N etwor ks for H uman A ct ion R ecognit ion
(a) 2D convolution
t e m
p o
r a l
(b) 3D convolution
Figure 1. Comparison of 2D (a) and 3D (b) convolutions.In (b) the size of the convolution kernel in the temporaldimension is 3, and the sets of connections are color-codedso that the shared weights are in the same color. In 3Dconvolution, the same 3D kernel is applied to overlapping3D cubes in the input video to extract motion features.
previous layer, thereby capturing motion information.Formally, the value at position (x,y,z) on the j th fea-ture map in the i th layer is given by
vx yzi j = tanh bi j+m
P i −1
p= 0
Q i −1
q= 0
R i −1
r = 0
wpqri j m v(x + p)(y+ q)(z+ r )(i−1)m ,
(2)where R i is the size of the 3D kernel along the tem-poral dimension, wpqri j m is the (p,q,r )th value of thekernel connected to themth featuremap in the previ-ous layer. A comparison of 2D and 3D convolutions isgiven in Figure 1.
Note that a 3D convolutional kernel can only extractone type of features from the frame cube, since thekernel weights are replicated across the entire cube. Ageneral design principle of CNNs is that the numberof feature maps should be increased in late layers bygenerating multiple types of features from the same
t e m
p o
r a l
Figure 2. Extraction of multiple features from contiguousframes. Mult iple 3D convolutions can be applied to con-tiguous frames to extract mult iple features. As in Figure 1,the sets of connections are color-coded so that the sharedweights are in the same color. Note that all the 6 sets ofconnections do not shareweights, result ing in two differentfeature maps on the right.
set of lower-level feature maps. Similar to the caseof 2D convolution, this can be achieved by applyingmultiple 3D convolutions with distinct kernels to thesame location in the previous layer (Figure 2).
2.2. A 3D CNN A rchitecture
Basedon the3D convolution describedabove, a varietyof CNN architecturescan bedevised. In the following,wedescribea 3D CNN architecturethat wehavedevel-oped for human action recognition on the TRECV IDdata set. In this architecture shown in Figure 3, weconsider 7 framesof size60×40centered on thecurrentframeasinputstothe3D CNN model. Wefirst apply aset of hardwired kernels to generatemultiple channelsof information from the input frames. This results in33 featuremaps in thesecond layer in 5 different chan-nels known as gray, gradient-x, gradient-y, optflow-x,and optflow-y. The gray channel contains the graypixel values of the 7 input frames. The feature mapsin the gradient-x and gradient-y channelsareobtainedby computing gradients along the horizontal and ver-tical directions, respectively, on each of the 7 inputframes, and the optflow-x and optflow-y channels con-tain the optical flow fields, along the horizontal andvertical directions, respectively, computed from adja-cent input frames. This hardwired layer is used to en-codeour prior knowledgeon features, and this schemeusually leads to better performance as compared torandom initialization.
3D Convolut ional N eural N etwor ks for H uman A ct ion R ecognit ion
H1:
33@60x40 C2:
23*2@54x34
7x7x3 3D
convolution
2x2
subsampling
S3:
23*2@27x17
7x6x3 3D
convolution
C4:
13*6@21x12
3x3
subsampling
S5:
13*6@7x4
7x4
convolution
C6:
128@1x1
full
connnection
hardwired
input:
7@60x40
Figure 3. A 3D CNN architecture for human action recognition. This architecture consists of 1 hardwired layer, 3 convo-lution layers, 2 subsampling layers, and 1 full connection layer. Detailed descriptions are given in the text.
We then apply 3D convolutions with a kernel size of7× 7×3 (7×7 in the spatial dimension and 3 in thetemporal dimension) on each of the 5 channels sepa-rately. To increase the number of feature maps, twosets of different convolutions are applied at each loca-tion, resulting in 2 setsof featuremaps in theC2 layereach consisting of 23 feature maps. This layer con-tains 1,480 trainable parameters. In the subsequentsubsampling layer S3, we apply 2×2 subsampling oneach of the feature maps in the C2 layer, which leadsto thesamenumber of featuremapswith reduced spa-tial resolution. Thenumber of trainableparameters inthis layer is 92. The next convolution layer C4 is ob-tained by applying 3D convolution with a kernel sizeof 7× 6× 3 on each of the 5 channels in the two setsof feature maps separately. To increase the numberof feature maps, we apply 3 convolutions with differ-ent kernels at each location, leading to 6 distinct setsof feature maps in the C4 layer each containing 13feature maps. This layer contains 3,810 trainable pa-rameters. The next layer S5 is obtained by applying3×3subsamplingon each featuremapsin theC4 layer,which leads to the same number of feature maps withreduced spatial resolution. The number of trainableparameters in this layer is 156. At this stage, the sizeof the temporal dimension is already relatively small(3 for gray, gradient-x, gradient-y and 2 for optflow-xand optflow-y), so we perform convolution only in thespatial dimension at this layer. The size of the con-volution kernel used is 7× 4 so that the sizes of theoutput featuremapsare reduced to1×1. TheC6 layerconsists of 128 feature maps of size 1×1, and each ofthem is connected to all the 78 featuremaps in the S5layer, leading to 289,536 trainable parameters.
By themultiple layersof convolution and subsampling,
the 7 input frames have been converted into a 128Dfeaturevector capturing themotion information in theinput frames. The output layer consists of the samenumber of units as the number of actions, and eachunit is fully connected to each of the 128 units inthe C6 layer. In this design we essentially apply alinear classifier on the 128D feature vector for actionclassification. For an action recognition problem with3 classes, the number of trainable parameters at theoutput layer is 384. The total number of trainableparameters in this 3D CNN model is 295,458, and allof them are initialized randomly and trained by on-line error back-propagation algorithm as described in(LeCun et al., 1998). We have designed and evalu-ated other 3D CNN architectures that combine mul-tiple channels of information at different stages, andour results show that this architecture gives the bestperformance.
3. R elated Work
CNNsbelong to the classof biologically inspired mod-els for visual recognition, and someother variantshavealso been developed within this family. Motivatedby the organization of visual cortex, a similar model,called HMAX (Serre et al., 2005), has been developedfor visual object recognition. In the HMAX model,a hierarchy of increasingly complex features are con-structed by the alternating applications of templatematching and max pooling. In particular, at the S1layer a still input image isfirst analyzed by an array ofGabor filters at multiple orientations and scales. TheC1 layer is then obtained by pooling local neighbor-hoods on the S1 maps, leading to increased invarianceto distortionson the input. TheS2 mapsareobtained
A. Karpathy, L. Fei-Fei, et al., CVPR 2014
S. Ji, K. Yu, et al., PAMI, 2013
AI / ML Cloud Services
Google TensorFlow
Microsoft Oxford API
IBM Watson API
AI / Deep Learning Growth
Personal Assistants / Chatbots
AI Devices / Smart Speakers
SKT NUGU
Google Home
Amazon Echo
PR2 (Willow Garage)
Care-O-bot 4 (Fraunhofer) Pepper (SoftBank)
Buddy (Blue Frog)
Personal Robots
Life-Like Robots
Sophia (Hanson Robotics)
Atlas (Boston Dynamics)
2. AI + Big Data에 의한 디지털 이노베이션
4차 산업혁명과 AI
Smart Machines – 4차산업혁명시대 신산업
Mind (SW, Data)
Body (HW, Device)
“Cognitive” Smart Machines
디지털 이노베이션 @ Manufacturing: IoT + Big Data + AI
FANUC Preferred Networks
Siemens MindSphere
GE Digital Predix
디지털 이노베이션 @ Logistics: IoT + Big Data + AI
Amazon Robotics UPS Chatbot
디지털 이노베이션 @ Commerce: IoT + Big Data + AI
Salesforce Einstein
SAP Chatbots
디지털 이노베이션 @ Home: IoT + Big Data + AI
Google Nest
Apple HomeKit
디지털 이노베이션 @ Healthcare: IoT + Big Data + AI
Digital Health
Digital Medicine
Digital Doctor
디지털 이노베이션 @ Service: IoT + Big Data + AI
Pepper Service Robot
Relay Hotel Delivery Robot
3. 인공지능의 미래
Humans (NI) and Machines (AI)
Embodied Mind | Mind Machine ( = Smart Machine)
• Introspectionism – 1920
Psyche
• Behaviorism Cybernetics 1920 – 1950
Mind (= Computer)
• Cognitivism Symbolic AI 1950 – 1980
Brain
• Connectionism Neural Nets (ML) 1980 – 2010
Body
• Action Science Autonomous Robots 2010 –
Environment
인공지능과 미래의 삶
Robot & Frank (2012) Her (2013)
Bicentennial Man (1999) Ex Machina (2015)
Future of AI
Narrow AI
AI with Deep Learning
Cognitive AI
Superhuman AI
Agency
1990 2010 2020
Follows given goals and methods
Works out own methods, follows given goals
Works out own goals
2030 1980
Human-Level AI
Modified from Eliezer Yudkowsky & David Wood
2050
Free Will Technology
Time
감사합니다