Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
DEEP LEARNING FOR RECOGNITION OF OBJECTS IN AUGMENTED REALITYHenrik Pedersen, PhDSenior Computer Vision Engineer
VISUAL COMPUTING LAB
Visual computing is a generic term for all computer science disciplines handlingimages and 3D models, i.e. computer graphics, image processing, visualization,computer vision, virtual and augmented reality, video processing, but also includesaspects of pattern recognition, human computer interaction, machine learning anddigital libraries. The core challenges are the acquisition, processing, analysis andrendering of visual information (mainly images and video). Application areas includeindustrial quality control, medical image processing and visualization, surveying,robotics, multimedia systems, virtual heritage, special effects in movies and television,and computer games.
[https://en.wikipedia.org/wiki/Visual_computing]
Computer Graphics & Visualization
Computer Vision & Image Analysis
Physics Simulations
High Performance Computing
PEOPLE
Graphics ✓ ✓ ✓ ✓ ✓ ✓Vision ✓ ✓ ✓ ✓Physics ✓ ✓ ✓ ✓ ✓ ✓ ✓HPC ✓ ✓
Aarhus Copenhagen
https://alexandra.dk/dk/om_os/labs/visual-computing-lab
PROJECT EXAMPLES
• Real-time terrain visualization in a web browser– Pointcloud generated using LiDAR (17.6 billion points)– 100 Terabytes of data– 40x40 cm resolution– 1 cm in height– All of Denmark– Overlays from satellite photos and OpenStreetMap.
DENMARK’S ELEVATION MODEL VISUALIZED IN WebGL
https://denmark3d.alexandra.dk
• LEGO Digital Designer, LEGO Universe, LEGO House (Fish Tank)
HIGH GRAPHICS QUALITY AND IMAGE RECOGNITION
https://www.legohouse.com/da-dk/explore/yellow-zone
VISIBLE EAR SIMULATOR
Virtual Reality for surgical training and pre-operative planning
• See and feel the inner ear.• Simulation of bone drilling with haptic feedback. • Realistic visualization based on real anatomy. https://ves.alexandra.dk/
AUGMENTED REALITY
Strategic focus: Industrial training 3D lungs Tracking of book frontpages
COMPUTER VISION / IMAGE ANALYSIS
Medical image registration(from hours to seconds)
Field segmentationfrom satellite images
DEEP LEARNING
DEEP LEARNING IN AUGMENTED REALITY
Camera pose estimation Object detection (where?)
Object classification (what?)Markerless tracking
WHAT IS DEEP LEARNING?
• Neural networks are machine learning algorithms inspired by the structure and function of the brain.
• Interest in Deep Neural Networks has sky-rocketed within the past 5 years.• Big data + GPUs + algorithmic progress
WHAT IS DEEP LEARNING?
Colorize gray-scale images
Turn horses into zebras Turn images into Van Gogh paintings
”Dream” images of fake celebrities
Image captioning
Detect human body pose
WHAT IS DEEP LEARNING?
• All you need is lots of training data and computing power.
WHAT IS DEEP LEARNING?
• All you need is lots of training data and computing power.
A car has fourwheels, which
are placedapproximately …
WHAT IS DEEP LEARNING?
• All you need is lots of training data and computing power.
Database of Cars and ”Not cars”
WHAT IS DEEP LEARNING?
Andrej KarpathyDirector of AI at Tesla
Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we write software. They are Software 2.0.
The “classical stack” of Software 1.0 is what we’re all familiar with […] It consists of explicit instructions to the computer written by a programmer.
In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code […] Instead, we specify some constraints on the behavior of a desirable program (e.g., a dataset of input output pairs of examples) and use the computational resources at our disposal to search the program space for a program that satisfies the constraints.
Nov 11, 2017Software 2.0
CONVOLUTIONAL NEURAL NETWORKS
• Look for parts and check if their relative positions in the image are consistentwith the type of object you are looking for.
Simple model of a car with three parts
CONVOLUTIONAL NEURAL NETWORKS
• But computers don’t ”see” the way humans do.
CONVOLUTIONAL NEURAL NETWORKS
0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 1 0 0 1 0 0
0 0 1 0 0 1 0 0
0 0 1 1 1 1 0 0
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 1 0 0 1 0 0
0 0 1 0 0 1 0 0
0 0 1 1 1 1 0 0
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 1 0 0 1 0 0
0 0 1 0 0 1 0 0
0 0 1 1 1 1 0 0
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 1 0 0 1 0 0
0 0 1 0 0 1 0 0
0 0 1 1 1 1 0 0
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 1 0 0 1 0 0
0 0 1 0 0 1 0 0
0 0 1 1 1 1 0 0
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 0 0 0 0 0
Input image
Convolution (locating parts)
0 1 0
1 0 0
0 0 0
Compression
0 1 0
0 0 1
0 0 0
1 0 1
0 0 0
1 0 1
0 1 0
1 0 0
0 0 0
0 1 0
1 0 0
0 0 0
0 1 0
0 0 1
0 0 0
1 0 1
0 0 0
1 0 1
0 1 0
1 0 0
0 0 0
This configuration is consistent with the letter ”A”
Training
• Look for parts and check if their relative positions in the image are consistentwith the type of object you are looking for.
CONVOLUTIONAL NEURAL NETWORKS
Input image
Convolutional layer
Feature maps
1st layer feature map:Tells the network whereto find simple features like edges and blobs.
Input image
Convolutional layer
Convolution Pooling
Feature maps
Activation
CONVOLUTIONAL NEURAL NETWORKS
Input image
Convolutional layer
Convolution Pooling
Feature maps
Convolutional layer
Activation
Convolution PoolingActivation
CONVOLUTIONAL NEURAL NETWORKS
Input image
Convolutional layer
Convolution Pooling
Feature maps
Convolutional layer
Activation
Convolution PoolingActivation
CONVOLUTIONAL NEURAL NETWORKS
2nd layer feature map:Tells the network where to find more complex features like eyes, nose, etc.
Input image
Convolutional layer
Convolution Pooling
Feature maps
Convolutional layer
Activation
Convolution PoolingActivation
Fully connected layer(s)
Output
CONVOLUTIONAL NEURAL NETWORKS
Fully connected layers:A trainable ”program” thatdoes ”something” with the features, like checking if their positions areconsistent with a face.
CONVOLUTIONAL NEURAL NETWORKS
Input image
Convolutional layer
Convolution Pooling
Feature maps
Convolutional layer
Activation
Convolution PoolingActivation
Fully connected layer(s)
Output
ENCODER/DECODER PERSPECTIVE
Encoder
Decoder
Fully connected layer(s)
Output
ENCODER/DECODER PERSPECTIVE
Decoder
ClassifierUses features to distinguish betweentwo or more classes.
Output:Discrete labels(”Dog” or ”Cat”)
RegressorUses features to predict some functional relationship.
Output:Real numbers(”Age” of person in image)
OUR USE OF DEEP LEARNING
OUR APPROACH
• Deep learning needs labeled training data – and lots of it
• Deep learning needs labeled training data – and lots of it– Annotated images– Very time consuming
• We specialize in rendering training data– Drastically reduce the time spent on acquiring and annotating images.– If CAD data and material descriptions are available, much can be automated.
OUR APPROACH
• Given an image, tell which in a number of classes it belongs to– What type of object– Quality control: OK or needs manual inspection– View direction
OBJECT CLASSIFICATION
OBJECT CLASSIFICATION
• Train on synthetic photorealistic images– Render images of objects from CAD files– Random viewing angle and distance to camera– Random lighting conditions
Tetris controller
Motor
OBJECT CLASSIFICATION
• Image augmentation– Add a “Background” class consisting of natural images.– Put rendered objects on top of background.– Add a little noise to the colors.
Background
Tetris controller
Motor
OBJECT CLASSIFICATION
• AlexNet (A. Krizhevsky 2012)– First Deep Neural Network (trained for 2 weeks!).– Transfer learning (network weights pre-trained on ImageNet).– Training time: 1-2 hours
• Results– Robustly distinguishes between
• Tetris controller• Motor• Background (anything else).
– Only recognizes what it has seen during training• When motor is too small, it is classified as background.
OBJECT CLASSIFICATION
DOES IT SCALE?
• ImageNet has 1000 object classes.• State-of-the-art Deep Learning algorithms perform better than humans!• We successfully tested this approach with up to 150 object classes.
VIEW CLASSIFICATION
• Train a neural network to roughly estimate camera pose.• Formulate as classification problem.
Input image ConvNet Which view is it?
VIEW CLASSIFICATION
• Again, we train on synthetic photorealistic images• 48 different views/classes
• Training image examples
VIEW CLASSIFICATION
View 1
View 2
• Results– Works fine for rough view classification like up/down or left/right.
• Use case in AR– Instructional assistance– “Flip object left-to-right”– “Turn object upside-down”
VIEW CLASSIFICATION
Input image Closest view
• Can we use Deep Learning to detect objects in images?– Input: Image– Output: Bounding boxes + labels for each object in the image
OBJECT DETECTION
OBJECT DETECTION
• How it used to be done– Train a classifier such as a deep neural network.– Run sliding window over image at multiple scales.
• Disadvantages– (Hand-crafted features)– Computationally expensive
OBJECT DETECTION
• Region-based Fully Convolutional Net (R-FCN)– Run a region proposal network (RPN) to generate regions of interest.– For each region of interest, look for object parts (3x3 grid)
OBJECT DETECTION
• Train on synthetic photorealistic images– Store bounding boxes of objects + class labels.– Allow multiple objects in same image.– Partially occlude objects with gray square.
OBJECT DETECTION
OBJECT DETECTION
• Results– Works well on both synthetic and real images.– Detects multiple objects in same image.– Real-time performance.
• Example: Segmentation– Given an image, divide pixels into classes
IMAGE-TO-IMAGE TRANSFORMATION
IMAGE-TO-IMAGE TRANSFORMATION
• Fully Convolutional Networks for semantic segmentation– Input: Image– Output: Image with class labels
• Image to image translation– Add color to an image– CT to histology– Building geometry
IMAGE-TO-IMAGE TRANSFORMATION
IMAGE-TO-IMAGE TRANSLATION
• Tracking object keypointsSelect 3D keypoints in the
CAD model that you want to track
Input image Output image
IMAGE-TO-IMAGE TRANSLATION
• Tracking object keypointsSelect a 3D keypoint in the
CAD model that you want to track
IMAGE-TO-IMAGE TRANSLATION
• Tracking of 48 object keypoints
• Given an image, compute one or more numbers that describe what you see. – The size of an object– Age of a person– Positions of facial landmarks– How many millimeters of tread depth left on a tire?
REGRESSION NETWORKS
• Predict object keypoint positions using a regressionl network.• Use keypoints to estimate object pose.
KEYPOINT REGRESSION
Predicted keypoints Estimated object pose
POSE ESTIMATION
• Where is the camera relative to the object?• Mapping between 3D object coordinates and 2D image coordinates.
Projection of 3D points (cube)onto an image.
Camera
Object
Scene
• Work in progress
FINE-TUNING CAMERA POSE
DEEP LEARNING IN AUGMENTED REALITY
• Our approach– Use synthetic images to train Deep Neural Networks.– Works well for object recognition, detection and markerless tracking.
Markerless tracking Detection and recognition
WHERE TO GO NEXT?
• Using synthetic images for training doesn’t always work that well…• We need to close the ”simulation-to-reality gap”.
Thank you for your attention!