DEEP LEARNING FOR RECOGNITION OF OBJECTS IN …...VISUAL COMPUTING LAB Visual computing is a generic term for all computer science disciplines handling images and 3D models, i.e. computer

DEEP LEARNING FOR RECOGNITION OF OBJECTS IN AUGMENTED REALITYHenrik Pedersen, PhDSenior Computer Vision Engineer

VISUAL COMPUTING LAB

Visual computing is a generic term for all computer science disciplines handlingimages and 3D models, i.e. computer graphics, image processing, visualization,computer vision, virtual and augmented reality, video processing, but also includesaspects of pattern recognition, human computer interaction, machine learning anddigital libraries. The core challenges are the acquisition, processing, analysis andrendering of visual information (mainly images and video). Application areas includeindustrial quality control, medical image processing and visualization, surveying,robotics, multimedia systems, virtual heritage, special effects in movies and television,and computer games.

[https://en.wikipedia.org/wiki/Visual_computing]

Computer Graphics & Visualization

Computer Vision & Image Analysis

Physics Simulations

High Performance Computing

PEOPLE

Graphics ✓ ✓ ✓ ✓ ✓ ✓Vision ✓ ✓ ✓ ✓Physics ✓ ✓ ✓ ✓ ✓ ✓ ✓HPC ✓ ✓

Aarhus Copenhagen

https://alexandra.dk/dk/om_os/labs/visual-computing-lab

https://alexandra.dk/dk/om_os/labs/visual-computing-lab

PROJECT EXAMPLES

• Real-time terrain visualization in a web browser– Pointcloud generated using LiDAR (17.6 billion points)– 100 Terabytes of data– 40x40 cm resolution– 1 cm in height– All of Denmark– Overlays from satellite photos and OpenStreetMap.

DENMARK’S ELEVATION MODEL VISUALIZED IN WebGL

https://denmark3d.alexandra.dk

https://www.youtube.com/watch?v=Rk2LoQQur1o

https://www.youtube.com/watch?v=Rk2LoQQur1o

https://denmark3d.alexandra.dk/

• LEGO Digital Designer, LEGO Universe, LEGO House (Fish Tank)

HIGH GRAPHICS QUALITY AND IMAGE RECOGNITION

https://www.legohouse.com/da-dk/explore/yellow-zone

https://www.youtube.com/watch?v=f6PwE2h9GX4

https://www.youtube.com/watch?v=f6PwE2h9GX4

https://www.legohouse.com/da-dk/explore/yellow-zone

VISIBLE EAR SIMULATOR

Virtual Reality for surgical training and pre-operative planning

• See and feel the inner ear.• Simulation of bone drilling with haptic feedback. • Realistic visualization based on real anatomy. https://ves.alexandra.dk/

https://ves.alexandra.dk/

AUGMENTED REALITY

Strategic focus: Industrial training 3D lungs Tracking of book frontpages

COMPUTER VISION / IMAGE ANALYSIS

Medical image registration(from hours to seconds)

Field segmentationfrom satellite images

DEEP LEARNING

DEEP LEARNING IN AUGMENTED REALITY

Camera pose estimation Object detection (where?)

Object classification (what?)Markerless tracking

WHAT IS DEEP LEARNING?

• Neural networks are machine learning algorithms inspired by the structure and function of the brain.

• Interest in Deep Neural Networks has sky-rocketed within the past 5 years.• Big data + GPUs + algorithmic progress


Colorize gray-scale images

Turn horses into zebras Turn images into Van Gogh paintings

”Dream” images of fake celebrities

Image captioning

Detect human body pose

https://www.youtube.com/watch?v=Khuj4ASldmU

https://www.youtube.com/watch?v=Khuj4ASldmU

https://www.youtube.com/watch?v=MfaTOXxA8dM

https://www.youtube.com/watch?v=MfaTOXxA8dM

https://www.youtube.com/watch?v=9reHvktowLY

https://www.youtube.com/watch?v=9reHvktowLY

https://www.youtube.com/watch?v=Zc-ihT0DQg0

https://www.youtube.com/watch?v=Zc-ihT0DQg0

https://www.youtube.com/watch?v=e-WB4lfg30M

https://www.youtube.com/watch?v=e-WB4lfg30M

https://www.youtube.com/watch?v=mxKlUO_tjcg

https://www.youtube.com/watch?v=mxKlUO_tjcg


• All you need is lots of training data and computing power.



A car has fourwheels, which

are placedapproximately …



Database of Cars and ”Not cars”


Andrej KarpathyDirector of AI at Tesla

Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we write software. They are Software 2.0.

The “classical stack” of Software 1.0 is what we’re all familiar with […] It consists of explicit instructions to the computer written by a programmer.

In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code […] Instead, we specify some constraints on the behavior of a desirable program (e.g., a dataset of input output pairs of examples) and use the computational resources at our disposal to search the program space for a program that satisfies the constraints.

Nov 11, 2017Software 2.0

CONVOLUTIONAL NEURAL NETWORKS

• Look for parts and check if their relative positions in the image are consistentwith the type of object you are looking for.

Simple model of a car with three parts


• But computers don’t ”see” the way humans do.


0 0 0 0 0 0 0 0

0 0 0 1 1 0 0 0

0 0 1 0 0 1 0 0

0 0 1 0 0 1 0 0

0 0 1 1 1 1 0 0

0 1 0 0 0 0 1 0

0 1 0 0 0 0 1 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 1 1 0 0 0

0 0 1 0 0 1 0 0

0 0 1 0 0 1 0 0

0 0 1 1 1 1 0 0

0 1 0 0 0 0 1 0

0 1 0 0 0 0 1 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 1 1 0 0 0

0 0 1 0 0 1 0 0

0 0 1 0 0 1 0 0

0 0 1 1 1 1 0 0

0 1 0 0 0 0 1 0

0 1 0 0 0 0 1 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 1 1 0 0 0

0 0 1 0 0 1 0 0

0 0 1 0 0 1 0 0

0 0 1 1 1 1 0 0

0 1 0 0 0 0 1 0

0 1 0 0 0 0 1 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 1 1 0 0 0

0 0 1 0 0 1 0 0

0 0 1 0 0 1 0 0

0 0 1 1 1 1 0 0

0 1 0 0 0 0 1 0

0 1 0 0 0 0 1 0

0 0 0 0 0 0 0 0

Input image

Convolution (locating parts)

0 1 0

1 0 0

0 0 0

Compression

0 1 0

0 0 1

0 0 0

1 0 1

0 0 0

1 0 1

0 1 0

1 0 0

0 0 0

0 1 0

1 0 0

0 0 0

0 1 0

0 0 1

0 0 0

1 0 1

0 0 0

1 0 1

0 1 0

1 0 0

0 0 0

This configuration is consistent with the letter ”A”

Training

• Look for parts and check if their relative positions in the image are consistentwith the type of object you are looking for.


Input image

Convolutional layer

Feature maps

1st layer feature map:Tells the network whereto find simple features like edges and blobs.

Input image

Convolutional layer

Convolution Pooling

Feature maps

Activation


Input image

Convolutional layer

Convolution Pooling

Feature maps

Convolutional layer

Activation

Convolution PoolingActivation


Input image

Convolutional layer

Convolution Pooling

Feature maps

Convolutional layer

Activation



2nd layer feature map:Tells the network where to find more complex features like eyes, nose, etc.

Input image

Convolutional layer

Convolution Pooling

Feature maps

Convolutional layer

Activation


Fully connected layer(s)

Output


Fully connected layers:A trainable ”program” thatdoes ”something” with the features, like checking if their positions areconsistent with a face.


Input image

Convolutional layer

Convolution Pooling

Feature maps

Convolutional layer

Activation



Output

ENCODER/DECODER PERSPECTIVE

Encoder

Decoder


Output

ENCODER/DECODER PERSPECTIVE

Decoder

ClassifierUses features to distinguish betweentwo or more classes.

Output:Discrete labels(”Dog” or ”Cat”)

RegressorUses features to predict some functional relationship.

Output:Real numbers(”Age” of person in image)

OUR USE OF DEEP LEARNING

OUR APPROACH

• Deep learning needs labeled training data – and lots of it

• Deep learning needs labeled training data – and lots of it– Annotated images– Very time consuming

• We specialize in rendering training data– Drastically reduce the time spent on acquiring and annotating images.– If CAD data and material descriptions are available, much can be automated.

OUR APPROACH

• Given an image, tell which in a number of classes it belongs to– What type of object– Quality control: OK or needs manual inspection– View direction

OBJECT CLASSIFICATION


• Train on synthetic photorealistic images– Render images of objects from CAD files– Random viewing angle and distance to camera– Random lighting conditions

Tetris controller

Motor


• Image augmentation– Add a “Background” class consisting of natural images.– Put rendered objects on top of background.– Add a little noise to the colors.

Background

Tetris controller

Motor


• AlexNet (A. Krizhevsky 2012)– First Deep Neural Network (trained for 2 weeks!).– Transfer learning (network weights pre-trained on ImageNet).– Training time: 1-2 hours

• Results– Robustly distinguishes between

• Tetris controller• Motor• Background (anything else).

– Only recognizes what it has seen during training• When motor is too small, it is classified as background.


DOES IT SCALE?

• ImageNet has 1000 object classes.• State-of-the-art Deep Learning algorithms perform better than humans!• We successfully tested this approach with up to 150 object classes.

VIEW CLASSIFICATION

• Train a neural network to roughly estimate camera pose.• Formulate as classification problem.

Input image ConvNet Which view is it?

VIEW CLASSIFICATION

• Again, we train on synthetic photorealistic images• 48 different views/classes

• Training image examples

VIEW CLASSIFICATION

View 1

View 2

• Results– Works fine for rough view classification like up/down or left/right.

• Use case in AR– Instructional assistance– “Flip object left-to-right”– “Turn object upside-down”

VIEW CLASSIFICATION

Input image Closest view

• Can we use Deep Learning to detect objects in images?– Input: Image– Output: Bounding boxes + labels for each object in the image

OBJECT DETECTION

OBJECT DETECTION

• How it used to be done– Train a classifier such as a deep neural network.– Run sliding window over image at multiple scales.

• Disadvantages– (Hand-crafted features)– Computationally expensive

OBJECT DETECTION

• Region-based Fully Convolutional Net (R-FCN)– Run a region proposal network (RPN) to generate regions of interest.– For each region of interest, look for object parts (3x3 grid)

OBJECT DETECTION

• Train on synthetic photorealistic images– Store bounding boxes of objects + class labels.– Allow multiple objects in same image.– Partially occlude objects with gray square.

OBJECT DETECTION

OBJECT DETECTION

• Results– Works well on both synthetic and real images.– Detects multiple objects in same image.– Real-time performance.

• Example: Segmentation– Given an image, divide pixels into classes

IMAGE-TO-IMAGE TRANSFORMATION


• Fully Convolutional Networks for semantic segmentation– Input: Image– Output: Image with class labels

• Image to image translation– Add color to an image– CT to histology– Building geometry


IMAGE-TO-IMAGE TRANSLATION

• Tracking object keypointsSelect 3D keypoints in the

CAD model that you want to track

Input image Output image


• Tracking object keypointsSelect a 3D keypoint in the

CAD model that you want to track


• Tracking of 48 object keypoints

• Given an image, compute one or more numbers that describe what you see. – The size of an object– Age of a person– Positions of facial landmarks– How many millimeters of tread depth left on a tire?

REGRESSION NETWORKS

• Predict object keypoint positions using a regressionl network.• Use keypoints to estimate object pose.

KEYPOINT REGRESSION

Predicted keypoints Estimated object pose

POSE ESTIMATION

• Where is the camera relative to the object?• Mapping between 3D object coordinates and 2D image coordinates.

Projection of 3D points (cube)onto an image.

Camera

Object

Scene

• Work in progress

FINE-TUNING CAMERA POSE

DEEP LEARNING IN AUGMENTED REALITY

• Our approach– Use synthetic images to train Deep Neural Networks.– Works well for object recognition, detection and markerless tracking.

Markerless tracking Detection and recognition

WHERE TO GO NEXT?

• Using synthetic images for training doesn’t always work that well…• We need to close the ”simulation-to-reality gap”.

Thank you for your attention!

Documents

DEEP LEARNING FOR RECOGNITION OF OBJECTS IN …...VISUAL COMPUTING LAB Visual computing is a generic term for all computer science disciplines handling images and 3D models, i.e. computer