Optimizing Hardware for Distributed Imagery Processing: A ...PDAD18].pdfother matching algorithms, automated tie point extraction, 3D point extraction, Orthophoto production, ... Dell

1

Optimizing Hardware for Distributed Imagery Processing: A Case Study

R. David Day, CP, GISP

Keystone Aerial Surveys, Inc.

[email protected]

Abstract

This case study details the procedures, tools and necessity of a thorough examination of a firm’s entire

processing infrastructure. When Keystone Aerial Surveys, Inc. needed to upgrade its processing

hardware Keystone first set out to determine how the software it used interacted with its enterprise and

workstation hardware. As a primary data acquisition company using Microsoft\Vexcel UltraCams,

Keystone uses the UltraMap suite to process the hundreds of thousands of raw images it collects every

year. This case study demonstrates how organizations running UltraMap or similar software can

understand and measure how the software uses system resources and take steps to evaluate their

software/hardware implementation.

Many primary acquisition vendors creating downstream products can benefit from the implementation

of the test structure, tools and techniques demonstrated. The topics of IOPS, latency and response

times of storage systems are crucial when discussing system performance and must be addressed by a

qualified storage or systems professional. Using an outside vendor is often the ideal solution for a small

to mid-sized company, thus correctly describing the unique technology needs of the geospatial industry

with industry pros and creating a proper system stress plan are vital to determine how software uses

network connectivity, Random Access Memory (RAM), Hard Disk, Central Processing Unit (CPU) and

Graphics Processing Unit (GPU) capabilities.

Background

Over the past 25 years the photogrammetric industry has moved from the film based tasks of stereo

plotting to softcopy and now to all digital work flows. As the industry has taken to using digital imagery

in digital workflows, the software designs and algorithms have grown in their ability to produce more

accurate results. Simultaneously, hardware has grown at staggering rate in accordance with Moore’s

law. Moore’s Law, named after Gordon Moore, a founder of Intel, was originally a theory stating that

computer processor capacity would double every two years. The law has morphed over the years to

doubling every 18 months, but the description of exponential growth has generally been true. (Li, 2013)

In October of 1990 Intel released the 80368SL, the first chip made specifically for the burgeoning

portable computer market. The chip was a 32 bit processor with 20 MHz clock speed, 4 GB maximum

memory with 855,000 transistors. (Intel Corp, 2015) In 2012, Intel released the 64-bit Xeon E5-2667

with a 3.2 GHz speed, 786 GB maximum addressable memory and 2.6 Billion transistors. This

represents a growth rate of 160%, 196% and 3040% respectively!

With this explosion of hardware capabilities, software has been able to advance in its attempts to

automate nearly all processes of the photogrammetric, image processing and other geospatial activities.

The use of parallel programming (multi-threading), CUDA programming for GPUs and other technologies

are available to software designers, but use of the technology varies by vendor. Geospatial software

providers have tackled the performance issues presented by the algorithms of semi-global matching and

other matching algorithms, automated tie point extraction, 3D point extraction, Orthophoto production,

2

automatic seamline generation and other compute intensive processes unique to the industry in various

ways. Software providers have attempted to use new algorithms for optimized use of hardware

resources such as GPUs, distributed processing and proprietary formats.

However, most manufacturers have limited access to massive hardware arrays from multiple

manufacturers and, due to the relatively small nature of the geospatial industry, the hardware vendors

do not test geospatial algorithms or software as is done with other types of common business software

tasks. Keystone set out to do this itself by testing and quantifying the resources used by its most

commonly run software package in order to optimize the company processing workflow. Some details

are given below about hardware and software, but generally the lessons to be learned are generic to

any software or situation.

Software

UltraMap is a software package produced by Microsoft\Vexcel of Austria to support the UltraCam line of

large format imagery sensors. The software has many modules, including Imagery processing, Aerial

Triangulation (AT), Radiometry, Digital Surface Model (DSM) and Orthophoto generation. For this

study, only the Raw Data Center (RDC), AT and Radiometry modules where considered. The UltraMap

software is based on a distributed processing model where one controller computer doles out tasks to

worker nodes. Both are generally server hardware, but can be workstations as well. The software must

be run on Windows 64 bit operating systems with nearly any hardware configuration, however the

manufacturer recommends latest generation multi-core processors with at least 2 GB of RAM per

processor core. GPUs may be used for the Dense Matcher module processing, but neither the software

nor hardware was tested in this case study. For storage, the manufacturer recommends either Network

Attached Storage (NAS) or Storage Area Network (SAN) architecture with 10 GB Ethernet (Vexcel

Imaging, 2014). As noted earlier, this was a test to customize the hardware environment for the

software, not a validation of the software’s speed or the manufacturer’s recommendations.

Processing Workflow

UltraCam sensors use multiple calibrated cameras to generate its final, full frame images. The raw data

recorded by the sensor consists of multiple pieces of imagery and metadata that must be assembled for

use in geospatial products. This state of the imagery is called Level0 (L0). The transfer of the L0 data

from the sensor to network storage is handled by the RDC module within UltraMap. The initial phase of

post processing primarily involves the seaming of the panchromatic images into on single virtual image.

RDC is again used to do this in a distributed fashion. The controller server receives the job and instructs

the processing nodes to retrieve the L0 imagery from the storage location, process it to the next level

and store it at a shared location. The imagery and associated metadata are called Level02 (L2) at this

stage in the process.

The L2 imagery is then run through an initial automatic point extraction using the AT module. Further

bundle adjustments, control point collection and point extractions can be done at this stage as needed.

Finally, color balance across the project is performed using the Radiometry module and a delivery data,

or Level03 (L3), job is run. This is distributed among the processing nodes in the same manner as the L0

to L2 processing. This entire workflow was the focus of the testing of the Keystone systems.

3

Hardware

Keystone’s existing hardware infrastructure consisted of a NetApp Network Attached Storage device

using Serial Advanced Technology Attachment (SATA) hard disks with 10 GB Ethernet and Fiber Channel

connectivity. Ethernet is the standard protocol for communication between networked computers that

requires hub or switch topology for communications. 10 G Ethernet is an extension of the Ethernet

protocol speed (transfers at 10 Gbps), but hardware compatibility with slower Ethernet hardware is not

supported. Consequently, separate Network Adapter cards, cabling and switching is necessary for true

10 G Ethernet networks. Cabling itself can be flexible as both copper and fiber cabling is supported.

Fiber Channel is a communication technology designed for data centers and SANs that is capable of

using direct connection, switching or arbitrated loop topologies. Fiber Channel is capable of supporting

various communication protocols, but does not support Ethernet natively (Norman, 2001).

The existing servers were a combination of Intel and AMD processing cores running on various ages of

Dell servers with at least the minimum of recommended RAM (though at various speeds). Most of the

connectivity between the servers and the storage was 10 GB E, but some are 1 GB.

Partner

Keystone engaged the services of a technology vendor (Razor Technology) early in the process in order

to provide the professional knowledge and skills necessary to perform an exhaustive system test. It is

important for a geospatial organization to educate the vendor on the unique nature of their work and

software requirements so that the testing can accurately reflect workflow. The partner provides

expertise in the implementation and interpretation of the test, while the organization must drive the

needs of the testing.

Methods

The system test was designed to mimic a real workflow event where the entire processing architecture

and storage architecture are stressed. This included massive movement of data during all stages of the

usage of UltraMap software (RDC, AT and Radiometry). A small footprint software package from EMC

Corporation was installed on each processing server that monitored the system Input/Output (I/O),

RAM, processor usage, network traffic, etc. for an entire 24 hour period. Using software designed for

the storage system, it was also monitored for a complete 24 hour period. This resulted in a time of

heavy usage, but also times of limited or no requests to the system that is crucial for establishing a base

line. The results for throughput, latency, etc. on the existing system were established and used for

bench marks and insight into a possible replacement. The test was then simulated on a prospective

system and, finally, the full system test was repeated on the installed solution after the purchase.

A schedule was developed that included specific processing of Level0 (raw imagery) to Level 02, AT,

Radiometry and final imagery processing. Each processing test performed was unique due to workflow

needs of a production system and the data available.

4

Assumptions

The general assumptions going into the test were:

Increasing Processor cores per processing computer will increase throughput

Increasing RAM per processor core will result in increased throughput

Network bandwidth of 1 GB is a bottleneck

Processing computer local machine disk speed is a bottleneck

All I/O is sequential

Solid state disk on the SAN will increase speed of I/O and overall throughput

Results: Initial Test of Processing Servers

For the initial testing there were several key times:

Monitoring started at 9am

Small L0 to L2 process shortly after 9 am

Large L2 to L3 process begins at approximately 11:20 am

Large copy to NAS storage through processing servers 11:45 am to 12:40 pm

By examining the results of one of the processors used heavily in the testing, it is quickly seen that many

of the assumptions were incorrect. The software performed differently than expected. The initial

processing of raw imagery actually requires moderate disk I/O (first spike in top of Figure 1), and

moderate memory and CPU usage. This is due to the long time period (up to 20 minutes) the server

spends on each image due to slower CPU speeds. The tasks of copying data and image delivery

processing (L3) are highly disk I/O intensive and overwhelm the system.

As seen in Figure 1 below, the I/O per second (IOPS) was extremely high despite the large images being

copied in sequential order, the threading of the software actually generates nearly random I/O

operations under a heavy workload. This points to the necessity for fast read and write events and solid

state disk. Solid State drives would increase the read/write speed by 3 times and reduce latency

(UserBenchmark, 2015).

Figure 1: Processing Server Disk, CPU and RAM performance metrics

5

The charts in Figure 1 also point to constant high memory usage during the periods of maximum

processing of imagery. With a total of 1600 MB on the machine, the bulk of the work was done at the

initial hours of the processing load and becoming erratic during primarily image movement tasks.

At 50% of the processing capacity being used, the CPU usage appears to be adequate. But because of

the use of only 8 of the possible 12 cores, Figure 1 is misleading. Instead, Figure 2 below, illustrates the

constant thread usage spread across the CPU processing threads being used. This confirms that CPU

speed is more critical in the optimization of UltraMap as opposed to core count alone.

Figure 2: Processing server thread utilization percentage

Table 1 below contains the latency statistics for the processing server in milliseconds. Both the read and

write response times are high due to the amount of randomness created by heavy sequential disk

reading and writing. This greatly affects the effectiveness of traditional spinning disk. Furthermore, the

industry standard for workload states that a Queue length greater than the spindle count is beyond

capacity. With a single hard disk system (Non-RAID) any average queue length approaching 2 is cause

for concern (Davis, 2014).

Average Read Response Time

(ms)

95th Percentile Read Response

Time (ms)

Average Write Response Time

(ms)

95th Percentile Write Response

Time (ms) Average Queue

Length 95th Percentile Queue Length

2.4 7.51 36.32 98.92 1.81 8

Table 1: Latency data for typical processing server

Finally, the test revealed that the network utilization was below 1% of the 10 gigabit Ethernet, pointing

out that network connectivity is not as crucial as suspected.

Results: Initial Test of Storage Device

The storage system is divided between two controllers with 10 GB Ethernet connections. During the

peak processing and image transfer periods, latency (response times) were high. This suggests that the

storage hardware is not optimized for the workflow. Additionally, having two controllers with

separately assigned hard disk volumes created uneven disk utilization across the device.

6

Figure 3: Controller 1 IOPS

Figure 4: Controller 2 IOPS

Figure 5: Controller 1 Response Time (Latency)

0

1,000

2,000

3,000

4,000

5,000

6,000

3:45 6:09 8:33 10:57 13:21 15:45 18:09 20:33 22:57 1:21

IOP

SFront End Write IOPS Front End Read IOPS

0

500

1,000

1,500

2,000

2,500

3,000

3,500

3:00 5:24 7:48 10:12 12:36 15:00 17:24 19:48 22:12 0:36 3:00

IOP

S

Front End Write IOPS Front End Read IOPS

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

3:45 6:09 8:33 10:57 13:21 15:45 18:09 20:33 22:57 1:21

Re

spo

nse

Tim

e (

ms)

Response Time (ms)

7

Figure 6: Controller 2 Response Time (Latency)

The CPU usage and network utilization suggested minimal overhead to the device and adequate

network infrastructure (Figures 7 and 8). The type of processes created by geospatial software are not

generally CPU intensive for large storage devices. It is clear that IOPS and disk speed are much more

important in this workflow.

Figure 7: CPU utilization per controller

Figure 8: Network utilization per controller

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

3:00 5:24 7:48 10:12 12:36 15:00 17:24 19:48 22:12 0:36 3:00

Re

spo

nse

Tim

e (

ms)

Response Time (ms)

0

20

40

60

80

100

8/21/2014 3:00:00 8/21/2014 7:48:00 8/21/2014 12:36:00 8/21/2014 17:24:00 8/21/2014 22:12:00 8/22/2014 3:00:00

Avg

. CP

U U

tiliz

atio

n %

Controler 1A Controler 1B

0

100,000

200,000

300,000

400,000

500,000

3:45 6:09 8:33 10:57 13:21 15:45 18:09 20:33 22:57 1:21

KB

/se

c

Net RX KB/sec Net TX KB/sec

8

Recommendation

Based on the results above, it was evident that the UltraMap software has unique requirements for each

of the modules being used. In order to create a machine that would serve all the software module’s

needs it was necessary to use SSD disk and select a CPU for speed ( as opposed to core count). Table 2

below illustrates how replacing systems with more powerful (faster) CPUs is more efficient (by reducing

the number of machines) than simply adding cores.

System Group Ghz Cores Chips

Cores Per Chip

Score Per Core

Server Count Total Comment

Current A 2.9 8 2 4 28.3 12 2712

Quad-Core AMD Opteron(tm) Processor 2389

Current B 2.4 8 2 4 24.9 5 997

Quad-Core AMD Opteron(tm) Processor 2379

Current C 2.66 12 2 6 49.6 4 2380 Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Current D 2.66 8 2 4 49.5 2 792 Intel(R) Xeon(R) CPU E5640 @ 2.67GHz

Total 23 6881

Proposed System 3.5 12 2 6

94.167 7 7910

Intel Xeon E5-2643 v2, 3.50 GHz

Table 2: Comparison of Current System with Optimal System

Using the specifications in Table 1 and the test results, systems were easily designed with the specific

CPU, RAM and SSD drives needed to meet the needs of the software. No further testing was necessary

for the correct specification of these servers.

Results pre-purchase test

Two options were determined to be possible candidates for the network storage piece. One system

(system “A”) was optimized using SSD arrays backed by standard hard disk with latency reduced by using

connectivity devices that converted 10 G Ethernet to Fiber Channel. System “B” uses multiple disk

arrays connected via 10 G Ethernet and standard SATA drives with a unique software and hardware

architecture. System “A” was benchmarked with a sample system by the manufacturer using Keystone

supplied imagery. While this did not accurately simulate the full UltraMap workflow, it did generate the

interesting results below:

9

Figure 9: System “A” Controller IO

Figure 10: System "A" Controller Latency

Figure 11: System "A" Disk IO

Figure 12: System "A" Disk Latency

As can be seen by Figures 9 through 12 above, during an intense write session to a SAN using SSD drives

for its initial storage and SATA drives for the bulk of the storage the initial I/O of the system is intense

and responds with excellent performance. However, as the SSD begin to fill, the latency of both the

controller and disk increases to unacceptable levels. The overhead of transitioning data from SSD to

SATA while continuing to write to the SATA disks causes unacceptably high delays (latency).

Due to time constraints and availability, System “B” was not able to be tested; forcing Keystone to make

a decision based on benchmarked tests, rather than real data. The solution, as configured, would have

the ability to support 4650 MB/s read and 2230 MB/s write with a total IOPS of 90,000. Latency, due to

network configuration, system design and high IOPS would be significantly reduced. Based on the

superior design of system B, this was selected for purchase.

10

Results: post installation

Once system “B” storage, the new processing servers and all software, firmware and other optimizations

were installed and completed a system test was performed to mimic previous tests. The charts below

show the process utilization of one of the new processing servers. While 11 out of the 12 possible CPU

cores are being used for processing, only 50% of the total CPU capacity is being used. This confirms that

additional optimizations in both software and hardware will not overly tax the server. Additionally, the

overhead could be used to run other processes on the server without danger of over tasking the system.

The amount of free memory does not fluctuate significantly and never approaches zero, in contrast to

the older processing servers (Figure 1).

Figure 13: New Processing Server Test Results for CPU usage (Top) and available memory (Bottom)

The use of solid state disk in the servers produced significant speed increases. The latency figures below

show the reduction in disk read/write speeds. Particularly interesting are the reduced average queue

length and the 95% reduction in write latency!

System Average Read Response Time

(ms)

95th Percentile Read Response

Time (ms)

Average Write Response Time

(ms)

95th Percentile Write Response

Time (ms) Average Queue

Length 95th Percentile Queue Length

Original System 2.4 7.51 36.32 98.92 1.81 8

New System 0.35 1.1 3.15 5.1 0.35 3

Table 3: Latency comparison between old and new processing servers

The storage area network also showed remarkable improvements over the previous system. The

Figures 14 and 15 below show a peak IOPS of nearly 17,000 while maintaining response times of half a

millisecond or less. Compare this to Figures 3 through 6 that show maximums IOPS of 5000 and

response times consistently above 1 millisecond and peaking at 8.

0

100

6:00 8:24 10:48 13:12 15:36 18:00

Pro

cess

or

%

Use

d

0

100,000

6:00 8:24 10:48 13:12 15:36 18:00

Fre

e

Me

mo

ry

(MB

)

11

Figure 14: New SAN throughput during system testing

Figure 15: New SAN latency during system testing

Finally, the most useful measure of the system’s value to Keystone is the increase in throughput of

imagery. While easily assessed anecdotally, quantifying the change is difficult for two reasons. One,

the variability of simultaneous processes can affect individual image processing speeds and, two, the

software does not provide a mechanism to sort or export the processing times. The times used in the

table below are based on a random sampling of larger jobs performed before and after the new system

installs. The actual times can vary, but these times represent an estimated median for the jobs

referenced.

Level 02 Processing Old Storage System New Storage System

Sensor "A"

Group "C"

Group "D"

Group Weighted

Ave New

Group "C"

Group "D"

Group Weighted

Ave Gain %

Eagle 16.6 15.8 10.8 15.8 9.3 14.4 10.7 11.0 30%

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

11:30 12:42 13:54 15:06 16:18 17:30 18:42 19:54

RW

+ M

eta

dat

a O

ps/

sec

http papi smb2

0.00

0.20

0.40

0.60

0.80

1.00

1.20

11:30 12:42 13:54 15:06 16:18 17:30 18:42 19:54

Re

spo

nse

Tim

e (

ms)

-av

g

SATA

12

Falcon 11.5 11.4 10.3 11.3 6.6 8.2 7.9 7.2 36%

UCX 11.1 10.3 7.8 10.5 4.5 5.2 4.8 4.7 55%

UCO 21.6 19.1 27.1 21.3 16.4 21.4 26.5 18.9 11% Table 4: Single image processing times to Level 02 by server type and sensor type

Level 03 Processing Old Storage System New Storage System

Sensor "A"

Group "C"

Group "D"

Group Weighted

Ave New

Group "C"

Group "D"

Group Weighted

Ave Gain %

Eagle 12.0 7.1 6.3 9.9 5.7 6.5 6.3 6.0 39%

Falcon 8.3 4.8 4.5 6.8 4.1 4.6 4.6 4.3 37%

UCX 11.5 6.9 8.5 9.7 3.3 4.0 3.5 3.5 64% Table 5: Single image processing times to Level 3 by server type and sensor type

The left side of the tables shows the speeds, in minutes, of a single image process by server type and

camera type. To repeat, the times vary based on the size of the processing job and other processes

being run on the system. The Weighted Average is based on the number of machine and cores on each

machine used in the processing stack. The right side of the tables show the new processing servers and

the older systems remaining in place. Interestingly, the older systems also showed improvement

because of the improved storage systems, but the new processing servers vastly out produce the

existing infrastructure. Another way to express the processing gains is to note that on a 60 core

distributed system, 1000 images will be processed between 60 and 100 minutes faster while allowing

other much needed processes to run unaffected. This can translate to hundreds of hours of time

savings in a single year.

Conclusions

The initial assumption that increased CPU core count would generate increased throughput

proved false. CPU processor speed proved a more vital factor in processing speed.

Imagery processing with UltraMap creates high IOPS due to large images being copied in

sequential order, the threading of the software actually generates nearly random I/O operations

under a heavy workload.

The server testing revealed that the network utilization was below 1% of the 10 gigabit Ethernet

network card, pointing out that network connectivity to the processing servers is not as crucial

as suspected in the assumptions.

Based on the testing, it was evident that the UltraMap software has unique requirements for

each specific module processing module. In order to create a machine that would serve all the

software module’s needs it was necessary for SSD disk to be used and a CPU selected for core

speed.

Unlike other business workflow scenarios SSD usage on large storage systems does not produce

good results for geospatial imaging workflows. During intense usage, as the SSD begin to fill,

the latency of both the controller and disk increases to unacceptable levels. The overhead of

transitioning data from SSD to SATA while continuing to write to the SATA disks causes

unacceptably high delays.

13

The use of solid state disk in the processing servers produced significant speed increases. The

results show the reduction in disk read/write speeds of 90% and 95% respectively. Particularly

interesting are the reduced average queue length by 80% to 0.35.

The storage area network also showed remarkable improvements over the previous system with

a peak IOPS of nearly 17,000 while maintaining response times of half a millisecond or less

compared to the previous system’s maximum IOPS of 5000 and response times consistently

above 1 millisecond and peaking at 8.

Processing gains individual image processing range from 11% to 64%. On a 60 core distributed

system, 1000 images will be processed between 60 and 100 minutes faster than with the

previous system while allowing other much needed processes to run unaffected. This will

translate to hundreds of hours of time savings in a single year.

As practitioners of photogrammetry, remote sensing and GIS, industry professionals do not need to have

expertise in all or even many of the deeper aspects of Information Technology (IT). It is however,

important to understand enough of the technologies behind the screens in order to ask the right

questions of the hardware vendors and the industry software providers. It is as equally important to

have a trustworthy and knowledgeable partner or staff member who can fully understand the latest

hardware innovations and apply the correct technologies to each software task. Because the geospatial

industry is now almost completely integrated into the high technology sector in its application and

usage, more than a cursory knowledge of IT must be coveted. Wastes of monies and, more

importantly, of time on systems not optimized for the intended software will not be acceptable in an

industry where margins are often tight. This case study shows that a thorough analysis and testing

regimen is invaluable for system optimization.

References Davis, R. L. (2014, 8 22). Monitoring Disk Queue Length. Retrieved from Idera: http://blog.idera.com/sql-

server/performance-and-monitoring/monitoring-disk-queue-length/

Intel Corp. (2015, April 1). Microprocessor Quick Reference Guide. Retrieved from Intel.com:

http://www.intel.com/pressroom/kits/quickrefyr.htm

Li, M. (. (2013, May 29). Keeping Up with Moore's Law. Retrieved from Dartmouth Undergraduate

Journal of Science: http://dujs.dartmouth.edu/spring-2013-15th-anniversary-edition/keeping-

up-with-moores-law#.VR0xN_nF9Ap

Norman, D. (2001, 12 2). Fibre Channel Technology fo Storage Area Networks. Retrieved from Rivier

University Department of Math and Computer Science:

https://www.rivier.edu/faculty/vriabov/CS553a_Paper_DNorman.pdf

UserBenchmark. (2015). SSD and HDD benchmark. Retrieved from UserBenchmark.com:

http://www.userbenchmark.com/

Vexcel Imaging. (2014). UltraMap v3.2 Overview. Graz, Austria: Vexcel Imaging GmbH.

Documents

Optimizing Hardware for Distributed Imagery Processing: A ...PDAD18].pdfother matching algorithms, automated tie point extraction, 3D point extraction, Orthophoto production, ... Dell