Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
1
Optimizing Hardware for Distributed Imagery Processing: A Case Study
R. David Day, CP, GISP
Keystone Aerial Surveys, Inc.
Abstract
This case study details the procedures, tools and necessity of a thorough examination of a firm’s entire
processing infrastructure. When Keystone Aerial Surveys, Inc. needed to upgrade its processing
hardware Keystone first set out to determine how the software it used interacted with its enterprise and
workstation hardware. As a primary data acquisition company using Microsoft\Vexcel UltraCams,
Keystone uses the UltraMap suite to process the hundreds of thousands of raw images it collects every
year. This case study demonstrates how organizations running UltraMap or similar software can
understand and measure how the software uses system resources and take steps to evaluate their
software/hardware implementation.
Many primary acquisition vendors creating downstream products can benefit from the implementation
of the test structure, tools and techniques demonstrated. The topics of IOPS, latency and response
times of storage systems are crucial when discussing system performance and must be addressed by a
qualified storage or systems professional. Using an outside vendor is often the ideal solution for a small
to mid-sized company, thus correctly describing the unique technology needs of the geospatial industry
with industry pros and creating a proper system stress plan are vital to determine how software uses
network connectivity, Random Access Memory (RAM), Hard Disk, Central Processing Unit (CPU) and
Graphics Processing Unit (GPU) capabilities.
Background
Over the past 25 years the photogrammetric industry has moved from the film based tasks of stereo
plotting to softcopy and now to all digital work flows. As the industry has taken to using digital imagery
in digital workflows, the software designs and algorithms have grown in their ability to produce more
accurate results. Simultaneously, hardware has grown at staggering rate in accordance with Moore’s
law. Moore’s Law, named after Gordon Moore, a founder of Intel, was originally a theory stating that
computer processor capacity would double every two years. The law has morphed over the years to
doubling every 18 months, but the description of exponential growth has generally been true. (Li, 2013)
In October of 1990 Intel released the 80368SL, the first chip made specifically for the burgeoning
portable computer market. The chip was a 32 bit processor with 20 MHz clock speed, 4 GB maximum
memory with 855,000 transistors. (Intel Corp, 2015) In 2012, Intel released the 64-bit Xeon E5-2667
with a 3.2 GHz speed, 786 GB maximum addressable memory and 2.6 Billion transistors. This
represents a growth rate of 160%, 196% and 3040% respectively!
With this explosion of hardware capabilities, software has been able to advance in its attempts to
automate nearly all processes of the photogrammetric, image processing and other geospatial activities.
The use of parallel programming (multi-threading), CUDA programming for GPUs and other technologies
are available to software designers, but use of the technology varies by vendor. Geospatial software
providers have tackled the performance issues presented by the algorithms of semi-global matching and
other matching algorithms, automated tie point extraction, 3D point extraction, Orthophoto production,
2
automatic seamline generation and other compute intensive processes unique to the industry in various
ways. Software providers have attempted to use new algorithms for optimized use of hardware
resources such as GPUs, distributed processing and proprietary formats.
However, most manufacturers have limited access to massive hardware arrays from multiple
manufacturers and, due to the relatively small nature of the geospatial industry, the hardware vendors
do not test geospatial algorithms or software as is done with other types of common business software
tasks. Keystone set out to do this itself by testing and quantifying the resources used by its most
commonly run software package in order to optimize the company processing workflow. Some details
are given below about hardware and software, but generally the lessons to be learned are generic to
any software or situation.
Software
UltraMap is a software package produced by Microsoft\Vexcel of Austria to support the UltraCam line of
large format imagery sensors. The software has many modules, including Imagery processing, Aerial
Triangulation (AT), Radiometry, Digital Surface Model (DSM) and Orthophoto generation. For this
study, only the Raw Data Center (RDC), AT and Radiometry modules where considered. The UltraMap
software is based on a distributed processing model where one controller computer doles out tasks to
worker nodes. Both are generally server hardware, but can be workstations as well. The software must
be run on Windows 64 bit operating systems with nearly any hardware configuration, however the
manufacturer recommends latest generation multi-core processors with at least 2 GB of RAM per
processor core. GPUs may be used for the Dense Matcher module processing, but neither the software
nor hardware was tested in this case study. For storage, the manufacturer recommends either Network
Attached Storage (NAS) or Storage Area Network (SAN) architecture with 10 GB Ethernet (Vexcel
Imaging, 2014). As noted earlier, this was a test to customize the hardware environment for the
software, not a validation of the software’s speed or the manufacturer’s recommendations.
Processing Workflow
UltraCam sensors use multiple calibrated cameras to generate its final, full frame images. The raw data
recorded by the sensor consists of multiple pieces of imagery and metadata that must be assembled for
use in geospatial products. This state of the imagery is called Level0 (L0). The transfer of the L0 data
from the sensor to network storage is handled by the RDC module within UltraMap. The initial phase of
post processing primarily involves the seaming of the panchromatic images into on single virtual image.
RDC is again used to do this in a distributed fashion. The controller server receives the job and instructs
the processing nodes to retrieve the L0 imagery from the storage location, process it to the next level
and store it at a shared location. The imagery and associated metadata are called Level02 (L2) at this
stage in the process.
The L2 imagery is then run through an initial automatic point extraction using the AT module. Further
bundle adjustments, control point collection and point extractions can be done at this stage as needed.
Finally, color balance across the project is performed using the Radiometry module and a delivery data,
or Level03 (L3), job is run. This is distributed among the processing nodes in the same manner as the L0
to L2 processing. This entire workflow was the focus of the testing of the Keystone systems.
3
Hardware
Keystone’s existing hardware infrastructure consisted of a NetApp Network Attached Storage device
using Serial Advanced Technology Attachment (SATA) hard disks with 10 GB Ethernet and Fiber Channel
connectivity. Ethernet is the standard protocol for communication between networked computers that
requires hub or switch topology for communications. 10 G Ethernet is an extension of the Ethernet
protocol speed (transfers at 10 Gbps), but hardware compatibility with slower Ethernet hardware is not
supported. Consequently, separate Network Adapter cards, cabling and switching is necessary for true
10 G Ethernet networks. Cabling itself can be flexible as both copper and fiber cabling is supported.
Fiber Channel is a communication technology designed for data centers and SANs that is capable of
using direct connection, switching or arbitrated loop topologies. Fiber Channel is capable of supporting
various communication protocols, but does not support Ethernet natively (Norman, 2001).
The existing servers were a combination of Intel and AMD processing cores running on various ages of
Dell servers with at least the minimum of recommended RAM (though at various speeds). Most of the
connectivity between the servers and the storage was 10 GB E, but some are 1 GB.
Partner
Keystone engaged the services of a technology vendor (Razor Technology) early in the process in order
to provide the professional knowledge and skills necessary to perform an exhaustive system test. It is
important for a geospatial organization to educate the vendor on the unique nature of their work and
software requirements so that the testing can accurately reflect workflow. The partner provides
expertise in the implementation and interpretation of the test, while the organization must drive the
needs of the testing.
Methods
The system test was designed to mimic a real workflow event where the entire processing architecture
and storage architecture are stressed. This included massive movement of data during all stages of the
usage of UltraMap software (RDC, AT and Radiometry). A small footprint software package from EMC
Corporation was installed on each processing server that monitored the system Input/Output (I/O),
RAM, processor usage, network traffic, etc. for an entire 24 hour period. Using software designed for
the storage system, it was also monitored for a complete 24 hour period. This resulted in a time of
heavy usage, but also times of limited or no requests to the system that is crucial for establishing a base
line. The results for throughput, latency, etc. on the existing system were established and used for
bench marks and insight into a possible replacement. The test was then simulated on a prospective
system and, finally, the full system test was repeated on the installed solution after the purchase.
A schedule was developed that included specific processing of Level0 (raw imagery) to Level 02, AT,
Radiometry and final imagery processing. Each processing test performed was unique due to workflow
needs of a production system and the data available.
4
Assumptions
The general assumptions going into the test were:
Increasing Processor cores per processing computer will increase throughput
Increasing RAM per processor core will result in increased throughput
Network bandwidth of 1 GB is a bottleneck
Processing computer local machine disk speed is a bottleneck
All I/O is sequential
Solid state disk on the SAN will increase speed of I/O and overall throughput
Results: Initial Test of Processing Servers
For the initial testing there were several key times:
Monitoring started at 9am
Small L0 to L2 process shortly after 9 am
Large L2 to L3 process begins at approximately 11:20 am
Large copy to NAS storage through processing servers 11:45 am to 12:40 pm
By examining the results of one of the processors used heavily in the testing, it is quickly seen that many
of the assumptions were incorrect. The software performed differently than expected. The initial
processing of raw imagery actually requires moderate disk I/O (first spike in top of Figure 1), and
moderate memory and CPU usage. This is due to the long time period (up to 20 minutes) the server
spends on each image due to slower CPU speeds. The tasks of copying data and image delivery
processing (L3) are highly disk I/O intensive and overwhelm the system.
As seen in Figure 1 below, the I/O per second (IOPS) was extremely high despite the large images being
copied in sequential order, the threading of the software actually generates nearly random I/O
operations under a heavy workload. This points to the necessity for fast read and write events and solid
state disk. Solid State drives would increase the read/write speed by 3 times and reduce latency
(UserBenchmark, 2015).
Figure 1: Processing Server Disk, CPU and RAM performance metrics
5
The charts in Figure 1 also point to constant high memory usage during the periods of maximum
processing of imagery. With a total of 1600 MB on the machine, the bulk of the work was done at the
initial hours of the processing load and becoming erratic during primarily image movement tasks.
At 50% of the processing capacity being used, the CPU usage appears to be adequate. But because of
the use of only 8 of the possible 12 cores, Figure 1 is misleading. Instead, Figure 2 below, illustrates the
constant thread usage spread across the CPU processing threads being used. This confirms that CPU
speed is more critical in the optimization of UltraMap as opposed to core count alone.
Figure 2: Processing server thread utilization percentage
Table 1 below contains the latency statistics for the processing server in milliseconds. Both the read and
write response times are high due to the amount of randomness created by heavy sequential disk
reading and writing. This greatly affects the effectiveness of traditional spinning disk. Furthermore, the
industry standard for workload states that a Queue length greater than the spindle count is beyond
capacity. With a single hard disk system (Non-RAID) any average queue length approaching 2 is cause
for concern (Davis, 2014).
Average Read Response Time
(ms)
95th Percentile Read Response
Time (ms)
Average Write Response Time
(ms)
95th Percentile Write Response
Time (ms) Average Queue
Length 95th Percentile Queue Length
2.4 7.51 36.32 98.92 1.81 8
Table 1: Latency data for typical processing server
Finally, the test revealed that the network utilization was below 1% of the 10 gigabit Ethernet, pointing
out that network connectivity is not as crucial as suspected.
Results: Initial Test of Storage Device
The storage system is divided between two controllers with 10 GB Ethernet connections. During the
peak processing and image transfer periods, latency (response times) were high. This suggests that the
storage hardware is not optimized for the workflow. Additionally, having two controllers with
separately assigned hard disk volumes created uneven disk utilization across the device.
6
Figure 3: Controller 1 IOPS
Figure 4: Controller 2 IOPS
Figure 5: Controller 1 Response Time (Latency)
0
1,000
2,000
3,000
4,000
5,000
6,000
3:45 6:09 8:33 10:57 13:21 15:45 18:09 20:33 22:57 1:21
IOP
SFront End Write IOPS Front End Read IOPS
0
500
1,000
1,500
2,000
2,500
3,000
3,500
3:00 5:24 7:48 10:12 12:36 15:00 17:24 19:48 22:12 0:36 3:00
IOP
S
Front End Write IOPS Front End Read IOPS
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
3:45 6:09 8:33 10:57 13:21 15:45 18:09 20:33 22:57 1:21
Re
spo
nse
Tim
e (
ms)
Response Time (ms)
7
Figure 6: Controller 2 Response Time (Latency)
The CPU usage and network utilization suggested minimal overhead to the device and adequate
network infrastructure (Figures 7 and 8). The type of processes created by geospatial software are not
generally CPU intensive for large storage devices. It is clear that IOPS and disk speed are much more
important in this workflow.
Figure 7: CPU utilization per controller
Figure 8: Network utilization per controller
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
3:00 5:24 7:48 10:12 12:36 15:00 17:24 19:48 22:12 0:36 3:00
Re
spo
nse
Tim
e (
ms)
Response Time (ms)
0
20
40
60
80
100
8/21/2014 3:00:00 8/21/2014 7:48:00 8/21/2014 12:36:00 8/21/2014 17:24:00 8/21/2014 22:12:00 8/22/2014 3:00:00
Avg
. CP
U U
tiliz
atio
n %
Controler 1A Controler 1B
0
100,000
200,000
300,000
400,000
500,000
3:45 6:09 8:33 10:57 13:21 15:45 18:09 20:33 22:57 1:21
KB
/se
c
Net RX KB/sec Net TX KB/sec
8
Recommendation
Based on the results above, it was evident that the UltraMap software has unique requirements for each
of the modules being used. In order to create a machine that would serve all the software module’s
needs it was necessary to use SSD disk and select a CPU for speed ( as opposed to core count). Table 2
below illustrates how replacing systems with more powerful (faster) CPUs is more efficient (by reducing
the number of machines) than simply adding cores.
System Group Ghz Cores Chips
Cores Per Chip
Score Per Core
Server Count Total Comment
Current A 2.9 8 2 4 28.3 12 2712
Quad-Core AMD Opteron(tm) Processor 2389
Current B 2.4 8 2 4 24.9 5 997
Quad-Core AMD Opteron(tm) Processor 2379
Current C 2.66 12 2 6 49.6 4 2380 Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
Current D 2.66 8 2 4 49.5 2 792 Intel(R) Xeon(R) CPU E5640 @ 2.67GHz
Total 23 6881
Proposed System 3.5 12 2 6
94.167 7 7910
Intel Xeon E5-2643 v2, 3.50 GHz
Table 2: Comparison of Current System with Optimal System
Using the specifications in Table 1 and the test results, systems were easily designed with the specific
CPU, RAM and SSD drives needed to meet the needs of the software. No further testing was necessary
for the correct specification of these servers.
Results pre-purchase test
Two options were determined to be possible candidates for the network storage piece. One system
(system “A”) was optimized using SSD arrays backed by standard hard disk with latency reduced by using
connectivity devices that converted 10 G Ethernet to Fiber Channel. System “B” uses multiple disk
arrays connected via 10 G Ethernet and standard SATA drives with a unique software and hardware
architecture. System “A” was benchmarked with a sample system by the manufacturer using Keystone
supplied imagery. While this did not accurately simulate the full UltraMap workflow, it did generate the
interesting results below:
9
Figure 9: System “A” Controller IO
Figure 10: System "A" Controller Latency
Figure 11: System "A" Disk IO
Figure 12: System "A" Disk Latency
As can be seen by Figures 9 through 12 above, during an intense write session to a SAN using SSD drives
for its initial storage and SATA drives for the bulk of the storage the initial I/O of the system is intense
and responds with excellent performance. However, as the SSD begin to fill, the latency of both the
controller and disk increases to unacceptable levels. The overhead of transitioning data from SSD to
SATA while continuing to write to the SATA disks causes unacceptably high delays (latency).
Due to time constraints and availability, System “B” was not able to be tested; forcing Keystone to make
a decision based on benchmarked tests, rather than real data. The solution, as configured, would have
the ability to support 4650 MB/s read and 2230 MB/s write with a total IOPS of 90,000. Latency, due to
network configuration, system design and high IOPS would be significantly reduced. Based on the
superior design of system B, this was selected for purchase.
10
Results: post installation
Once system “B” storage, the new processing servers and all software, firmware and other optimizations
were installed and completed a system test was performed to mimic previous tests. The charts below
show the process utilization of one of the new processing servers. While 11 out of the 12 possible CPU
cores are being used for processing, only 50% of the total CPU capacity is being used. This confirms that
additional optimizations in both software and hardware will not overly tax the server. Additionally, the
overhead could be used to run other processes on the server without danger of over tasking the system.
The amount of free memory does not fluctuate significantly and never approaches zero, in contrast to
the older processing servers (Figure 1).
Figure 13: New Processing Server Test Results for CPU usage (Top) and available memory (Bottom)
The use of solid state disk in the servers produced significant speed increases. The latency figures below
show the reduction in disk read/write speeds. Particularly interesting are the reduced average queue
length and the 95% reduction in write latency!
System Average Read Response Time
(ms)
95th Percentile Read Response
Time (ms)
Average Write Response Time
(ms)
95th Percentile Write Response
Time (ms) Average Queue
Length 95th Percentile Queue Length
Original System 2.4 7.51 36.32 98.92 1.81 8
New System 0.35 1.1 3.15 5.1 0.35 3
Table 3: Latency comparison between old and new processing servers
The storage area network also showed remarkable improvements over the previous system. The
Figures 14 and 15 below show a peak IOPS of nearly 17,000 while maintaining response times of half a
millisecond or less. Compare this to Figures 3 through 6 that show maximums IOPS of 5000 and
response times consistently above 1 millisecond and peaking at 8.
0
100
6:00 8:24 10:48 13:12 15:36 18:00
Pro
cess
or
%
Use
d
0
100,000
6:00 8:24 10:48 13:12 15:36 18:00
Fre
e
Me
mo
ry
(MB
)
11
Figure 14: New SAN throughput during system testing
Figure 15: New SAN latency during system testing
Finally, the most useful measure of the system’s value to Keystone is the increase in throughput of
imagery. While easily assessed anecdotally, quantifying the change is difficult for two reasons. One,
the variability of simultaneous processes can affect individual image processing speeds and, two, the
software does not provide a mechanism to sort or export the processing times. The times used in the
table below are based on a random sampling of larger jobs performed before and after the new system
installs. The actual times can vary, but these times represent an estimated median for the jobs
referenced.
Level 02 Processing Old Storage System New Storage System
Sensor "A"
Group "C"
Group "D"
Group Weighted
Ave New
Group "C"
Group "D"
Group Weighted
Ave Gain %
Eagle 16.6 15.8 10.8 15.8 9.3 14.4 10.7 11.0 30%
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
11:30 12:42 13:54 15:06 16:18 17:30 18:42 19:54
RW
+ M
eta
dat
a O
ps/
sec
http papi smb2
0.00
0.20
0.40
0.60
0.80
1.00
1.20
11:30 12:42 13:54 15:06 16:18 17:30 18:42 19:54
Re
spo
nse
Tim
e (
ms)
-av
g
SATA
12
Falcon 11.5 11.4 10.3 11.3 6.6 8.2 7.9 7.2 36%
UCX 11.1 10.3 7.8 10.5 4.5 5.2 4.8 4.7 55%
UCO 21.6 19.1 27.1 21.3 16.4 21.4 26.5 18.9 11% Table 4: Single image processing times to Level 02 by server type and sensor type
Level 03 Processing Old Storage System New Storage System
Sensor "A"
Group "C"
Group "D"
Group Weighted
Ave New
Group "C"
Group "D"
Group Weighted
Ave Gain %
Eagle 12.0 7.1 6.3 9.9 5.7 6.5 6.3 6.0 39%
Falcon 8.3 4.8 4.5 6.8 4.1 4.6 4.6 4.3 37%
UCX 11.5 6.9 8.5 9.7 3.3 4.0 3.5 3.5 64% Table 5: Single image processing times to Level 3 by server type and sensor type
The left side of the tables shows the speeds, in minutes, of a single image process by server type and
camera type. To repeat, the times vary based on the size of the processing job and other processes
being run on the system. The Weighted Average is based on the number of machine and cores on each
machine used in the processing stack. The right side of the tables show the new processing servers and
the older systems remaining in place. Interestingly, the older systems also showed improvement
because of the improved storage systems, but the new processing servers vastly out produce the
existing infrastructure. Another way to express the processing gains is to note that on a 60 core
distributed system, 1000 images will be processed between 60 and 100 minutes faster while allowing
other much needed processes to run unaffected. This can translate to hundreds of hours of time
savings in a single year.
Conclusions
The initial assumption that increased CPU core count would generate increased throughput
proved false. CPU processor speed proved a more vital factor in processing speed.
Imagery processing with UltraMap creates high IOPS due to large images being copied in
sequential order, the threading of the software actually generates nearly random I/O operations
under a heavy workload.
The server testing revealed that the network utilization was below 1% of the 10 gigabit Ethernet
network card, pointing out that network connectivity to the processing servers is not as crucial
as suspected in the assumptions.
Based on the testing, it was evident that the UltraMap software has unique requirements for
each specific module processing module. In order to create a machine that would serve all the
software module’s needs it was necessary for SSD disk to be used and a CPU selected for core
speed.
Unlike other business workflow scenarios SSD usage on large storage systems does not produce
good results for geospatial imaging workflows. During intense usage, as the SSD begin to fill,
the latency of both the controller and disk increases to unacceptable levels. The overhead of
transitioning data from SSD to SATA while continuing to write to the SATA disks causes
unacceptably high delays.
13
The use of solid state disk in the processing servers produced significant speed increases. The
results show the reduction in disk read/write speeds of 90% and 95% respectively. Particularly
interesting are the reduced average queue length by 80% to 0.35.
The storage area network also showed remarkable improvements over the previous system with
a peak IOPS of nearly 17,000 while maintaining response times of half a millisecond or less
compared to the previous system’s maximum IOPS of 5000 and response times consistently
above 1 millisecond and peaking at 8.
Processing gains individual image processing range from 11% to 64%. On a 60 core distributed
system, 1000 images will be processed between 60 and 100 minutes faster than with the
previous system while allowing other much needed processes to run unaffected. This will
translate to hundreds of hours of time savings in a single year.
As practitioners of photogrammetry, remote sensing and GIS, industry professionals do not need to have
expertise in all or even many of the deeper aspects of Information Technology (IT). It is however,
important to understand enough of the technologies behind the screens in order to ask the right
questions of the hardware vendors and the industry software providers. It is as equally important to
have a trustworthy and knowledgeable partner or staff member who can fully understand the latest
hardware innovations and apply the correct technologies to each software task. Because the geospatial
industry is now almost completely integrated into the high technology sector in its application and
usage, more than a cursory knowledge of IT must be coveted. Wastes of monies and, more
importantly, of time on systems not optimized for the intended software will not be acceptable in an
industry where margins are often tight. This case study shows that a thorough analysis and testing
regimen is invaluable for system optimization.
References Davis, R. L. (2014, 8 22). Monitoring Disk Queue Length. Retrieved from Idera: http://blog.idera.com/sql-
server/performance-and-monitoring/monitoring-disk-queue-length/
Intel Corp. (2015, April 1). Microprocessor Quick Reference Guide. Retrieved from Intel.com:
http://www.intel.com/pressroom/kits/quickrefyr.htm
Li, M. (. (2013, May 29). Keeping Up with Moore's Law. Retrieved from Dartmouth Undergraduate
Journal of Science: http://dujs.dartmouth.edu/spring-2013-15th-anniversary-edition/keeping-
up-with-moores-law#.VR0xN_nF9Ap
Norman, D. (2001, 12 2). Fibre Channel Technology fo Storage Area Networks. Retrieved from Rivier
University Department of Math and Computer Science:
https://www.rivier.edu/faculty/vriabov/CS553a_Paper_DNorman.pdf
UserBenchmark. (2015). SSD and HDD benchmark. Retrieved from UserBenchmark.com:
http://www.userbenchmark.com/
Vexcel Imaging. (2014). UltraMap v3.2 Overview. Graz, Austria: Vexcel Imaging GmbH.