View
0
Download
0
Category
Preview:
Citation preview
Stereo Vision with Texture Learning for Fault-tolerant
Automatic Baling
Morten Rufus Blasa,b,∗, Mogens Blankea,c,
aTechnical University of Denmark, Department of Electrical Engineering, Automation
and Control Group, Elektrovej build. 326, DK-2800 Kgs. Lyngby, DenmarkbCLAAS Agrosystems, Bøgeskovvej 6, 3490 Kvistgaard, Denmark
cCeSOS, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
Abstract
This paper presents advances in automated baling using stereo-vision. A
robust classification scheme is developed for learning and classifying based on
texture and shape. Using a state-of-the-art texton approach a fast classifier
is suggested that can handle non-linearities and artifacts in data. Shape
information is employed to make the classifier robust to large variations in
lighting conditions and greatly reduce the likelihood that artifacts in signals
from the stereo vision system lead to gross errors in estimated object posi-
tions. The classifier is tested on data from a stereovision guidance system
on a tractor. The system is shown to be able to classify cut plant material
(called swath) by learning it’s appearance. A 3D classifier is successfully
used to train the texture classifier. It is demonstrated from field tests how
fault-tolerant fusion of steering reference data are obtained for an automated
baling vehicle.
∗Corresponding authorEmail addresses: rufus.blas@claas.com (Morten Rufus Blas), mb@elektro.dtu.dk
(Mogens Blanke)
Preprint submitted to Computers and Electronics in Agriculture October 16, 2010
Keywords: texture classification, field navigation, fault-tolerance, stereo
vision, robotics
1. Introduction1
Dependable navigation in outdoor, unstructured environments is crucial2
if autonomous machines shall become a reality in agriculture. Blind GPS3
information is not sufficient due to presence of non-mapped objects, uncer-4
tainties and GPS availability. When a machine shall make navigation deci-5
sions without human interference, an ability to recognise specific structures6
in the environment is needed. Using sensors for outdoor use, typically stereo7
cameras and laser range finders, algorithms that process their data need be8
robust against artifacts in signals and natural variations of the environment.9
Recent efforts towards following field structures, without relying on GPS,10
were presented in [1], [2] and [3]. General findings were that texture methods11
had limited success, due to large variations in conditions, and colour methods12
were restricted to mainly identifying green plants [4]. A mapping approach13
using visual odometry was described in [5]. showing advantages over GPS in14
terms of robustness and accuracy over shorts distances. In robotics, recent15
uses of classification based on colour and texture include [6] where terrain for16
a Mars rover was classified. A combination of visual and geometric features17
was used in [7] for outdoor classification where texture was used in a bag-of-18
words approach and off-line training was needed. A very large set of filter-19
banks was needed to handle all variation. The approach described in [8] was20
more efficient and suggested a texture representation also used in this paper.21
It has the advantage that the texture can be learnt online and allows the22
2
classifier to be optimised for individual scenes without the need of large filter23
banks.24
Combining a 3D classifier and a texture classifier could be a solution to25
the general problem of following structures in the field, for example, to plow,26
seed, spray or harvest. Focusing on baling as a specific harvesting task, this27
requires to follow rows of cut straw or grass (swath) in order to pick it up and28
to process it into bales. This is a labour intensive and repetitive task which29
is of interest to automate. The difficulties pertaining to automating this task30
are similar to the difficulties in automating a large range of other agricultural31
tasks. The ability to track such structure using 3D shape information from32
a stereo camera and/or GPS information, was previously demonstrated [9],33
but issues remained with respect to robustness.34
This paper considers how existing 3D classification methods in agricul-35
ture could be made more robust and tolerant to faults. A combination of36
texture classification, mapping, and supervision is suggested to achieve this37
goal. A novel texton-based classifier is presented that uses online methods38
to learn texture information about the swath and the surroundings. It is39
then analysed how this information could be integrated in a mapping system40
to keep track of measured swath positions and an implementation is pre-41
sented where the map is used to guide a vehicle along a swath by steering42
the tractor’s front wheels while a driver controls the throttle and brakes.43
Results from field tests finally demonstrate how supervision and fusing of 3D44
and texture information in a map were successfully applied in an automatic45
baling system.46
3
2. System Overview47
An automatic baling system is composed of two parts: a vision system and48
a control system as described in Fig.1. In the vision system, a left and right49
image are acquired from a stereo camera. A stereo algorithm extracts 3D50
information from the images. This is then fed into a 3D tracking algorithm51
that tracks the swath field structure. A learning algorithm teaches a texture52
tracking algorithm to track the swath based on the 3D tracking. Supervision53
allows the learning process to be fault-tolerant. Mapping provides fusion of54
tracking information. To enable mapping a fault-tolerant positioning fuses55
and supervises visual odometry and GPS information. The output from the56
mapping system is fed to a track controller that in turn steers the wheels of57
the tractor.58
2.1. Hardware59
In a proof of concept configuration and for the present paper, image pro-60
cessing was done on a laptop which interfaces to the vehicle controller ECU61
through the tractor’s CAN bus. The driver interfaces to the control system62
through a terminal to change settings and engages/disengages the automatic63
steering system through a switch. The baler has integrated pressure sensors64
which are used to measure the bale diameter. This information is used in65
the controller to assure an even filling of the bale chamber. A wheel angle66
sensor provides feedback about the angle of the front wheels of the tractor.67
Hydraulics allow actuation of the front wheels. A stereo camera is mounted68
in front of the tractor and a Real Time Kinematic (RTK) GPS on the roof69
provides position information.70
4
3. Stereo processing71
Stereo-vision perceives depth using triangulation. The distance to a point72
is determined by the triangle between the point and where it appears in each73
of two images. To do this the two images must be aligned. The process of74
aligning the images is part of a calibration step. Given a calibrated stereo75
camera the images are then aligned by warping them. This is known as76
rectification. This gives two cameras with parallel optical axes and horizontal77
epipolar lines. A dense estimation of ranges is then performed at each pixel78
by matching along the epipolar lines [10]. This is done using a correlation79
window with typical sizes of 10x10 pixels. The correlation window matches80
texture in the two images with each other. The output of the matching81
process is a disparity image (see Fig.11). This gives the image difference82
between the position of objects in the two cameras in pixels. The horizontal83
distance from the image centre to the object image is dl for the left image84
and dr for the right image. Then the disparity value d is given by:85
d = dl − dr (1)
A pixel in the disparity image can then be projected to a 3D coordinate86
using triangulation.87
For a stereo camera a point in the depth map is defined by (u, v, d) which88
is the column, row, and disparity value in the disparity image. This can89
be projected to and from a 3D coordinate defined by (x, y, z) in the camera90
coordinate system by:91
5
xl =
x
y
z
=
(u−u0)bd
(v−v0)bd
fb
d,
(2)
where u0 and v0 are the column and row coordinate of the optical centre92
of the image in pixels. b is the camera baseline and f is the focal length for93
the rectified image.94
4. Height Classification95
In order to identify the swath in 3D the ground plane is first estimated (see96
the work by Konolige [11]). The distance from each pixel in the disparity97
image to the plane is then calculated (using the 3D coordinates and the98
point-to-plane distance). Pixels below the plane are set to zero-distance.99
The height values are then normalised based on a median value of a subset100
of the largest height values. The output is a 2D image scaled from 0-101
1 with a higher value indicating a higher likelihood of pertaining to the102
swath (under the assumption that only the swath is higher than the ground103
plane). In practice this performs well even though the ground may be uneven104
and/or hilly. In the local area used for height classification the ground plane105
approximation works well.106
5. Texture Classification107
5.1. General Aspects108
In a seminal paper, Leung and Malik [12] showed that many textures could109
be represented and re-created using a small number of basis vectors extracted110
6
from the local descriptors; they called the basis vectors textons. While Leung111
and Malik used a filter bank, Varma and Zisserman [13] showed that small112
local texture neighbourhoods may be better than using large filter banks. In113
addition, a small local neighbourhood vector can be much faster to compute114
than multichannel filtering such as Gabor filters over large neighbourhoods.115
5.2. Texton Labelling116
Given a colour image as input pixel neighbourhoods in the image are117
grouped into belonging to 1 of 23 texton types (texton image). This number118
was chosen as a compromise between quality and speed. These textons are119
learnt from the training image. This is done by first extracting a descriptor120
in the form of a vector from each pixel location in the image. For each121
location pi the vector is:122
pi =
W1 ∗ Lc
W2 ∗ ac
W2 ∗ bc
W3 ∗ (L1 − Lc)...
W3 ∗ (L8 − Lc)
(3)
Where Lc, ac, bc is the colour of the pixel at this location in CIE*LAB123
colour-space. (L1−Lc), ..., (L8−Lc) are the intensity differences between the124
pixel at this location and the 8 surrounding pixels in a 3× 3 neighbourhood.125
The vector elements are then weighted using {W1 = 0.5, W2 = 1, W3 = 0.5}.126
A K-means algorithm is then run on all these descriptors to extract cluster127
centres which we refer to as textons. The K-means algorithm finds the set128
7
of textons µj that partitions the descriptors into k sets S = S0, S1, ..., Sk by129
trying to minimize [14]:130
S∗ = argminS
k∑
j=1
∑
pi∈Sj
‖pi − µj‖2 (4)
where pi are the descriptors in the training image(s) and Sj is the set of131
descriptors belonging to cluster j out of k clusters.132
Each pixel location in the image is then labeled as belonging to a texton133
by finding the nearest texton in Euclidean space. An example of such a134
classification is shown in Fig.2.135
5.3. Texture Training136
The texton image is used as the basic input to train the texture classifier.137
Textons were only calculated for the left image. Additional extra redundancy138
could be achieved by using both left and right images. However as they139
view the same scene they will often have similar failure modes. Therefore140
this redundancy was not used in order to instead get faster performance.141
Given a training mask representing the location of the swath in an image142
(representing the location of the ”swath” as well as the surroundings ”not143
swath”), and a texton image the average histogram of texton occurrences144
around 32 × 32 image patches for the ”swath” and ”not swath” case can be145
constructed (with 32× 32 pixels corresponding to 1− 2% of the total image146
size).147
As is apparent from the swath images in Fig.4, there are multiple objects148
with unique textures present in both the swath and in the area surrounding149
the swath. For example, stubble from harvested plants may be present150
8
in part of the image or tire tracks from other machines may form separate151
textures. The solution presented here is to identify and model each texture152
independently by representing each of them with a mean histogram. This153
also allows handling of how texture changes with distance from the camera154
because objects further away have a smaller pixel resolution. Identification of155
the different textures is done by taking the list of histograms for the ”swath”156
and ”not swath” cases and clustering them (similarly to what was done for157
the textons) independently of each other using K-means [14]. This was done158
using a single training image. For the results presented here, the number159
of clusters was 3 for each case. The reason for using K-means was simply160
for speed. Other probabilistic learning approaches could be used, including161
Support Vector Machines (SVM) [15] and Gaussian Mixture Models (GMM)162
[16].163
5.4. Texture Classifier164
Classification is done on a texton image by identifying a likelihood that165
a histogram centred around a pixel is either ”swath” or ”not swath”. This166
is represented by the two hypotheses: Hs for ”swath” and Hn for ”not167
swath”. A suitable distance measure must be used to compare the his-168
tograms. Having considered several distance measures the sum-of-absolute169
distances (SAD) was chosen and the ratio of the distance from an observed170
distribution to either of the hypotheses were used for classification. A main171
reason for using this distance measure was that it can be implemented to172
run very fast since it does not involve square roots or exponential functions.173
Given two histograms h1 and h2 the distance between them is (defined by174
the ⊕ operator):175
9
h1 ⊕ h2 =
k∑
j=1
|h1(j) − h2(j)| (5)
In [16] it was demonstrated that GMM’s could be used to model the176
distribution of colours of a dirt road. Each Gaussian could be made to177
recognize different sets of colours and allow modeling of multi-modal colour178
distributions by mixing the Gaussians. RGB colour data is 3 dimensional179
which is much smaller than a texton histogram with 64 dimensions. Modeling180
texture even with just one Gaussian was found too slow on current hardware181
and a number of Gaussians seem needed to represent the distributions well.182
A compromise was achieved by modeling the texture histograms as a183
number of K-means clusters. A major penalty in doing this is that the184
density of the distributions is not modeled.185
The output of the texture training allows the histograms at each pixel186
in an image to be labeled as belonging to either Hs or Hn. Then these187
histograms are clustered separately for each hypothesis using K-means with188
m = 3 clusters.189
Then Hs is Hs = {hs,1,...,hs,m} and Hn = {hn,1,...,hn,m} for Hn. The190
classifier is then formulated as a distance ratio to the nearest cluster under191
each hypothesis:192
d =min(h ⊕ hn,1,...,h ⊕ hn,m)
min(h ⊕ hs,1,...,h ⊕ hs,m)(6)
The distance is computed for each pixel in the image. A geometrical193
analysis is then used to detect if a field structure is present in the image. An194
example of classifying an image is given in Figure 2.195
10
5.5. Swath Detection196
The swath in the image is parameterised by a width, position and orienta-197
tion. A mask can then be constructed for all feasible swath parameterisations198
(see Fig. 3 for an example of a mask). An exhaustive search is then per-199
formed within a quantised set of the parameter space to maximise a match200
score. In Fig. 3 the dark region has a value of −1, the white region has a201
value of 1. This mask is multiplied pixel-wise with the classified images for202
both the height as well as the texture, and then summed to produce a scalar203
value indicating goodness of a match.204
5.6. Updating Texture Model205
The texture training produces a model of the texture that the classifier206
uses. In the current implementation, training is performed at a rate of207
about 1 Hz. Training is only done if the 3D classifier match score is above208
a threshold. The newly trained texture model is then validated by testing209
the match score from using the old model against the new one. If better,210
the old model is discarded. It has been considered whether it could pay off211
to use an incremental texture model where the old model is not completely212
discarded. This approach had difficulties with local minima. In effect, as213
new models are constantly being tested, the learning system does have an214
incremental attribute in that often the method converges to finding a single215
training image that best represents the field variations. Nevertheless, it still216
has the ability of quickly switching model if a new type of environment is217
encountered.218
Attempts were initially made with trying to implement an offline model219
instead of learning the textures online. However as can be seen in Fig.4220
11
the variation that needs to be captured by such a model is quite large. For221
natural environments it is difficult to get training data for generating offline222
models that can handle all the necessary variation. In practice one could223
do the above training offline without any changes. From an implementation224
point of view the training is done concurrently with texture tracking on a225
separate processor/core so it does not slow down the system.226
6. Classification Results227
In order to evaluate the performance of the classifier, a simple scoring228
scheme was designed. The classification rate was defined as the number229
of correct pixel classifications Dc normalised by the maximum number of230
correct pixel classifications, Dc,max. High classification rate is the better.231
The false alarm rate was defined as the number of false pixel classifications232
(false alarms) Df normalised by the maximum number of possible false pixel233
classifications Df,max. Low false alarm rates are preferred. A hand labeled234
image was used as ground truth.235
Classification rate =Dc
Dc,max
(7)
False alarm rate =Df
Df,max
(8)
A set of 20 consecutive images (sampled with about 1 Hz) were selected236
at random from 5 different data-sets of swaths (see Fig.4). The data-sets had237
few shadows and all images in a data-set had the sun coming from the same238
angle (these parameters can cause the texture to change). A hand labeled239
set of ground truth images were made for all 100 images. For each of the240
sequences, the classifier was learnt from one image and then applied to the241
12
others. For this image the position of the swath was calculated using the242
3D classifier that extracts the width and position of the swath based on the243
3D profile in the stereo images.244
The results (see Fig.5) are very good with an average of around 90%245
detection rate relative to the ground truth. The false alarm rate is around246
4% and is thus also good. Some images in data-set #3 have problems247
getting the heading of the swath correct as in Fig.4 where 2(c) has a slightly248
wrong heading. This error was attributed to limited resolution this far out249
meaning that there was not enough texture information to classify reliably250
on this specific swath. There was a spike on image #17 in data-set #4. This251
was due to an ambiguity in the image where the classifier chose a solution252
slightly to the left of the hand labeled image to compensate for a large lump253
of swath lying off centre relative to the rest of the swath. This is shown in254
Fig.6. Such artifacts are unavoidable in a natural environment, and fault-255
tolerant techniques need be employed to avoid undesired behaviours from the256
steering control. The fault-tolerance aspects are discussed in Section 8.257
7. Mapping258
The mapping system relies on recent research for high-precision position-259
ing using VO fused with GPS (see [11]). By tracking features that change260
between images, the change in pose of the camera can be estimated. This is261
done by computing changes in position between image frames. GPS provides262
global correction so drift in the VO subsystem can be avoided. This fused263
position estimate is used to maintain a map of the position of the swath.264
The swath map is modelled as a Taylor series expansion of a clothoid [17]:265
13
y(x) = y0 + tan(φ)x + C0x2
2(9)
Where y0 is the lateral offset between the vehicle and the swath centre,266
φ is the angle of the swath relative to the vehicle, C0 is the curvature of267
the swath and x is distance ahead of the vehicle. Estimating the curvature268
from a single image can be difficult. This is due to the natural variance of269
the swath position and width. To estimate the curvature requires a larger270
”accumulated field of view”, which is obtained as follows.271
For each tracked image, the centre point of the swath is extracted for each272
horizontal line in the image. The centre points are stored as a list. Old273
centre points are deleted if they represent positions behind the bailer, and274
new centre points are added as new images are tracked. Clothoid parameters275
are estimated using least-squares fitting of the parameters in Eq. 9 to the276
centre points in the list, hence representing a range from the camera field of277
view back to the pickup on the baler. The baler picks up the swath so the278
dynamic map naturally ends at the baler.279
8. Supervision and fault-tolerance280
Two supervision systems are run concurrently in the software. The first281
operates solely on position sensor information (Positioning Supervision in282
Fig.1) and is based on work described in [9] and this is not described fur-283
ther. The second is concerned with supervising the trained texture classifier284
(Texture Supervision in Fig.1).285
To make sure the trained texture classifier is not faulty, a supervision286
system has been setup for validating. The principle is to evaluate the texture287
14
classifier relative to the 3D classifier over a number of frames. Change288
detection theory is applied to determine if the texture classifier output is289
similar to the 3D tracking. Two residuals are formed: one for lateral and290
another for angular deviation. The residuals are simply the differences291
between outputs of the 3D and the texture based classifiers.292
When no prior information can be assumed, statistical change detection293
is often based on a log likelihood ratio (si):294
si = lnpθ1(ri)
pθ0(ri)(10)
where the probability densities from the observed distributions are used,295
pθ1 for the case of a fault, pθ0 for the normal case, respectively, and the value296
of a residual at sample instant i is denoted ri.297
Decision on whether a change has taken place need averaging over a num-298
ber of values of si in order to get adequate confidence. In the implemented299
change detection algorithm, a standard cumulative sum (CUSUM) algorithm300
is used, see e.g. [18] and references herein, to detect a change in mean:301
S(k) =
k∑
i=1
si =
k∑
i=1
lnpθ1(ri)
pθ0(ri)(11)
To obtain estimates of probability density functions for normal and not-302
normal cases, histograms of the residuals were analysed for cases with and303
without faults. These are shown in Fig.7. By inspection, faults appear to be304
given by a shift in the mean µ of the residuals ri while the standard deviation305
σ remained unchanged. Therefore, the distributions for normal and faulty306
cases are approximated by scalar Gaussian distributions,307
15
Normal : N (µ0, σ) : pθ0(ri) =1
σ√
2πexp(−ri − µ0
2σ2) (12)
Fault : N (µ1, σ) : pθ1(ri) =1
σ√
2πexp(−ri − µ1
2σ2) (13)
A scalar test function g(k) was made as an approximation to S(k) over a308
window of N samples, and the test function is the well known309
g(k) =k
∑
i=k−m
lnpθ1(ri)
pθ0(ri)(14)
=µ1 − µ0
σ2
k∑
i=k−N
(ri −µ1 + µ0
2)
Changes in mean of 20 cm and 5 deg are used to reject a trained tex-310
ture classifier. The variances are estimated to be σ2 = 88.0 cm2 and σ2 =311
16.5 deg2, respectively. A sample window of N = 20 (roughly 2 seconds) was312
used to calculate the cumulative sum with tests for both a positive and neg-313
ative change in mean. A fault flag was triggered if this value exceeds γ = 5314
for the lateral fault, and γ = 2 for the angular fault.315
Histograms of the residuals were analysed for cases with and without316
faults. These are shown in Fig. 7. By inspection, faults appear to be given317
by a shift in the mean of the residuals ri and the probability densities of318
the residuals are reasonably well approximated by Gaussian distributions.319
Therefore, standard statistical change detection can distinguish between the320
cases of zero mean (no fault) and non-zero mean (fault). If a test is to be321
made to detect a known change in magnitude, the standard cumulative sum322
(CUSUM) test can be used. If the magnitude of change need be assumed323
unknown, a Generalised Likelihood Ration test (GLR) is to be employed [19].324
16
In this paper, a CUSUM detector is used because, in a fault-tolerant guidance325
system, position deviation of a known (small) magnitude could be tolerated.326
A change in mean of 20 cm and 5 deg are used to reject a trained texture327
classifier. The variance is estimated to be σ2 = 88.0 cm2 and σ2 = 16.5 deg2328
respectively. A sample window of N = 20 (roughly 2 seconds) was used to329
calculate the cumulative sum with tests for positive or negative changes in330
mean. A fault flag was triggered if this value exceeded γ = 5 for the lateral331
fault, and γ = 2 for the angular fault.332
For a change in mean of A = 20 cm over the sample window it corresponds333
to detecting a change in mean between N (0, σ2
N) and N (A, σ2
N). The proba-334
bility of a false alarm PFA is the area from the threshold and the probability335
of detection PD is then given by the area under the [19]:336
PFA = Q(γ
√
σ2/N) = 0.0086 (15)
PD = Q(γ − A
√
σ2/N) = 0.9999 (16)
Where Q is the right-tail probability. For the angular deviation, similar337
results can be obtained: PFA = 0.0138, PD = 0.9995. This gives a good338
detection rate while also assuring a reasonably low false alarm rate.339
The residuals checks are run concurrently with the rest of the system and340
can be seen in Fig.8 for the run in Fig.13. The maximum value is shown at341
each time step for both positive and negative changes in mean (denoted by342
g(k)). In this run two faults are detected. Only the second fault is detected343
by both residuals.344
Individual match scores for the two classifiers are first thresholded based345
17
on their match score. If they are both accepted, they are passed to the346
supervision system. If no faults are detected, the results are fused by taking347
a weighted mean of the lateral and angular deviations (based on the match348
scores). If a fault is detected, the texture classifier information is not added349
to the map. Mapping provides intelligent filtering of the classifier outputs350
so jumps are not experienced when switching between the information used.351
The approach outlined gave good fault-tolerance against artifacts in the im-352
age processing.353
9. Control354
A tracking control system to collect the swath is shown in the block355
diagram in Fig. 1. The control system remains active as long as the match356
score is above a predefined threshold and there is map information ahead of357
the vehicle such that a reference track can be computed. The control system358
has a variable offset that is calculated to ensure the bale chamber is filled359
evenly. If the bale chamber is unevenly filled then the bale becomes cone360
shaped. Pressure sensors inside the baler provide a measure of how evenly361
it is filled. The bale must have a certain size before the pressure sensors362
give usable feedback. To compensate for this lack of feedback the controller363
has two states. In the initial state where pressure has not yet built up364
an open-loop steering pattern is followed where the vehicle changes between365
being positioned on the left edge of the swath and then the right edge of the366
swath. This motion is parameterised as how far out this edge is relative to367
the centre of the swath; how long it should follow an edge, and how sharply368
it should change sides. The steering system changes mode when the bale369
18
size reaches a minimal level and sensor feedbacks can be used. The sensor370
feedback provides a signal in the range −1 to 1 indicating how cone shaped371
the bale is. An adequately amplified version of this signal is added to the372
measured lateral offset. The swath parameters y0, φ, C0 are finally fed to the373
tracking controller.374
10. Results375
10.1. Texture Versus 3D376
A comparison between the measured lateral and angular deviation of the377
swath position relative to the vehicle for the 3D and texture algorithms is378
shown in Fig. 10. Clearly, the two follow each other closely. It is interesting379
to note that the two algorithms complement each other in that sudden drops380
in the match score seldom occur in both at the same time. The match score381
was on average higher for the texture method than for the 3D method in this382
sequence. This may be different from swath to swath depending upon how383
high it is and how slanted the edges are. Towards the end of the sequence384
(around frame 1500) the 3D method started to have problems detecting a385
height difference on a very flat section of swath, but the texture method386
continued driving to the end. An example of the normalized images for387
texture and 3D classification can be seen in Fig.11. A comparison between388
the two classifications methods and the fused approach using mapping is389
given in Fig.12. Specifically the texture method was found more sensitive to390
changes in lighting than the 3D method. For example due to shadows. The391
3D method can potentially have problems with mistaking standing crop for392
the field structure which the texture method is less sensitive to. By combining393
19
texture and 3D a more robust system can be achieved. However the system394
is still sensitive to training using 3D so will require some initial 3D structure395
for the first training. If the 3D tracking detects a wrong 3D structure this396
can potentially be detected by comparing it to the texture classification that397
is less sensitive to this.398
10.2. Field Tests399
Results for the system operating live are provided in Fig.13. The results400
span a period of about 800 s and involves creating 7 bales. The system401
steers autonomously. There were 3 drops in the match score which were402
where the swath ended in the headland and manual steering was required to403
bring the vehicle to the next length of swath. During the turning periods404
the lateral deviation measurements were ignored. The match scores shown405
have been normalised from 0− 1. The vehicle needed to stop when ejecting406
a finished bale, which can be observed from the velocity plot. The driver407
kept the velocity at 2m/s but also increased it to close to 3m/s for a certain408
part of the results.409
11. Conclusion410
There has generally been little efforts at improving redundancy in vision411
based guidance systems. The system presented here improved redundancy412
by combining 3D measurements, texture, and mapping. The results in this413
paper showed that textures present in outdoor agricultural environments414
can be learnt and tracked robustly. 3D data provided by the stereo camera415
facilitated such learning. Geometric shape constraints allowed an initial416
sorting of false positives from a swath from analysis of width, position, and417
20
orientation. Supervision provided further detection of faults by comparing418
the learnt model to 3D tracking. Future work for the texture classifier will419
involve eliminating the need for online learning based on the mask from 3D420
tracking and thus make it more capable to perform independently of 3D data.421
More field tests will also need to be made with the texture classifier to see422
how it handles moving shadows and lighting changes.423
The mapping component allowed additional parameters such as the swath424
curvature to be tracked and increased robustness of the overall system. Ex-425
tensive tests demonstrated the system to work online on an autonomously426
steered vehicle on a variety of swath.427
12. References428
[1] J. Jin, L. Tang, Corn plant sensing using real-time429
stereo vision, J. Field Robot. 26 (6-7) (2009) 591–608.430
doi:http://dx.doi.org/10.1002/rob.v26:6/7.431
[2] F. Rovira-Mas, S. Han, J. Wei, J. F. Reid, Autonomous guidance of432
a corn harvester using stereo vision, Agricultural Engineering Interna-433
tional: the CIGR Ejournal. 9.434
[3] T. Coen, A. Vanrenterghem, W. Saeys, J. De Baerdemaeker, Autopilot435
for a combine harvester, Comput. Electron. Agric. 63 (1) (2008) 57–64.436
doi:http://dx.doi.org/10.1016/j.compag.2008.01.014.437
[4] T. Bakker, H. Wouters, K. van Asselt, J. Bontsema, L. Tang, J. Muller,438
G. van Straten, Original paper: A vision based row detection sys-439
21
tem for sugar beet, Comput. Electron. Agric. 60 (1) (2008) 87–95.440
doi:http://dx.doi.org/10.1016/j.compag.2007.07.006.441
[5] M. Kise, Q. Zhang, Creating a panoramic field image using multi-442
spectral stereovision system, Comput. Electron. Agric. 60 (1) (2008)443
67–75. doi:http://dx.doi.org/10.1016/j.compag.2007.07.002.444
[6] A. Angelova, L. Matthies, D. Helmick, P. Perona, Fast terrain classifi-445
cation using variable-length representation for autonomous navigation,446
2007, pp. 1–8.447
[7] I. Posner, M. Cummins, P. Newman, A generative framework for fast448
urban labeling using spatial and temporal context, Auton. Robots 26 (2-449
3) (2009) 153–170. doi:http://dx.doi.org/10.1007/s10514-009-9110-6.450
[8] M. R. Blas, M. Agrawal, A. Sundaresan, K. Konolige, Fast color/texture451
segmentation for outdoor robots, in: IROS, 2008, pp. 4078–4085.452
[9] M. R. Blas, M. Blanke, Natural environment modeling and fault-453
diagnosis for automated agricultural vehicle, in: Proceedings 17th IFAC454
World Congress, Seoul, Korea, 2008, pp. 1590–1595.455
[10] K. Konolige, Small vision systems: hardware and implementation, in:456
Intl. Symp. on Robotics Research, 1997, pp. 111–116.457
[11] K. Konolige, M. Agrawal, M. R. Blas, R. C. Bolles, B. Gerkey,458
J. Sola, A. Sundaresan, Mapping, navigation, and learning459
for off-road traversal, J. Field Robot. 26 (1) (2009) 88–113.460
doi:http://dx.doi.org/10.1002/rob.v26:1.461
22
[12] T. Leung, J. Malik, Representing and recognizing the visual appearance462
of materials using three-dimensional textons, International Journal of463
Computer Vision 43 (1) (2001) 29–44.464
[13] M. Varma, A. Zisserman, Texture classification: Are filter banks nec-465
essary?, Proceedings of the IEEE Conference on Computer Vision and466
Pattern Recognition 2 (2003) 691–696.467
[14] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.468
[15] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin469
classifiers, Proceedings of the Fifth Annual ACM Workshop on Compu-470
tational Learning Theory (1992) 144–152.471
[16] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel,472
P. Fong, J. Gale, M. Halpenny, G. Hoffmann, K. Lau, C. Oakley,473
M. Palatucci, V. Pratt, P. Stang, S. Strohband, C. Dupont, L.-E. Jen-474
drossek, C. Koelen, C. Markey, C. Rummel, J. van Niekerk, E. Jensen,475
P. Alessandrini, G. Bradski, B. Davies, S. Ettinger, A. Kaehler, A. Ne-476
fian, P. Mahoney, Stanley: The robot that won the darpa grand chal-477
lenge, Journal of Field Robotics 23 (9) (2006) 661–692.478
[17] B. Southall, C. J. Taylor, Stochastic road shape estimation, Com-479
puter Vision, IEEE International Conference on 1 (2001) 205.480
doi:http://doi.ieeecomputersociety.org/10.1109/ICCV.2001.10022.481
[18] M. Blanke, M. Kinnaert, J. Lunze, M. Staroswiecki, Diagnosis and Fault-482
Tolerant Control 2nd Edition, Springer, 2006.483
23
[19] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume 2,484
Prentice Hall, 1998.485
24
Figure 1: Block diagram of vision system and tracking control for baling. Supervised
classification and positioning provides mapping of field structures, which are fed to the
steering controller.25
(a) (b)
(c) (d)
(e)
Figure 2: (a) Left image from the stereo camera. (b) Texton classification. Each pixel
colour represents a different texton (using a set of random colours). (c) The stereo camera
image with a transparency mask based on the swath classification. (d) Swath classification
based on texture with intensity representing the strength of classification. (e) Visualization
of the texture classification using 3D stereo information.
26
Figure 3: (a) The original image with a swath in the middle. (b) Mask illustrating width,
position, and orientation of swath extracted from stereo-algorithm.
1(a) 1(b) 1(c) 1(d) 1(e)
2(a) 2(b) 2(c) 2(d) 2(e)
Figure 4: Images 1(a-e) illustrate the training images used. Image #10 in each dataset is
classified in 2(a-e) with the blue lines representing the swath bounds.
27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2070
75
80
85
90
95
100
Image #
Per
cent
Datasets − Detection Rate
Set. 1Set. 2Set. 3Set. 4Set. 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
2
4
6
8
10
12
Image #
Per
cent
Datasets − False Alarms
Set. 1Set. 2Set. 3Set. 4Set. 5
Figure 5: Classification and false alarm rate for the tested sequences.
28
Figure 6: This is the worst match in the used datasets according to the scoring scheme
where the algorithm follows a lump of swath slightly to the left of the rest of the swath.
Image #17 in dataset #4.
29
−20 −15 −10 −5 0 5 10 150
0.02
0.04
0.06
0.08
0.1
0.12Angular Deviation (no fault)
degrees−15 −10 −5 0 5 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14Angular Deviation (faulty)
degrees
−60 −50 −40 −30 −20 −10 0 10 20 300
0.005
0.01
0.015
0.02
0.025
0.03
0.035Lateral Deviation (no fault)
cm−60 −50 −40 −30 −20 −10 0 100
0.01
0.02
0.03
0.04
0.05
0.06
0.07Lateral Deviation (faulty)
cm
Figure 7: Normalised histograms for residuals with and without faults. The histograms
are shown with fitted Gaussian probability density functions.
30
0 500 1000 1500 2000−2
−1
0
1
2
3
4Fault Detection (Angular Fault)
g(k)
frame
0 500 1000 1500 2000−4
−2
0
2
4
6
8Fault Detection (Lateral Fault)
g(k)
frame
Figure 8: Fault detection in angular and lateral deviations using a CUSUM change detec-
tor. The thick line defines the threshold for a fault.
31
Figure 9: The open-loop zig-zag motion of the vehicle is here clearly illustrated by the red
line. The vehicle in turn drives on the left edge of the swath and then on the right edge.
32
[ptb]
0 500 1000 1500 2000−100
−50
0
50
100Lateral Deviation
frame
cm
3DTexture
0 500 1000 1500 2000−30
−20
−10
0
10
20Angular Deviation
frame
degr
ees
3DTexture
0 500 1000 1500 20000
0.2
0.4
0.6
0.8Match Score
frame
3DTexture
Figure 10: Results for 1800 frames of video. The measured lateral and angular deviation
of the swath relative to the vehicle plotted for both 3D and texture. The match score is
shown for the two algorithms. There was no swath inside the field of view between frames
950-1100, 1270-1380, 1600-1800.
33
Figure 11: An example of classification results from an image pair. Only the left image
is shown for clarity. In the disparity image warmer colours indicate shorter range. The
more white a pixel is in the classified images the stronger the field structure is detected.
3D Texture 3D and texture Mapping
Sensitivity to lighting Low High Low
Sensitivity to other 3D structures High Low High (but can detect fault)
Requires 3D field structure Yes No No (after training)
Requires field structure with texture different from surroundings No Yes No
Can reliably identify field structure curvature No No Yes
Can detect classification faults No No Yes
Can fuse multiple image classifications (reduce noise) No No Yes
Figure 12: Comparison of classifications methods.
34
100 200 300 400 500 600 700 800 900 1000−200
−100
0
100
200
cm
Lateral Deviation VS Bale Pressure
Scaled Pressure DifferenceSwath Lateral Deviation
100 200 300 400 500 600 700 800 900 10000
50
100
150
200Bale Diameter
cm
100 200 300 400 500 600 700 800 900 10000
0.2
0.4
0.6
0.8
1Match Score
100 200 300 400 500 600 700 800 900 10000
1
2
3Velocity
m/s
time (s)
Figure 13: Results for a 13 minute run where 7 bales are created. The controller is using
the feedback signal from the baler.
35
Recommended