Upload
chris-becker
View
1.098
Download
2
Embed Size (px)
DESCRIPTION
Slides from "Searching 35 Million Images by Color Using Solr" presented by Chris Becker at Solr Lucene Revolution 2014 in Washington D.C.
Citation preview
Searching Images by ColorChris Becker
Search Engineering @ Shutterstock
What is Shutterstock?
• Shutterstock sells stock images, videos & music.
• Crowdsourced from artists around the world
• Shutterstock reviews and indexes them for search
• Customers buy a subscription and download them
Why search by color?
Color is one of many visual
attributes that you can use
to create an engaging
image search experience
Shutterstock Labs
Spectrum
Palette
Diving into Color Data
Calculating Distances Between Colors
• Euclidean distance works reasonably well in any color space
distRGB = sqrt((r1-r
2)^2 + (g
1-g
2)^2 + (b
1-b
2)^2)
distHSL = sqrt((h1-h
2)^2 + (s
1-s
2)^2 + (l
1-l
2)^2)
distLCH = sqrt((L1-L
2)^2 + (C
1-C
2)^2 + (H
1-H
2)^2)
distLAB = sqrt((L1-L
2)^2 + (a
1-a
2)^2 + (b
1-b
2)^2)
• More sophisticated equations that better account for human
perception can be found at
http://en.wikipedia.org/wiki/Color_difference
Images are just numbers
[
[[054,087,058], [054,116,206], [017,226,194], [234,203,215], [188,205,000], [229,156,182]],
[[214,238,109], [064,190,104], [191,024,161], [104,071,036], [222,081,005], [204,012,113]],
[[197,100,189], [159,204,024], [228,214,054], [250,098,125], [050,144,093], [021,122,101]],
[[255,146,010], [115,156,002], [174,023,137], [161,141,077], [154,189,005], [242,170,074]],
[[113,146,064], [196,057,200], [123,203,160], [066,090,234], [200,186,103], [099,074,037]],
[[194,022,018], [226,045,008], [123,023,087], [171,029,021], [040,001,143], [255,083,194]],
[[115,186,246], [025,064,109], [029,071,001], [140,031,002], [248,170,244], [134,112,252]],
[[116,179,059], [217,205,159], [157,060,251], [151,205,058], [036,214,075], [107,103,130]],
[[052,003,227], [184,037,078], [161,155,181], [051,070,186], [082,235,108], [129,233,211]],
[[047,212,209], [250,236,085], [038,128,148], [115,171,113], [186,092,227], [198,130,024]],
[[225,210,064], [123,049,199], [173,207,164], [161,069,220], [002,228,184], [170,248,075]],
[[234,157,201], [168,027,113], [117,080,236], [168,131,247], [028,177,060], [187,147,084]],
[[184,166,096], [107,117,037], [154,208,093], [237,090,188], [007,076,086], [224,239,210]],
[[105,230,058], [002,122,240], [036,151,107], [101,023,149], [048,010,225], [109,102,195]],
[[050,019,169], [219,235,027], [061,064,133], [218,221,113], [009,032,125], [109,151,137]],
[[010,037,189], [216,010,101], [000,037,084], [166,225,127], [203,067,214], [110,020,245]],
[[180,147,130], [045,251,177], [127,175,215], [237,161,084], [208,027,218], [244,194,034]],
[[089,235,226], [106,219,220], [010,040,006], [094,138,058], [148,081,166], [249,216,177]],
[[121,110,034], [007,232,255], [214,052,035], [086,100,020], [191,064,105], [129,254,207]],
]
• getting histograms
• computing median values
• standard deviations / variance
• other statistics
Any operation you can do on a set of
numbers, you can do on an image
Extracting Color Data
Tools & Libraries
• ImageMagick
• Python Image Library
• ImageJ
# python example to get a histogram from an image
import PIL
from PIL import Image
from pprint import pprint
image = Image.open('./samplephoto.jpg')
width, height = image.size
colors = image.getcolors(width*height)
hist = {}
for i, c in enumerate(colors):
hex = '%02x%02x%02x' % (c[1][0],c[1][1],c[1][2])
hist[hex] = c[0]
pprint(hist)
Indexing & Searching
in Solr
Indexing color histograms
color_txt = "cfebc2
cfebc2 cfebc2 cfebc2
cfebc2 cfebc2 cfebc2
cfebc2 cfebc2 cfebc2
95bf40 95bf40 95bf40
95bf40 95bf40 95bf40
2e6b2e 2e6b2e 2e6b2e
ff0000 …"
• index colors just like you would index text
• amount of color = frequency of the term
Solr Schema & Queries
• Can use solr’s default ranking effectively
/solr/select?q=ff0000 e2c2d2&qf=color&defType=edismax…
• or use term frequencies directly for specific sort functions:
sort=product(tf(color,"ff0000"),tf(color,"e2c2d2")) desc
<field name="color" type="text_ws" …>
Indexing color statistics
lightness:
median: 2
standard dev: 1
largest bin: 0
largest bin size: 50
saturation
median: 0
standard dev: 0
largest bin: 0
largest bin size: 100
…
Represent aggregate statistics of each image
Solr Fields & Queries
• Sort by the distance between input param
and median value for each image
/solr/select?q=*&sort=abs(sub($query,hue_median)) asc
<field name=”hue_median” type=”int” …>
Ranking & Relevance
which image is more relevant if I search for ?
image from www.shutterstock.com
How do we account for these factors?
How much of the image contains the
selected color?
• Score each color by the number of pixels
sort=tf(color,"cfebc2") desc
Balance Precision and Recall
• Reduce your colorspace enough
to balance:
• color accuracy
• index size
• query complexity
• result counts
• only need 100-200 colors for a good UX
✓
Weighing Multiple Colors Together
• If you search for 2 or more colors, the top result should have
the most even distribution of those colors
• simple option:sort=product(tf(color,"ff9900"),tf(color,"2280e2")) desc
• more complex: compute the standard deviation or variance
of the term frequencies of matching color values for each
image, and sort the results with the lowest variance first.
✓
Weighing Similar & Different Colors
• The score for one color should reflect all the colors in the image.
• At indexing time, increase the score based on similar colors;
decrease it based on differing colors.
Conclusion
Conclusion• Steps for building color search in Solr:
• Extract colors using a tool like the Python Image Library
• Score colors based on the number of pixels
• Adjust scores based on similar / different colors
• Index colors into Solr as text document
• In your query, sort by the term frequency values for each
color
One more demo…