3. What is Shutterstock?
• Shutterstock sells stock images, videos & music.
• Crowdsourced from artists around the world
• Shutterstock reviews and indexes them for search
• Customers buy a subscription and download them
13. Any operation you can do on a set of
numbers, you can do on an image
• getting histograms
• computing median values
• standard deviations / variance
• other statistics
17. # python example to get a histogram from an image
import PIL
from PIL import Image
from pprint import pprint
image = Image.open('./samplephoto.jpg')
width, height = image.size
colors = image.getcolors(width*height)
hist = {}
for i, c in enumerate(colors):
hex = '%02x%02x%02x' % (c[1][0],c[1][1],c[1][2])
hist[hex] = c[0]
pprint(hist)
19. Indexing color histograms
• index colors just like you would index text
• amount of color = frequency of the term
color_txt = "cfebc2
cfebc2 cfebc2 cfebc2
cfebc2 cfebc2 cfebc2
cfebc2 cfebc2 cfebc2
95bf40 95bf40 95bf40
95bf40 95bf40 95bf40
2e6b2e 2e6b2e 2e6b2e
ff0000 …"
20. Solr Schema & Queries
<field name="color" type="text_ws" …>
• Can use solr’s default ranking effectively
/solr/select?q=ff0000 e2c2d2&qf=color&defType=edismax…
• or use term frequencies directly for specific sort functions:
sort=product(tf(color,"ff0000"),tf(color,"e2c2d2")) desc
21. Indexing color statistics
Represent aggregate statistics of each image
lightness:
median: 2
standard dev: 1
largest bin: 0
largest bin size: 50
saturation
median: 0
standard dev: 0
largest bin: 0
largest bin size: 100
…
22. Solr Fields & Queries
<field name=”hue_median” type=”int” …>
• Sort by the distance between input param
and median value for each image
/solr/select?q=*&sort=abs(sub($query,hue_median)) asc
29. How much of the image contains the
selected color?
• Score each color by the number of pixels
sort=tf(color,"cfebc2") desc
30. Balance Precision and Recall
• Reduce your colorspace enough
to balance:
• color accuracy
• index size
• query complexity
• result counts
• only need 100-200 colors for a good UX
✓
31. Weighing Multiple Colors Together
• If you search for 2 or more colors, the top result should have
the most even distribution of those colors
✓
• simple option:
sort=product(tf(color,"ff9900"),tf(color,"2280e2")) desc
• more complex: compute the standard deviation or variance
of the term frequencies of matching color values for each
image, and sort the results with the lowest variance first.
32. Weighing Similar & Different Colors
• The score for one color should reflect all the colors in the image.
• At indexing time, increase the score based on similar colors;
decrease it based on differing colors.
34. Conclusion
• Steps for building color search in Solr:
• Extract colors using a tool like the Python Image Library
• Score colors based on the number of pixels
• Adjust scores based on similar / different colors
• Index colors into Solr as text document
• In your query, sort by the term frequency values for each
color