SlideShare a Scribd company logo
1 of 28
Rapid
Solr Schema
Development
Alexandre Rafalovitch (@arafalov)
Apache Solr Committer
Montreal Solr/ML meetup May 2018
Phone directory - content
Names, often from multiple cultures
Addresses
Phone numbers
Company/Group
Locations
Other fun data
I use https://www.fakenamegenerator.com/ for demos
 Can generate bulk entries in csv, tab-separated, sql, etc
 Many fields, languages, regions
 Warning: comes with an – invisible – byte order mark
Slide 2
Today's exploration
Solr 7.3 (latest)
The smallest learning schema/configuration required
Rapid schema evolution workflow
Free-form and fielded user entry
Dealing with multiple languages
Dealing with alternative name spellings
Searching phone numbers by any-length suffix
Configuring Solr to simplify API interface
(Bonus points) Fit into 40 minutes presentation!
Slide 3
Today's dataset
http://www.fakenamegenerator.com/ - Bulk request (20000 identities) – Free and configurable!
Name sets: American, Arabic, Australian, Chinese, French, Hispanic, Polish, Russian, Russian
(Cyrillic), Thai
Countries: Australia, Canada, France, Poland, Spain, United Kingdom, United States
Age range: 19 - 85 years old
Gender: 50% male, 50% female
Fields:
id,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,StateFull,ZipCod
e,CountryFull,EmailAddress,Username,TelephoneNumber,TelephoneCountryCode,Birthday,Age,T
ropicalZodiac,Color,Occupation,Company,BloodType,Kilograms,Centimeters,GUID,Latitude,Longi
tude
Renamed first field (Number) to id to fit Solr's naming convention
Removed BOM (in Vim, :set nobomb)
Slide 4
First try – Solr's built in schema
bin/solr start – standalone (non-clustered) server with no initial collections
bin/solr create -c demo1 – uses default configset, with 'schemaless' mode, not for production
Starts with 4 fields (id, _text_, _version_, _root_)
Auto-creates the rest on first occurance
bin/post -c demo1 ../dataset.csv
auto-detect content type from extension
can bulk upload files
see techproducts shipped example
bin/solr start –e techproducts
For one file, can also do via Admin UI
DEMO
Slide 5
Schemaless schema – lessons learned
Imported 1 record
Failed on the second one, because ZipCode was detected as a number
Can fix that by explicit configuration and rebuilding – see films example
(example/films/README.txt)
Other issues
Dual fields for text and string
Everything multivalued – because "just in case" – No sorting, API is messier, etc
Many large files
managed-schema: 546 lines (without comments)
solrconfig.xml: 1364 lines (with comments)
Plus another 42 configuration files, mostly language stopwords
Home work to get this working – not enough time today
Slide 6
Learning schema
managed-schema: start from nearly nothing – add as needed
solrconfig.xml: start from nearly all defaults – Most definitely NOT production ready
Not SolrCloud ready – add those as you scale
No extra field types – add as you need them
How small can we go?!?
Based on exploration done for my presentation at Lucene/Solr Revolution 2016
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-
2016 (slides and video)
https://github.com/arafalov/solr-deconstructing-films-example - repo
A bit out of date – schemaless mode was tuned since
Today's version uses latest Solr feature
https://github.com/arafalov/solr-presentation-2018-may/commits/master (changes commit-
by-commit)
Slide 7
Learning schema – managed-schema
<?xml version="1.0" encoding="UTF-8"?>
<schema name="smallest-config" version="1.6">
<field name="id" type="string" required="true" indexed="true" stored="true" />
<field name="_text_" type="text_basic" multiValued="true" indexed="true"
stored="false" docValues="false"/>
<dynamicField name="*" type="text_basic" indexed="true" stored="true"/>
<copyField source="*" dest="_text_"/>
<uniqueKey>id</uniqueKey>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/>
<fieldType name="text_basic" class="solr.SortableTextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</schema>
Slide 8
Learning schema – solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>7.3.0</luceneMatchVersion>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">_text_</str>
<str name="echoParams">all</str>
</lst>
</requestHandler>
</config>
Slide 9
2 files, 33 lines combined, including blanks – but Will It Blend Search?
bin/solr create -c tinydir -d ../configs/smallest/ - provide custom config files to the collection
bin/post -c tinydir ../dataset.csv – Remember the BOM and renaming column Number->id
Does it search?
General search?
Case-insensitive search?
Range search: Centimeters:[* TO 99]
Fielded search?
Facet?
Sort?
Are ids preserved?
Are individual fields easy to work with (fl, etc)?
DEMO
Learning schema – create and index
Slide 10
It works! And ready to start being used from other parts of the project
Do NOT expose Solr directly to the Internet. Not until you are a Solr Wizard, the Gray.
managed-schema file has NOT changed – because of dynamicField
Still 21 lines
Would still keep the comments
Would still preserve field/type definitions
Will change on first AdminUI/API modification – gets rewritten
What else? Actual search-engine tuning!
Special cases
Numerics – e.g. for Range search
Spatial search – e.g. for Mapping/distance ranking
Multivalued fields
Dates
Special parsing (e.g. names/surnames)
Useful telephone number search
Relevancy tuning!
Learning schema - conclusion
Slide 11
Several possibilities
Admin UI
Delete schema field
Add schema field with new definition
Reindex
Sometimes causes docValue-related exception, have to rebuild collection from scratch
Schema API (Admin UI uses a subset of it)
See: https://lucene.apache.org/solr/guide/7_3/schema-api.html
Also has Replace a Field
Also has Add/Delete Field Type
Great to use programmatically or with something like Postman (https://www.getpostman.com/)
Edit schema/solrconfig.xml directly and reload the collection
Not recommended for production, but OK with a single server/single developer
Remember to edit actual scheme not the original config one
◦ Check "Instance" location in Admin UI, in collections' Overview screen
Remember that in SolrCloud mode, the config files are NOT on disk (they are in ZooKeeper).
Evolving schema
Slide 12
Numeric fields
 Age – int
 Centimeters (height?) – int
 Kilograms – float
Copy missing field types (pint, pfloat) from solr-7.3.0/server/solr/configsets/_default/conf/managed-schema
Map numeric fields explicitly
Delete content due to radical storage needs change
 bin/post -c tinydir -format solr -d "<delete><query>*:*</query></delete>"
Reload the core in Admin UI's Core Admin (menu is different in SolrCloud mode)
Index again
 bin/post -c tinydir ../dataset.csv
New queries
 facet=true&facet.range=Age&facet.range.start=0&facet.range.end=200&facet.range.gap=10
 Centimeters:[* TO 99] (again)
DEMO
Evolving schema – add numeric fields
Slide 13
Solr supports extensive spatial search
https://lucene.apache.org/solr/guide/7_3/spatial-search.html
bounding-box with different shapes (circles, polygons, etc)
distance limiting or boosting
different options with different functionalities
LatLonPointSpatialField
SpatialRecursivePrefixTreeFieldType
BBoxField
All require combined Lat Lon coordinates (lat,lon)
We are providing separate Latitude and Longitude fields – need to merge them with a comma
Let's copy a field type and create a field:
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
distErrPct="0.025" maxDistErr="0.001" distanceUnits="kilometers" />
<field name="location" type="location_rpt" indexed="true" stored="true" />
Remember to reload – no need to delete, as it is a new field
Next, need to also give merge instructions with an Update Request Processor
Evolving schema – spatial search
Slide 14
Update Request Processors
Deal with the data before it touches the schema
Can do pre-processing magic with many, many processors
See: https://lucene.apache.org/solr/guide/7_3/update-request-processors.html
See: http://www.solr-start.com/info/update-request-processors/ (mine)
Some are more magical then others and have shortcuts, e.g. TemplateUpdateProcessorFactory
All can be configured with chains in solrconfig.xml and apply explicitly or by default
That's how the schemaless mode works (default chain in solrconfig.xml of _default configset)
Also check the way dates are parsed in it, search for parse-date – can be used standalone
IgnoreFieldUpdateProcessorFactory could be useful to drop fields we don't want Solr to process at all
(including in collect-all _text_ field)
Let's reindex everything using the template to populate the new field:
bin/post -c tinydir -params "processor=template&template.field=location:{Latitude},{Longitude}" ../dataset.csv
Query:
q=*:*&rows=1&
fq={!geofilt sfield=location}&
pt=45.493444, -73.558154&d=100&
facet=on&facet.field=City&facet.mincount=1
DEMO
Evolving schema – URPs
Slide 15
Search for John and look at the phone numbers (q=John&fl=TelephoneNumber):
03.99.56.91.63
(08) 9435 3911
79 196 65 43
306-724-3986
Can we search that?
TelephoneNumber:3911 – yes
TelephoneNumber:"65 43" – sort of (need to quote or know these are together)
TelephoneNumber:3986 – sort of: some at the end, some at middle
Use Case: Just search the last digits (suffix) regardless of formatting
We have MANY analyzers, tokenizers, and character and token filters to help us with it
https://lucene.apache.org/solr/guide/7_3/understanding-analyzers-tokenizers-and-filters.html
http://www.solr-start.com/info/analyzers/ (mine)
Evolving schema – phone numbers
Slide 16
Let's define a super-custom field type:
<fieldType name="phone" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])"
replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])"
replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
Notice
Asymmetric analyzers
Reversing the string to make it end-digits starts digit (make sure that's symmetric!)
Edge n-grams (3-30 character substrings) - makes the index larger, but the search very fast
Evolving schema – digits-only type
Slide 17
Remap TelephoneNumber to it
<field name="TelephoneNumber" type="phone"
indexed="true" stored="true" />
And reindex (don't forget our speed hack' for now):
bin/post -c tinydir -params
"processor=template&template.field=location:{Latitude},{Longitude
}" ../dataset.csv
Check terms in Admin UI Schema screen and do our test searches
TelephoneNumber:3911
TelephoneNumber:"65 43"
TelephoneNumber:6543
TelephoneNumber:3986
DEMO
Evolving schema – digits-only type - cont
Slide 18
Many languages have accents on letters
Frédéric, Thérèse, Jérôme
Many users can't be bothered to type them
Sometimes, they don't even know how to type them
Łódź, Kędzierzyn-Koźle
Can we just ignore the accents when we search?
Several ways, but let's use the simplest by insert a filter into the text_basic type definition
<filter class="solr.ASCIIFoldingFilterFactory" />
Before the LowerCaseFilterFactory
Reload the collection and reindex – because the filter is symmetric (affects indexing)
Search without accents, general or fielded
Lodz, Frederic, Therese, GivenName:jerome
DEMO
Evolving schema – collapsing accents
Slide 19
What are similar names to 'Alexandre':
q=GivenName:Alexandre~2&
facet=on&facet.field=GivenName&facet.mincount=1
Alexander, Alexandra, Alexandrin, Leixandre, Alexandre, Alexandrie
We can't ask the user to enter arcane Solr syntax
Let's do a phonetic search instead
Bunch of different ways, each with its own tradeoffs
PhoneticFilterFactory, BeiderMorseFilterFactory, DaitchMokotoffSoundexFilterFactory,
DoubleMetaphoneFilterFactory,....
https://lucene.apache.org/solr/guide/7_3/phonetic-matching.html
Best to have one - or several - separate Field Type definitions with a copy field
Allows to experiment
Allows to trigger them at different times (e.g. in advanced search, but not general one)
Allows to tune them for relevancy by assign different weights
Evolving schema – Names and Surnames
Slide 20
How do we actually search multiple fields at once?
We've been using the default 'lucene' query parser so far on either _text_ or specific field
Solr has MANY parsers
General: "lucene", DisMax, Extended DisMax (edismax)
Specialized: Block Join, Boolean, Boost, Collapsing, Complex Phrase, Field, Filters, Function, Function Range,
Graph, Join, Learning to Rank, .....
 https://lucene.apache.org/solr/guide/7_3/other-parsers.html
We already used Spatial geofilt query parser: fq={!geofilt sfield=location}
edismax allows to search against multiple fields, with different weights, boosts, ties, minimum-
match specifications, etc
Choose with defType=edismax or {edismax param=value param=value}search_string
Let's search for "George Brown" against (qf) "GivenName Surname Company StreetAddress City"
and display same fields only
DEMO
Try using http://splainer.io/ to review the results
Try with qf=GivenName^5 Surname^5 Company StreetAddress City
Side-trip into eDisMax and query parsers
Slide 21
Result: 149 records, but all over the field values
Enter RELEVANCY
Recall – did we find all documents?
Precision – did we find just the documents we needed
Recall and Precision – fight. Perfect recall is q=*:* ......
Ranking – First hit is very important, ones after that less so (not always)
Side note: Field sorting destroys ranking.
We were optimizing Recall
Dump everything into _text_ and let search sort it out
Optimizing for Precision may seem easy too
Under eDisMax, set mm=100%
DEMO
eDisMax exploration continues
Slide 22
It is a business decision what Precision and Recall mean for your use case
Often "find more just in case" and focus on "ranking better" is the right approach
Try
qf=GivenName^5 Surname^5 Company StreetAddress City (no mm)
qf=GivenName^5 Surname^5 Company StreetAddress City and mm=100%
qf=GivenName^5 Surname^5 _text_ and mm=100%
DEMO in Splainer
Relevancy business case for our names (GivenName, Surname)
UPPER/lower case does not matter
Exact spelling (with accents) matches best – new Field Type needed (actually original text_basic...)
Accent-free spelling matches next – existing text_basic and therefore dynamic field match is fine
Phonetic spelling matches lowest (but higher than fallback _text_ field) – new Field Type needed
eDisMax for ranking
Slide 23
<fieldType name="text_exact" class="solr.SortableTextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_phonetic" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldType>
<field name="GivenName_exact" type="text_exact" indexed="true" stored="false"/>
<field name="Surname_exact" type="text_exact" indexed="true" stored="false"/>
<field name="GivenName_ph" type="text_phonetic" indexed="true" stored="false"/>
<field name="Surname_ph" type="text_phonetic" indexed="true" stored="false"/>
<copyField source="GivenName" dest="GivenName_exact"/>
<copyField source="GivenName" dest="GivenName_ph"/>
<copyField source="Surname" dest="Surname_exact"/>
<copyField source="Surname" dest="Surname_ph"/>
Multiple fields for same content
Slide 24
Our test cases
Frédéric, Thérèse, Jérôme
Check different analysis in Admin UI's Analysis screen
Can choose fields or field types from drop-down, use types as we have dynamic fields
Can also test analysis vs search and highlight the matches
Test search with Admin UI and Splainer with eDisMax enabled and Thérèse against different set
of Query Fields (qf)
Default search (qf=_text_)
GivenName
GivenName _text_
GivenName^10 _text_
GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_
DEMO
Testing multiple representations
Slide 25
Original search URL: http://...:8983/solr/tinydir/select?defType=edismax&fl=.....
The good parameter set:
defType=edismax
qf=GivenName_exact^15 GivenName^10 GivenName_ph^5% _text_
fl=GivenName Surname Company StreetAddress City CountryFull
Lock it in a dedicated request handler in solrconfig.xml
<requestHandler name="/namesearch" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">_text_</str>
<str name="echoParams">all</str>
<str name="defType">edismax</str>
<str name="qf">GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_</str>
<str name="fl">GivenName Surname Company StreetAddress City CountryFull</str>
</lst>
</requestHandler>
Now: http://...:8983/solr/tinydir/namesearch?q=Thérèse
DEMO
Simplify API usage
Slide 26
Based on previous work with Thai language: https://github.com/arafalov/solr-thai-test
Needs ICU libraries in solrconfig.xml
 <lib path="../../../contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-7.3.0.jar" />
<lib path="../../../contrib/analysis-extras/lib/icu4j-59.1.jar" />
Field, type, and copyField definition in managed-schema:
<fieldType name="text_ru_en" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="ru-en" />
<filter class="solr.BeiderMorseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.BeiderMorseFilterFactory" />
</analyzer>
</fieldType>
<field name="GivenName_ruen" type="text_ru_en" indexed="true" stored="false"/>
<copyField source="GivenName" dest="GivenName_ruen"/>
Reload, reindex
Search
 GivenName:Zahar
 GivenName_ruen:Zahar
And BOOM!
Bonus magic
Slide 27
Rapid
Solr Schema
Development
Alexandre Rafalovitch (@arafalov)
Apache Solr Committer
Montreal Solr/ML meetup May 2018

More Related Content

What's hot

Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksErik Hatcher
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solrNet7
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrJayesh Bhoyar
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 

What's hot (20)

Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solr
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 

Similar to Rapid Solr Schema Development (Phone directory)

New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Tuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paperTuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paperVinay Kumar
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scalingsmattoon
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management SystemsMatthew Turland
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sunsmattoon
 
Simplify your professional web development with symfony
Simplify your professional web development with symfonySimplify your professional web development with symfony
Simplify your professional web development with symfonyFrancois Zaninotto
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbonezonathen
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersVladimir Pavlov
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesUdita Plaha
 
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Mack Hardy
 
Enterprise search in_drupal_pub
Enterprise search in_drupal_pubEnterprise search in_drupal_pub
Enterprise search in_drupal_pubdstuartnz
 
Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02arghya007
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraDataStax Academy
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsEiti Kimura
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 

Similar to Rapid Solr Schema Development (Phone directory) (20)

New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Tuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paperTuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paper
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scaling
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management Systems
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sun
 
Lightweight web frameworks
Lightweight web frameworksLightweight web frameworks
Lightweight web frameworks
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Simplify your professional web development with symfony
Simplify your professional web development with symfonySimplify your professional web development with symfony
Simplify your professional web development with symfony
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbone
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted Waters
 
Dn D Custom 1
Dn D Custom 1Dn D Custom 1
Dn D Custom 1
 
Dn D Custom 1
Dn D Custom 1Dn D Custom 1
Dn D Custom 1
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails Slides
 
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
 
Enterprise search in_drupal_pub
Enterprise search in_drupal_pubEnterprise search in_drupal_pub
Enterprise search in_drupal_pub
 
Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 

Recently uploaded

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 

Recently uploaded (20)

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 

Rapid Solr Schema Development (Phone directory)

  • 1. Rapid Solr Schema Development Alexandre Rafalovitch (@arafalov) Apache Solr Committer Montreal Solr/ML meetup May 2018
  • 2. Phone directory - content Names, often from multiple cultures Addresses Phone numbers Company/Group Locations Other fun data I use https://www.fakenamegenerator.com/ for demos  Can generate bulk entries in csv, tab-separated, sql, etc  Many fields, languages, regions  Warning: comes with an – invisible – byte order mark Slide 2
  • 3. Today's exploration Solr 7.3 (latest) The smallest learning schema/configuration required Rapid schema evolution workflow Free-form and fielded user entry Dealing with multiple languages Dealing with alternative name spellings Searching phone numbers by any-length suffix Configuring Solr to simplify API interface (Bonus points) Fit into 40 minutes presentation! Slide 3
  • 4. Today's dataset http://www.fakenamegenerator.com/ - Bulk request (20000 identities) – Free and configurable! Name sets: American, Arabic, Australian, Chinese, French, Hispanic, Polish, Russian, Russian (Cyrillic), Thai Countries: Australia, Canada, France, Poland, Spain, United Kingdom, United States Age range: 19 - 85 years old Gender: 50% male, 50% female Fields: id,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,StateFull,ZipCod e,CountryFull,EmailAddress,Username,TelephoneNumber,TelephoneCountryCode,Birthday,Age,T ropicalZodiac,Color,Occupation,Company,BloodType,Kilograms,Centimeters,GUID,Latitude,Longi tude Renamed first field (Number) to id to fit Solr's naming convention Removed BOM (in Vim, :set nobomb) Slide 4
  • 5. First try – Solr's built in schema bin/solr start – standalone (non-clustered) server with no initial collections bin/solr create -c demo1 – uses default configset, with 'schemaless' mode, not for production Starts with 4 fields (id, _text_, _version_, _root_) Auto-creates the rest on first occurance bin/post -c demo1 ../dataset.csv auto-detect content type from extension can bulk upload files see techproducts shipped example bin/solr start –e techproducts For one file, can also do via Admin UI DEMO Slide 5
  • 6. Schemaless schema – lessons learned Imported 1 record Failed on the second one, because ZipCode was detected as a number Can fix that by explicit configuration and rebuilding – see films example (example/films/README.txt) Other issues Dual fields for text and string Everything multivalued – because "just in case" – No sorting, API is messier, etc Many large files managed-schema: 546 lines (without comments) solrconfig.xml: 1364 lines (with comments) Plus another 42 configuration files, mostly language stopwords Home work to get this working – not enough time today Slide 6
  • 7. Learning schema managed-schema: start from nearly nothing – add as needed solrconfig.xml: start from nearly all defaults – Most definitely NOT production ready Not SolrCloud ready – add those as you scale No extra field types – add as you need them How small can we go?!? Based on exploration done for my presentation at Lucene/Solr Revolution 2016 https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution- 2016 (slides and video) https://github.com/arafalov/solr-deconstructing-films-example - repo A bit out of date – schemaless mode was tuned since Today's version uses latest Solr feature https://github.com/arafalov/solr-presentation-2018-may/commits/master (changes commit- by-commit) Slide 7
  • 8. Learning schema – managed-schema <?xml version="1.0" encoding="UTF-8"?> <schema name="smallest-config" version="1.6"> <field name="id" type="string" required="true" indexed="true" stored="true" /> <field name="_text_" type="text_basic" multiValued="true" indexed="true" stored="false" docValues="false"/> <dynamicField name="*" type="text_basic" indexed="true" stored="true"/> <copyField source="*" dest="_text_"/> <uniqueKey>id</uniqueKey> <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/> <fieldType name="text_basic" class="solr.SortableTextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> </schema> Slide 8
  • 9. Learning schema – solrconfig.xml <?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>7.3.0</luceneMatchVersion> <requestHandler name="/select" class="solr.SearchHandler"> <lst name="defaults"> <str name="df">_text_</str> <str name="echoParams">all</str> </lst> </requestHandler> </config> Slide 9
  • 10. 2 files, 33 lines combined, including blanks – but Will It Blend Search? bin/solr create -c tinydir -d ../configs/smallest/ - provide custom config files to the collection bin/post -c tinydir ../dataset.csv – Remember the BOM and renaming column Number->id Does it search? General search? Case-insensitive search? Range search: Centimeters:[* TO 99] Fielded search? Facet? Sort? Are ids preserved? Are individual fields easy to work with (fl, etc)? DEMO Learning schema – create and index Slide 10
  • 11. It works! And ready to start being used from other parts of the project Do NOT expose Solr directly to the Internet. Not until you are a Solr Wizard, the Gray. managed-schema file has NOT changed – because of dynamicField Still 21 lines Would still keep the comments Would still preserve field/type definitions Will change on first AdminUI/API modification – gets rewritten What else? Actual search-engine tuning! Special cases Numerics – e.g. for Range search Spatial search – e.g. for Mapping/distance ranking Multivalued fields Dates Special parsing (e.g. names/surnames) Useful telephone number search Relevancy tuning! Learning schema - conclusion Slide 11
  • 12. Several possibilities Admin UI Delete schema field Add schema field with new definition Reindex Sometimes causes docValue-related exception, have to rebuild collection from scratch Schema API (Admin UI uses a subset of it) See: https://lucene.apache.org/solr/guide/7_3/schema-api.html Also has Replace a Field Also has Add/Delete Field Type Great to use programmatically or with something like Postman (https://www.getpostman.com/) Edit schema/solrconfig.xml directly and reload the collection Not recommended for production, but OK with a single server/single developer Remember to edit actual scheme not the original config one ◦ Check "Instance" location in Admin UI, in collections' Overview screen Remember that in SolrCloud mode, the config files are NOT on disk (they are in ZooKeeper). Evolving schema Slide 12
  • 13. Numeric fields  Age – int  Centimeters (height?) – int  Kilograms – float Copy missing field types (pint, pfloat) from solr-7.3.0/server/solr/configsets/_default/conf/managed-schema Map numeric fields explicitly Delete content due to radical storage needs change  bin/post -c tinydir -format solr -d "<delete><query>*:*</query></delete>" Reload the core in Admin UI's Core Admin (menu is different in SolrCloud mode) Index again  bin/post -c tinydir ../dataset.csv New queries  facet=true&facet.range=Age&facet.range.start=0&facet.range.end=200&facet.range.gap=10  Centimeters:[* TO 99] (again) DEMO Evolving schema – add numeric fields Slide 13
  • 14. Solr supports extensive spatial search https://lucene.apache.org/solr/guide/7_3/spatial-search.html bounding-box with different shapes (circles, polygons, etc) distance limiting or boosting different options with different functionalities LatLonPointSpatialField SpatialRecursivePrefixTreeFieldType BBoxField All require combined Lat Lon coordinates (lat,lon) We are providing separate Latitude and Longitude fields – need to merge them with a comma Let's copy a field type and create a field: <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true" distErrPct="0.025" maxDistErr="0.001" distanceUnits="kilometers" /> <field name="location" type="location_rpt" indexed="true" stored="true" /> Remember to reload – no need to delete, as it is a new field Next, need to also give merge instructions with an Update Request Processor Evolving schema – spatial search Slide 14
  • 15. Update Request Processors Deal with the data before it touches the schema Can do pre-processing magic with many, many processors See: https://lucene.apache.org/solr/guide/7_3/update-request-processors.html See: http://www.solr-start.com/info/update-request-processors/ (mine) Some are more magical then others and have shortcuts, e.g. TemplateUpdateProcessorFactory All can be configured with chains in solrconfig.xml and apply explicitly or by default That's how the schemaless mode works (default chain in solrconfig.xml of _default configset) Also check the way dates are parsed in it, search for parse-date – can be used standalone IgnoreFieldUpdateProcessorFactory could be useful to drop fields we don't want Solr to process at all (including in collect-all _text_ field) Let's reindex everything using the template to populate the new field: bin/post -c tinydir -params "processor=template&template.field=location:{Latitude},{Longitude}" ../dataset.csv Query: q=*:*&rows=1& fq={!geofilt sfield=location}& pt=45.493444, -73.558154&d=100& facet=on&facet.field=City&facet.mincount=1 DEMO Evolving schema – URPs Slide 15
  • 16. Search for John and look at the phone numbers (q=John&fl=TelephoneNumber): 03.99.56.91.63 (08) 9435 3911 79 196 65 43 306-724-3986 Can we search that? TelephoneNumber:3911 – yes TelephoneNumber:"65 43" – sort of (need to quote or know these are together) TelephoneNumber:3986 – sort of: some at the end, some at middle Use Case: Just search the last digits (suffix) regardless of formatting We have MANY analyzers, tokenizers, and character and token filters to help us with it https://lucene.apache.org/solr/guide/7_3/understanding-analyzers-tokenizers-and-filters.html http://www.solr-start.com/info/analyzers/ (mine) Evolving schema – phone numbers Slide 16
  • 17. Let's define a super-custom field type: <fieldType name="phone" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/> <filter class="solr.ReverseStringFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/> <filter class="solr.ReverseStringFilterFactory"/> </analyzer> </fieldType> Notice Asymmetric analyzers Reversing the string to make it end-digits starts digit (make sure that's symmetric!) Edge n-grams (3-30 character substrings) - makes the index larger, but the search very fast Evolving schema – digits-only type Slide 17
  • 18. Remap TelephoneNumber to it <field name="TelephoneNumber" type="phone" indexed="true" stored="true" /> And reindex (don't forget our speed hack' for now): bin/post -c tinydir -params "processor=template&template.field=location:{Latitude},{Longitude }" ../dataset.csv Check terms in Admin UI Schema screen and do our test searches TelephoneNumber:3911 TelephoneNumber:"65 43" TelephoneNumber:6543 TelephoneNumber:3986 DEMO Evolving schema – digits-only type - cont Slide 18
  • 19. Many languages have accents on letters Frédéric, Thérèse, Jérôme Many users can't be bothered to type them Sometimes, they don't even know how to type them Łódź, Kędzierzyn-Koźle Can we just ignore the accents when we search? Several ways, but let's use the simplest by insert a filter into the text_basic type definition <filter class="solr.ASCIIFoldingFilterFactory" /> Before the LowerCaseFilterFactory Reload the collection and reindex – because the filter is symmetric (affects indexing) Search without accents, general or fielded Lodz, Frederic, Therese, GivenName:jerome DEMO Evolving schema – collapsing accents Slide 19
  • 20. What are similar names to 'Alexandre': q=GivenName:Alexandre~2& facet=on&facet.field=GivenName&facet.mincount=1 Alexander, Alexandra, Alexandrin, Leixandre, Alexandre, Alexandrie We can't ask the user to enter arcane Solr syntax Let's do a phonetic search instead Bunch of different ways, each with its own tradeoffs PhoneticFilterFactory, BeiderMorseFilterFactory, DaitchMokotoffSoundexFilterFactory, DoubleMetaphoneFilterFactory,.... https://lucene.apache.org/solr/guide/7_3/phonetic-matching.html Best to have one - or several - separate Field Type definitions with a copy field Allows to experiment Allows to trigger them at different times (e.g. in advanced search, but not general one) Allows to tune them for relevancy by assign different weights Evolving schema – Names and Surnames Slide 20
  • 21. How do we actually search multiple fields at once? We've been using the default 'lucene' query parser so far on either _text_ or specific field Solr has MANY parsers General: "lucene", DisMax, Extended DisMax (edismax) Specialized: Block Join, Boolean, Boost, Collapsing, Complex Phrase, Field, Filters, Function, Function Range, Graph, Join, Learning to Rank, .....  https://lucene.apache.org/solr/guide/7_3/other-parsers.html We already used Spatial geofilt query parser: fq={!geofilt sfield=location} edismax allows to search against multiple fields, with different weights, boosts, ties, minimum- match specifications, etc Choose with defType=edismax or {edismax param=value param=value}search_string Let's search for "George Brown" against (qf) "GivenName Surname Company StreetAddress City" and display same fields only DEMO Try using http://splainer.io/ to review the results Try with qf=GivenName^5 Surname^5 Company StreetAddress City Side-trip into eDisMax and query parsers Slide 21
  • 22. Result: 149 records, but all over the field values Enter RELEVANCY Recall – did we find all documents? Precision – did we find just the documents we needed Recall and Precision – fight. Perfect recall is q=*:* ...... Ranking – First hit is very important, ones after that less so (not always) Side note: Field sorting destroys ranking. We were optimizing Recall Dump everything into _text_ and let search sort it out Optimizing for Precision may seem easy too Under eDisMax, set mm=100% DEMO eDisMax exploration continues Slide 22
  • 23. It is a business decision what Precision and Recall mean for your use case Often "find more just in case" and focus on "ranking better" is the right approach Try qf=GivenName^5 Surname^5 Company StreetAddress City (no mm) qf=GivenName^5 Surname^5 Company StreetAddress City and mm=100% qf=GivenName^5 Surname^5 _text_ and mm=100% DEMO in Splainer Relevancy business case for our names (GivenName, Surname) UPPER/lower case does not matter Exact spelling (with accents) matches best – new Field Type needed (actually original text_basic...) Accent-free spelling matches next – existing text_basic and therefore dynamic field match is fine Phonetic spelling matches lowest (but higher than fallback _text_ field) – new Field Type needed eDisMax for ranking Slide 23
  • 24. <fieldType name="text_exact" class="solr.SortableTextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_phonetic" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> </analyzer> </fieldType> <field name="GivenName_exact" type="text_exact" indexed="true" stored="false"/> <field name="Surname_exact" type="text_exact" indexed="true" stored="false"/> <field name="GivenName_ph" type="text_phonetic" indexed="true" stored="false"/> <field name="Surname_ph" type="text_phonetic" indexed="true" stored="false"/> <copyField source="GivenName" dest="GivenName_exact"/> <copyField source="GivenName" dest="GivenName_ph"/> <copyField source="Surname" dest="Surname_exact"/> <copyField source="Surname" dest="Surname_ph"/> Multiple fields for same content Slide 24
  • 25. Our test cases Frédéric, Thérèse, Jérôme Check different analysis in Admin UI's Analysis screen Can choose fields or field types from drop-down, use types as we have dynamic fields Can also test analysis vs search and highlight the matches Test search with Admin UI and Splainer with eDisMax enabled and Thérèse against different set of Query Fields (qf) Default search (qf=_text_) GivenName GivenName _text_ GivenName^10 _text_ GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_ DEMO Testing multiple representations Slide 25
  • 26. Original search URL: http://...:8983/solr/tinydir/select?defType=edismax&fl=..... The good parameter set: defType=edismax qf=GivenName_exact^15 GivenName^10 GivenName_ph^5% _text_ fl=GivenName Surname Company StreetAddress City CountryFull Lock it in a dedicated request handler in solrconfig.xml <requestHandler name="/namesearch" class="solr.SearchHandler"> <lst name="defaults"> <str name="df">_text_</str> <str name="echoParams">all</str> <str name="defType">edismax</str> <str name="qf">GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_</str> <str name="fl">GivenName Surname Company StreetAddress City CountryFull</str> </lst> </requestHandler> Now: http://...:8983/solr/tinydir/namesearch?q=Thérèse DEMO Simplify API usage Slide 26
  • 27. Based on previous work with Thai language: https://github.com/arafalov/solr-thai-test Needs ICU libraries in solrconfig.xml  <lib path="../../../contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-7.3.0.jar" /> <lib path="../../../contrib/analysis-extras/lib/icu4j-59.1.jar" /> Field, type, and copyField definition in managed-schema: <fieldType name="text_ru_en" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="ru-en" /> <filter class="solr.BeiderMorseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.BeiderMorseFilterFactory" /> </analyzer> </fieldType> <field name="GivenName_ruen" type="text_ru_en" indexed="true" stored="false"/> <copyField source="GivenName" dest="GivenName_ruen"/> Reload, reindex Search  GivenName:Zahar  GivenName_ruen:Zahar And BOOM! Bonus magic Slide 27
  • 28. Rapid Solr Schema Development Alexandre Rafalovitch (@arafalov) Apache Solr Committer Montreal Solr/ML meetup May 2018

Editor's Notes

  1. Line 205-206 facet=true&facet.range=Age&facet.range.start=0&facet.range.end=200&facet.range.gap=10
  2. http://localhost:8983/solr/tinydir/select?rows=1&d=100&facet.field=City&facet=on&fq={!geofilt%20sfield=location}&pt=45.493444,%20-73.558154&q=*:*&facet.mincount=1
  3. TelephoneNumber:3911 – yes TelephoneNumber:"65 43" – sort of (need to quote or know these are together) TelephoneNumber:3986
  4. Frédéric, Thérèse, Jérôme Łódź, Kędzierzyn-Koźle
  5. Thérèse