The document outlines an incremental indexing framework to index database tables using Apache Solr. It proposes using database views to collate relevant data from multiple tables and batch processing to periodically fetch updated records and convert them to XML documents to post to Solr. The key components are database views, a data fetcher, an XML converter, an indexer controller class, and a job scheduler that coordinates periodic indexing based on configured triggers.
5. The following slides discuss a incremental indexing approach that we thought would work well for our requirements. In this approach the Search Index relevant views are created using Database Views and the indexing is done as a Batch Process and not at real time. First we need to understand the need for the Database Views . When a search term is searched for in the index, the result page shows some details and summary of the result. For instant results these details need to be stored in the index itself so we don’t have to hit the database just to display collated results in the results page. When creating the Solr index it then doesn't make much sense to index all the tables individually. This is because each table will have it own dependencies with child and parent tables. We will either have to create similar dependencies in the index or else create our indexes intelligently keeping the search needs in mind. This will involve creating appropriate joins across tables to fetch all the data relevant to a search result at one shot. The database view can do this job of collating data from the parent and child tables in a representation that exactly matches the requirements of the search index. This makes the job of the application layer hassle free. It just picks everything from the view and indexes it as it is. Incremental Indexing Process ( the need for Database Views )
6. Next we need to understand why the Batch Indexing process can work well for us. Most of our search requirements would involve searching for historic data. Rarely could there be cases where we search for data put in immediately. Even these cases can be handled by setting the Batch Process interval to a very small time. The real time indexing process can become a pretty expensive process in case a large amount of data is entered in small intervals. Also the batch process gives us the flexibility of working on a copy of the database to make the whole indexing process an offline one. Incremental Indexing Process ( the need for Batch indexing )
7. Database Result Set to XML Converter Data Fetcher Indexing Job Scheduler Database Indexer (the controller class) SOLR Index Manager (9) Solr XML (1) Indexing Job Name (2) Database View Name (5) Result Set (6) Solr XML (3) Query (4) Result Set (8) Solr XML Indexing Job - Trigger Config file ( Indexing Job Schedules ) Trigger Time 1 - Indexing Job 1 Trigger Time 2 - Indexing Job 2 Trigger Time 3 - Indexing Job 3 7) Solr XML Incremental Indexing Batch Process ( the flow ) Components in green are explained in detail in next slide >> Indexing Job – Database View Mapping file More than one DB view might need to be indexed at the same time, so these can be as an Indexing Job. Indexing Job 1 – Database View1 Database View2 Database View3 Database View4 Indexing Job 2 – Database View5 Database View6 DB View Column name to Solr field mapping - Database View 1 Column 1 - Solr Field 1 Column 2 - Solr Field 2 Column 3 - Solr Field 3 - Database View 2 Column 1 - Solr Field 3 Column 2 - Solr Field 2
8. Incremental Indexing Batch Process ( the components ) An Indexing Job has been defined as indexing of all the set of Database Views that need to be indexed at the same time and at equal time intervals. Triggers holds the time information, the start time, time interval and other such time related details. So when a Indexing Job is associated to a trigger, the job will run according to the start time and time intervals as mentioned in the trigger. Indexing Job - Trigger Config file has all Indexing Job Schedules. It maps triggers to indexing jobs. Indexing Job – Database View Mapping file defines the Indexing Jobs. It associates Database Views with each Indexing Job. If a database view like the one for the messages module requires to be picked up for at a smaller time interval than the one for the shopping module, then they will be part of different indexing jobs having different Triggers. Database Indexer acts as the controller of the database indexing process. It does the job of calling the Data Fetcher to get database records in XML format which it sends to the Index Manager to post it to Solr. The Data Fetcher communicates with the database to get all the new and updated records for a given database view along with those records that have been marked for deletion. It then feeds this data to the Result Set to XML converter to get the data converted to the Solr recognizable XML format. The Result Set to XML converter is a utility class which converts database records to XML format. If the record is new or updated it puts it in the <add> tag. If it is marked for deletion then it is put in the <delete> tag. It picks up Solr Field names corresponding to the DB View Column names from the DB View Column name to Solr field mapping file.
9. Incremental Indexing Batch Process ( the flow) The indexing process is triggered off by the Indexing Job Scheduler . An indexing job is triggered from the Indexing Job Scheduler based on the trigger settings to which it is associated in the Indexing Job - Trigger Config file . The Indexing Job Scheduler makes a call to the Database Indexer sending the name of the job to done as an argument. The Database Indexer acts as the controller for this whole process. It picks up the names of Database Views to be indexed corresponding to the Indexing Job sent by Indexing Job Scheduler from the Indexing Job – Database View Mapping file . The Database Indexer loops over the set of Database Views and makes a call to the Data Fetcher for each View. The Data Fetcher hits the database with a query to get all the latest records from the View. The result set is sent to Result set to XML Converter which return the Solr XML. This Solr XML is sent back to the Database Indexer which in turn sends it to the Index manger for posting it to Solr.