Crowdsourcing solutions can be helpful to extract information from disaster-related data during crisis management. However, certain information can only be obtained through similarity operations. Some of them also depend on additional data stored in a Relational Database Management System (RDBMS). In this context, several works focus on crisis management supported by data. Nevertheless, none of them provides a methodology for employing a similarity-enabled RDBMS in disaster-relief tasks. To fill this gap, we introduce a similarity-enabled methodology together with a supporting architecture named Data-Centric Crisis Management (DCCM), which employs our methods over a RDBMS. We evaluate our proposal through three tasks: classification of incoming data regarding current events, identifying relevant information to guide rescue teams; filtering of incoming data, enhancing the decision support by removing near-duplicate data; and similarity retrieval of historical data, supporting analytical comprehension of the crisis context. To make it possible, similarity-based operations were implemented within one popular, open-source RDBMS. Results using real data from Flickr show that the proposed methodology over DCCM is feasible for real-time applications. In addition to high performance, accurate results were obtained with a proper combination of techniques for each task. At last, given its accuracy and efficiency, we expect our work to provide a framework for further developments on crisis management solutions.
On the Support of a Similarity-Enabled Relational Database Management System in Civilian Crisis Situations
1. Luiz Olmes (speaker)
Paulo H. Oliveira, Antonio C. Fraideinberze, Natan A. Laverde,
Hugo Gualdron, Andre S. Gonzaga, Lucas D. Ferreira,
Willian D. Oliveira, Jose F. Rodrigues Jr., Robson L. F. Cordeiro,
Caetano Traina Jr., Agma J. M. Traina, Elaine P. M. Sousa
pholiveira@usp.br
On the Support of a Similarity-Enabled
Relational Database Management System
in Civilian Crisis Situations
3. Introduction – RESCUER project
3
The RESCUER project, a partnership between the
European Union and Brazil, aims at developing
solutions to improve decision-making in crises
Further details: http://www.rescuer-project.org/
4. 4
Introduction
Multimedia data
Support decision-making during crises
Automatic analysis on multimedia data
Concepts related to similarity search
Gap
No well-defined methodology for applying
similarity search on crisis situations
5. Introduction
5
Contributions
1. Methodology for employing a similarity-
enabled RDBMS in disaster-relief tasks
2. Data-Centric Crisis Management (DCCM)
architecture, which employs our methodology
over a similarity-enabled RDBMS
6. Introduction
6
Evaluation of the contributions
Task 1. Objects Classification
Task 2. Redundant Objects Filtering
Task 3. Retrieval of Historical Data
31. Case Study – Dataset
Flickr-Fire (Bedo et. al, 2015)
1000 images containing fire
1000 images not containing fire
80 images for the Filtering task
31
32. Case Study
Similarity support on PostgreSQL
Our own implementation: Kiara
Recalling the methodology tasks
1. Objects Classification
2. Redundant Objects Filtering
3. Retrieval of Historical Data
32
36. Results – Redundant Objects Filtering
Buffer: 80 images
43 of one class and 37 of another class
Range queries
Range = 10
Feature Extractor: Perceptual Hash
http://www.phash.org/
Evaluation Function: Hamming Distance
36
38. Results – Retrieval of Historical Data
Color Structure Descriptor and
Manhattan Distance (Bedo et. al, 2015)
Range queries
Range = 7.2
K-NN queries
k = 1000
38
42. Results – Performance (Tasks)
42
Task Average Time (s)
1) Objects Classification 0,851
2) Redundant Objects Filtering 0,057
3a) Retrieval of Historical Data – Range 1,147
3b) Retrieval of Historical Data – kNN 0,849
45. Conclusions
Methodology for employing similarity-enabled
RDBMS on crisis management
The Data-Centric Crisis Management (DCCM)
architecture, based on such methodology
Our methodology follows 3 tasks
Objects Classification
Redundant Objects Filtering
Retrieval of Historical Data
45
46. Conclusions
By employing proper similarity techniques
(e.g. Feature Extractors, Evaluation Functions)
to the crisis context
Accurate response
Efficient response
The DCCM architecture enables the use of
well-known cutting-edge methods and
technologies to aid in a critical scenario
46
47. On the Support of a Similarity-Enabled
Relational Database Management System
in Civilian Crisis Situations
Thank you for your attention!
Luiz Olmes (speaker)
Paulo H. Oliveira, Antonio C. Fraideinberze, Natan A. Laverde,
Hugo Gualdron, Andre S. Gonzaga, Lucas D. Ferreira,
Willian D. Oliveira, Jose F. Rodrigues Jr., Robson L. F. Cordeiro,
Caetano Traina Jr., Agma J. M. Traina, Elaine P. M. Sousa
pholiveira@usp.br
Editor's Notes
I am going to present the work entitled “On the Support of a Similarity-enabled Relational Database Management System in Civilian Crisis Situations”, developed by my colleagues from University of São Paulo.
Here are the topics to be presented.
Starting with the introduction.
This work reports on the RESCUER project, a partnership between the European Union and Brazil, which aims at developing solutions to improve decision-making in crises.
This project involves analyzing multimedia data to support control centers in decision-making during crises.
To perform automatic analysis tasks on multimedia data, concepts related to similarity search must be employed.
There is a gap in this context, which is the lack of a well-defined methodology for applying similarity search on crisis situations.
Our work aims at filling that gap.
Here are our contributions.
The first one: a methodology for employing a similarity-enabled RDBMS in disaster-relief tasks.
The second one: the Data-Centric Crisis Management (DCCM) architecture, which employs our methodology over a similarity-enabled RDBMS.
We evaluate our contributions through 3 tasks.
Task 1: Objects Classification, allowing to identify relevant information from disaster-related data to guide rescue teams.
Task 2: Redundant Objects Filtering, enhancing the decision support by removing near-duplicate data.
Task 3: Retrieval of Historical Data, supporting analytical comprehension of the crisis context.
Moving on to the background.
Here are the topics covered in the background.
Similarity Search is part of a process known as Content-Based Retrieval, depicted in this figure.
The User sends in a query, which contains a sample of a Complex Object of any multimedia type in its raw form.
Then, a Feature Extractor transforms the raw data into a meaningful representation, called Feature Vector.
Next, the Similarity Query accesses the database and compares the sample Feature Vector to the data stored.
It does so by means of an Evaluation Function, which could be, for instance, the Euclidean Distance.
The return value of an Evaluation Function represents the dissimilarity degree of a pair of Complex Objects.
Based on the values returned by the Evaluation Function, the Similarity Query builds the Retrieval Results, which are the most similar elements to the sample provided.
Finally, the Retrieval Results are returned to the User.
Here are the most common ones types of similarity query.
The k-NN query retrieves the k nearest neighbors to the sample element, represented by "s_q".
The Range query retrieves all elements having a distance value from "s_q" that is less or equal to a given range value, represented by "xi" in the figure.
Instance-Based Learning comprehends a class of supervised learning algorithms.
One of the main IBL algorithms is the k-NN Classifier.
It retrieves the k most similar instances regarding the element being classified.
Then, the algorithm predicts the label based on the retrieved instances, assigning the label of the prevailing class.
In this example, the element would be labeled as red.
Moving on to our proposal, the DCCM architecture. DCCM stands for Data-Centric Crisis Management.
This figure presents the scenario of a crisis situation supported by DCCM.
In a Crisis Situation, eyewitnesses can collect data regarding the event, which can be, for instance, pictures.
The data are sent to a Crowdsourcing System and redirected as a stream to DCCM.
Then, the control center can query DCCM for the Decision-Making process.
Here, we have the submodules that compound the DCCM architecture.
I draw your attention to the Historical Records submodule, which is a database containing disaster-related information from past events.
In a crisis situation, we assume a crowdsourcing system that receives data and submits them to DCCM (arrow A1).
In this example, the data are images.
Each object of the stream is placed in a Buffer to be analyzed by the Filtering Engine.
Such analysis occurs over the feature vectors of the data, which are extracted by the Similarity Engine (arrow A2).
The current object can be either marked as a near duplicate or classified.
In this case, it is classified (arrows A3, A4 and A5) because it is the first arriving object.
It receives the yellow color to indicate a class it belongs to, which represents a type of crisis, such as conflagration.
The arrow A5 indicates a notification to the control center, which receives only classified elements rather than near duplicates as well.
Here, another image arrives (arrow A1).
It has its feature vector generated by the Similarity Engine (arrow A2).
Then, it is compared to the image currently in the Buffer.
It turns out that the arriving image is a near duplicate, so it is not submitted to the Classification Engine (arrows A3, A4 and A5).
To determine whether the arriving object is a near duplicate of any object in the Buffer, DCCM makes use of the Range query.
The range value is predefined according to the type of crisis and to the Feature Extractor employed by the Similarity Engine.
DCCM poses a Range query over the elements in the Buffer, the arriving one being the query element, and checks the results.
If at least one element in the Buffer is retrieved by this query, then the arriving element is a near duplicate.
Otherwise, the arriving element can be classified.
One more image arrives (arrow A1).
In this case, the arriving image is not a near duplicate of the images currently in the Buffer.
Therefore, it is submitted to the Classification Engine (arrows A3, A4 and A5).
It receives the green color, representing the class of another type of crisis.
After a while...
We have a couple of images in the Buffer.
Note that only one image of each class is a classified element.
The remaining ones are near duplicates.
It is time to clean our buffer.
The moment to do so is predefined according to either a limit size, e.g. number of elements, or a time window.
Now, the data is pulled out of the Buffer to be submitted to the Representative Selector (arrow B1).
This submodule is responsible for updating the representative object of each class, i.e. the object that best depicts that type of crisis.
In the example, the current representative objects are the ones in the second row.
The classified ones are the default representatives.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sugestão: guardar este comentário para as perguntas ao invés de já falar neste slide.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The criteria to select the representatives are also predefined by specialists.
For example, it could be the image with better resolution or the image that best depicts the type of crisis according to some clustering algorithm.
A better option could be to enhance the DCCM architecture so it becomes able to tune the criteria automatically, based on the current types of crisis.
After the execution of the Representative Selector, the representative of the yellow class remains the same.
However, the representative of the green class has changed.
Finally, all the elements are inserted into the Historical Records database, providing an improved dataset for the Classification and Historical Retrieval engines.
Moreover, the control center is notified with the updated representatives, in order to provide them with information that best represent the current crises.
Finally, another possible use case for the DCCM architecture corresponds to the Retrieval of Historical Data (arrows C1, C2 and C3).
The Decision-Making Team might be interested in information from past crises similar to the current ones.
In this case, they can pose Similarity Queries to retrieve such information from the Historical Records.
Moving on to the case study.
The Flickr-Fire dataset used in this work is composed by 2000 images, half containing fire and half not.
For the Filtering task, we used 80 images with redundant fire characteristics.
We included similarity support in the RDBMS PostgreSQL.
Due to performance reasons, we used our own implementation of a similarity-enabled RDBMS.
We also recall the tasks our methodology follows.
Starting with the quality results.
To classify the incoming images, we used the knowledge presented in the work of Bedo et. al, 2015.
The Color Structure Descriptor and the Manhattan Distance, as Feature Extractor and Evaluation Function, are suitable for fire-detection.
We applied the k-NN Classifier with k equal to 10 and performed the 10-fold cross validation.
The accuracy of the classifier was of 86%.
As previously presented, the Filtering Engine is composed by a Buffer.
So, our Buffer was composed by 80 images.
For this task, we ran Range queries with a range value of 10.
The Perceptual Hash and the Hamming Distance were employed.
This choice was made based on existing results on near-duplicate filtering, as presented in the link at the bottom.
This is the precision-recall graph of the Redundant Objects Filtering task.
The curve’s late fall-off represents a highly effective retrieval method.
Note, also, that the precision is above 90% until 50% of recall.
The third task is the Retrieval of Historical Data.
Again, we used the same setup as in the work of Bedo et. al.
To retrieve half of the dataset and check the quality of the response, we set these parameters.
Recall that the dataset contains 2000 images.
This is the precision-recall graph of the Retrieval of Historical Data task.
One can see that the precision is around 80% when retrieving 10% of the desired class (100 images) and around 90% when retrieving 5% (50 images).
From the control-center point of view, retrieving small amounts of precise images is highly required in order to take advantage of previous acquired knowledge.
Moving on to the overall performance.
This is the setup for the performance experiments.
This table presents the average time to run each of the tasks once.
The Classification of one object took about 0.851s.
The Filtering of Redundant Objects took, in average, 0.057s.
The Retrieval of Historical Data took around 1.147s and 0.849 for the Range and k-NN versions, respectively.
The Range queries took a bit longer because sometimes more than half of the dataset was retrieved, depending on the query element being used.
This graph shows the time spent to store one image.
One can see that the time to extract the features is what takes more time.
On the other hand, the insertion time does not change with the growth in the number of instances in the database.
This shows that DCCM does not create any bottleneck in the RDBMS.
Finally, the conclusions.
We proposed a methodology for employing a similarity-enabled RDBMS on crisis management tasks, along with an architecture based on such methodology.
Our methodology follows these 3 tasks: Objects Classification, Redundant Objects Filtering and Retrieval of Historical Data.
By employing proper similarity techniques to the crisis context, one can obtain both accurate and efficient responses through our architecture.
Our architecture enables the use of well-known cutting-edge methods and technologies to aid in a critical scenario.