Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

Shobhna Srivastava discusses Elsevier's Research Citation network. She talks about how the journey of trying to simplify the existing data processing pipeline, to optimise costs, and choose the right solution to the problem opens the doors to other potential use cases and innovation. Graph technology has been applied to the scientific research domain to enhance content discovery.

  • Be the first to comment

  • Be the first to like this

The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

  1. 1. July, 2020 Shobhna Srivastava Enhancing Search results with Graph Neo4j/Elsevier
  2. 2. Context ■ Elsevier is a global information & analytics business specializing in Science & health ■ Scopus – “Expertly curated abstract & citations database” ■
  3. 3. IN PRODUCT
  4. 4. Problem definition 4 Doesn’t enable changes or enriching document with new data points This processing is fragile Costly solution Hardware used •90 nodes Solr indexing cluster (this is separate to live search cluster) •Redshift •Of course processing EC2 instances Old document enrichment pipeline •Index is created in Solr •Redshift updated from Solr •Then new counts are calculated, and diff done with old Solr index •Then the updates are applied to Solr index •And finally live Solr cluster is updated
  5. 5. Bounded context Runtime system – performance is important Aware of starting node or nodes Depth first or breadth first traversal Metrics generation 5
  6. 6. Why graph? Classic multi-level graph traversals Many-to-many relations on input data Non-trivial & multi-level joins Most enrichment is done on relationships and how data are connected to each other 6
  7. 7. Technology choice Neo4J Neptune Meets QPS ✓ ⚠ Neptune is much slower with with queries that require longer traversals (i.e. "rolled up" queries per organisation count - 7 ms on Neo4j vs 7 seconds on Neptune) Scalability ⚠Tested with graph size that fits into cache, with larger graph some smarter caching should be implemented ⚠ Works fast on larger instances (supposedly because of the cache size), so with larger graph some application-level optimisations might be required. A bit trickier than Neo4j because cache settings are not visible/configurable Indexing ✓ ⚠Indexes are not configurable Transaction management ✓ ⚠ Every traversal is a single transaction, manual commit/rollback are not supported Easy of cluster management ✓ Out-of-the box clustering with enterprise license Unless enterprise licence purchased clustering and data replication should be handled by us ✓ Easy out-of-the box data replication, immediate consistency Cost 2 r4.4xlarge instances + LB ~ 1800 USD/month 2 r4.4xlarge instances + 250 GB storage (estimated based on test data) ~ 2015 USD/month + 0.2 USD/1 million I/O requests (1,600 million requests made only during testing) 7
  9. 9. Relations update example 9
  10. 10. Result ■ ~300,000,000 nodes – Work (Article, books, chapter) – 268,419,884 – Person (Author) – 40,633,203 – Organisation - 13,044,870 – Journal - 227,747 ■ ~1,000,000,000 relations ■ ~1,000,000 updates a day ■ Hardware used (From ~90+ to ~9 nodes) – 3 nodes (r4.4xlarge) – 3 nodes data processing – 3 nodes for API 10
  11. 11. Future work 11 Weighted ranking Guided navigation Related entities Suggestion New links Associations