The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

•

0 likes•371 views

Shobhna Srivastava discusses Elsevier's Research Citation network. She talks about how the journey of trying to simplify the existing data processing pipeline, to optimise costs, and choose the right solution to the problem opens the doors to other potential use cases and innovation. Graph technology has been applied to the scientific research domain to enhance content discovery.

Data & Analytics

July, 2020
Shobhna Srivastava
Enhancing Search
results with Graph
Neo4j/Elsevier

Context
■ Elsevier is a global
information & analytics
business specializing in
Science & health
■ Scopus – “Expertly curated
abstract & citations
database”
■ https://www.scopus.com/

Problem definition
4
Doesn’t enable changes or enriching document with new data points
This processing is fragile
Costly solution
Hardware used
•90 nodes Solr indexing cluster (this is separate to live search cluster)
•Redshift
•Of course processing EC2 instances
Old document enrichment pipeline
•Index is created in Solr
•Redshift updated from Solr
•Then new counts are calculated, and diff done with old Solr index
•Then the updates are applied to Solr index
•And finally live Solr cluster is updated

Bounded context
Runtime system –
performance is
important
Aware of starting
node or nodes
Depth first or
breadth first
traversal
Metrics generation
5

Why
graph?
Classic multi-level graph traversals
Many-to-many relations on input data
Non-trivial & multi-level joins
Most enrichment is done
on relationships and how data are
connected to each other
6

Technology choice
Neo4J Neptune
Meets QPS ✓ ⚠ Neptune is much slower with with queries that require longer traversals
(i.e. "rolled up" queries per organisation count - 7 ms on Neo4j vs 7 seconds
on Neptune)
Scalability ⚠Tested with graph size that fits into cache, with larger graph some
smarter caching should be implemented
⚠ Works fast on larger instances (supposedly because of the cache size),
so with larger graph some application-level optimisations might be required.
A bit trickier than Neo4j because cache settings are not visible/configurable
Indexing ✓ ⚠Indexes are not configurable
Transaction management ✓ ⚠ Every traversal is a single transaction, manual commit/rollback are not
supported
Easy of cluster management ✓ Out-of-the box clustering with enterprise license
Unless enterprise licence purchased clustering and data replication
should be handled by us
✓ Easy out-of-the box data replication, immediate consistency
Cost 2 r4.4xlarge instances + LB ~ 1800 USD/month 2 r4.4xlarge instances + 250 GB storage (estimated based on test data) ~
2015 USD/month + 0.2 USD/1 million I/O requests (1,600 million requests
made only during testing)
7

Result
■ ~300,000,000 nodes
– Work (Article, books, chapter) – 268,419,884
– Person (Author) – 40,633,203
– Organisation - 13,044,870
– Journal - 227,747
■ ~1,000,000,000 relations
■ ~1,000,000 updates a day
■ Hardware used (From ~90+ to ~9 nodes)
– 3 nodes (r4.4xlarge)
– 3 nodes data processing
– 3 nodes for API
10

Future work
11
Weighted ranking
Guided navigation
Related entities Suggestion
New links Associations

What's hot

Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen

Massively Scalable Computational Finance with SciDBParadigm4Inc

Big Data in the Cloud with Azure Marketplace ImagesMark Kromer

Managed Cluster ServicesAdam Doyle

Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Data Con LA

Big Data LDN 2016: All data is equal – but some data is more equal than othersMatt Stubbs

American Ancestors Use Case - Scalability & Support Using the Elasticsearch S...Elasticsearch

Case Study: Big Data AnalyticsAbhinav Das

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Build Real-Time Applications with Databricks StreamingDatabricks

Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks

Offload, Transform, and Present - The New World of Data Integrationgluent.

Big Data – A New Testing ChallengeTEST Huddle

Democratizing Machine Learning: Perspective from a scikit-learn CreatorDatabricks

Delivering digital transformation and business impact with io t, machine lear...Robert Sanders

Building big data solutions on azureEyal Ben Ivri

The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit

The State of the Data Warehouse in 2017 and BeyondSingleStore

Query OManuell Labor

What's hot (20)

Data kitchen 7 agile steps - big data fest 9-18-2015

Massively Scalable Computational Finance with SciDB

Big Data in the Cloud with Azure Marketplace Images

Managed Cluster Services

Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...

Big Data LDN 2016: All data is equal – but some data is more equal than others

American Ancestors Use Case - Scalability & Support Using the Elasticsearch S...

Case Study: Big Data Analytics

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Build Real-Time Applications with Databricks Streaming

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

Offload, Transform, and Present - The New World of Data Integration

Big Data – A New Testing Challenge

Democratizing Machine Learning: Perspective from a scikit-learn Creator

Delivering digital transformation and business impact with io t, machine lear...

Building big data solutions on azure

The key to unlocking the Value in the IoT? Managing the Data!

The State of the Data Warehouse in 2017 and Beyond

Query O

Similar to The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin

Taking Splunk to the Next Level – ArchitectureSplunk

Data ware housing - Introduction to data ware housing process.Vibrant Technologies & Computers

Performing Oracle Health Checks Using APEXDatavail

Kylin and Druid Presentationargonauts007

Jethro data meetup index base sql on hadoop - oct-2014Eli Singer

An AMIS overview of database 12cGetting value from IoT, Integration and Data Analytics

An AMIS Overview of Oracle database 12c (12.1)Marco Gralike

Using Couchbase and Elasticsearch as data layersTal Maayani

Taking Splunk to the Next Level - Architecture Breakout SessionSplunk

Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago

Taking Splunk to the Next Level - Architecture Breakout SessionSplunk

20141206 4 q14_dataconference_i_am_your_dbhyeongchae lee

Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j

Introduction to data mining and data warehousingEr. Nawaraj Bhandari

Taking Splunk to the Next Level - Architecture Breakout SessionSplunk

MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.

Optimized NFV placement in Openstack CloudsYathiraj Udupi, Ph.D.

Similar to The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier (20)

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Taking Splunk to the Next Level – Architecture

Data ware housing - Introduction to data ware housing process.

Performing Oracle Health Checks Using APEX

Kylin and Druid Presentation

Jethro data meetup index base sql on hadoop - oct-2014

An AMIS overview of database 12c

An AMIS Overview of Oracle database 12c (12.1)

Using Couchbase and Elasticsearch as data layers

Taking Splunk to the Next Level - Architecture Breakout Session

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017

Taking Splunk to the Next Level - Architecture Breakout Session

20141206 4 q14_dataconference_i_am_your_db

Novo Nordisk's journey in developing an open-source application on Neo4j

Introduction to data mining and data warehousing

Taking Splunk to the Next Level - Architecture Breakout Session

MongoDB for Spatio-Behavioral Data Analysis and Visualization

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Optimized NFV placement in Openstack Clouds

Recently uploaded

Midocean dropshipping via API with DroFxolyaivanovalion

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Invezz.com - Grow your wealth with trading signalsInvezz1

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Zuja dropshipping via API with DroFx.pptxolyaivanovalion

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Discover Why Less is More in B2B Researchmichael115558

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Halmar dropshipping via API with DroFxolyaivanovalion

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Recently uploaded (20)

Midocean dropshipping via API with DroFx

Edukaciniai dropshipping via API with DroFx

Schema on read is obsolete. Welcome metaprogramming..pdf

BigBuy dropshipping via API with DroFx.pptx

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Invezz.com - Grow your wealth with trading signals

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Log Analysis using OSSEC sasoasasasas.pptx

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

Zuja dropshipping via API with DroFx.pptx

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Discover Why Less is More in B2B Research

CebaBaby dropshipping via API with DroFX.pptx

Halmar dropshipping via API with DroFx

Carero dropshipping via API with DroFx.pptx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

1. July, 2020 Shobhna Srivastava Enhancing Search results with Graph Neo4j/Elsevier

2. Context ■ Elsevier is a global information & analytics business specializing in Science & health ■ Scopus – “Expertly curated abstract & citations database” ■ https://www.scopus.com/

3. IN PRODUCT

4. Problem definition 4 Doesn’t enable changes or enriching document with new data points This processing is fragile Costly solution Hardware used •90 nodes Solr indexing cluster (this is separate to live search cluster) •Redshift •Of course processing EC2 instances Old document enrichment pipeline •Index is created in Solr •Redshift updated from Solr •Then new counts are calculated, and diff done with old Solr index •Then the updates are applied to Solr index •And finally live Solr cluster is updated

5. Bounded context Runtime system – performance is important Aware of starting node or nodes Depth first or breadth first traversal Metrics generation 5

6. Why graph? Classic multi-level graph traversals Many-to-many relations on input data Non-trivial & multi-level joins Most enrichment is done on relationships and how data are connected to each other 6

7. Technology choice Neo4J Neptune Meets QPS ✓ ⚠ Neptune is much slower with with queries that require longer traversals (i.e. "rolled up" queries per organisation count - 7 ms on Neo4j vs 7 seconds on Neptune) Scalability ⚠Tested with graph size that fits into cache, with larger graph some smarter caching should be implemented ⚠ Works fast on larger instances (supposedly because of the cache size), so with larger graph some application-level optimisations might be required. A bit trickier than Neo4j because cache settings are not visible/configurable Indexing ✓ ⚠Indexes are not configurable Transaction management ✓ ⚠ Every traversal is a single transaction, manual commit/rollback are not supported Easy of cluster management ✓ Out-of-the box clustering with enterprise license Unless enterprise licence purchased clustering and data replication should be handled by us ✓ Easy out-of-the box data replication, immediate consistency Cost 2 r4.4xlarge instances + LB ~ 1800 USD/month 2 r4.4xlarge instances + 250 GB storage (estimated based on test data) ~ 2015 USD/month + 0.2 USD/1 million I/O requests (1,600 million requests made only during testing) 7

8. ARCHITECTURE COMPONENTS 8

9. Relations update example 9

10. Result ■ ~300,000,000 nodes – Work (Article, books, chapter) – 268,419,884 – Person (Author) – 40,633,203 – Organisation - 13,044,870 – Journal - 227,747 ■ ~1,000,000,000 relations ■ ~1,000,000 updates a day ■ Hardware used (From ~90+ to ~9 nodes) – 3 nodes (r4.4xlarge) – 3 nodes data processing – 3 nodes for API 10

11. Future work 11 Weighted ranking Guided navigation Related entities Suggestion New links Associations

The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

Similar to The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier