SlideShare a Scribd company logo
1 of 52
Download to read offline
Wikimedia Content API:
A Cassandra Use-case
Eric Evans <eevans@wikimedia.org>
@jericevans
Strangeloop | September 17, 2016
Our Vision:
A world in which every single
human can freely share in the sum
of all knowledge.
About:
● Global movement
● Largest collection of free, collaborative knowledge in human history
● 16 projects
● 16.5 billion total page views per month
● 58.9 million unique devices per day
● More than 13k new editors each month
● More than 75k active editors month-to-month
About: Wikipedia
● More than 38 million articles in 290 languages
● Over 10k new articles added per day
● 13 million edits per month
● Ranked #6 globally in web traffic
Wikimedia
Architecture
LAMP
THE ARCHITECTURE
ALL OF IT
HTML
<h1>
Star Wars: The Force Awakens
</h1>
<p>
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by
<a href="/wiki/J._J._Abrams" title="J. J. Abrams">
J. J. Abrams
</a>
</p>
Wikitext
= Star Wars: The Force Awakens =
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by [[J. J. Abrams]].
Conversion
wikitext html
WYSIWYG
Conversion
wikitext html
Character-based diffs
Metadata
[[Foo|{{echo|bar}}]]
<a rel="mw:WikiLink" href="./Foo">
<span about="#mwt1" typeof="mw:Object/Template"
data-parsoid="{...}" >bar</span>
</a>
Parsoid
● Node.js service
● Converts wikitext to HTML/RDFa
● Converts HTML/RDFa to wikitext
● Semantics, and syntax (avoid dirty diffs)!
● Expensive (slow)
● Resulting output is large
RESTBase
● Services aggregator / proxy (REST)
● Durable cache (Cassandra)
● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
Cassandra
RESTBase
RESTBase
Parsoid ... ...
Other use-cases
● Mobile content service
● Math formula rendering service
● Dumps
● ...
Cassandra
Environment
● Cassandra 2.2.6
● 2 datacenters
● 3 racks rows per datacenter
● 6 replicas (3x3)
● 18 hosts (16 core, 128G, SSDs)
● 54 Cassandra nodes
● Deflate compression (~14-18%)
● Read-heavy workload (5:1)
Data model
key-revision-value
● Key
○ Site domain + document title
○ Used as partition key (determines cluster distribution)
● Revision
○ Monotonically increasing integer assigned by MediaWiki
○ Many-to-one relationship with key
● Render
○ TimeUUID assigned by RESTBase
○ Many-to-one relationship with revision
● Value
○ Blob
As CQL DDL
CREATE TABLE data (
domain text,
title text,
rev int,
render timeuuid,
value blob,
PRIMARY KEY ((domain, title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
Data model
(en.wikipedia.org, Star_Wars:_The_Force_Awakens)
717862573 717873822
...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a
...
09877568...7c7a913d3d8a
bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
Compression
Compression
chunk_length_kb
Compression
chunk_length_kb
Brotli compression
● Brought to you by the folks at Google; Successor to deflate
● Cassandra implementation (https://github.com/eevans/cassandra-brotli)
● Initial results very promising
● Better compression, lower cost (apples-apples)
● And, wider windows are possible (apples-oranges)
○ GC/memory permitting
○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size!
○ https://phabricator.wikimedia.org/T122028
● Stay tuned!
Compaction
Compaction
● The cost of having log-structured storage
● Asynchronously optimize data on disk for later reads
● At a minimum, reorganize into fewer files
● If possible, reorganize to position query results nearer each other
● Garbage collect
○ Remove overwritten values
○ Expiring TTLs
○ Removing deleted (aka tombstoned) data (after a fashion
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Very efficient reads, but also quite IO intensive
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Very efficient reads, but also quite IO intensive
OMG, THIS!
Nope.
● Too much disk IO
● Too much GC throughput
● Way too much disk IO
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Very efficient reads, but also quite IO intensive
● Date-tiered
○ For append only, total ordered data
○ Avoids mixing old data with new
○ Cold data eventually ceases to be compacted NO, THIS!
Nope.
● Too difficult to reason about
● Optimizations easy defeated
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Very efficient reads, but also quite IO intensive
● Date-tiered
○ For append only, total ordered data
○ Avoids mixing old data with new
○ Cold data eventually ceases to be compacted
this?
But remember: Compaction is also for...
● Removing overwritten values
● Expiring TTLs
● Removing deleted (aka tombstoned) data (after a fashion)
The Overlapping Tables Problem
Tables being compacted Tables not being compacted
Garbage
Collection (JVM)
G1GC
● Early adopters of G1 (aka “Garbage 1st”)
● Successor to Concurrent Mark-sweep (CMS)
● Incremental parallel compacting collector
● More predictable performance than CMS
Humongous objects
● Anything >= ½ region size is classified as Humongous
● Humongous objects are allocated into Humongous Regions
● Only one object for a region (wastes space, creates fragmentation)
● Until 1.8u40, humongous regions collected only during full collections (Bad)
● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better)
● Treated as exceptions, so should be exceptional
○ For us, that means 8MB regions
● Enable GC logging and have a look!
Node density
Motivation
● Compaction
● GC
● Failure scenarios
● ...
What we do
● Processes (yup)
● Puppetized configuration
○ /etc/cassandra-a/
○ /etc/cassandra-b/
○ systemd units
○ Etc
● Shared RAID-0
What we should have done
● Virtualization
● Containers
● Blades
● Not processes
The Good
● Fault tolerance
● Availability
● Datacenter / rack awareness
● Visibility
● Ubiquity
● Community
The Bad
● Usability
○ Compaction
○ Streaming
○ JMX
● Vertical Scaling
The Ugly
● Upgrades
● Release process
Wikimedia Content API (Strangeloop)

More Related Content

What's hot

Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
yaevents
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Salil Navgire
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
openstackindia
 

What's hot (20)

Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
Введение в Apache Cassandra
Введение в Apache CassandraВведение в Apache Cassandra
Введение в Apache Cassandra
 
KDB+ Lite
KDB+ LiteKDB+ Lite
KDB+ Lite
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspots
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraApache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
Redis Overview
Redis OverviewRedis Overview
Redis Overview
 
Redis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HARedis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HA
 

Viewers also liked

Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
Eric Evans
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 

Viewers also liked (20)

Domain driven design and model driven development
Domain driven design and model driven developmentDomain driven design and model driven development
Domain driven design and model driven development
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
 

Similar to Wikimedia Content API (Strangeloop)

Similar to Wikimedia Content API (Strangeloop) (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Cassandra in production
Cassandra in productionCassandra in production
Cassandra in production
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Scaling MySQL in Amazon Web Services
Scaling MySQL in Amazon Web ServicesScaling MySQL in Amazon Web Services
Scaling MySQL in Amazon Web Services
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
Consistent hashing algorithmic tradeoffs
Consistent hashing  algorithmic tradeoffsConsistent hashing  algorithmic tradeoffs
Consistent hashing algorithmic tradeoffs
 
Monitoring Cassandra With An EYE
Monitoring Cassandra With An EYEMonitoring Cassandra With An EYE
Monitoring Cassandra With An EYE
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Long Term Road Test of C*
Long Term Road Test of C*Long Term Road Test of C*
Long Term Road Test of C*
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 

More from Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Wikimedia Content API (Strangeloop)

  • 1. Wikimedia Content API: A Cassandra Use-case Eric Evans <eevans@wikimedia.org> @jericevans Strangeloop | September 17, 2016
  • 2.
  • 3. Our Vision: A world in which every single human can freely share in the sum of all knowledge.
  • 4. About: ● Global movement ● Largest collection of free, collaborative knowledge in human history ● 16 projects ● 16.5 billion total page views per month ● 58.9 million unique devices per day ● More than 13k new editors each month ● More than 75k active editors month-to-month
  • 5. About: Wikipedia ● More than 38 million articles in 290 languages ● Over 10k new articles added per day ● 13 million edits per month ● Ranked #6 globally in web traffic
  • 9.
  • 10.
  • 11. HTML <h1> Star Wars: The Force Awakens </h1> <p> Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by <a href="/wiki/J._J._Abrams" title="J. J. Abrams"> J. J. Abrams </a> </p>
  • 12. Wikitext = Star Wars: The Force Awakens = Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by [[J. J. Abrams]].
  • 17. Metadata [[Foo|{{echo|bar}}]] <a rel="mw:WikiLink" href="./Foo"> <span about="#mwt1" typeof="mw:Object/Template" data-parsoid="{...}" >bar</span> </a>
  • 18. Parsoid ● Node.js service ● Converts wikitext to HTML/RDFa ● Converts HTML/RDFa to wikitext ● Semantics, and syntax (avoid dirty diffs)! ● Expensive (slow) ● Resulting output is large
  • 19. RESTBase ● Services aggregator / proxy (REST) ● Durable cache (Cassandra) ● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
  • 21. Other use-cases ● Mobile content service ● Math formula rendering service ● Dumps ● ...
  • 23. Environment ● Cassandra 2.2.6 ● 2 datacenters ● 3 racks rows per datacenter ● 6 replicas (3x3) ● 18 hosts (16 core, 128G, SSDs) ● 54 Cassandra nodes ● Deflate compression (~14-18%) ● Read-heavy workload (5:1)
  • 25. key-revision-value ● Key ○ Site domain + document title ○ Used as partition key (determines cluster distribution) ● Revision ○ Monotonically increasing integer assigned by MediaWiki ○ Many-to-one relationship with key ● Render ○ TimeUUID assigned by RESTBase ○ Many-to-one relationship with revision ● Value ○ Blob
  • 26. As CQL DDL CREATE TABLE data ( domain text, title text, rev int, render timeuuid, value blob, PRIMARY KEY ((domain, title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
  • 27. Data model (en.wikipedia.org, Star_Wars:_The_Force_Awakens) 717862573 717873822 ...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a ... 09877568...7c7a913d3d8a bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
  • 31. Brotli compression ● Brought to you by the folks at Google; Successor to deflate ● Cassandra implementation (https://github.com/eevans/cassandra-brotli) ● Initial results very promising ● Better compression, lower cost (apples-apples) ● And, wider windows are possible (apples-oranges) ○ GC/memory permitting ○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size! ○ https://phabricator.wikimedia.org/T122028 ● Stay tuned!
  • 33. Compaction ● The cost of having log-structured storage ● Asynchronously optimize data on disk for later reads ● At a minimum, reorganize into fewer files ● If possible, reorganize to position query results nearer each other ● Garbage collect ○ Remove overwritten values ○ Expiring TTLs ○ Removing deleted (aka tombstoned) data (after a fashion
  • 34. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Very efficient reads, but also quite IO intensive
  • 35. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Very efficient reads, but also quite IO intensive OMG, THIS!
  • 36. Nope. ● Too much disk IO ● Too much GC throughput ● Way too much disk IO
  • 37. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Very efficient reads, but also quite IO intensive ● Date-tiered ○ For append only, total ordered data ○ Avoids mixing old data with new ○ Cold data eventually ceases to be compacted NO, THIS!
  • 38. Nope. ● Too difficult to reason about ● Optimizations easy defeated
  • 39. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Very efficient reads, but also quite IO intensive ● Date-tiered ○ For append only, total ordered data ○ Avoids mixing old data with new ○ Cold data eventually ceases to be compacted this?
  • 40. But remember: Compaction is also for... ● Removing overwritten values ● Expiring TTLs ● Removing deleted (aka tombstoned) data (after a fashion)
  • 41. The Overlapping Tables Problem Tables being compacted Tables not being compacted
  • 43. G1GC ● Early adopters of G1 (aka “Garbage 1st”) ● Successor to Concurrent Mark-sweep (CMS) ● Incremental parallel compacting collector ● More predictable performance than CMS
  • 44. Humongous objects ● Anything >= ½ region size is classified as Humongous ● Humongous objects are allocated into Humongous Regions ● Only one object for a region (wastes space, creates fragmentation) ● Until 1.8u40, humongous regions collected only during full collections (Bad) ● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better) ● Treated as exceptions, so should be exceptional ○ For us, that means 8MB regions ● Enable GC logging and have a look!
  • 46. Motivation ● Compaction ● GC ● Failure scenarios ● ...
  • 47. What we do ● Processes (yup) ● Puppetized configuration ○ /etc/cassandra-a/ ○ /etc/cassandra-b/ ○ systemd units ○ Etc ● Shared RAID-0
  • 48. What we should have done ● Virtualization ● Containers ● Blades ● Not processes
  • 49. The Good ● Fault tolerance ● Availability ● Datacenter / rack awareness ● Visibility ● Ubiquity ● Community
  • 50. The Bad ● Usability ○ Compaction ○ Streaming ○ JMX ● Vertical Scaling
  • 51. The Ugly ● Upgrades ● Release process