This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).
9. LinkedIn Data Infrastructure: Sample Stack
9
Infra challenges in 3-phase
ecosystem are diverse,
complex and specific
Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
21. Application View
21
Hierarchical data model
Rich functionality on resources
Conditional updates
Partial updates
Atomic counters
Rich functionality within
resource groups
Transactions
Secondary index
Text search
25. Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
26. Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
32. Data model (cont‟d)
{
article_id=5560874437395353942,
title=Five Good Reasons to Hire the Unemployed,
language=en_US,
article_source=bit.ly,
url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29v
ZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK,
...
}
39. Data model
DDL for data definition and schema
Central versioned registry of all schemas
Schema review
Programmatic compatibility model
– Schema changes handled transparently
40. Workflow
1 Check in schema
2 Code review
3 Ship
Seamless data load into downstream systems
43. Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
45. Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
54. Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
55. Model of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)
• Distributed L-BFGS
• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing
Transition needs to be goodProducts => data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
Not part of kafka
- Others: Oozie
Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban