Building Data Products using Hadoop at Linkedin - Mitul Tiwari

•Download as PPT, PDF•

13 likes•2,011 views

Hadoop and other big data tools such as Voldemort, Azkaban, and Kafka, drive many data driven products at LinkedIn such as “People You MayKnow” and various recommendation products such as “Jobs You May Be Interested In”. Each of these products can be viewed as a large scale social recommendation problems, which analyzes billions of possible options, and suggest appropriate recommendation. Since these products analyzes billions of edges and terabytes of data daily, it can be built only using a large scale distributed compute infrastructure. Kafka publish-subscribe messaging system is used to get the data in Hadoop file system. Hadoop MapReduce is used as the basic building block to analyze billions of potential options, and predict recommendation. Over a hundred MapReduce tasks are combined together in a work-flow uising Azkaban, a Hadoop work-flow management tool. The output of Hadoop jobs is finally stored in Voldemort key-value store to serve the data at run-time for efficiency. During this talk audience will get a basic understanding of link prediction problem behind “ People You May Know” feature, which is a large scale social recommendation problem. Overview of the solution of this problem using Hadoop MapReduce, Azkaban workflow management tool, and Voldemort key-value store will be presented. I will also describe how to efficiently compute the number of common connections (triangle closing) using Hadoop Mapreduce, which is one of the many signals in link prediction. Overall, people interested in building interesting applications using Hadoop MapReduce will hugely benefit from this talk.

Technology Business

Building Data Products using Hadoop at Linkedin ,[object Object],[object Object],[object Object]

Data Products: Key Ideas ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data Products: Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object]

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Systems and Tools ,[object Object],[object Object],[object Object],[object Object]

Systems and Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

People You May Know Alice Bob Carol How do people know each other?

People You May Know Alice Bob Carol Triangle closing How do people know each other?

People You May Know Alice Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections How do people know each other?

Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();

Pig Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],connections = LOAD `connections` USING PigStorage();

Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],group_conn = GROUP connections BY source_id;

Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);

Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;

Our Workflow triangle-closing top-n push-to-prod

Our Workflow triangle-closing top-n push-to-prod remove connections

Our Workflow triangle-closing top-n push-to-prod remove connections push-to-qa

Workflow Requirements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample Azkaban Job Spec ,[object Object],[object Object],[object Object],[object Object]

Production Storage ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Voldemort Storage ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data Quality ,[object Object],[object Object],[object Object],[object Object],[object Object]

Performance ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Things Covered ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

SNA Team ,[object Object],[object Object],[object Object]

More from BigDataCloud

Big Data in the Cloud - Solutions & Apps

BigDataCloud

Big Data Analytics in Motorola on the Google Cloud Platform

BigDataCloud

Streak + Google Cloud Platform

BigDataCloud

Using Advanced Analyics to bring Business Value

BigDataCloud

Creating Business Value from Big Data, Analytics & Technology.

BigDataCloud

A tutorial given at NAACL HLT 2013. Richard Socher and Christopher Manning http://nlp.stanford.edu/courses/NAACL2013/ Machine learning is everywhere in today's NLP, but by and large machine learning amounts to numerical optimization of weights for human designed representations and features. The goal of deep learning is to explore how computers can take advantage of data to develop features and representations appropriate for complex interpretation tasks. This tutorial aims to cover the basic motivation, ideas, models and learning algorithms in deep learning for natural language processing. Recently, these methods have been shown to perform very well on various NLP tasks such as language modeling, POS tagging, named entity recognition, sentiment analysis and paraphrase detection, among others. The most attractive quality of these techniques is that they can perform well without any external hand-designed resources or time-intensive feature engineering. Despite these advantages, many researchers in NLP are not familiar with these methods. Our focus is on insight and understanding, using graphical illustrations and simple, intuitive derivations. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable, rather than black boxes labeled "magic here". The first part of the tutorial presents the basics of neural networks, neural word vectors, several simple models based on local windows and the math and algorithms of training via backpropagation. In this section applications include language modeling and POS tagging. In the second section we present recursive neural networks which can learn structured tree outputs as well as vector representations for phrases and sentences. We cover both equations as well as applications. We show how training can be achieved by a modified version of the backpropagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Applications include sentiment analysis and paraphrase detection. We also draw connections to recent work in semantic compositionality in vector spaces. The principle goal, again, is to make these methods appear intuitive and interpretable rather than mathematically confusing. By this point in the tutorial, the audience members should have a clear understanding of how to build a deep learning system for word-, sentence- and document-level tasks. The last part of the tutorial gives a general overview of the different applications of deep learning in NLP, including bag of words models. We will provide a discussion of NLP-oriented issues in modeling, interpretation, representational power, and optimization.

Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning

BigDataCloud

Recommendation Engines - An Architectural Guide

BigDataCloud

As the big data market matures, discussions about Hadoop are expanding from pure technology to how businesses can use it to innovate and leap frog competitors. In this session, Karmasphere will outline how technologists can effectively work with their CMOs - the likely drivers of widespread Hadoop adoption, to unlock its business value. The discussion will include: how changes in marketing are driving the adoption of Hadoop big data analytics, the evolving role of the data and business analysts and a review of real-world big data analytics use cases. Karmasphere will demonstrate how the Full Fidelity Analytics of Hadoop can empower high-tech, e-commerce, etail and reatil banking to quickly and easily analyze complex data types across silos and apply sophisticated analytics to personalize customer engagement and optimize revenue.

Why Hadoop is the New Infrastructure for the CMO?

BigDataCloud

Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal

BigDataCloud

Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB

BigDataCloud

Big Data Cloud Meetup - Jan 24 2013 - Zettaset

BigDataCloud

At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages. This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

BigDataCloud

Michael Ralph Stonebraker is a computer scientist specializing in database research. He is currently an adjunct professor at MIT, where he has been involved in the development of the Aurora, C-Store, H-Store, Morpheus, and SciDB systems.Through a series of academic prototypes and commercial startups, Stonebraker's research and products are central to many relational database systems on the market today. He is also the founder of a number of database companies, including Ingres, Illustra, Cohera, StreamBase Systems, Vertica, VoltDB, and Paradigm4. He was previously the Chief Technical Officer (CTO) of Informix & a Professor of Computer Science at University of California, Berkeley. He is also an editor for the book "Readings in Database Systems"

What Does Big Data Mean and Who Will Win

BigDataCloud

Big Data Analytics is characterized by analysis of data on three vectors: exploding data volume, proliferating data variety (relational, multi-media), and accelerating data velocity. However, other key vectors such as costs and skill set needed for Big Data Analytics are often overlooked. In this session, we will consider all five vectors by exploring various techniques where traditional but progressive technologies such as column store DBMS and Event Stream Processing is combined with open source frameworks such as Hadoop to exploit the full potential of Big Data Analytics. Agenda: - Big Data Analytics in the real world - Commercial and Open Source techniques - Bringing together Commercial and Open Source techniques * Architectures * Programming APIs (e.g. embedded and federated MapReduce) - Conclusions

Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase

BigDataCloud

BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation

BigDataCloud

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...

BigDataCloud

BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...

BigDataCloud

BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...

BigDataCloud

Personalized recommendations are ubiquitous in social network and shopping sites these days. How do they do it? As long as enough user interaction data is available for items e.g., products in shopping sites, a kind of recommendation engine based on what’s known as ' Collaborative Filtering' is not that difficult to build. Since the solution causes a combinatorial explosion, Hadoop can play a critical role in processing massive amount of data in collaborative filtering based solutions. In this presentations, I will cover a Hadoop based recommendation engine implementation using collaborative filtering.

Recommendation Engine Powered by Hadoop - Pranab Ghosh

BigDataCloud

BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...

BigDataCloud

More from BigDataCloud (20)

Big Data in the Cloud - Solutions & Apps

Big Data Analytics in Motorola on the Google Cloud Platform

Streak + Google Cloud Platform

Using Advanced Analyics to bring Business Value

Creating Business Value from Big Data, Analytics & Technology.

Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning

Recommendation Engines - An Architectural Guide

Why Hadoop is the New Infrastructure for the CMO?

Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal

Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB

Big Data Cloud Meetup - Jan 24 2013 - Zettaset

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

What Does Big Data Mean and Who Will Win

Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase

BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...

BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...

BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...

Recommendation Engine Powered by Hadoop - Pranab Ghosh

BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...

Recently uploaded

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

The action of the next cyber saga takes place in the mystical lands of the Asia-Pacific region, where the main characters began their digital activities in the middle of 2021 and qualitatively strengthened it in 2022. Corporate espionage, document theft, audio recordings, and data leaks from messaging platforms were all a matter of one day for Dark Pink. Their geographical focus may have started in the Asia-Pacific region, but their ambitions knew no bounds, targeting a European government ministry in a bold move to expand their portfolio. Their victim profile was as diverse as a UN meeting, targeting military organizations, government agencies, and even a religious organization. Because discrimination is not a fashionable agenda. In the world of cybercrime, they serve as a reminder that sometimes the most serious threats come in the most unassuming packages with a pink bow.

Cyberprint. Dark Pink Apt Group [EN].pdf

Overkill Security

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Angeliki Cooney has spent over twenty years at the forefront of the life sciences industry, working out of Wynantskill, NY. She is highly regarded for her dedication to advancing the development and accessibility of innovative treatments for chronic diseases, rare disorders, and cancer. Her professional journey has centered on strategic consulting for biopharmaceutical companies, facilitating digital transformation, enhancing omnichannel engagement, and refining strategic commercial practices. Angeliki's innovative contributions include pioneering several software-as-a-service (SaaS) products for the life sciences sector, earning her three patents. As the Senior Vice President of Life Sciences at Avenga, Angeliki orchestrated the firm's strategic entry into the U.S. market. Avenga, a renowned digital engineering and consulting firm, partners with significant entities in the pharmaceutical and biotechnology fields. Her leadership was instrumental in expanding Avenga's client base and establishing its presence in the competitive U.S. market.

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Angeliki Cooney

Keynote 2: APIs in 2030: The Risk of Technological Sleepwalk Paolo Malinverno, Growth Advisor - The Business of Technology Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

apidays

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Exploring Multimodal Embeddings with Milvus

Zilliz

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

MINDCTI Revenue Release Quarter One 2024

MIND CTI

In the thrilling conclusion to 2023, ransomware groups had a banner year, really outdoing themselves in the "make everyone's life miserable" department. LockBit 3.0 took gold in the hacking olympics, followed by the plucky upstarts Clop and ALPHV/BlackCat. Apparently, 48% of organizations were feeling left out and decided to get in on the cyber attack action. Business services won the "most likely to get digitally mugged" award, with education and retail nipping at their heels. Hackers expanded their repertoire beyond boring old encryption to the much more exciting world of extortion. The US, UK and Canada took top honors in the "countries most likely to pay up" category. Bitcoins were the currency of choice for discerning hackers, because who doesn't love untraceable money?

Ransomware_Q4_2023. The report. [EN].pdf

Overkill Security

Recently uploaded (20)

presentation ICT roal in 21st century education

Strategies for Landing an Oracle DBA Job as a Fresher

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

AXA XL - Insurer Innovation Award Americas 2024

Cyberprint. Dark Pink Apt Group [EN].pdf

FWD Group - Insurer Innovation Award 2024

CNIC Information System with Pakdata Cf In Pakistan

Axa Assurance Maroc - Insurer Innovation Award 2024

Manulife - Insurer Transformation Award 2024

Why Teams call analytics are critical to your entire business

Exploring the Future Potential of AI-Enabled Smartphone Processors

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

[BuildWithAI] Introduction to Gemini.pdf

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

MS Copilot expands with MS Graph connectors

Exploring Multimodal Embeddings with Milvus

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

MINDCTI Revenue Release Quarter One 2024

Ransomware_Q4_2023. The report. [EN].pdf

Building Data Products using Hadoop at Linkedin - Mitul Tiwari

2. Who am I?

4. People You May Know

5. Profile Stats: WVMP

6. Viewers of this profile also ...

7. Skills

8. InMaps

10.

11.

12.

13.

14.

15.

16.

17.

18. People You May Know Alice Bob Carol How do people know each other?

19. People You May Know Alice Bob Carol How do people know each other?

20. People You May Know Alice Bob Carol Triangle closing How do people know each other?

21. People You May Know Alice Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections How do people know each other?

22. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();

23.

24. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();

25. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();

26. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();

27. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();

28. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();

29.

30.

31.

32.

33. Our Workflow triangle-closing

34. Our Workflow triangle-closing top-n

35. Our Workflow triangle-closing top-n push-to-prod

36.

37. Our Workflow triangle-closing top-n push-to-prod

38. Our Workflow triangle-closing top-n push-to-prod remove connections

39. Our Workflow triangle-closing top-n push-to-prod remove connections push-to-qa

40. PYMK Workflow

41.

42.

43.

44. Azkaban Workflow

45. Azkaban Workflow

46. Azkaban Workflow

47. Our Workflow triangle-closing top-n push-to-prod remove connections

48. Our Workflow triangle-closing top-n push-to-prod remove connections

49.

50.

51.

52. Data Cycle

53. Voldemort RO Store

54. Our Workflow triangle-closing top-n push-to-prod remove connections

55.

56.

57.

58.

59.

60.

61. Questions?

62.

63.

Editor's Notes

Hi, I am Mitul Tiwari. Today I am going to talk about building data driven products using Hadoop at LinkedIn.
I am part of Search, Network, Analytics team at LinkedIn, and I work on data driven products such as People You May Know.
let me illustrate through a few examples of data products at LinkedIn
LinkedIn is the second largest social network for professionals with more than 100 million members. PYMK is a large scale recommendation system that helps you connect with others. Basically, PYMK is a link prediction problem, where we analyze billions of edges to recommend possible connections to you. A big big-data problem!
Another example of a data product at LinkeIn is “Profile Stats” or “Who Viewed My Profile”. Profile Stats provides analytics about your profile on LinkedIn. It provides stats about who viewed your profile, what are the top search queries leading to your profile, the number of profile views per day/week, location of the visitor, etc. We have billions of pageviews per month, Profile Stats is another big data problem.
Another example of Data Product at LinkedIn is “Viewers of this profile also viewed these profiles”. A collaborative filtering way of suggesting profile.
topic pages for skills
Visualize your connections. Cluster your connections based on their connection density among them.
The key ideas behind these data products are Recommendations, Analytics, Insight, and Visualization.
Some challenges behind building data driven products at linkedin. A naive implementation of PYMK may result in generating 120mX120m pairs, which is 14400 trillion pairs. So you have to be smart about it. So which data product would you like me to build during this talk?
Here is a pig script to do triangle closing, that is, find the number of common connections between any pair of members.
So how many of you are familiar with Pig? Let me refresh some Pig constructs for those who are not very familiar with Pig.
First you load connections data that is in bidirectional pairs format. Representing each direction of an edge by a pair of member ids.
Then we group connections pairs by source_ids to aggregate all connections for each member.
From aggregated connections we generate pairs of members (id1, id2) which are friend-of-friend through a source_id
Now we group by (id1, id2) to aggregate all common connections, and count to find the number of common connections.
Finally, we store common connections data in HDFS.
Let me illustrate the triangle closing through our running example. First, we load each direction edge represented by a pair.
Then we group connections pairs by source_ids to aggregate all connections for each member.
From aggregated connections we generate pairs of members (id1, id2) which are friend-of-friend through a source_id
Finally we group by (id1, id2) to aggregate all common connections, and count to find the number of common connections.
After we are done with triangle closing we can list each members’ friends-of-friends ordered by the number of common connections. .
Since there might be too many people who are your friends-of-friends, you might want to select top n from that list. For example, there are more than hundred-fifty thousands people who are my friends-of-friends
Next you need to push this data out in production. So that’s a simple workflow.
I just described how you can build People You May Know by doing triangle-closing and finding out friends-of-friends, and the number of common connections between them. As you just saw there might be multiple jobs dependent on each other that you have to run in that order. So how do we manage our workflow?
We need to ensure that we are showing good quality data to our members. First, we verify that data transfer between HDFS and production system is done properly. Second, we push data to a QA store with a viewer to check any blatant mistakes. Third, we have can explain any PYMK recommendation, how and why that recommendation is appearing. Fourth, we have ways to rollback in case something goes wrong. And finally, we have unit tests in place to check things are processes as we desire
First, we can improve performance by 50% by utilizing symmetry in our triangle-closing. If Bob is a friend-of-friend of Carol then Carol is a friend-of-friend of Bob. Second, there are supernodes in our social graph. For examples, Barack Obama has more than 10000 connections on LinkedIn. If we generate f-o-f pairs from his connections involving Barack Obama, there will be more 100 million pairs of f-o-f. Third, we can sample certain number of connections to decrease the the number of pairs from f-of-f, and randomize so that we generate different pairs every day.
Oscon Data award for open source contributions.

Building Data Products using Hadoop at Linkedin - Mitul Tiwari

Recommended

Recommended

More Related Content

More from BigDataCloud

More from BigDataCloud (20)

Recently uploaded

Recently uploaded (20)

Building Data Products using Hadoop at Linkedin - Mitul Tiwari

Editor's Notes