Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Future of Data Science

248 views

Published on

Data Science Big Data Technologies

Published in: Technology
  • Have u ever tried external professional writing services like ⇒ www.HelpWriting.net ⇐ ? I did and I am more than satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

The Future of Data Science

  1. 1. The Future of Data Science SARITH DIVAKAR M | LBS COLLEGE OF ENGINEERING, KASARAGOD www.sarithdivakar.info sarith@cusat.ac.in
  2. 2. Agenda • DATA SCIENCE • BIG DATA • TECHNOLOGIES
  3. 3. Data Scientist “A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician”
  4. 4. An Interview with Lisa Qian, Airbnb  WHICH SKILLS OR PROGRAMMING LANGUAGES DO YOU MOST FREQUENTLY USE IN YOUR WORK, AND WHY? “At Airbnb, we all use Hive to query data and build derived tables. I use R to do analysis and build models. I use Hive and R every day of the job. A lot of data scientists use Python instead of R – it’s just a matter of what we were familiar with when we came in. There have also been recent efforts to use Spark to build large-scale machine learning models.”
  5. 5. Reference: Mathrubhumi, “http://digitalpaper.mathrubhumi.com/943320/kochi/21-Sept-2016#page/6/2”
  6. 6. Data Scientist Salaries Average Salary (2015): $118,709 per year Minimum: $76,000 Maximum: $148,000 Median Salary (2015): $93,991 per year Total Pay Range: $63,524 – $138,123
  7. 7. Data Scientist Qualifications Master’s degree 80% PhD 46% Math and statistics 32% Computer Science 19% Engineering 16% Reference: The Burtch Works Study, “http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/”
  8. 8. Data Scientist Job Outlook  McKinsey reported that by 2018 the U.S. could face a shortage of 1,40,000 to 1,90,000 “people with deep analytic skills” Reference: Report of McKinsey Global Institute, “http://www.mckinsey.com/business-functions/business- technology/our-insights/big-data-the-next-frontier-for-innovation”
  9. 9. http://www.dst.gov.in/big-data- initiative-1 Big Data Initiative
  10. 10. What Kind of Skills Will I Need?
  11. 11. Past and Future of Data Science  Descriptive analytics  Describing what has already taken place  Predictive analytics and real-time analytics in pursuit of business goals  Improving the customer experience  Improving products and services  Reducing costs
  12. 12. Where to prioritize their Focus?  Amazon, Google and Netflix.  Python  Variety of tools, perspectives and approaches  Identify methods and models most appropriate in a particular use case. Reference: Devavrat Shah, Professor, Department of Electrical Engineering and Computer Science, MIT, “http://blog.edx.org/future-data-science-qa-mit- professional-educations-devavrat-shah”
  13. 13. Popular Applications  Internet Search  Digital Advertisements  Gaming
  14. 14. Data Science to refine the “Crude Oil”  Volume  Variety  Velocity  Veracity  Value  (add your own V here…..)
  15. 15. Where big data comes from?  Huge amount of data is created everyday!  It comes from Us!  No digitized process becomes digitized  Digital India  Programmee to transform India to a digitally empowered society and knowledge economy
  16. 16. Excavating Hidden Treasures from Big Data  Insights into data can provide business advantage  Some key early indications can mean fortunes to business  More precise analysis with more data  Integrate Big Data with traditional data: Enhance business intelligence analysis
  17. 17. Challenges in big data  Heterogeneity and incompleteness  Scale  Timeliness  Privacy  Human collaboration
  18. 18. RDBMS : Why not for Big Data?  Limitations in RDBMS  RDBMS cannot handle petabytes of data  Seek time of disk drives is improving more slowly than transfer rate of data  RDBMS are not built to handle unstructured or semi structured data  Normalization of data makes it difficult for handling large data sets  Example : WebLogs
  19. 19. Distributed computing  Dividing large problems into smaller ones, and solved concurrently ("in parallel")  Connecting multiple machines together for  Storing big files  Parallel processing  Data locality  Redundancy
  20. 20. Challenges in distributed computing The distributed computing had some challenges which restricted organizations to depend upon it. Those are  Concurrency control  Data synchronization  Atomic commit  Transaction split into small tasks  Leader election
  21. 21. Big data and cloud: converging technologies  Big data: Extracting value out of “variety, velocity and volume” from unstructured information available  Cloud: On demand, elastic, scalable pay per use self service model
  22. 22. Answer these before moving to big data analysis  Do you have an effective big data problem?  Can the business benefit from using Big Data?  Do your data volumes really require these distributed mechanisms?
  23. 23. Technology to handle big data  Google was the first company to effectively use big data  Engineers at google created massively distributed systems  Collected and analyzed massive collections of web pages & relationships between them and created “Google Search Engine” capable of querying billions of pages
  24. 24. First generation of Distributed systems  Proprietary  Custom Hardware and software  Centralized data  Hardware based fault recovery  Eg: Teradata, Netezza etc
  25. 25. Second generation of Distributed systems  Open source  Commodity hardware  Distributed data  Software based fault recovery  Eg : Hadoop, HPCC
  26. 26. Why we need new generation?  Lot has been changed from 2000  Both hardware and software gone through changes  Big data has become necessity now  Let’s look at what changed over decade
  27. 27. Changes in Hardware State of hardware in 2000 State of hardware now Disk was cheap so disk was primary source of data RAM is the king Network was costly so data locality RAM is primary source of data and we use disk for fallback RAM was very costly Network is speedier Single core machines were dominant Multi core machines are commonplace
  28. 28. Shortcomings of Second generation  Batch processing is primary objective  Not designed to change depending upon use cases  Tight coupling between API and run time  Do not exploit new hardware capabilities  Too much complex
  29. 29. Third generation distributed systems  Handle both batch processing and real time  Exploit RAM as much as disk  Multiple core aware  Do not reinvent the wheel  They use  HDFS for storage  Apache Mesos / YARN for distribution  Plays well with Hadoop
  30. 30. Hadoop vs Spark Stores data on disk Sores data in memory (RAM) Commodity hardware can be utilized Need high end systems with greater RAM Uses Replication to achieve fault tolerance Uses different data storage models to achieve fault tolerance (Eg. RDD) Speed of processing is less due to disk read write 100x faster than Hadoop Supports only Java & R Supports Java, Python, R, Scala etc. Ease of programming is high. Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3. Runs on Hadoop, Cloud, Mesos or standalone
  31. 31. Spark Open Source Ecosystem
  32. 32. Who are using Spark
  33. 33. Get Your Hands Dirty With Data
  34. 34. References 1. The Burtch Works Study, http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/ 2. Mathrubhumi, http://digitalpaper.mathrubhumi.com/943320/kochi/21-Sept-2016#page/6/2 3. Report of McKinsey Global Institute, http://www.mckinsey.com/business-functions/business-technology/our- insights/big-data-the-next-frontier-for-innovation 4. Devavrat Shah, Professor, Department of Electrical Engineering and Computer Science, MIT, “http://blog.edx.org/future-data-science-qa-mit-professional-educations-devavrat-shah 5. “Data Mining and Data Warehousing”, M.Sudheep Elayidom, SOE, CUSAT 6. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award. 7. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/ 8. “Apache Hadoop”, https://hadoop.apache.org/ 9. “Apache Spark”, http://spark.apache.org/

×