Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online Meetup]


Published on

Let’s tackle problems in software development in an automated, data-driven and reproducible way!

As developers, we often feel that there might be something wrong with the way we develop software. Unfortunately, a gut feeling alone isn’t sufficient for the complex, interconnected problems in software systems.

We need solid, understandable arguments to gain budgets for improvement projects or to defend us against political decisions. Though, we can help ourselves: Every step in the development or use of software leaves valuable, digital traces. With clever analysis, these data can show us root causes of problems in our software and deliver new insights – understandable for everybody.

If concrete problems and their impact are known, developers and managers can create solutions and take sustainable actions aligned to existing business goals.

In this meetup, I talk about the analysis of software data by using a digital notebook approach. This allows you to express your gut feelings explicitly with the help of hypotheses, explorations and visualizations step by step.

I show the collaboration of open source analysis tools (Jupyter, Pandas, jQAssistant and, of course, Neo4j) to inspect problems in Java applications and their environment. We have a look at performance hotspots, knowledge loss and worthless code parts – completely automated from raw data up to visualizations for management.

Participants learn how they can translate their unsafe gut feelings into solid evidence for obtaining budgets for dedicated improvement projects with the help of data analysis.

Published in: Software
  • Login to see the comments

Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online Meetup]

  1. 1. Software Analytics with Jupyter, Pandas, jQAssistant and Neo4j Identifying Problems in Software Development with Data Analysis Markus Harrer @feststelltaste Neo4j Online Meetup 23rd November 2017
  2. 2. Markus Harrer Software Development Analyst Key Activities Java Development, Data Analysis in Software Development Areas of Interest Clean Code, Agile, Software Archeology, Software Revival, Epistemology, Cognitive Psychology @feststelltaste About me
  3. 3. Agenda 1. Motivation 2. Sofware Analytics 3. My impl of Software Analytics 4. Examples & Demos 5. Summary 6. Q&A
  4. 4. Motivation Everything wrong with Software Development
  5. 5. Meanwhile in the pub…
  6. 6. Symptom Fixing
  7. 7. Lack of Communication$
  8. 8. Politics
  9. 9. Why is software development still so crazy?
  10. 10. WALL OF IGNORANCE Janelle Klein: IDEAFLOW - How to Measure the PAIN in Software Development. Leanpub
  11. 11. WALL OF IGNORANCE RISK VISIBILITY Janelle Klein: IDEAFLOW - How to Measure the PAIN in Software Development. Leanpub
  14. 14. Software Analytics Sober Problem Solving with Data Analysis based on Software Data
  15. 15. Software Analytics is... “... analytics on software data for managers and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions.” Tim Menzies, Thomas Zimmermann: Software Analytics - So What?. IEEE Software Magazine
  16. 16. Frequency Questions Use standard tools for everyday‘s questions Use Software Analytics to tackle high-risk problems Risk/Value Right Insights for better Decisions Adopted from Tim Menzies, Thomas Zimmermann: Software Analytics - So What?. IEEE Software Magazine
  17. 17. Types of Software Data Communitychrono- logical Runtimestatic => Problems are interconnected, so should be the data sources!
  18. 18. Tackling problems – automated, data-driven and reproducible. MyGuideline Software Analytics = Data Science on Software Data
  19. 19. Why does it work now? • Domain-Driven Design brings business language into code • Data Science enables problem analysis for developers • New Tools can create high-level concepts Code Problems Business Language abstract detailed Problems can be connected to concepts in business terms!
  20. 20. My impl of Software Analytics How can Developers use the Power of Data Analysis in their Daily Work?
  21. 21. What can you do today? • Visualize developer contributions over time • Identify unused, error-prone or abandoned code • Create a code and problem inventory for legacy systems • Find performance bottlenecks by analyzing call trees • Visualize unwanted dependencies between modules Make specific problems in your software system visible! e. g. Race Conditions, Architecture Smells, Build Breaker, Programming Errors
  22. 22. Choose known tools or tools for plan B* Python Neo4j, Pandas, Spark * want to learn / profit from in near future on a suitable platform.Jupyter, Zeppelin => Tools shouldn‘t stand in the way!
  23. 23. Notebookan open dialog with data Context Idea Analysis Conclusion Problem Context documented Ideas, assumptions and heuristics communicated Preprocessing justified Calculations understandable Summaries conclusive Everything automated
  24. 24. Notebook-Driven Data Analysis
  25. 25. Python Data Scientist's Best Friend: Easy, effective, fast programming language Pandas Pragmatic Data Analysis Framework: Great data structures & integrations with machine learning libraries D3 Visualization Library for Data-Driven Document: Just beautiful, interactive graphics! Jupyter Interactive Notebook: Central hub for data analysis and documentation Basic Tooling
  26. 26. Advanced Tooling: jQAssistant & Neo4j + = scan document validate
  27. 27. Advanced Tooling: jQAssistant & Neo4j Main Ideas • Scan software structures • Store data in Neo4j database • Execute queries • Examine relationships • Add high-level concepts • Validate rules via constraints • Generate reports
  28. 28. jQAssistant – Use Cases Living, self-validating architecture documentation
  29. 29. jQAssistant – Use Cases Java Class Business‘ Subdomain Living, self-validating architecture documentation + Find design & code smells + Add business perspectives
  30. 30. Neo4j Schema for Software Data Node Labels File Class Method Commit Relationship Types CONTAINS DEPENDS_ON INVOKES CONTAINS_CHANGE Properties name fqn signature message File Java key value name “Pet” fileName “” fqn “” TypeFile
  31. 31. Cypher Query Example Spring PetClinic “Give me all database objects” MATCH (t:Type)-[:ANNOTATED_BY]->()-[:OF_TYPE]->(a:Type) WHERE a.fqn="javax.persistence.Entity" RETURN t AS JpaEntity
  32. 32. Toolchain Python, Jupyter XML/Graph Tables Text Data Pandas jQAssistant Input Pandas, Neo4j Analysis matplotlib xlsx E pptx P Output D3
  33. 33. Examples The complete Toolchain in Action
  34. 34. Example JaCoCo  Pandas  D3 Production Coverage 1. Measure code coverage in production 2. Calculate ratio of covered lines to all lines 3. Visualize “usage hotspots” with hierarchical bubble chart
  35. 35. Example Git  Pandas  D3 Knowledge Island* 1. Take Git log with numstats 2. Calculate proportional contributions for each source code file per author 3. Visualize “ownership” with hierarchical bubble chart * heavily inspired by Adam Tornhill
  36. 36. Example jQAssistant  Neo4j  Pandas  D3 Dependency Analysis between Bounded Contexts
  37. 37. Example jQAssistant  Neo4j  Pandas  D3 Dependency Analysis between Bounded Contexts MATCH (s1:Subdomain)<-[:BELONGS_TO]- (type:Type)-[r:DEPENDS_ON*0..1]-> (dependency:Type)-[:BELONGS_TO]->(s2:Subdomain) RETURN as type, as dep, COUNT(r) as number Subdomains => Bounded Contexts that have meaning to business!
  38. 38. Example JProfiler  jQAssistant  Neo4j  Pandas Mining performance hotspots 1. Record Call Trees 2. Identify which parts of the application code is responsible for most of the DB operations 3. Trace problems back to the root causes Requests Incoming Outgoing SQL Calls
  39. 39. Example jQAssistant  Neo4j  Pandas Recursive Method Calls MATCH (m:Method)-[:INVOKES*]->(m) RETURN m
  40. 40. Example jQAssistant  Neo4j  Pandas Recursive Method Calls to Database MATCH (m:Method)-[:INVOKES*]->(m) -[:INVOKES]->(dbMethod:Method) <-[:DECLARES]-(dbClass:Class) WHERE = "Database" RETURN m, dbMethod, dbClass
  41. 41. Example jQAssistant  Neo4j  Pandas Identify possible Race Conditions public class OwnerController { ... private static int ownersIndexes; MATCH (c:Class)-[:DECLARES]->(f:Field)<-[w:WRITES]-(m:Method) WHERE EXISTS(f.static) AND NOT EXISTS( RETURN,, w.lineNumber, static = same field for all instances of that class
  42. 42. Summary
  43. 43. Summary • Tooling for data analysis in software development is here! • First analyses are easy to do using tools you already know • Specific in-depth analysis are powerful and worthwhile • Connection between business and developers is possible! • Problems can be attached to code that is business-related • Making the impact of risk-taking visible is a must-have to improve! • Jupyter/Pandas & jQAssistant/Neo4j are my favorites • Provide many ways for identifying problems • Help to figure out solutions as well!
  44. 44. Links Markus Harrer • Blog: • Twitter: • SlideShare: • Consulting: jQAssistant/Neo4j • Demos: • Guide: • Talk by Dirk Mahler:
  45. 45. Q&A Questions and Answers