Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Investigating the Panama Papers Connections with Neo4j - Stefan Komar, Neo4j

Neo4j PartnerTag München 2017
Stefan Kolmar, Neo4j

  • Login to see the comments

  • Be the first to like this

Investigating the Panama Papers Connections with Neo4j - Stefan Komar, Neo4j

  1. 1. Inves&ga&ng the #PanamaPapers Connec&ons with Neo4j PartnerDay Munich Stefan Kolmar Director Field Engineering
  2. 2. Source Material taken from •  the ICIJ presenta1on •  the Reddit AMA •  online publica1ons (SZ, Guardian, TNW et.al.) •  the ICIJ website •  hFps://panamapapers.icij.org/ •  The Power Players •  Key Numbers & Figures
  3. 3. +190 journalists in more than 65 countries 12 staff members (USA, Costa Rica, Venezuela, Germany, France, Spain) 50% of the team = Data & Research Unit
  4. 4. raw files metadata author; sender... database search and discovery raw text
  5. 5. 3 million files x 10 seconds per file = 347 Days Inves1gators used Nuix’s op1cal character recogni1on to make millions of scanned documents text-searchable. They used Nuix’s named en1ty extrac1on and other analy1cal tools to iden1fy and cross- reference the names of Mossack Fonseca clients through millions of documents.
  6. 6. Lucene syntax queries with proximity matching! 400 users
  7. 7. Unstructured data extrac1on ●  Nuix professional OCR service ●  ICIJ Extract (open source, Java: hFps://github.com/ICIJ/extract), leverages Apache Tika, Tesseract OCR and JBIG2-ImageIO. Structured data extrac1on ●  A bunch of Python Database ●  Apache Solr (open source, Java) ●  Redis (open source, C) ● Neo4j (open source, Java) App ●  Blacklight (open source, Rails) ●  Linkurious (closed source, JS) Stack
  8. 8. Context is King name: “John” last: „Miller“ role: „Nego1ator“ name: "Maria" last: "Osara" name: “Some Media Ltd” value: “$70M” PERSON PERSON PERSON PERSON name: ”Jose" last: “Pereia“ posi1on: “Governor“ name: “Alice” last: „Smith“ role: „Advisor“
  9. 9. Context is King SENT SUPPORTS CREATED MENTIONS name: “John” last: „Miller“ role: „Nego1ator“ name: "Maria" last: "Osara" since: Jan 10, 2011 name: “Some Media Ltd” value: “$70M” PERSON PERSON WROTE PERSON PERSON name: ”Jose" last: “Pereia“ posi1on: “Governor“ name: “Alice” last: „Smith“ role: „Advisor“
  10. 10. The world is a graph – everything is connected •  people, places, events •  companies, markets •  countries, history, poli1cs •  sciences, art, teaching •  technology, networks, machines, applica1ons, users •  sodware, code, dependencies, architecture, deployments •  criminals, fraudsters and their behavior
  11. 11. NODE key: “value” proper1es Property Graph Model Nodes •  The en11es in the graph •  Can have name-value proper%es •  Can be labeled Rela&onships •  Relate nodes by type and direc1on •  Can have name-value proper%es RELATIONSHIP NODE NODE key: “value” proper1es key: “value” proper1es key: “value” proper1es
  12. 12. Your friend Neo4j An open-source graph database •  Manage and store your connected data as a graph •  Query rela&onships easily and quickly •  Evolve model and applica&ons to support new requirements and insights •  Built to solve rela&onal pains
  13. 13. Value from Data Rela&onships Common Use Cases Internal Applica&ons Master Data Management Network and IT Opera1ons Fraud Detec&on Customer-Facing Applica&ons Real-Time Recommenda1ons Graph-Based Search Iden1ty and Access Management hTp://neo4j.com/use-cases
  14. 14. Whiteboard to Graph
  15. 15. Neo4j: All about PaTerns (:Person { name:"Dan"} ) -[:KNOWS]-> (:Person {name:"Ann"}) KNOWS Dan Ann NODE NODE LABEL PROPERTY hTp://neo4j.com/developer/cypher LABEL PROPERTY
  16. 16. Cypher: Find PaTerns MATCH (:Person { name:"Dan"} ) -[:KNOWS]-> (who:Person) RETURN who KNOWS Dan ??? LABEL NODE NODE LABEL PROPERTY ALIAS ALIAS hTp://neo4j.com/developer/cypher
  17. 17. Ge]ng Data into Neo4j Cypher-Based “LOAD CSV” •  Transac1onal (ACID) writes •  Ini1al and incremental loads of up to 10 million nodes and rela1onships , , , LOAD CSV WITH HEADERS FROM "url" AS row MERGE (:Person {name:row.name, age:toInt(row.age)});
  18. 18. Ge]ng Data into Neo4j Load JSON with Cypher •  Load JSON via procedure •  Deconstruct the document •  Into a non-duplicated graph model {} {} {} CALL apoc.load.json("url") yield value as doc UNWIND doc.items as item MERGE (:Contract {title:item.title, amount:toFloat(item.amount)});
  19. 19. Ge]ng Data into Neo4j CSV Bulk Loader neo4j-import •  For ini1al database popula1on •  For loads with 10B+ records •  Up to 1M records per second ,,, ,,, ,,, bin/neo4j-import –-into people.db --nodes:Person people.csv --nodes:Company companies.csv --relationship:STAKEHOLDER stakeholders.csv
  20. 20. The Steps Involved in the Document Analysis 1.  Acquire documents 2.  Classify documents •  Scan / OCR •  Extract document metadata 3.  Whiteboard domain and ques&ons, determine •  en&&es and their rela&onships •  poten1al en1ty and rela1onship proper&es •  sources for those en11es and their proper1es
  21. 21. The Steps Involved in the Document Analysis 4.  Develop analyzers, rules, parsers and named en1ty recogni1on 5.  Parse and store metadata, document and en1ty rela1onships •  Parse by author, named en11es, dates, sources and classifica1ons 6.  Infer en1ty rela1onships 7.  Compute similari1es, transi1ve cover and triangles 8.  Analyze data using graph queries and visualiza1ons
  22. 22. We need a Data Model Meta Data En&&es •  Document, Email, Contract, DB- Record •  Meta: Author, Date, Source, Keywords •  Conversa1on: Sender, Receiver, Topic •  Money Flows Actual En&&es •  Person •  Representa1ve (Officer) •  Address •  Client •  Company •  Account Either based on our use cases & ques1ons On the en11es present in our meta-data and data.
  23. 23. Data Model – Rela&onships Meta-Data •  sent, received, cc‘ed •  men1oned, topic-of •  created, signed •  aFached •  roles •  family rela1onships Ac&vi&es •  open account •  manage •  has shares •  registered address •  money flow
  24. 24. The ICIJ Data Model
  25. 25. The ICIJ Data Model •  Simplis1c Datamodel with 4 En11es and 5 Rela1onships •  We only know the published model •  Missing •  Documents, Metadata •  Family Rela1onships •  Connec1ons to Public Record Databases •  Contains Duplicates •  Rela1onship informa1on stored on en11es •  Could use richer labeling
  26. 26. Example Dataset - Azerbaijan’s President Ilham Aliyev •  was already previously inves1gated •  whole family involved •  different shell companies & involvements hFp://neo4j.com/graphgist/ec65c2fa-9d83-4894-bc1e-98c475c7b57a
  27. 27. Based On: hFp://neo4j.com/blog/analyzing-panama-papers-neo4j/

×