3. “… any illegal act characterized by deceit, concealment, or
violation of trust. These acts are not dependent upon the
threat of violence or physical force. Frauds are perpetrated
by parties and organizations to obtain money, property, or
services; to avoid payment or loss of services; or to secure
personal or business advantage.”
International Professional
Practices Framework(IPPF)
DEFINITION
5. DOMAIN OF APPLICATION
WHERE FRAUD COULD BE FOUND?
▸ HealthCare Systems
▸ Credit Cards Domain
▸ Social Networks
▸ Satellite Or Army Systems Controlling
▸ …
7. DIFFERENCES
WHAT IS THE CHARACTERISTICS OF HEALTHCARE DOMAIN DATA?
▸ Complexity and number of fields in these kind of data are
tremendous.
▸ The people or organizations attends to make profit to others.
▸ Data is really BIG and sometimes stream
▸ Many kinds of data like: Image, Raw Text, Sound, …
▸ Data are not labeled and hard to classification
▸ Concept drifting
9. TIPS
ROLE OF BIG DATA IN HEALTHCARE
▸ DNA. One of the most important public datasets in
Amazon.
▸ Stanford’s BigData conference is all about Healthcare
▸ Microsoft has stablished an academic part to work on
healthcare
▸ Loss of money in many countries because of FRAUD in
healthcare (up to 10% US annual health care expenditure)
10. EXAMPLES
SOME FRAUDS THAT TRADITIONAL HEALTHCARE SYSTEMS USED TO FACE WITH
▸ Changing patient’s insurance identification document
▸ Prescribing some fixed brands of drugs by a Dr
▸ Prescribing expensive drugs than what is usual for same
disease
▸ getting some kinds of drugs by a patient more than usual
▸ and many more…
14. STATISTICAL METHODS
STATISTICAL METHODS…
▸ Uses some rules
▸ Rules are described by a domain expert
▸ Creating application to initial statistical parameters ex:
▸ Count average of drugs in every prescription
▸ Total price of every disease
▸ Then they can be compared with new data. If high
difference found, ALARM GOES OFF
15. STATISTICAL METHODS
CONS AND PROS
▸ It’s very simple and easy to implement
▸ Low computation overhead
▸ Very easy to use for stream data
▸ Low flexibility
▸ Can’t be used for data concept drifting
▸ Adding rules is hard
▸ Every thing is based on domain expert knowledge
▸ It’s possible that defined solution wouldn’t be complete
17. MACHINE LEARNING AND DATA MINING ALGORITHMS
MACHINE LEARNING ALGORITHMS
▸ Choosing one or more machine learning algorithm based
on the data
▸ Use them for learning and detecting frauds
▸ If (data are labeled) classification is perfect idea
▸ Else clustering
▸ Or using clustering to labeling and the using classifications
19. GRAPH BASED FRAUD DETECTION
GRAPH ANALYSIS
▸ It has been going popular since 2015
▸ It’s still just a assistant system to get along with machine
learning algorithms
▸ It can’t consider all aspects
▸ But handy
26. ENVIRONMENT OF PAPER
ENVIRONMENT OF PAPER
▸ Dataset: CMS Medicare Part-B
▸ Used Apache HADOOP and Apache Pig
▸ 8 nodes
▸ 4 cores for each node
▸ 64 GB of memory for each node
▸ Total time of execution: 3 hours
28. STEP 1
COMPUTE THE SIMILARITY BETWEEN PROVIDERS
▸ Computing similarities between providers based on
shared procedure
▸ If similarity of two providers are more than a threshold an
edge connects them
▸ Sensitive Hashing & DimSum can help but it didn’t use
▸ 880K providers => 774 billion similarity computation
▸ My dataset: ~140 providers => 20K similarity computation
30. STEP 2
COMPUTING PERSONALIZED PAGE RANK FOR EACH SPECIALITY
▸ Loop over all specialities
▸ For each speciality apply Personalized Page Rank to the
graph
▸ Identify anomalous providers: PRSpeciality(node) high but
whose whose speciality is not the one used for the page
rank calculation
31. EXAMPLE
PERSONALIZED PAGE RANK ON DERMATOLOGIST SPECIALITY
24.1%
24.1%
18.7%
15.4%
7.9%
2.9%
4.1%
2.9%
Dermatologist
Surgeon
Internist
33. SPARK IMPLEMENTATION
WHAT WE DID IN SPARK
▸ Implementation from the scratch
▸ Changing the algorithm of page rank in Spark GraphX
▸ Every Personalized Page Rank runs 100 loops
▸ Dataset contains 20,000 raw data
▸ It took 20 minutes to run the algorithm on a core i7, 4core
macbook Pro with 4GB memory (main part of memory
occupied by OS)
39. SOLUTION ANALYSIS
CONS. AND PROS.
▸ Algorithm need computing similarity for all pairs of providers.
▸ It just consider one aspect of the fraud. Not complete
▸ Low speed & needs huge amount of memory (because of
computing similarity at first) - 2GB data needs 512 GB Ram
▸ Hard to add new data and update the graph
▸ High cost of part 2
▸ Needs to define rules to use graph analysis (other papers)
40. SOLUTION ANALYSIS
CONS. AND PROS.
▸ Part 1 needs shuffle => reduce performance
▸ Modeling as a graph => easy to understand and analysis
▸ New way of fraud detection. progressing
▸ Capable of using LSH but wanted 100% accuracy
41. SUGGESTIONS
FUTURE
▸ Using other centrality algorithms
▸ Using algorithms like community detection instead of
clustering
▸ If we injects data of patients we can do more (in a bipartite
graph we can detect frauds of more popular providers).