Learning from 6,000 projects mining specifications in the large

Learning from
6,000 Projects
Mining Models in the Large

Andreas Zeller
Saarland University

Saarbrücken

®
Visual
Computing
Institute

Some numbers
• ~70 PhD advisors in computer science

Some numbers
• ≥ 300 PhD students in computer science

Some numbers
• ~60 new PhD graduates per year

Some numbers
• ~60 new MSc graduates per year

Some numbers
• ~60 new MSc graduates per year
• 800–1400 € per month as a PhD stipend
(+ laptop & oﬃce • starting right after BSc • all courses in English)

Two Graduates

Michael Backes Andrej Rybalchenko
TR35 in 2009 TR35 in 2010

Michael Backes Andrej Rybalchenko

secure protocols Andrej Rybalchenko

secure protocols loop termination


hard to verify

information ow


hard to verify

information ow liveness


hard to verify

buﬀer over ow



hard to verify

buﬀer over ow resource leaks



hard to verify

buﬀer over ow resource leaks



easy to specify
hard to verify

∀i ∈ {0, . . . , |x |} : x [i] < x [i + 1]
|x| = |x |
∀i ∈ {0, . . . , |x|} : ιi ∈ {0, . . . , |x |} : x[i] = x [i ]
∀i ∈ {0, . . . , |x |} : ιi ∈ {0, . . . , |x|} : x [i ] = x[i]

hard to specify

∀i ∈ {0, . . . , |x |} : x [i] < x [i + 1]
|x| = |x |
∀i ∈ {0, . . . , |x|} : ιi ∈ {0, . . . , |x |} : x[i] = x [i ]
∀i ∈ {0, . . . , |x |} : ιi ∈ {0, . . . , |x|} : x [i ] = x[i]

easy to verify
hard to specify

is-sorted(x ) ∧ is-permutation(x, x )

still hard to specify

microsoft word

travel booking

microsoft word

travel booking

airplane control

microsoft word mobile phones

travel booking

airplane control


travel booking operating systems

airplane control



airplane control banking systems




hard to specify




easy to verify
hard to specify

hard to specify

new language • duplicate eﬀort • can’t abstract from details

mine speci cations
from 6,000 projects

Speci cations

∀i ∈ {0, . . . , |x |} : x [i] < x [i + 1]
|x| = |x |
∀i ∈ {0, . . . , |x|} : ιi ∈ {0, . . . , |x |} : x[i] = x [i ]
∀i ∈ {0, . . . , |x |} : ιi ∈ {0, . . . , |x|} : x [i ] = x[i]

pre- and postconditions

Speci cations
auth()!
<init>()
openPort()
socket: null socket: ¬null
state: NOT_CON state: PLAIN

quit() auth()
socket: ¬null
state: AUTH

nite state models

OP-Miner
Usage Models

Program iter.hasNext () iter.next ()

OP-Miner
Usage Models Temporal Properties

hasNext ≺ next
Program hasNext ≺ hasNext
iter.hasNext () iter.next () next ≺ hasNext
next ≺ next

OP-Miner

hasNext ≺ next
next ≺ next

Patterns

hasNext ≺ next
hasNext ≺ hasNext

OP-Miner

hasNext ≺ next
next ≺ next

Anomalies Patterns

hasNext ≺ next
✓ hasNext ≺ hasNext hasNext ≺ next
hasNext ≺ next hasNext ≺ hasNext
✗ hasNext ≺ hasNext

public Stack createStack () {
Random r = new Random ();
int n = r.nextInt ();
Stack s = new Stack ();
int i = 0;
while (i < n) {
s.push (rand (r));
i++;
}
s.push (-1);
return s;
}

int i = 0;
while (i < n) {
s.push (rand (r));
i++;
}
s.push (-1);
return s;
}

int n = r.nextInt (); int n = r.nextInt ();
int i = 0; Stack s = new Stack ();
while (i < n) {
s.push (rand (r));
i++; int i = 0;
}
s.push (-1);
return s;
}

while (i < n) {
s.push (rand (r));
i++; int i = 0;
}
s.push (-1); i < n
return s; i++;
}
s.push (rand (r));

while (i < n) {
s.push (rand (r));
i++; int i = 0;
}
s.push (-1); i < n i < n
return s; i++;
}
s.push (-1); s.push (rand (r));




int i = 0;

i < n i < n
i++;

s.<init>()

s.push (_)

s.push (_)



s.push (rand (r));

r.<init> ()

r.nextInt ()

Utils.rand (r)

Methods vs. Properties
Temporal Properties
start ≺ lock ≺ eof ≺
stop unlock close
Methods

Temporal Properties
stop unlock close

get()
Methods

Temporal Properties
stop unlock close

get()
Methods

open()

Temporal Properties
stop unlock close

get()
Methods

open()

hello()

Temporal Properties
stop unlock close

get()
Methods

open()

hello()

parse()

Temporal Properties
stop unlock close
Pattern
get()
Methods

open()

hello()

parse()

Temporal Properties
stop unlock close
Pattern
get()
Methods

open()

hello()

parse()
Support

Discovering Anomalies
Temporal Properties
stop unlock close

get()
Methods

open()

hello()

parse()

Discovering Anomalies
Temporal Properties
stop unlock close
Anomaly
get() ✘
Methods

open()

hello()

parse()

for (Iterator iter = itdFields.iterator();
iter.hasNext();) {
...
for (Iterator iter2 = worthRetrying.iterator();
iter.hasNext();) {
...
}
}

for (Iterator iter = itdFields.iterator();
iter.hasNext();) {
...
for (Iterator iter2 = worthRetrying.iterator();
iter.hasNext();) {
... should be iter2
}
}

public void visitNEWARRAY (NEWARRAY o) {
byte t = o.getTypecode ();
if (!((t == Constants.T_BOOLEAN) ||
(t == Constants.T_CHAR) ||
...

(t == Constants.T_LONG))) {
constraintViolated (o, "(...) '+t+' (...)");

}
}

public void visitNEWARRAY (NEWARRAY o) {
byte t = o.getTypecode ();
if (!((t == Constants.T_BOOLEAN) ||
(t == Constants.T_CHAR) ||
...

(t == Constants.T_LONG))) {
constraintViolated (o, "(...) '+t+' (...)");

}
} should be double quotes

Name internalNewName (String[] identifiers)
...

for (int i = 1; i < count; i++) {

SimpleName name = new SimpleName(this);

name.internalSetIdentifier(identifiers[i]);

...

}
...
}

Name internalNewName (String[] identifiers)
...

for (int i = 1; i < count; i++) {

SimpleName name = new SimpleName(this);

name.internalSetIdentifier(identifiers[i]);

...

} should stay as is
...
}

public String getRetentionPolicy ()
{
...
for (Iterator it = ...; it.hasNext();)
{
... = it.next();
...
return retentionPolicy;
}
...
}

public String getRetentionPolicy ()
{
...
for (Iterator it = ...; it.hasNext();)
{
... = it.next();
...
return retentionPolicy;
}
... should be xed
}

44% of violations
are defects or code smells

mine speci cations
across thousands of projects

Wisdom of the crowds

Francis
Galton
Nein, links auch nicht

Target Languages
Java C++ C PHP Javascript

Target Languages

Similar syntax
{...} ; foo()

Target Languages

Similar syntax
{...} ; foo()

Similar keywords
while if switch return

Lightweight Parser

Abstract Temporal
Source Code
Representation Properties

Lightweight Parser

Abstract Temporal
Source Code
}
language-independent
lightweight parsing

Abstract Temporal
Source Code

Abstract Temporal
Source Code

int j;
int fA;
int fB = open(“newFile”);
fA = open(“myFile”);
j = 7;
while (j > 3) {
read(fA);
write(fB, “Hello”);
j--;
}

close(fA);
close(fB);

Abstract Temporal
Source Code

int j; fB: open(CONST)
int fA;
int fB = open(“newFile”);
fA: open(CONST)
fA = open(“myFile”);
j = 7;
while (j > 3) { Loop:
read(fA); read(fA)
write(fB, “Hello”); write(fB, CONST)
j--;
}
close(fA)
close(fA);
close(fB); close(fB)

Abstract Temporal
Source Code

fB: open(CONST)

fA: open(CONST)

Loop:
read(fA)
write(fB, CONST)

close(fA)

close(fB)

Abstract Temporal
Source Code

fA: open(CONST)
fB: open(CONST)
read(fA)
fA: open(CONST)
close(fA)

Loop:
read(fA)
write(fB, CONST)

close(fA)

close(fB)

Abstract Temporal
Source Code

fA: open(CONST)
fB: open(CONST)
read(fA)
fA: open(CONST)
close(fA)

Loop:
read(fA)
write(fB, CONST)

fB: open(CONST)
close(fA)
write(fB, CONST)

close(fB) close(fB)

Abstract Temporal
Source Code

fA: open(CONST)
fB: open(CONST) open() < read()
read(fA)
fA: open(CONST)
close(fA)

Loop:
read(fA)
write(fB, CONST)

fB: open(CONST)
close(fA)
write(fB, CONST)

close(fB) close(fB)

Abstract Temporal
Source Code

fA: open(CONST)
open() < close()
read(fA)
fA: open(CONST)
close(fA)

Loop:
read(fA)
write(fB, CONST)

fB: open(CONST)
close(fA)
write(fB, CONST)

close(fB) close(fB)

Abstract Temporal
Source Code

fA: open(CONST)
open() < close()
read(fA)
read() < read()
fA: open(CONST)
close(fA)

Loop:
read(fA)
write(fB, CONST)

fB: open(CONST)
close(fA)
write(fB, CONST)

close(fB) close(fB)

Abstract Temporal
Source Code

fA: open(CONST)
open() < close()
read(fA)
read() < read()
fA: open(CONST)
close(fA) read() < close()

Loop:
read(fA)
write(fB, CONST)

fB: open(CONST)
close(fA)
write(fB, CONST)

close(fB) close(fB)

Abstract Temporal
Source Code

fA: open(CONST)
open() < close()
read(fA)
read() < read()
fA: open(CONST)
close(fA) read() < close()

Loop:
read(fA)
write(fB, CONST)

fB: open(CONST) open() < write()
close(fA) open() < close()
write(fB, CONST) write() < write()
close(fB) close(fB)
write() < close()

8,000

6,000

4,000

2,000

0
C projects

8,000

6,097
6,000

4,000

2,000

0
C projects

200,000,000 8,000

6,097
150,000,000 6,000

100,000,000 4,000

50,000,000 2,000

0 0
Lines of code C projects

201,321,237
200,000,000 8,000

6,097
150,000,000 6,000

100,000,000 4,000

50,000,000 2,000

0 0
Lines of code C projects

15,803,766 properties (“f < g”)

18 hours analysis time
single core

11 million lines of code per hour

static int dcc_listen_init (…) {
dcc->sok = socket(…);
if (…) {
while (…) {
… = bind (dcc->sok, …);
}
/* with a small port range, reUseAddr is needed */
setsockopt (dcc->sok, …, SO_REUSEADDR, …);
}
listen (dcc->sok, …);
}

static int dcc_listen_init (…) {
dcc->sok = socket(…);
if (…) {
while (…) {
… = bind (dcc->sok, …);
}
/* with a small port range, reUseAddr is needed */
setsockopt (dcc->sok, …, SO_REUSEADDR, …);
}
listen (dcc->sok, …); should be called before bind()
}

static int find_file (…)
{
DIR *dirp;
struct dirent *dirinfo;
…
dirp = opendir(".");
if (dirp == NULL)
{
…
}
while ((dirinfo = readdir(dirp)) != NULL)
{
…
}
rewinddir(dirp);
return 1;
}

static int find_file (…)
{
DIR *dirp;
struct dirent *dirinfo;
…
dirp = opendir(".");
if (dirp == NULL)
{
…
}
while ((dirinfo = readdir(dirp)) != NULL)
{
…
}
rewinddir(dirp);
return 1; should call closedir() instead
}

Check my Code

• Check your code against
the wisdom of Linux

• Builds on millions of
mined speci cations

• Detects problems no
other tool can detect

www.checkmycode.org

Check my Code

• Check your code against
the wisdom of Linux
Dat abase
• Builds on millions of
ilable
mined speci cations ava
fo r dow nload
• Detects problems no
other tool can detect

www.checkmycode.org




easy to mine

Challenges

• Mining complete speci cations

Challenges

• Finding relevant abstractions

Challenges

• Producing readable speci cations

Challenges

• Producing readable speci cations
• Integrating speci cation mining
and programming

Andrzej Wasylkowski Christian Lindig Natalie Gruska

Learning from 6,000 projects mining specifications in the large

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Learning from 6,000 projects mining specifications in the large

Similar to Learning from 6,000 projects mining specifications in the large (20)

More from CISPA Helmholtz Center for Information Security

More from CISPA Helmholtz Center for Information Security (17)

Recently uploaded

Recently uploaded (20)

Learning from 6,000 projects mining specifications in the large

Editor's Notes