Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Studying the impact of Social Structures on Software Quality
Next
Download to read offline and view in fullscreen.

Share

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Download to read offline

Talk given at the 2009 International Conference on Software Maintenance in Edmonton, Alberta, Canada.

  • Be the first to like this

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

  1. 1. An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan Queen’s University, Canada 1
  2. 2. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 2
  3. 3. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 3
  4. 4. The Importance of Mailing List Archives rm of comm unication • Emai l popular fo to distribu te messages • Mailing lists valuable in formation • Messa ges contain ssions of s ource code • Discu evelopmen t decisions •D • Er ror reports ser support requests •U 4
  5. 5. Mining the Mailing Lists of 23 Open-Source Projects • Summarizing developer mailing lists • Using off-the-shelf tools • Data from around 500,000 emails • Unexpected results from experiments 5
  6. 6. catter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! ! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch "datadir" !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !! ! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M #include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case diff !! easier !! certs !! given !! { !! 6
  7. 7. nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. ! catter !! things !! info !! !! impose !! them. !! opinion !! keys symlinks !! configuration eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD ! ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! ! specifies > !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, ! ! char !! 1F !! ecified, !! hey, !! file !! !! reasons !! it. !! reasonable. !! Dec !! postgres 43 damn !! options: !! utterly !! line, !! files !! !! DataDir, !! pg_hba.conf !! 69 !! + SetData co ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B hod options !! considering !! always. !! !! symlinks. !! different !! 5434 !! /etc/pgsql/ path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !! "datadir" !! !! !! running things !!overides !! convenient !! using, symlinking "hbaconfig" ab onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !! ! !! bits !! databases simple !! controllable modssl servers !! undesired /path/name3" ","I Similarly, ObFlam "A:a:B:b:c:D:d:Fh:ik:lm:M ster !! Config !! directory!!!!+ { !! #include !! *) !! vendors !! !! people E3 discussion !! packager !! ass. !! really !! machine !! 08:27:06 !! 3B !! 16 !! +# !! explic !! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !! evil !! sense !! hbaconfig malloc(strlen(DataDir) diff !! easier !! certs !! given !! { !! Debian !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o 6
  8. 8. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise 7
  9. 9. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise Additional processing and cleaning needed! 8
  10. 10. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 9
  11. 11. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 10
  12. 12. Resolving Multiple Sender Identities • Participants send mail from different addresses • Up to 21% of addresses are aliases • Such aliases bias identity-based analyses • Manual inspection and correction tedious • No fully automated approach to resolve identities 11
  13. 13. Reconstructing Discussion Threads • Mail stored sequentially in archives • Logical grouping: discussion topics • Required information erroneous or missing • Essential for social network and topic analysis A A B B C C D D Linear Sequence Thread Hierarchy 12
  14. 14. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 13
  15. 15. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 14
  16. 16. Attachments • MIME standard defines extensions to email • Binary data encoded as text • Around 10% of messages have attachments • Extract attachments and store separately 15
  17. 17. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 16
  18. 18. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 17
  19. 19. Quotes and Signatures • Duplicate information • Unrelated to actual message • Removing signatures is challenging • Quoted text may or may not be desirable • Signatures impact text mining approaches • No perfect method for signature removal ==== === ==== | = ==== n === ==== -- Deaco ======= ==== == = ==== eapons! ==== | === ==== ear w === ==== ==== rmonucl ==== === === == e === == === ==== === ==== t the th ======= key . === ==== === ==== shoot a === ==== pub lic === ==== === ==== do not === ==== fo r my ======== ase ==== cmu.edu === | Ple === ==== rew. === ==== ==== eek@and ==== ==== er g ==== === ng == | Fi ======== == ==== 18
  20. 20. More Risks presented in the Paper 19
  21. 21. (1) Mailing Lists contain valuable information on a project. (2) Data Needs Pre-Processing before applying traditional tools. (3) Manual Data Processing is often not feasible or requires much effort. (4) Off-the-Shelf tools were not designed to prepare data for mining. 20

Talk given at the 2009 International Conference on Software Maintenance in Edmonton, Alberta, Canada.

Views

Total views

1,211

On Slideshare

0

From embeds

0

Number of embeds

21

Actions

Downloads

9

Shares

0

Comments

0

Likes

0

×