Who's who in Gnome : using LSA to merge software repository identities

E.T.M. Kouters, B.N. Vasilescu, A. Serebrenik, M.G.J. Brand, van den

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

40 Citations (Scopus)
2 Downloads (Pure)

Abstract

Understanding an individual’s contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all GNOME Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on GNOME Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.
Original languageEnglish
Title of host publicationProceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012)
Place of PublicationPiscataway
PublisherInstitute of Electrical and Electronics Engineers
Pages592-595
ISBN (Print)978-1-4673-2312-3
DOIs
Publication statusPublished - 2012
Eventconference; 28th IEEE International Conference on Software Maintenance; 2012-09-23; 2012-09-30 -
Duration: 23 Sep 201230 Sep 2012

Conference

Conferenceconference; 28th IEEE International Conference on Software Maintenance; 2012-09-23; 2012-09-30
Period23/09/1230/09/12
Other28th IEEE International Conference on Software Maintenance

Fingerprint

Semantics
Ecosystems
Merging
Control systems

Cite this

Kouters, E. T. M., Vasilescu, B. N., Serebrenik, A., & Brand, van den, M. G. J. (2012). Who's who in Gnome : using LSA to merge software repository identities. In Proceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012) (pp. 592-595). Piscataway: Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICSM.2012.6405329
Kouters, E.T.M. ; Vasilescu, B.N. ; Serebrenik, A. ; Brand, van den, M.G.J. / Who's who in Gnome : using LSA to merge software repository identities. Proceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012). Piscataway : Institute of Electrical and Electronics Engineers, 2012. pp. 592-595
@inproceedings{ccfff2cb55764e0097b3b7cbc9d5d414,
title = "Who's who in Gnome : using LSA to merge software repository identities",
abstract = "Understanding an individual’s contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all GNOME Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on GNOME Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.",
author = "E.T.M. Kouters and B.N. Vasilescu and A. Serebrenik and {Brand, van den}, M.G.J.",
year = "2012",
doi = "10.1109/ICSM.2012.6405329",
language = "English",
isbn = "978-1-4673-2312-3",
pages = "592--595",
booktitle = "Proceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012)",
publisher = "Institute of Electrical and Electronics Engineers",
address = "United States",

}

Kouters, ETM, Vasilescu, BN, Serebrenik, A & Brand, van den, MGJ 2012, Who's who in Gnome : using LSA to merge software repository identities. in Proceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012). Institute of Electrical and Electronics Engineers, Piscataway, pp. 592-595, conference; 28th IEEE International Conference on Software Maintenance; 2012-09-23; 2012-09-30, 23/09/12. https://doi.org/10.1109/ICSM.2012.6405329

Who's who in Gnome : using LSA to merge software repository identities. / Kouters, E.T.M.; Vasilescu, B.N.; Serebrenik, A.; Brand, van den, M.G.J.

Proceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012). Piscataway : Institute of Electrical and Electronics Engineers, 2012. p. 592-595.

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Who's who in Gnome : using LSA to merge software repository identities

AU - Kouters, E.T.M.

AU - Vasilescu, B.N.

AU - Serebrenik, A.

AU - Brand, van den, M.G.J.

PY - 2012

Y1 - 2012

N2 - Understanding an individual’s contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all GNOME Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on GNOME Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.

AB - Understanding an individual’s contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all GNOME Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on GNOME Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.

U2 - 10.1109/ICSM.2012.6405329

DO - 10.1109/ICSM.2012.6405329

M3 - Conference contribution

SN - 978-1-4673-2312-3

SP - 592

EP - 595

BT - Proceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012)

PB - Institute of Electrical and Electronics Engineers

CY - Piscataway

ER -

Kouters ETM, Vasilescu BN, Serebrenik A, Brand, van den MGJ. Who's who in Gnome : using LSA to merge software repository identities. In Proceedings of the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Maintenance (ICSM 2012, Trento, Italy, September 23-30, 2012). Piscataway: Institute of Electrical and Electronics Engineers. 2012. p. 592-595 https://doi.org/10.1109/ICSM.2012.6405329