Datamining GitHub : Examining Time-to-solve for issues in relation to group-size

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Joaquin Bonino Quintana; Sebastian Fagerlind; [2020]

Keywords: ;

Abstract: There exists a mountain of data on GitHub on how software development is done in reality. The data is accessible through an API but is rarely used outside of the projects it belongs to. This data could be an untapped resource and potentially give us an insight about software development in general and how to organize projects more efficiently. This report takes a sample of 676 800 repositories from GitHub and tries to find a relation or underlying function between the number of members and the time it takes them to solve issues. This report also looks at GitHub API and its viability as a tool for research. The data collection was done with two programs written in Golang and took 4 days of execution to run. Because GitHub is used for many different things by many people, some filtering of the data had to be done to remove repositories that were irrelevant or not usable for this study. The requirements included having more than 3 members, having made at least two pull requests and having actually resolved issues. After filtering out projects that didn’t fit our requirements, 1517 repositories remained. It was found that no clear relationship of underlying function could be established due to the variance being too high. However, GitHub API proved to be a valuable tool for this research and holds great potential, even though there are several limitations that have to be considered for further studies.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)