2015년 4월 23일 목요일

Cohort analysis of Rust contributors

Cohort analysis is widely used in business analytics, but open source software projects do not seem to make use of it. The most interesting activity to analyze is code contribution, and for that all the data is readily available in the source code repository history. Data is even quite small: Rust is a big project, but there are less than 50,000 commits in total.

In 2014, Igor Steinmacher and co. published a paper titled Attracting, Onboarding, and Retaining Newcomer Developers in Open Source Software Projects. This is the most interesting paper presenting "developer joining model". Go read the paper, and other papers from Software Engineering and Collaborative Systems Research Group at Math and Statistics Institute, University of São Paulo.

According to the paper, "outsiders" are first "attracted" to be a "newcomer". Many factors affect this step, including software license, development infrastructure, project size, project complexity, project age, and specific events such as releases. Next, "newcomers" go through "onboarding" to become a regular "contributor". In this step, first impression matters a lot, and a timely response helps, and a rude response hinders. To "retain" "contributors", contributors need to be able to understand project process, and the project needs to keep a favorable atmosphere.

This is surprisingly (or not?) similar to user acquisition, conversion, and retention in business. So I wrote a quick Python script to do the analysis. Contributors are segmented to cohorts by month of first commit. Contributors are considered converted when they make commits in 2 different months. Contributors are considered retained when they keep making commits.

In 2 years period starting from January 2013, Rust acquired 711 new contributors in 23 months. (I excluded the first month.)

Rust releases every 3 months, and 3 months pattern is evident. Contributors are more likely to make their first commits in the month Rust released. Release 0.9 and 0.10 are especially prominent.

Rust converted 252 new contributors to repeat contributors, with conversion rate of 35%. Contributors took 2.3 months to convert in average. Following is plot of 19 months, excluding first 2 months (no new contributors can be converted there) and last 3 months (not enough time to observe conversion).

This is harder to interpret. From what I can tell, absolute conversion is increasing, but conversion rate is decreasing. This makes sense, since Rust had more self-selected contributors in its early life. Now there are more casual contributors.

Here is retention table of last 6 months of above period:

Cohort 0 +1 +2 +3 +4 +5
2014-041462102
2014-05137877
2014-0617566
2014-071031
2014-082210
2014-098

Data is quite noisy: after 2 months, retention rates are: 62%(8/13) for May, 35%(6/17) for June, 14%(2/14) for April, and 10%(1/10) for July. My impression is that Rust is retaining contributors well once they contribute in 3 different months, not 2. Maybe I need to change conversion criteria. Data is too noisy to say anything about time trend.