Analyzing GitHub with the search API
The Net::GitHub module provides a perly interface to GitHub’s feature-rich API. You can do everything with it, from creating new repos to managing issues and initiating pull requests. Today I’m going to focus on search.
Setup
Grab yourself a copy of Net::GitHub (make sure it’s version 0.68 or higher). The CPAN Testers results show that it builds on all major platforms, including Windows. You can install it via from CPAN at the command line:
$ cpan Net::GitHub
First steps
First we need to create a search object. You can search GitHub anonymously up to 5 times per minute or if you authenticate, 20 times per minute. The module documentation shows examples of how to authenticate, so we’ll proceed here unauthenticated.
use Net::GitHub::V3;
# unauthenticated
my $gh = Net::GitHub::V3->new;
my $search = $gh->search;
my %data = $search->repositories({ q => 'docker'});
The code above creates a $search
object, and initiates a repo search for docker. The %data
hash contains the search results. Let’s have a look at them:
{'incomplete_results' => bless( do{\(my $o = 0)}, 'JSON::XS::Boolean' ),
'total_count' => 12830,
'items' => [ {
'open_issues_count' => 771,
'url' => 'https://api.github.com/repos/docker/docker',
'has_downloads' => bless( do{\(my $o = 1)}, 'JSON::XS::Boolean' ),
'tags_url' => 'https://api.github.com/repos/docker/docker/tags',
'forks_count' => 2794,
'has_issues' => $VAR1->{'items'}[0]{'has_downloads'},
'clone_url' => 'https://github.com/docker/docker.git',
'name' => 'docker',
'private' => $VAR1->{'incomplete_results'},
'watchers_count' => 14846,
'pushed_at' => '2014-09-05T00:32:46Z',
'description' => 'Docker - the open-source application container engine',
'updated_at' => '2014-09-04T21:59:25Z',
'html_url' => 'https://github.com/docker/docker',
'stargazers_count' => 14846,
'size' => 135198,
'watchers' => 14846,
'created_at' => '2013-01-18T18:10:
'open_issues' => 771,
'language' => 'Go',
'git_url' => 'git://github.com/docker/docker.
'full_name' => 'docker/docker',
'homepage' => 'http://www.docker.com',
'forks' => 2794,
'score' => '89.950935',
...
},
]
};
I’ve truncated the results for the sake of brevity, to show the top level key values and one simplified repo:
incomplete_results
is a key value pair that returns a boolean true if the are more search results than those returned by the searchtotal_count
shows the total number of repos returned by the searchitems
is the interesting one - it’s an arrayref of repo hashes
Getting more results
Let’s update the code to pull more results. GitHub permits up to 100 results per API call and a 1,000 results per search.
use Net::GitHub::V3;
my $gh = Net::GitHub::V3->new;
my $search = $gh->search;
my @data = @{ $search->repositories({ q => 'docker',
per_page => 100 })->{items} };
while ($search->has_next_page) {
sleep 12; # 5 queries max per minute
push @data, @{ $search->next_page->{items} };
}
The code above executes the same search as before, except now I’m passing the per_page
parameter to get 100 results per call. I also extract the items
arrayref directly into the @data
array. The while loop will continue to call the search API until no further results are returned or we hit the 1,000 result limit.
Analyzing the data
So now we have a full set of results in , what can we do with it? One analysis that could be interesting is a count by programming language. Every repo hash contains a language
key value pair, so we can extract and count it. Lets see which language most docker-related repos are written in.
use Net::GitHub::V3;
my $gh = Net::GitHub::V3->new;
my $search = $gh->search;
my @data = @{ $search->repositories({ q => 'docker+created:>2014-09-01',
per_page => 100 })->{items} };
while ($search->has_next_page) {
sleep 12; # 5 queries max per minute
push @data, @{ $search->next_page->{items} };
}
my %languages;
for my $repo (@data) {
my $language = $repo->{language} ? $repo->{language} : 'Other';
$languages{ $language }++;
}
for (sort { $languages{$b} <=> $languages{$a} } keys %languages) {
printf "%10s: %5i\n", $_, $languages{$_};
}
Let’s walk through this code. First of all, I changed the search argument to limit results to repos created since September 2014 using the created
qualifier. This was to ensure we didn’t hit the 1,000 result search limit. The GitHub search API supports a whole range of useful search qualifiers (although it’s not documented, created
will take a full timestamp like 2014-09-01T00:00:00Z
).
Next I declared the %languages
hash and iterated through the results, extracting each repo’s language. Where language was undef
, I labelled the repo “Other”. Finally I sorted the results and printed them using printfto get a nicely formatted output. Here are the results:
Shell: 238
Other: 58
Python: 13
Ruby: 10
JavaScript: 8
Go: 6
Perl: 2
PHP: 2
Clojure: 1
Java: 1
Perhaps as is to be expected, the results show shell programs dominating the Docker space in September.
Further Info
GitHub’s search API supports more than just repo search. You can search issues, code and users as well. Check out the official GitHub search API documentation for more examples.
Net::GitHub provides an interface for far more than just search though. It’s a full-featured API - you can literally manage your GitHub account via Perl code with Net::GitHub. The developer Fayland Lam has provided loads of documentation, and I found him helpful responsive to enquiries. Thanks Fayland!
If you’re looking for more than just search, you may also want to look at Ingy döt Net’s awesome git-hub, which provides the full power of GitHub at the command line.
This article was originally posted on PerlTricks.com.
Tags
David Farrell
David is the editor of Perl.com. An organizer of the New York Perl Meetup, he works for ZipRecruiter as a software developer, and sometimes tweets about Perl and Open Source.
Browse their articles
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub