Skip to content

Commit f8adb52

Browse files
committed
replace outdated BQ lang table with GH Archive PR lang extract
The table [bigquery-public-data:github_repos.languages] was last updated in Nov 2022. This is a significant issue since, without any further updates, we can only count events that are happening for these outdated lists of repositories. Hence, we need a new method to obtain a large enough sample of repository primary language metadata. Fortunately, we can directly extract the language from PullRequest events, because they provide such a language field. So, whenever there is a PullRequest for any of the repos we want to include in our ranking, we are able to determine the language. These amount to many millions. The drawback is that we cannot include repositories that did not have any pull request for the current quarter. I think this is a fair trade-off for now until maybe there is some better solution.
1 parent 8ab2725 commit f8adb52

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

scripts/query.js

+7-3
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,13 @@ const queryBuilder = (tables) => {
6262
FROM ${tables} WHERE NOT LOWER(actor.login) LIKE "%bot%") a
6363
JOIN ( SELECT repo_name as name, lang FROM ( SELECT * FROM (
6464
SELECT *, ROW_NUMBER() OVER (PARTITION BY repo_name ORDER BY lang) as num FROM (
65-
SELECT repo_name, FIRST_VALUE(language.name) OVER (
66-
partition by repo_name order by language.bytes DESC) AS lang
67-
FROM [bigquery-public-data:github_repos.languages]))
65+
SELECT
66+
JSON_EXTRACT_SCALAR(payload, "$.pull_request.base.repo.language") as lang,
67+
repo.name as repo_name
68+
FROM ${tables}
69+
WHERE
70+
JSON_EXTRACT_SCALAR(payload, "$.pull_request.base.repo.language") IS NOT NULL
71+
))
6872
WHERE num = 1 order by repo_name)
6973
WHERE lang != 'null') b ON a.name = b.name)
7074
GROUP by type, language, year, quarter, actor.login

0 commit comments

Comments
 (0)