'Find all files with certain filetype on GitHub
We are currently building a small search engine for which we want to crawl GitHub for publicly available KiCad schematics. However, we are unsure how to get these using the GitHub API as efficiently as possible. This is our current approach:
- Search repositories that contain .kicad_pcb files using the query
.kicad_pcb in:path created:2015-01-01..2015-01-31
and then iterating over the months. We are using PyGithub so the actual code is:
repos = g.search_repositories(f'.kicad_pcb in:path created:{dt.strftime("%Y-%m")}-01..dt.strftime("%Y-%m")}-{last_day[1]}')
- Search each repository with a code search query:
.kicad_pcb in:path repo:{repo.full_name}
However I am getting files as a result that do not have .kicad_pcb in their path as far as I can see. Also, the whole search only returns 228 repositories with 462 kicad files from 2015 to now, which seems very little? Can anyone spot a mistake that we made or suggest a better approach?
Additional question: Using the 'search code' function, we sometimes get an Exception because GitHub can only serve 'blobs' that are less than one MB. Is there any way to prevent this? We don't need the file to be included in the search results as long as we can get the url to download it at a later point.
Solution 1:[1]
It is not possible to search repository contents apart from the description and the readme file. For what I want to achieve, githubarchive is the better option.
Solution 2:[2]
The GitHub search offers the following option to searhc for file extensions:
extension:kicad_pcb
https://github.com/search?q=extension%3Akicad_pcb
You can combine it with repo:
or org:
modifiers. Maybe this is already your solution?
https://docs.github.com/en/search-github/searching-on-github/searching-code#search-by-file-extension
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Gasp0de |
Solution 2 | guerda |