get_items, we are using the
get_image_filesto load the images. I was curious how to see how
get_image_filesworked under the hood to return all the image files in a dataset. As Jeremy always suggests, I started looking into source code by handy question mark functionality in Jupyter Notebooks. The source code for
get_image_filescan be found in fastai repo here. The source code for
get_image_filesfunction is calling
get_files(path, extensions=image_extensions, recurse=recurse, folders=folders)on passing with extensions set as
What is image_extensions doing?
image extensionsis just a variable returning a set of images from the mimetypes, which is part of Python standard library to map filenames to MIME types. Let's see
image_extensionsoutput to see whole set of image type extensions.
mimetypes.types_map.items()returns a dictionary items. It consists of key, value pair and we are selecting value pairs starting with the word
image/to return a set of image_extensions as shown in the above output.
get_filesfunction, which returns a list of files, based on extensions passed, folders. Let's look into the source code of
get_files(path, extensions=None, recurse=True, folders=None, followlinks=Truefunction as well:
get_files(path, extensions=None, recurse=True, folders=None, followlinks=True)is doing line by line.Let's look at the first few lines of code.
fastcorelibrary. The extensions are converted to a set if it's being passed as a list, range, string etc. using
setify. All the extensions are converted to lower case characters if any extension is in upper case. To read more about the
setifyfunction check the fastcore docs.
recurse=True. If it's True, it goes through all the files in the File path we have passed as well as going inside various folders inside the File Path recursively. Else if
recurse=False, we just go through all files in the File Path we have passed without going inside various folders.
.gitdirectory, with the following file structure.
os.scandirreturns an iterator of Directory objects. In Python
osmodule, there is an
os.listdir(path='.')which does the same functionality as
scandirgives a better performance for most of common use cases. 
is_file()returns, if it's a file or whether it's pointing to a directory with
recurse=True, it goes through all the directories and works on files recursively. Let's look at the sources code and try to understand more.
os.walkby checking this article. You can see that on iterating through
os.walk(), we can get the directory path, and associate file path as a list. This is being passed to
_get_files(p, f, extension)function.
get_filesfunction is working it will be useful to look at the below illustration:
recurse=False, for path bears. It returns just returns file README excluding (.gitignore) and directories.
recurse=True, for path bears. It returns all valid files inside the root directory as well as in folders such grizzly, black, teddy, details, folder etc.
_get_filesfunction, which returns the list of filenames to a list of pathlib Path of various filenames.
fs, the list of files returned. We are not passing files that are starting with
.gitignore or .envas it's not usually very useful for our dataset to get as files. Also, it's not returning file extensions passed or
p/ffor the list of files will become a list of paths as shown in the below result. Transforming from a list of file names, we are transforming it to a list of Pathlib module Path, pointing to various filenames.
get_image_files, fetch the list of images under the hood by looking into the source code.