✍️
Today IL
  • Today I learned!
  • Deployment
    • Rolling, Canary, Blue-green deployments
    • Kubernetees Helm Charts
  • AI/ML
    • SeldonIO
    • Installing software in E2E cloud compute
    • Watching nvidia-smi
    • How does github copilot works?
    • composer library
    • Better to pass callback in fit_one_cycle
    • Eliza - demo
    • Helsinki-NLP Translation models
  • Fastai Learning
  • Python
    • Understanding get_image_files in fastai
    • Resizing an image to smaller size
    • Extracting a Json Object as List(regex)
    • f-strings debugging shortcut
    • Pytest
    • conda switch between python versions
    • Nested functions exception handling
  • Programming
    • Installing Linux Operating system
    • Certbots usage
    • Code highlighting in Google Docs
    • HTTP Methods
    • How to use vertical mouse?
    • HTTP Status Codes
    • Keycloak, Oauth, OpenID connect, SAML
    • Why should NPM packages be as small as possible?
    • Clean Architecture
    • REST vs gRPC
    • Keycloak concepts
    • what is proxy server and nginx?
    • Asymptotic Time Complexity
  • async/await
    • JavaScript Asynchronous operation
    • Lesson 2- Eventloops
    • Lesson 1- asyncio history
    • Lesson 3- using coroutines
    • Lesson 4- coroutines in hood
    • Python async/await
    • JavaScript
  • R Programming language
    • Facet_grid and Facet_wrap
    • geom_point
  • C programming language
    • Inputting String in C programming language
    • Checking if a element is passed as input or not?
  • Git/Github
    • give credits to other people
    • one time setting to easily use Github
    • Checkout to specific tag
    • git suggestions in PR
    • Using emojis in git commits
  • Databases
    • Postgres Database Dockercompose
    • TIL New SQL Operators - Except, UNION, Distinct
    • Analysing Performance of DB Queries
    • Querying Date ranges in postgres
    • Handling Database disconnects in SQLAlchemy
  • WITH NO EXCEPT
  • What is difference with JSON documents in postgres and Mongodb
Powered by GitBook
On this page
  • Conclusion
  • References

Was this helpful?

  1. Python

Understanding get_image_files in fastai

PreviousFastai LearningNextResizing an image to smaller size

Was this helpful?

In fastai library we use Datablocks like the below example for loading datasets, and to train models. The below code is DataBlock, which is used to load a dataset of various types of bears to split into train and validation datasets along and to resize images to size 128*128. For a detailed explanation, check on From Data to DataLoaders section in .

bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=Resize(128))

In this Datablock get_items, we are using the get_image_files to load the images. I was curious how to see how get_image_files worked under the hood to return all the image files in a dataset. As always suggests, I started looking into source code by handy question mark functionality in Jupyter Notebooks. The source code for get_image_files can be found in . The source code for get_image_files function is:

You can see it's expecting the path to the folder where files are present in the image folder. Also the function signature, consists of recurse=True and folder=None by default.

You can see get_image_files function is calling get_files(path, extensions=image_extensions, recurse=recurse, folders=folders) on passing with extensions set as image_extensions.

What is image_extensions doing?

The image extensions is just a variable returning a set of images from the , which is part of Python standard library to map filenames to MIME types. Let's see image_extensions output to see whole set of image type extensions.

>>> image_extensions
{'.jpg', '.svg', '.pgm', '.png', '.xbm', '.jpe', '.pbm', '.pnm', '.rgb', '.tiff', '.xpm', '.jpeg', '.ras', '.ico', '.tif', '.ppm', '.xwd', '.gif', '.bmp', '.ief'}

mimetypes.types_map.items() returns a dictionary items. It consists of key, value pair and we are selecting value pairs starting with the word image/ to return a set of image_extensions as shown in the above output.

Now let's look at get_files function, which returns a list of files, based on extensions passed, folders. Let's look into the source code of get_files(path, extensions=None, recurse=True, folders=None, followlinks=True function as well:

Let's understand what the function get_files(path, extensions=None, recurse=True, folders=None, followlinks=True) is doing line by line.Let's look at the first few lines of code.

from fastcore.all import L, setify
from pathlib import Path

def get_files(path, extensions=None, recurse=True, folders=None, followlinks=True):
    "Get all the files in `path` with optional `extensions`, optionally with `recurse`, only in `folders`, if specified."
    path = Path(path)
    folders=L(folders)
    extensions = setify(extensions)
    extensions = {e.lower() for e in extensions}

If function definition, we have by default passed recurse=True. If it's True, it goes through all the files in the File path we have passed as well as going inside various folders inside the File Path recursively. Else if recurse=False, we just go through all files in the File Path we have passed without going inside various folders.

import os

def get_files(path, extensions=None, recurse=True, folders=None, followlinks=True):
    ...
    ...
    if recurse:
        ...
        ...
    else:
        f = [o.name for o in os.scandir(path) if o.is_file()]
        res = _get_files(path, f, extensions)

For the sake of understanding, let's take an example a .git directory, with the following file structure.

os.scandir returns an iterator of Directory objects. In Python os module, there is an os.listdir(path='.') which does the same functionality as scandir. Yet scandir gives a better performance for most of common use cases. [1]

f = [o.name for o in os.scandir(path) if o.is_file()]

It returns a list of file extensions as shown below with list comprehensions, where is_file() returns, if it's a file or whether it's pointing to a directory with followlinks.

>>> path=Path('.git')
>>> [o.name for o in os.scandir(path) if o.is_file()]
['index', 'HEAD', 'packed-refs', 'config', 'description']

If recurse=True, it goes through all the directories and works on files recursively. Let's look at the sources code and try to understand more.

>>> for i,(p,d,f) in enumerate(os.walk(path, followlinks=True)): # returns (dirpath, dirnames, filenames):
...     print(p, f)
...
. ['index', 'HEAD', 'packed-refs', 'config', 'description', '.env']
./.gitignore []
./branches []
./hooks ['fsmonitor-watchman.sample', 'update.sample', 'pre-applypatch.sample', 'pre-push.sample', 'pre-receive.sample', 'applypatch-msg.sample', 'pre-commit.sample', 'prepare-commit-msg.sample', 'commit-msg.sample', 'post-update.sample', 'pre-rebase.sample']
./objects []
./objects/pack ['pack-0ae71ead7e875289ae1c9a4b14ca65dbb7a9fc83.pack', 'pack-0ae71ead7e875289ae1c9a4b14ca65dbb7a9fc83.idx']
./objects/info []
./info ['exclude']
./refs []
./refs/tags []
./refs/remotes []
./refs/remotes/origin ['HEAD']
./refs/heads ['master']
./logs ['HEAD']
./logs/refs []
./logs/refs/remotes []
./logs/refs/remotes/origin ['HEAD']
./logs/refs/heads ['master']

Just to summarise how the get_files function is working it will be useful to look at the below illustration:

When recurse=False, for path bears. It returns just returns file README excluding (.gitignore) and directories.

>>> get_files(path, recurse=False)
['README']

While recurse=True, for path bears. It returns all valid files inside the root directory as well as in folders such grizzly, black, teddy, details, folder etc.

After that, it's passed to _get_files function, which returns the list of filenames to a list of pathlib Path of various filenames.

fs, the list of files returned. We are not passing files that are starting with . like .gitignore or .env as it's not usually very useful for our dataset to get as files. Also, it's not returning file extensions passed or f'.{f.split(".")[-1].lower()}' in extensions.

res on passing p/f for the list of files will become a list of paths as shown in the below result. Transforming from a list of file names, we are transforming it to a list of Pathlib module Path, pointing to various filenames.

>> get_files(Path('.'))
[Path('index'), Path('HEAD'), Path('packed-refs'), Path('config'), Path('description'), Path('hooks/fsmonitor-watchman.sample'), Path('hooks/update.sample'), Path('hooks/pre-applypatch.sample'), Path('hooks/pre-push.sample'), Path('hooks/pre-receive.sample'), Path('hooks/applypatch-msg.sample'), Path('hooks/pre-commit.sample'), Path('hooks/prepare-commit-msg.sample'), Path('hooks/commit-msg.sample'), Path('hooks/post-update.sample'), Path('hooks/pre-rebase.sample'), Path('objects/pack/pack-0ae71ead7e875289ae1c9a4b14ca65dbb7a9fc83.pack'), Path('objects/pack/pack-0ae71ead7e875289ae1c9a4b14ca65dbb7a9fc83.idx'), Path('info/exclude'), Path('refs/remotes/origin/HEAD'), Path('refs/heads/master'), Path('logs/HEAD'), Path('logs/refs/remotes/origin/HEAD'), Path('logs/refs/heads/master')]

Conclusion

I hope with this blog post, you now have understood how get_image_files, fetch the list of images under the hood by looking into the source code.

References

We are converting the path provided to us into a Pathlib object, and folders are converted to a special Python-like list called (L) based on the fastcore library. The extensions are converted to a set if it's being passed as a list, range, string etc. using setify. All the extensions are converted to lower case characters if any extension is in upper case. To read more about the setify function check .

image

I would highly recommend the functionality of os.walk by checking this . You can see that on iterating through os.walk(), we can get the directory path, and associate file path as a list. This is being passed to _get_files(p, f, extension) function.

WhatsApp Image 2021-08-09 at 8 12 10 AM

This is how the get_image_files, returns a L object based on fastcore for any object. For the , the output of get_image_files and get_files is as following:

image

In case, if I have missed something or to provide feedback, please feel free to reach out to me .

[1]

[2]

[3]

Chapter 2 of Fastbook
Jeremy
fastai repo here
mimetypes
the fastcore docs
article
BIWI Dataset
@kurianbenoy2
Fastai source code
Python os module
Deep Learning for Coders with Fastai and Pytorch: AI Applications Without a PhD