Software Engineering for ML

There is a book about this, which covers how to implement software engineering best practices in data science and machine learning projects: Software Engineering for Data Scientists

I need to refactor and split this somehow before publishing.

Good code

What constitutes a good code:

Simple
Modular
Readable
Efficient
Robust

Tests

Benefits of having tests implemented:

acts as a safety net
makes refactor more safe

Tests follow a specific structure:

Arrangement: Set up the conditions for the test.
Action: Execute the functionality being tested.
Assertion: Verify that the outcome is as expected.

This is known as the AAA pattern (Arrange, Act, Assert), and is a good rule of thumb to follow when writing tests.

Once tests are finished, is important to clean up any resources that were created, modified, or used during the test to ensure that subsequent tests run in a clean environment. In some frameworks this is known as “teardown”.

There are three different types of tests, including:

Unit Tests: Test individual components or functions in isolation.
Integration Tests: Test how different components work together.
End-to-End Tests: Test the entire application flow from start to finish.

They should follow a pyramid structure, you should have more unit tests than integration tests, and more integration tests than end-to-end tests.

Tests for Data Science and Machine Learning

Focus on:

Using synthetic data to ensure tests are deterministic and reproducible. Could include outliers, missing values, or specific distributions.
Tests needs to be reproducible, so set random seeds where applicable.
Checking model performance metrics to ensure they meet predefined thresholds.

NOTE

Keep focused on unit tests, for a faster feedback loop.

Folder structure

There is no silver bullet when it comes to folder structure, altough sometimes you can find recommendations based on the language or framework you are using.

For data projects, I would recommend separating:

Source Code
Tests
Notebooks
Scripts
Data (raw, processed, external, interim), if applicable
Models
Documentation

TIP

Use tools like Cookiecutter Data Science to create a standardized folder structure for your data science projects.

Configuration Management

Ideally, no values should be hardcoded in the codebase. Instead, use configuration files or environment variables to manage settings and parameters.

This is useful when training models or running experiments, as you can easily change hyperparameters or other settings without modifying the code.

In this case, usually YAML or JSON files are used to store configurations.

Look for Code Smells

Code smells are indicators of potential improvements in the codebase, such as:

Duplicated code or logic
Bloated functions or classes
Bad naming and lack of naming conventions
Chained conditional blocks: Multiple if-else statements might indicate a poor logical structure.
Magic Numbers: Hardcoded numbers that lack context or meaning.

Version Control Best Practices

Branches should be short-lived and merged back to main frequently, to avoid conflicts.
Keep commits small and focused on a single change or improvement.
Write clear and descriptive commit messages to document the changes made, consider following conventional commit guidelines.

Refactoring

Avoid refactoring and add new features at the same time.
Avoid premature optimization, code needs to work first and do what it is supposed to do before improving performance.

Python Performance

use builtin functions
asynchronous programming
generators and iterators

Find where the bottlenecks are by measuring time and memory consumption, using profiling tools.

Some examples of profiling tools are:

cProfile
line_profiler
memory_profiler

For jupyter notebooks, you can use the built-in magic commands:

%timeit
%memit

Is ideal for single functions, but profilers are better for analyzing the whole codebase.

Book: High Performance Python

Big O Notation

Big O notation is a way to describe the performance or complexity of an algorithm in terms of time or space as the input size grows. It provides an upper bound on the growth rate of an algorithm’s resource consumption.

Complexity in this case means how the time or space (memory) requirements of an algorithm change as the size of the input data increases.

Common Big O complexities include, from best to worst:

O(1): Constant time complexity, where the execution time remains the same regardless of input size.
O(log n): Logarithmic time complexity, where the execution time grows logarithmically with input size.
O(n): Linear time complexity, where the execution time grows linearly with input size.
O(n log n): Linearithmic time complexity, often seen in efficient sorting algorithms.
O(n^2): Quadratic time complexity, where the execution time grows quadratically with input size, often seen in nested loops.
O(2^n): Exponential time complexity, where the execution time doubles with each additional input element, often seen in recursive algorithms.
O(n!): Factorial time complexity, where the execution time grows factorially with input size, often seen in algorithms that generate all permutations.

$ luctre

Recent Posts

The residue is the point

The small clever move

Do less, but better

Software Engineering for ML

Good code

Tests

Tests for Data Science and Machine Learning

Folder structure

Configuration Management

Look for Code Smells

Version Control Best Practices

Refactoring

Python Performance

Big O Notation

Graph View

Table of Contents