There is a book about this, which covers how to implement software engineering best practices in data science and machine learning projects: Software Engineering for Data Scientists
I need to refactor and split this somehow before publishing.
Good code
What constitutes a good code:
- Simple
- Modular
- Readable
- Efficient
- Robust
Tests
Benefits of having tests implemented:
- acts as a safety net
- makes refactor more safe
Tests follow a specific structure:
- Arrangement: Set up the conditions for the test.
- Action: Execute the functionality being tested.
- Assertion: Verify that the outcome is as expected.
This is known as the AAA pattern (Arrange, Act, Assert), and is a good rule of thumb to follow when writing tests.
Once tests are finished, is important to clean up any resources that were created, modified, or used during the test to ensure that subsequent tests run in a clean environment. In some frameworks this is known as “teardown”.
There are three different types of tests, including:
- Unit Tests: Test individual components or functions in isolation.
- Integration Tests: Test how different components work together.
- End-to-End Tests: Test the entire application flow from start to finish.
They should follow a pyramid structure, you should have more unit tests than integration tests, and more integration tests than end-to-end tests.
Tests for Data Science and Machine Learning
Focus on:
- Using synthetic data to ensure tests are deterministic and reproducible. Could include outliers, missing values, or specific distributions.
- Tests needs to be reproducible, so set random seeds where applicable.
- Checking model performance metrics to ensure they meet predefined thresholds.
NOTE
Keep focused on unit tests, for a faster feedback loop.
Folder structure
There is no silver bullet when it comes to folder structure, altough sometimes you can find recommendations based on the language or framework you are using.
For data projects, I would recommend separating:
- Source Code
- Tests
- Notebooks
- Scripts
- Data (raw, processed, external, interim), if applicable
- Models
- Documentation
TIP
Use tools like Cookiecutter Data Science to create a standardized folder structure for your data science projects.
Configuration Management
Ideally, no values should be hardcoded in the codebase. Instead, use configuration files or environment variables to manage settings and parameters.
This is useful when training models or running experiments, as you can easily change hyperparameters or other settings without modifying the code.
In this case, usually YAML or JSON files are used to store configurations.
Look for Code Smells
Code smells are indicators of potential improvements in the codebase, such as:
- Duplicated code or logic
- Bloated functions or classes
- Bad naming and lack of naming conventions
- Chained conditional blocks: Multiple if-else statements might indicate a poor logical structure.
- Magic Numbers: Hardcoded numbers that lack context or meaning.
Version Control Best Practices
- Branches should be short-lived and merged back to main frequently, to avoid conflicts.
- Keep commits small and focused on a single change or improvement.
- Write clear and descriptive commit messages to document the changes made, consider following conventional commit guidelines.
Refactoring
- Avoid refactoring and add new features at the same time.
- Avoid premature optimization, code needs to work first and do what it is supposed to do before improving performance.
Python Performance
- use builtin functions
- asynchronous programming
- generators and iterators
Find where the bottlenecks are by measuring time and memory consumption, using profiling tools.
Some examples of profiling tools are:
- cProfile
- line_profiler
- memory_profiler
For jupyter notebooks, you can use the built-in magic commands:
- %timeit
- %memit
Is ideal for single functions, but profilers are better for analyzing the whole codebase.
Book: High Performance Python
Big O Notation
Big O notation is a way to describe the performance or complexity of an algorithm in terms of time or space as the input size grows. It provides an upper bound on the growth rate of an algorithm’s resource consumption.
Complexity in this case means how the time or space (memory) requirements of an algorithm change as the size of the input data increases.
Common Big O complexities include, from best to worst:
- O(1): Constant time complexity, where the execution time remains the same regardless of input size.
- O(log n): Logarithmic time complexity, where the execution time grows logarithmically with input size.
- O(n): Linear time complexity, where the execution time grows linearly with input size.
- O(n log n): Linearithmic time complexity, often seen in efficient sorting algorithms.
- O(n^2): Quadratic time complexity, where the execution time grows quadratically with input size, often seen in nested loops.
- O(2^n): Exponential time complexity, where the execution time doubles with each additional input element, often seen in recursive algorithms.
- O(n!): Factorial time complexity, where the execution time grows factorially with input size, often seen in algorithms that generate all permutations.