Project Autodidact Progress Report (S02-C01-M02-AllParts)

Project Autodidact

Project Details: https://insightsbyse.com/projectautodidact/

Scott Ernst Bio: https://insightsbyse.com/aboutscotternst/

Project Contact: InsightsBySE@protonmail.com

Progress Report Scope (S02-C01-M02-AllParts)

Stage 2 of 4: Programming, Data Science, and Machine Learning Fundamentals and Applications

Cluster 1 of 14: Fundamental Data Science Knowledge, Skills, and Tools

Module 2 of 3: Git, GitHub, and GitHub Integration With Google Colab and Databricks

Parts 1 through 3: See below

Summary Of Goals Achieved

  1. Learned process for initializing a local repository
  2. Learned process for connecting a local repository to GitHub
  3. Learned process for using Git commands for ensuring reproducibility and attribution: git blame command, creating lightweight and annotated Git tags, remotely sharing Git tags, git checkout command, and git switch command
  4. Learned process for creating public and private repositories, creating applicable IP licenses, controlling repository access, controlling repository security, and creating a .gitignore file
  5. Learned similarities and differences among Git configuration scopes: local, global, and system
  6. Learned core Git commands: git init, git clone, git status, git add, git commit -m “message,” git push, and git pull
  7. Learned similarities and differences between applicability of GitHub local and remote branches
  8. Learned workflow for a GitHub local branch
  9. Learned workflow for a GitHub remote branch in a cloud environment (Google Colab or Databricks)
  10. Learned applicability of direct GitHub file editing and workflow for direct GitHub file editing
  11. Learned applicability of .gitignore file editing and workflow for .gitignore file
  12. Learned applicability of GitHub diffs and workflow for GitHub diffs
  13. Learned process for opening any public or private GitHub-hosted Google Colab notebook directly in Google Colab interface
  14. Learned process for adding “Open in Colab” badge to GitHub README.md file for instant execution by collaborators or users
  15. Learned process for saving a new Google Colab notebook or changes to an existing Google Colab directly to a GitHub repository
  16. Learned process for using a Personal Access Token for pushing changes from Google Colab to a GitHub repository instead of a password
  17. Learned similarities and differences between external and internal command-line tools in Google Colab
  18. Learned applicability and execution of common Git shell commands in Google Colab
  19. Learned applicability, scope, and use of GitHub Dependabot Secrets
  20. Learned applicability, scope, and use of GitHub Actions Secrets
  21. Learned applicability, scope, and use of GitHub Codespaces Secrets
  22. Learned GitHub pull request workflow for collaboration and code review: (1) branching, isolation, and feature development initialization, (2) atomic committing across the MLOps lifecycle, (3) creating and submitting the pull request, (4) collaborative code review and iteration cycle, and (5) final merge, history management, and audit trail cleanup
  23. Learned MLOps CI/CD pipeline: (1) Continuous Integration (CI): the pre-merge code, data, and security gate, (2) Continuous Delivery (CD): model training, artifact management, and staging, and (3) Continuous Deployment (CD): production release strategy and rollback
  24. Learned GitHub Actions Workflows with YAML: (1) workflows, (2), jobs, (3) steps, (4) actions, (5) events, (6) runners, (7) contexts, (8) secrets, (9) inputs, (10) outputs, (11) concurrency, and (12) reusability
  25. Learned GitHub Issues: (1) issue creation, (2) labels, (3) assignees, (4) comments and discussions, (5) milestones, (6) projects, (7) linking to pull requests, (8) sub-issues, (9) issue templates, (10) notifications, and (11) administration
  26. Learned Git Tags and GitHub Releases for version control

Part 1 of 3: GitHub Basics and Foundational Git Commands

Goal 1 Statement: Learn process for initializing a local repository

Goal 1 Plan: (1) Read source materials and (2) Complete practice problem

Goal 1 Work Product: Practice problem below and (2) List of best practices

Practice Problem:

Scenario: You’re a quant analyst setting up the foundation for a new stock price prediction model on your local workstation to start tracking code changes from day one for version control using the command line.

Problem to be Solved: Create a new project folder, initialize a local Git repository within it, and make the very first commit of a placeholder script.

Knowledge and Skills Developed: Understanding the git init command, the three-state architecture (working directory, staging area, local repository), and performing a fundamental git add/git commit workflow.

Goal 1 Result: Practice problem completed (practice files deleted)

Goal 2 Statement: Learn process for connecting a local repository to GitHub

Goal 2 Plan: (1) Read source materials and (2) Complete practice problem

Goal 2 Work Product: (1) Practice problem below and (2) List of best practices

Practice Problem:

Scenario: You’re a data scientist securing and sharing your recently developed model feature on predicting NFL game outcomes after completing a successful local experiment. To back up the code and prepare it for peer review by a colleague, you must create a remote repository and push the local history.

Problem to be Solved: Create a corresponding repository on GitHub, link the local Git repository to this remote location, and transfer all local history to the cloud.

Knowledge and Skills Developed: Understanding the difference between local and remote repositories, using the git remote add command to establish a connection, and executing git push to upload the code and history to GitHub. This is the link between the tool (Git) and the platform (GitHub).

Goal 2 Result: Practice problem completed (practice files deleted)

Goal 3 Statement: Learn process for using Git commands for ensuring reproducibility and attribution: git blame command, creating lightweight and annotated Git tags, remotely sharing Git tags, git checkout command, and git switch command

Goal 3 Plan: (1) Read source materials and (2) Complete practice problems

Goal 3 Work Product: 3 practice problems below and (2) List of best practices

Practice Problem 1: Pinpointing a Regression Bug (Attribution and Blame)

Scenario: You’re a quantitative researcher whose stock prediction model output variance unexpectedly increased by 15% in the live production environment two days ago. To identify which recent change caused the model’s instability, you must inspect the code changes in the primary prediction script.

Problem to be Solved: Identify the specific commit and the responsible developer that last modified the crucial function responsible for feature normalization, which is suspected to be the source of the variance increase.

Knowledge and Skills Developed: Mastering the git blame command to analyze attribution line-by-line, linking a current operational problem back to a specific point in the project’s history.

Practice Problem 2: Tagging a Successful Deployment (Reproducibility and Tagging)

Scenario: You’re a data scientist deploying an experimental A/B test for a new customer segmentation model named ‘Segment-V3’ achieved 95% recall in testing and is now going live immediately after successful Continuous Integration (CI) tests pass. To create a permanent, easy-to-reference pointer to the exact code that is now running in production, you must ensure the results can be replicated later by using an annotated tag.

Problem to be Solved: Tag the current commit of the main branch with a unique, descriptive, and permanent tag, and then upload the tag to GitHub.

Knowledge and Skills Developed: Understanding the use of Git tags (git tag -a) to mark significant milestones, distinguishing between lightweight and annotated tags, and ensuring tags are shared remotely (git push –tags).

Practice Problem 3: Recreating A Past Financial Forecast (Reproducibility and Checkout)

Scenario: You’re a portfolio manager who needs to audit the exact code, configuration, and dependency list used to generate a highly accurate interest rate forecast from six months ago tagged with the specific Git Tag forecast-2024-Q2. To check the input parameters and library versions to ensure regulatory compliance and debug why current forecasts are underperforming, you must check out the specific tag.

Problem to be Solved: Restore the entire local working directory and repository state to the exact point in time tagged with the specific Git Tag forecast-2024-Q2.

Knowledge and Skills Developed: Using git checkout (or git switch) on a tag to enter a detached HEAD state, understanding that this state is read-only, and locating the necessary requirements.txt file at that historical point.

Goal 3 Result: Practice problems completed (practice files deleted)

Goal 4 Statement: Learn process for creating public and private repositories, creating applicable IP licenses, controlling repository access, controlling repository security, and creating a .gitignore file

Goal 4 Plan: (1) Read source materials and (2) Complete practice problems

Goal 4 Work Product: 3 practice problems below and (2) List of best practices

Practice Problem 1: Securing A Proprietary Trading Model (Creating Private Repository)

Scenario: You’re a member of a new AI development team at a hedge fund developing a new proprietary, high-frequency trading prediction algorithm using non-public market data analysis scripts. To ensure commercial confidentiality and prevent proprietary methodology from leaking to competitors, you must create a private repository on GitHub before writing the first line of code.

Problem to be Solved: Create a new GitHub repository that ensures maximum confidentiality and restricts access to only the three team members.

Knowledge and Skills Developed: Understanding the distinction between Public and Private, the process of creating a secure repository, and the process of granting specific team access.

Practice Problem 2: Preparing A Sports Data Portfolio (Creating Public Repository)

Scenario: You’re a data science student preparing for a job interview who needs to create a public repository on GitHub for a predictive model on NFL game outcomes, which will serve as your primary portfolio piece, after completing the code locally. To maximize visibility, demonstrate coding skills, and inform potential users how they can utilize the code, you must create a public repository on GitHub with necessary open-source elements.

Problem to be Solved: Create a public repository from a local Git project and ensure it is properly licensed and described for maximum professional impact.

Knowledge and Skills Developed: Understanding the purpose of public repositories, the importance of licensing for professional work, and the command to push a local repository to a new remote repository.

Practice Problem 3: Protecting Data Access Credentials (Security and .gitignore)

Scenario: You’re a data engineer working on a business intelligence project on your local workstation and writing a Python script that uses a unique API Key stored in a file named config.ini to pull sales data. To prevent the highly sensitive API key from being recorded in the repository history, even in a private repository, you must use the .gitignore file before committing the script for the first time.

Problem to be Solved: Implement the necessary Git configuration to ensure the config.ini file, which contains the API key, is never staged or committed, while still allowing the main Python script that reads the key to be committed.

Knowledge and Skills Developed: Understanding the purpose and syntax of the .gitignore file and the security principle of separating secrets from code.

Goal 4 Result: Practice problems completed (practice files deleted)

Goal 5 Statement: Learn similarities and differences among git configuration scopes: local, global, and system

Goal 5 Plan: (1) Read source materials and (2) Verify Git global configuration

Goal 5 Work Product: (1) Verified Git global configuration and (2) List of best practices

Goal 5 Result: Completed

Goal 6 Statement: Learn core Git commands: git init, git clone, git status, git add, git commit -m “message,” git push, and git pull

Goal 6 Plan: (1) Read source materials and (2) Complete practice problems

Goal 6 Work Product: 3 practice problems below and (2) List of best practices

Practice Problem 1: Initializing A New Business Forecasting Project (git init, git add, and git commit)

Scenario: You’re a business analyst starting a new project for quarterly sales forecasting. To enable version control from day one and commit the initial project structure and documentation, you must use the initialization and staging commands immediately after creating the project folder in the local terminal environment.

Problem to be Solved: Initialize a new Git repository, add a README.md file, stage the file, and create the first commit with a descriptive message.

Knowledge and Skills Developed: The sequence of creating a local repository (git init), preparing files (git add), and creating the first attributed snapshot (git commit).

Practice Problem 2: Synchronizing A Finance Model Update (git pull, git add, git push)

Scenario: You’re a quantitative developer working on a local terminal within a cloned repository and need to update your local copy of a credit risk model, add a new script, commit the script, and push the change to the shared remote repository. To ensure you start on the latest stable code (Reproducibility) and share your new work (Attribution), you must use the synchronization commands.

Problem to be Solved: Pull the latest changes from GitHub, add a new script (new_feature_script.py), and push the resulting commit to the remote main branch.

Knowledge and Skills Developed: The full cycle of synchronization (git pull) and contribution (git commit, git push).

Practice Problem 3: Acquiring And Preparing A Sports Analytics Project (git clone, git status)

Scenario: You’re a new team member joining a pre-existing project for predicting player performance metrics that is hosted on GitHub. To get a full, linked copy of the project history and prepare a specific configuration file for local use, you must use the cloning command and check the working directory state.

Problem to be Solved: Clone the existing repository (https://github.com/team/player-metrics.git) and verify the link. Then, create a local configuration file that Git should ignore, and check the status.

Knowledge and Skills Developed: Using git clone to acquire a project and using git status to verify the state of the Working Directory and Staging Area.

Goal 6 Result: Practice problems completed (practice files deleted)

Goal 7 Statement: Learn similarities and differences between applicability of local and remote branches

Goal 7 Plan: Read source materials

Goal 7 Work Product: List of best practices

Goal 7 Result: Completed

Goal 8 Statement: Learn workflow for a Github local branch

Goal 8 Plan: Read source materials

Goal 8 Work Product: List of best practices

Goal 8 Result: Completed

Goal 9 Statement: Learn workflow for a Github remote branch in a cloud environment (Google Colab or Databricks)

Goal 9 Plan: Read source materials

Goal 9 Work Product: List of best practices

Goal 9 Result: Completed

Goal 10 Statement: Learn applicability of direct GitHub file editing and workflow for direct GitHub file editing

Goal 10 Plan: Read source materials

Goal 10 Work Product: List of best practices

Goal 10 Result: Completed

Goal 11 Statement: Learn applicability of .gitignore file editing and workflow for .gitignore file

Goal 11 Plan: Read source materials

Goal 11 Work Product: List of best practices

Goal 11 Result: Completed

Goal 12 Statement: Learn applicability of GitHub diffs and workflow for GitHub diffs

Goal 12 Plan: Read source materials

Goal 12 Work Product: List of best practices

Goal 12 Result: Completed

Part 2 of 3: Intermediate GitHub and Google Colab Integration

Goal 13 Statement: Learn process for opening any public or private GitHub-hosted Google Colab notebook directly in Google Colab interface

Goal 13 Plan: Read source materials

Goal 13 Work Product: List of best practices

Goal 13 Result: Completed

Goal 14 Statement: Learn process for adding “Open in Colab” badge to GitHub README.md file for instant execution by collaborators or users

Goal 14 Plan: Read source materials

Goal 14 Work Product: List of best practices

Goal 14 Result: Completed

Goal 15 Statement: Learn process for saving a new Google Colab notebook or changes to an existing Google Colab directly to a GitHub repository

Goal 15 Plan: Read source materials

Goal 15 Work Product: List of best practices

Goal 15 Result: Completed

Goal 16 Statement: Learn process for using a Personal Access Token for pushing changes from Google Colab to a GitHub repository instead of a password

Goal 16 Plan: Read source materials

Goal 16 Work Product: List of best practices

Goal 16 Result: Completed

Goal 17 Statement: Learn similarities and differences between external and internal command-line tools in Google Colab

Goal 17 Plan: Read source materials

Goal 17 Work Product: List of best practices

Goal 17 Result: Completed

Goal 18 Statement: Learn applicability and execution of common Git shell commands in Google Colab

Goal 18 Plan: Read source materials

Goal 18 Work Product: List of best practices

Goal 18 Result: Completed

Goal 19 Statement: Learn applicability, scope, and use of GitHub Dependabot Secrets

Goal 19 Plan: Read source materials

Goal 19 Work Product: List of best practices

Goal 19 Result: Completed

Goal 20 Statement: Learn applicability, scope, and use of GitHub Actions Secrets

Goal 20 Plan: Read source materials

Goal 20 Work Product: List of best practices

Goal 20 Result: Completed

Goal 21 Statement: Learn applicability, scope, and use of GitHub Codespaces Secrets

Goal 21 Plan: Read source materials

Goal 21 Work Product: List of best practices

Goal 21 Result: Completed

Part 3 of 3: Advanced GitHub for Production Data Science and MLOps Concepts

Goal 22 Statement: Learn GitHub pull request workflow for collaboration and code review: (1) branching, isolation, and feature development initialization, (2) atomic committing across the MLOps lifecycle, (3) creating and submitting the pull request, (4) collaborative code review and iteration cycle, and (5) final merge, history management, and audit trail cleanup

Goal 22 Plan: Read source materials

Goal 22 Work Product: List of best practices

Goal 22 Result: Completed

Goal 23 Statement: Learn MLOps CI/CD pipeline: (1) Continuous Integration (CI): the pre-merge code, data, and security gate, (2) Continuous Delivery (CD): model training, artifact management, and staging, and (3) Continuous Deployment (CD): production release strategy and rollback

Goal 23 Plan: Read source materials

Goal 23 Work Product: List of best practices

Goal 23 Result: Completed

Goal 24 Statement: Learn GitHub Actions Workflows with YAML: (1) workflows, (2), jobs, (3) steps, (4) actions, (5) events, (6) runners, (7) contexts, (8) secrets, (9) inputs, (10) outputs, (11) concurrency, and (12) reusability

Goal 24 Plan: Read source materials

Goal 24 Work Product: List of best practices

Goal 24 Result: Completed

Goal 25 Statement: Learn GitHub Issues: (1) issue creation, (2) labels, (3) assignees, (4) comments and discussions, (5) milestones, (6) projects, (7) linking to pull requests, (8) sub-issues, (9) issue templates, (10) notifications, and (11) administration

Goal 25 Plan: Read source materials

Goal 25 Work Product: List of best practices

Goal 25 Result: Completed

Goal 26 Statement: Learn Git Tags and GitHub Releases for version control

Goal 26 Plan: Read source materials

Goal 26 Work Product: List of best practices

Goal 26 Result: Completed