A  short  note  on  Git  and  Github...

A short note on Git and Github...

Git

What is Git?

Git is a distributed, open-source version control system. It enables developers and data scientists to track code, merge changes and revert to older versions - AWS. It allows you to sync changes with a remote server. Due to its flexibility and popularity, Git has become an industry standard as it supports almost all development environments, command-line tools, and operating systems.

How does Git work?

Git stores your files and their development history in a local repository. Whenever you save changes you have made, Git creates a commit. A commit is a snapshot of current files. These commits are linked with each other, forming a development history graph, as shown below. It allows us to revert back to the previous commit, compare changes, and view the progress of the development project - Azure DevOps. The commits are identified by a unique hash which is used to compare and revert the changes made.

Branches

download (1).png

The branches are copies of the source code that works parallel to the main version. To save the changes made, merge the branch into the main version. This feature promotes conflict-free teamwork. Each developer has his/her task, and by using branches, they can work on the new feature without the interference of other teammates. Once the task is finished, you can merge new features with the main version (master branch).

Commits

GIT-Commit-Flow.png

There are three states of files in Git: modified, staged, and commit. When you make changes in a file, the changes are saved in the local directory. They are not part of the Git development history. To create a commit, you need to first stage changed files. You can add or remove changes in the staging area and then package these changes as a commit with a message describing the changes.

What are the benefits of Git?

  • Track changes: It allows developers to view historical changes. Development history makes it easy to identify and fix bugs.
  • IDE Integration: Due to its popularity, Git integration is available in all development environments, for example VSCode and JupyterLab.
  • Team collaboration: A developer team can view their progress, and by using branches, they can work individually on a task and merge changes with the main version. Pull requests, resolving merge conflicts, and code review promote team collaboration.
  • Distributed VSC: In a distributed system, there is no centralized file storage. There are multiple backups for the same project. This approach allows developers to work offline and commit changes.

Github

What is Github?

GitHub is a cloud software development platform. It is commonly used for saving files, tracking changes, and collaborating on development projects. In recent years, GitHub has become the most popular social platform for software development communities. Individuals can contribute to open-source projects and bug reports, discuss new projects and discover new tools.

Data scientists and machine learning engineers are following the path of software developers and integrating the workflow with GitHub. By doing this, they can share their research work, allow community contribution, and collaborate with data teams. You can find all kinds of data science and machine learning projects, guides, tutorials, and resources on this platform. For students, the platform has become an opportunity to gain work experience and eventually land a job in a prestigious company.

Portfolio

Most technical recruiters will ask for the portfolio projects or GitHub profile. This helps them determine whether a candidate is a good fit for their company. It is highly recommended to create a GitHub profile and update it regularly. Hiring managers are always on the lookout for candidates that are highly experienced in software development and contribute to open-source projects. Being able to analyze the GitHub portfolio helps them prepare questions for technical interview sessions.

Features

GitHub also provides various other features that are as important as showcasing a portfolio. It is necessary to learn about each feature so that you can incorporate them into your data science projects.

  • Open-source: GitHub provides a complete ecosystem for open-source projects. You can sponsor maintainers, contribute to a project, use the open-source tool in your existing project, and promote your work.
  • Community Collaboration: GitHub has become a community platform where issues, feature requests, code, and documentation contributions can be discussed.
  • Explore: GitHub Explore tab helps you discover new projects, trending tools, and developer events.
  • GitHub Gists: You can share the snippet of your code or embed it in a blog or website.
  • GitHub CLI: It allows you to perform merge requests, review code, check issues, and monitor progress from the command line program.
  • Free Storage: unlimited private and public repositories storage.
  • Web hosting: You can publish your portfolio site or documentation. GitHub pages provide easy to build and deploy website experience.
  • Codespace: a cloud development environment integrated with your GitHub repository.
  • Project: a customizable, flexible tool for planning and tracking the work on GitHub.
  • Automation: GitHub Action automates development workflow such as build, test, publish, release, and deployment.
  • Sponsor: You can support your favorite open-source project or developers by paying a monthly or one-time fee. It also allows developers to use third-party payment platforms such as ko-fi.

Basic Commands

image10_1a4384e5fa.png

Before we jump into managing data science projects, let's learn about the most common Git commands that you will be using in every data science project. The basic commands include initializing the Git repository, saving changes, checking logs, pushing the changes to the remote server, and merging.

  • git init create a Git repository in a local directory.
  • git clone : copy the entire repository from a remote server to remote directory. You can also use it to copy local repositories.
  • git add : add a single file or multiple files and folders to the staging area.
  • git commit –m “Message”: create a snapshot of changes and save it in the repository.
  • git config use to set user-specific configurations like email, username, and file format.
  • git status shows the list of changed files or files that have yet to be staged and committed.
  • git push : send local commits to remote branch of repository.
  • git checkout -b : creates a new branch and switches to a new branch.
  • git remote –v: view all remote repositories.
  • git remote add : add remote server to local repository.
  • git branch –d : delete the branch.
  • git pull merge commits to a local directory from a remote repository.
  • git merge : after resolving merge conflicts the command blends selected branch into the current branch.
  • git log show a detailed list of commits for the current branch.

Conclusion

GitOps are crucial for data application development. They have become an essential skill for all types of IT jobs; even academic researchers are using them to share experimental code with a wider audience. On the other hand, GitHub plays a larger role in promoting open-source projects by providing a free software development ecosystem for all.

In this tutorial, we have learned about Git and GitHub and why they are important for data science projects. The tutorial also introduces you to basic Git commands and provides hands-on experience on how to track changes in data, model, and code. If you are interested in learning more about Git, then take an Introduction to Git course on DataCamp.

Thank You for Reading...