The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
1. An Introduction to GIT
Version Control with Git
Avoid to be the repository manager
Dr. Fabio Fumarola
2. Agenda
• What is Version Control? (and why use it?)
• What is Git? (And why Git?)
• How git works
• Create a repository
• Branches
• Add remote
• How data is stored
2
3. History
• Created by Linus Torvalds for work on the Linux
kernel ~2005
• Some of the companies that use git:
3
12. What is a Version Control
• Version Control - A system for managing changes made to
documents and other computer files
• What kinds of files can we use it with?
– Source code
– Documentation
– Short stories
– Binary files (music and pictures)
• What should we use it for?
– Text files
– Projects that have lots of revisions (changes)
– Collaborating
12
13. VCS Systems and their Operations
• Lots of different choices available:
– CVS
– SVN
– Perforce
– Git
– Mercurial (Hg), Bazaar
– And more!
• Most follow a repository model (though differ in how
the repositories work)
13
14. So why do we need a VCS?
• Our goals
– Share code (or something else) easily
– Keep track of any changes we make (and undo them with
ease)
– Maintain multiple versions of the same project/code base
– Clearly communicate what changes have been made
• There are two type of VCS
– Centralized and
– Distributed
14
15. Distributed vs Centralized
Centralized VCS
•One central repository
•Must be capable of
connecting to repo
•Need to solve issues with
group members making
different changes on the same
files
Distributed VCS
•Everyone has a working repo
•Faster
•Connectionless
•Still need to resolve issues,
but it's not an argument
against DVCS
15
18. Creating our first repository
• Install git
• Establish our first repository:
– mkdir git-test
– cd git-test
– git init
• What do we have here?
– git status
18
19. Using our first repository
• Add a file
– touch file.txt
– git add file.txt
– git commit –m “add the first file”
19
20. Branching
• With Git we should embrace
the idea of branching
• Branching is the best method
to work with other in a
project
20
34. Branches Illustrated
master
A B C D E
> git checkout master
> git merge bug456
> git checkout master
> git merge bug456
F’ G’
bug456
35. Branches Review
• Branches should be used to create “Features” in our
project
• Local branches are very powerful
• Rebase is not scary
35
36. Adding a remote
But, how we share our code with the other collaborators?
36
My Local
Repo
Tom’s Repo
Tracey’s
Repo
Matt’s
Repo
A B C
A B C A B C
A B C
38. Setting up a Remote
• We can clone an existing repository
– git clone git@github.com:fabiofumarola/akka-tutorial.git
• We can push our changes
– git push
• We can pull friends change
– git pull
• We can also add a remote to an existing repository
– git remote add origin git@github.com:fabiofumarola/akka-
tutorial.git
38
55. Short vs. Long-Lived Branches
• We can use branch to:
– Solve bugs (hotfixes)
– Create features
– Make a release
• In order to simplify the management we can use:
– Git Flow: http://danielkummer.github.io/git-flow-
cheatsheet/index.it_IT.html
78. How Git stores data
• Git stores the content of each file in the tracking history
• Each time we do a commit it is made a copy of the file.
• However the content of each file is subject to revision for
conflicts (merge).
78
79. Git best practices for code collaboration
• When to commit?
– Source of major arguments (big changes vs small change)
– Never put broken code on the master branch (test first!)
– Try not to break things (don't do two weeks worth of work in one
commit)
– Always use a clear, concise commit message
– Put more details in lines below, but always make the first line short
– Describe the why; the what is clear in the change log
• When making giant changes, consider branches (we'll talk
about these in a few slides)
• Oh, and make sure your name and email are right
79
81. SSH
• Used to be most common transport for git
• Pros
– Allows reads and writes
– Authenticated network protocol
– Secure: data transfer is encrypted and authenticated
– Efficient: makes data as compact as possible
• Cons
– No anonymous read-only access
82. Sidebar: What is SSH?
• SSH is a protocol used for secure network
communication Getting files from github
• Generate public/private keys (ssh-keygen)
• Distribute public keys (add key to github)
• Someone (github) sends secure “message” (files) –
they encode with public key
• You receive the message/files – decode with
private key (only you know)
Putting files on github
• Process is reversed to send files to github
• You have the github public key (see
github_rsa.pub, in Documents and
Settings/Cyndi/.ssh on my machine)
• Use it to encode when sending
• github uses their private key to decode
Editor's Notes
Git is a distributed version control system.
Or you can think of it as
A directory content management system
A tree based history storage system
Or How git is described on the Git Man page a Stupid content tracker.
Git is super cool.
Now lets see what this visually looks like.
On my first commit I have A.
The default branch that gets created with git is a branch names Master.
This is just a default name. As I mentioned before, most everything in git is done by convention. Master does not mean anything special to git.
We make a set of commits, moving master and our current pointer (*) along
Suppose we want to work on a bug. We start by creating a local “story branch” for this work.
Notice that the new branch is really just a pointer to the same commit (C) but our current pointer (*) is moved.
Now we make commits and they move along, with the branch and current pointer following along.
We can “checkout” to go back to the master branch.
This is where I was freaked out the first time I did this. My IDE removed the changes I just made. It can be pretty startling, but don’t worry you didn’t lose anything.
And then merge from the story branch, bringing those change histories together.
And since we’re done with the story branch, we can delete it. This all happened locally, without affecting anyone upstream.
Let’s consider another scenario. Here we created our bug story branch back off of (C). But some changes have happened in master (bug 123 which we just merged) since then.
And we made a couple of commits in bug 456.
Again, to merge, we checkout back to master which moves our (*) pointer.
And now we merge, connecting the new (H) to both (E) and (G). Note that this merge, especially if there are conflicts, can be unpleasant to perform.
Now we delete the branch pointer.
But notice the structure we have now. This is very non-linear. That will make it challenging to see the changes independently. And it can get very messy over time.
Rebase flow - Let’s go back in time and look at another approach that git enables. So here we are ready to merge.
Instead of merging, we “rebase”.
What this means is something like this:
1. Take the changes we had made against (C) and undo them, but remember what they were
2. Re-apply them on (E) instead
Now when we merge them, we get a nice linear flow.
Also, the actual changeset ordering in the repository mirrors what actually happened. (F’) and (G’) come after E rather than in parallel to it. Also, there is one fewer snapshots in the repository.
Suppose we are here. We cloned master on (A) and have been fixing bug 123 in our story branch.
I’ll use the orange box to indicate where the master pointer is on the remote server.
Here we are as before with our local master branch and the remote master branch both pointing at (A)
The changes on the Bug123 branch are only known to my local machine the remote server does not have these changes. Or the bug123 branch for that matter.
But in fact there are two versions of the orange master pointer. One is what we last know about the upstream master and the other is what is actually up there (which we don’t know about).
So if this is what we know, we can update our master to catch up.
First we checkout master which moves our current (*) to there. Note that we are actually on our master, not the upstream one. That is always true. But the tracking branch is also pointing to (A) at this point.
Now we can do a pull on the origin (our source remote) and move both along to their new place.
I have not talked about Pull before. The pull command is combination of a fetch from the remote server and a merge of the changes.
You can do these steps separately, but if you are not working on the branch you are pulling down, pull is just a nice way to get up to date.
Now we can do a pull on the origin (our source remote) and move both along to their new place.
Returning to our bug fix now by checkout on that, we have a similar problem to what we saw before. B-C-D-E all come before F and G. Merging would create issues, right?
So we use rebase to rewind and replay B-C-D-E after G.
Then we checkout back to master
And merge. Note that the orange (upstream) pointer is still “back there”.
And finally, because we want to publish these changes, we push to the origin, moving the orange pointer along.
Push will update the remote server.
If you are out of date, Git will reject that push.
Git will require you to merge locally, then push the results.
And finally, because we want to publish these changes, we push to the origin, moving the orange pointer along.
Delete the story branch and we’re good to go
Say we want to start working on the next version of our Cool Project
Say we want to start working on the next version of our Cool Project. We will want to create a develop branch
To share this with the team we need to push it up to the remote
To share this with the team we need to push it up to the remote
Lets say there are some changes on develop from other team members. Until we do a pull we won’t see these changes locally
Pull some changes from other team members… you should be doing this at least once a day.
Now we have an idea. We create a working branch off of development
Now we have an idea. We create a working branch off of development
One of your teammates did a hotfix in production
Since we are keeping up to date. Just doing a pull or fetch now and then is a good idea.
Merge Idea into Develop. First we want to checkout develop
Since there was not any additional changes on develop we could easily merge idea into develop. – This is call a fast forward merge since git is really just moving the pointer to H
We can delete idea now
We now need to share develop with the rest of the team.
When we are ready to move the develop branch to production (master) we have two choices. We can go through the merge flow, or a rebase flow.
Move onto master
Move onto master
Move onto master
Rebase flow - Move onto master
Rebase flow - Move onto master
Rebase flow – Get Origin up to date
As with most things with Git, there are multiple ways to do something. Using a Merge flow verse a Rebase flow is a matter of taste, but as you can see the rebase flow looks cleaner and as you get into large projects the number of branches can get pretty messy.
This is a simple example of rewriting history in git.