You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/03-overview.md
+10-14Lines changed: 10 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,15 +28,13 @@ This program will teach you best practices in data science project management an
28
28
This material will help you to manage a research project that comprise some *online collaborative working*,
29
29
has a relatively *big team, where people have complementary skills*,
30
30
use *coding or programming*, as well as the *reuse of code*,
31
-
and last but not least, aim at producing a *reproducible analysis*.
31
+
and last but not least, aim at producing a *reproducible analysis*, as is pictured below.
32
32
33
-
Here we give an short overview of the topics that will be covered in this course.
34
-
Note that the course episode split follows a different logic, and you will find training linked to each five of these data science specifics in most episodes.
35
-
As pictured below, the specifics of data science projects can be grouped in five main topics: working online, working with a heterogenous and relative big team, writing code, reusing code, and aiming at a reproducible analysis.
36
-
This latter point being central to data science practices.
37
33
38
34
<imgsrc="../fig/datasciencespecifics.jpg"alt="Specicity of data science project. Five blocks (working online, large teams whose members have with specialised skills, writing code and re-using code) are placed around a central block where reproducible analysis is written. Data specifics by Julien Colomb CC-BY 4.0 "width="500"/>
39
35
36
+
Here we give an short overview of the topics that will be covered in this course.
37
+
Note that the course episode split follows a different logic, and you will find training linked to each five of these data science specifics in most episodes.
40
38
41
39
42
40
## Online work
@@ -47,24 +45,22 @@ Discussions are also more difficult to organise and meetings are more complex to
47
45
In this course, we will look at different elements that make this work easier.
48
46
49
47
Af first, there should be one entry point for the project, where every team member can find the main documentation as well as links to other documents and data.
50
-
This starts with setting one `main` URL when setting up the project, as well as using good readme files and templates. The information needs to be updated during the project and shared with the whole team.
48
+
This starts with setting `one main URL` when setting up the project, as well as using good readme files and templates. The information needs to be updated during the project and shared with the whole team.
51
49
52
50
The use of online project management tool (like kaban boards for todo list) can also help members of the team to coordinate their work, and follow their achievements.
53
51
54
52
## Team science
55
53
56
-
Because teams can be big, and quite heterogenous in terms of skills (especially computer and programming skills), it is important to follow best practice of team bulding.
54
+
Because teams can be big, and quite heterogeneous in terms of skills (especially computer and programming skills), it is important to follow best practice of team building.
57
55
58
56
In particular, one should set reasonable goals and milestones for the project, and document them in the main documentation.
59
57
It is also important that every team member knows what his part is, and that the work is well distributed.
60
58
61
-
One should make sure every team member is able to use the communication tools set for the team, and take particular care of the organisation of meetings. Data and code should be documented, such that every team member can follow and reuse the work of the other team members.
62
-
In this course, we will present ways to foster this documentation process.
63
-
59
+
One should make sure every team member is able to use the communication tools set for the team, and take particular care of the organisation of meetings. Data and code should be documented (and this documentation work should be fostered), such that every team member can follow and reuse the work of the other team members.
64
60
65
61
## Involves coding
66
62
67
-
When data analysis is done via a programming language, things become mostly easier, but this facility has some drawbacks as well as some effects on data management practice.
63
+
When data analysis is done via a programming language, things become mostly easier, but this facility has some drawbacks, as well as some effects on data management practice.
68
64
69
65
First, a data analysis workflow will now start with the computer reading the raw data.
70
66
This means that the choice of the data format for the raw data may change, and that manually gathered data should be (easily) computer readable.
@@ -77,7 +73,7 @@ However, errors are easier to spot (doing code reviews and tests) and when the c
77
73
78
74
## Involves reuse of code
79
75
80
-
Very soon in a research project, writing code consist mostly of taking code written by someone else and applying it (with some tweaks sometimes) to one own's data.
76
+
Very soon in a research project, writing code consist mostly of taking code written by someone else and applying it (with some tweaks sometimes) to one own data.
81
77
We will look at ways to find relevant code, make sure it can be trusted, make sure you can legally use it, and ways to cite it (to give recognition the initial software engineer deserves).
82
78
83
79
In addition, code written in the project will probably be reused, too.
@@ -86,11 +82,11 @@ We will look into best practices to make this reuse easier, both in how the code
86
82
87
83
## Reproducibility
88
84
89
-
At the core of data science, the analysis reproducibiliy is both a goal and a huge advantage (in terms of research transparency, trustworthiness and in term of work efficiency).
85
+
At the core of data science, the analysis reproducibiliy is both a goal and a huge advantage (in terms of research transparency, trustworthiness and work efficiency).
90
86
The use of code is not enough to get a reproducible analysis, one needs to have access to both the code and the data used to produce the research result, a concept called provenance.
91
87
This may not be trivial, especially if several version of the code and of the data exist.
92
88
93
-
In this course, we will have a strong emphasis on version control, while we will introduce the concepts (and some tools) of provenance, as well as literate programming (reproducible reports and executable papers).
89
+
In this course, we will have a strong emphasis on version control, while we will introduce the concepts (and some tools) of provenance, as well as literate programming (reproducible reports and executable papers), where the code, the figure and explanatory text are bound in the same file.
0 commit comments