Skip to content

Commit dfd8fce

Browse files
authored
Merge pull request /issues/78 from carpentries-incubator/JC-reorganisation
new chapter organisation, ready for more work !
2 parents c2f0835 + a8940cf commit dfd8fce

23 files changed

+819
-565
lines changed

_extras/04-FAIRHowTo.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: FAIR How-Tos for data and software
33
---
44

5-
# FAIR How-To for data
5+
# FAIR How-to guide for data
66

77
## Findable:
88
Your data should be findable, by you and others.
@@ -86,7 +86,7 @@ Finally build new habits:
8686
- Start the process early on!
8787
![](https://i.imgur.com/pJ99MTf.png)
8888

89-
# FAIR How-To for software
89+
# FAIR How-to guide for software
9090

9191
The fair principles are intended to be applied to all digital objects. Both data and software are considered digital objects and can be products of research. Software generally performs a function upon something (e.g., software processes data), while data generally has a function performed upon it (e.g., data is processed by software) and we are more interested in what software does rather than what software is. Current conversations pretty much tell us the same thing, it is a special kind of data, which allows it to be treated in similar to data when it comes to certain aspects of FAIR.
9292

_extras/51-datawrangling.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
---
2+
title: Data Wrangling and exploration
3+
---
4+
#### Data Wrangling steps
5+
Each data project requires a unique approach to ensure its final dataset is reliable and accessible. That being said, several processes typically inform the approach. These are commonly referred to as data wrangling steps or activities.
6+
7+
##### Discovery
8+
Discovery refers to the process of familiarizing yourself with data so you can conceptualize how you might use it. You can liken it to looking in your refrigerator before cooking a meal to see what ingredients you have at your disposal.
9+
10+
During discovery, you may identify trends or patterns in the data, along with obvious issues, such as missing or incomplete values that need to be addressed. This is an important step, as it will inform every activity that comes afterward.
11+
12+
##### Structuring
13+
Raw data is typically unusable in its raw state because it’s either incomplete or misformatted for its intended application. Data structuring is the process of taking raw data and transforming it to be more readily leveraged. The form your data takes will depend on the analytical model you use to interpret it.
14+
15+
##### Cleaning
16+
Data cleaning is the process of removing inherent errors in data that might distort your analysis or render it less valuable. Cleaning can come in different forms, including deleting empty cells or rows, removing outliers, and standardizing inputs. The goal of data cleaning is to ensure there are no errors (or as few as possible) that could influence your final analysis.
17+
18+
##### Enriching
19+
Once you understand your existing data and have transformed it into a more usable state, you must determine whether you have all of the data necessary for the project at hand. If not, you may choose to enrich or augment your data by incorporating values from other datasets. For this reason, it’s important to understand what other data is available for use.
20+
21+
If you decide that enrichment is necessary, you need to repeat the steps above for any new data.
22+
23+
##### Validating
24+
Data validation refers to the process of verifying that your data is both consistent and of a high enough quality. During validation, you may discover issues you need to resolve or conclude that your data is ready to be analyzed. Validation is typically achieved through various automated processes and requires programming.
25+
26+
##### Publishing
27+
Once your data has been validated, you can publish it. This involves making it available to others within your organization for analysis. The format you use to share the information—such as a written report or electronic file—will depend on your data and the organization’s goals.
28+
29+
#### Importance of Data Wrangling
30+
Any analyses you perform will ultimately be constrained by the data that informs you. If data is incomplete, unreliable, or faulty, then analyses will be too—diminishing the value of any insights gleaned.
31+
32+
Data wrangling seeks to remove that risk by ensuring data is in a reliable state before it’s analyzed and leveraged. This makes it a critical part of the analytical process.
33+
34+
It’s important to note that data wrangling can be time-consuming and taxing on resources, particularly when done manually. This is why many organizations institute policies and best practices that help employees streamline the data cleanup process—for example, requiring that data include certain information or be in a specific format before it’s uploaded to a database.
35+
36+
For this reason, it’s vital to understand the steps of the data wrangling process and the negative outcomes associated with incorrect or faulty data.
37+
38+
> ***Case study example:***
39+
>
40+
> *A PhD student has recorded lab results in an Excel spreadsheet.
41+
> She has a copy of the raw results, but for her analysis, she wants to remove some outliers.
42+
> She manually deletes some rows and saves the data as .csv.
43+
> There is no record of what she deleted or the rule she applied, and a manual check of the differences between her data and the original would be necessary.*
44+
>
45+
> *Instead, another PhD student writes an R script that reads the raw data file, uses a filter where the method and parameters are saved, and saves the output as a CSV file. Her process for removing outliers is reproducible.*
46+
47+
Within a computational project, these steps may accidentally become obscure and so specific effort is required to make sure no one is manually processing data in a way that can't be repeated and that all the steps are recorded.
48+
49+
```
50+
genomeProject/processing/data_cleaning.py
51+
52+
import pandas as `pd`
53+
54+
# Reads raw data from a directory.
55+
df = pd.read_csv('genomeProject/data/220103_GenomicData.csv')
56+
57+
# Shows the first 5 rows of the data set.
58+
df.head(5)
59+
60+
# Remove unwanted columns by name
61+
df.drop(columns=['year','quasi_column_ID'],inplace=True)
62+
63+
# Rename a column name
64+
df.rename(columns={"batch_3_4_7": "batch_347"})
65+
66+
# Shows the first 5 rows of the data set now with the columns renamed and some removed
67+
df.head(5)
68+
69+
# Change the index of the data to an ID
70+
df = df.set_index('ID')
71+
72+
# Save processed data set
73+
df.to_csv(genomeProject/data/220103_GenomicData_processed)
74+
```
75+
76+
## Data exploration and insights
77+
78+
The first part of data analysis is exploration. It is usually helpful to look at the data directly, and this might involve:
79+
80+
- Printing rows of a dataset in a notebook/IDE
81+
- Printing the summary statistics for each row of a column
82+
- Producing a scatter plot or histogram for a column of data
83+
84+
Visualising processed data before any analysis is usually always helpful.
85+
When working with data sets, `ggplot` (in R) or `matplotlib`/`seaborn` (in Python) libraries provide attractive figures that can be produced very quickly. Visualising data is useful for exploring data from the start, and also illustrates the methodology or the steps of analysis.
86+
This is particularly valuable in Jupyter Notebooks. Code to produce figures should be literate, functional, and reusable in the same way as data cleaning and analysis code.
87+
That way future visualisations can be easily updated or reused.
88+
89+
Unlike data visualisations for release/communicating results/publications, these figures may be more practical than aesthetic.
90+
It is perfectly acceptable to open 25 histograms at once when exploring data, even though in an eventual paper you would only show one.
91+
Data visualisation is a tool, and small functions or scripts can make sure exploring data is practical and easy to repeat.
92+
93+
The code to produce the following figure is below.
94+
Libraries like `matplotlib` do a lot of the hard work for you, and there are countless tutorials for different kinds of visualisations.
95+
96+
![](https://www.oreilly.com/library/view/python-data-science/9781491912126/assets/pyds_04in39.png)
97+
98+
99+
```
100+
plt.hexbin(x, y, gridsize=30, cmap='Blues')
101+
cb = plt.colorbar(label='count in bin')
102+
```

_extras/guide.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,30 @@
11
---
22
title: "Instructor Notes"
33
---
4-
FIXME
4+
5+
The content of this course might be provided following different schedule,
6+
in parts or in a different order.
7+
Here we list different possibilities
8+
9+
# Teaching schedules
10+
11+
## Six times one hour.
12+
13+
1. Introduction: what is this course, what are the motivation to include data science practices in research, what do we mean with data science practices.
14+
2. Reproducibility, provenance and version control.
15+
3. Setting up a project and its management tools.
16+
4. Research data management in a computational project.
17+
5. Code quality control.
18+
6. Publication and open science.
19+
20+
## One day workshop (8h)
21+
22+
Following the blocks above, but adding ice breaking and review exercises.
23+
24+
# Advertise the course
25+
26+
Content of the introduction part (chapter 1-3) may be reused to give a 10 minutes overview of the objectives.
27+
28+
- It might be best to start a workshop with the chapter 2: motivations.
529

630
{% include links.md %}

episodes/02-motivation.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,7 @@ exercises: 0
55
questions:
66
- "How does this training relate to your work?"
77
- "What are the benefits of using data science skills?"
8-
- "What can go wrong with working on data/code?"
98
- "What are the challenges for teams and management?"
10-
- "Are there procedures and protocols that can help?"
119
objectives:
1210
- "Understand how this training material will help your research (and career)."
1311
keypoints:

episodes/03-overview.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: "What is special in data science project ?"
3+
teaching: 10
4+
exercises: 5
5+
questions:
6+
- Get an overview of the training material
7+
- understand how the different aspects of this material relates to one another
8+
objectives:
9+
- "Understand how this training material is organised."
10+
keypoints:
11+
- "content organisation"
12+
---
13+
14+
# Overview of this training material
15+
16+
> ## Team and data science
17+
> The principles taught in this course can be applied to any research project.
18+
> It will help project manager to work with online tools, with people with specialised skillsets,
19+
> and with a project involving some code writing and code reuse.
20+
> The goal is to produce a reproducible data analysis in modern research context.
21+
>
22+
> It requires some specific and some non-specific project management actions.
23+
> This course covers a large panel of action to perform before, during and after the project is conducted.
24+
>
25+
{: .callout}
26+
27+
This program will teach you best practices in data science project management and how to appyly them to research project.
28+
This material will help you to manage a research project that comprise some *online collaborative working*,
29+
has a relatively *big team, where people have complementary skills*,
30+
use *coding or programming*, as well as the *reuse of code*,
31+
and last but not least, aim at producing a *reproducible analysis*, as is pictured below.
32+
33+
34+
<img src="../fig/datasciencespecifics.jpg" alt="Specicity of data science project. Five blocks (working online, large teams whose members have with specialised skills, writing code and re-using code) are placed around a central block where reproducible analysis is written. Data specifics by Julien Colomb CC-BY 4.0 " width="500"/>
35+
36+
Here we give an short overview of the topics that will be covered in this course.
37+
Note that the course episode split follows a different logic, and you will find training linked to each five of these data science specifics in most episodes.
38+
39+
40+
## Online work
41+
42+
When part of the work happens online, it becomes very easy to loose track of what documents are where,
43+
what tools were used (by whom), and who is doing what.
44+
Discussions are also more difficult to organise and meetings are more complex to drive.
45+
In this course, we will look at different elements that make this work easier.
46+
47+
Af first, there should be one entry point for the project, where every team member can find the main documentation as well as links to other documents and data.
48+
This starts with setting `one main URL` when setting up the project, as well as using good readme files and templates. The information needs to be updated during the project and shared with the whole team.
49+
50+
The use of online project management tool (like kaban boards for todo list) can also help members of the team to coordinate their work, and follow their achievements.
51+
52+
## Team science
53+
54+
Because teams can be big, and quite heterogeneous in terms of skills (especially computer and programming skills), it is important to follow best practice of team building.
55+
56+
In particular, one should set reasonable goals and milestones for the project, and document them in the main documentation.
57+
It is also important that every team member knows what his part is, and that the work is well distributed.
58+
59+
One should make sure every team member is able to use the communication tools set for the team, and take particular care of the organisation of meetings. Data and code should be documented (and this documentation work should be fostered), such that every team member can follow and reuse the work of the other team members.
60+
61+
## Involves coding
62+
63+
When data analysis is done via a programming language, things become mostly easier, but this facility has some drawbacks, as well as some effects on data management practice.
64+
65+
First, a data analysis workflow will now start with the computer reading the raw data.
66+
This means that the choice of the data format for the raw data may change, and that manually gathered data should be (easily) computer readable.
67+
This is particularly important for spreadsheets, as a lot of time can be saved by designing the spreadsheet in a tidy format.
68+
69+
Second, statistical analysis and data representation in figures will now be much easier to perform, making p-hacking and harking practices very easy to do, even involuntarily. Researcher have to actively make sure their analysis is not flawed.
70+
71+
Third, making errors in the code may have larger effects than making error in a manual analysis.
72+
However, errors are easier to spot (doing code reviews and tests) and when the code is corrected, the results are immediately corrected, too.
73+
74+
## Involves reuse of code
75+
76+
Very soon in a research project, writing code consist mostly of taking code written by someone else and applying it (with some tweaks sometimes) to one own data.
77+
We will look at ways to find relevant code, make sure it can be trusted, make sure you can legally use it, and ways to cite it (to give recognition the initial software engineer deserves).
78+
79+
In addition, code written in the project will probably be reused, too.
80+
We will look into best practices to make this reuse easier, both in how the code is written (modularity, documentation, tests) and in how the code is shared (license, repository, version control, release, users instruction, developers instructions).
81+
82+
83+
## Reproducibility
84+
85+
At the core of data science, the analysis reproducibiliy is both a goal and a huge advantage (in terms of research transparency, trustworthiness and work efficiency).
86+
The use of code is not enough to get a reproducible analysis, one needs to have access to both the code and the data used to produce the research result, a concept called provenance.
87+
This may not be trivial, especially if several version of the code and of the data exist.
88+
89+
In this course, we will have a strong emphasis on version control, while we will introduce the concepts (and some tools) of provenance, as well as literate programming (reproducible reports and executable papers), where the code, the figure and explanatory text are bound in the same file.
90+
91+
92+
{% include links.md %}
93+

0 commit comments

Comments
 (0)