Skip to content

marcdotson/data-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Stack

This repository provides training for my recommended data stack, a collection of tools for working on data analytics projects. The intended audience is students in my courses at the Jon M. Huntsman School of Business at Utah State University, students that I’m mentoring on projects at the Analytics Solutions Center (ASC), and collaborators on research projects.

My recommended data stack consists of the following:

  • Python for data wrangling, visualizations, and modeling
  • Positron as the integrated development environment (IDE)
  • GitHub for version control, project management, and collaboration
  • Quarto for communicating results with presentations, reports, dashboards, etc.
  • Copilot for assisting with code development and data analysis

Python

Installing Python can be challenging, even for advanced users. As immortalized by xkcd:

Notably, Python comes pre-installed on some operating systems (OS). This version should not be used by anyone except the OS itself. For this and other reasons, you’ll need the ability to maintain multiple versions of Python on the same computer. Python is a big tent, and there are many ways to install and maintain versions. I recommend using uv, a single unified tool for installing and managing both Python versions and project environments. Get started by installing uv via the command line.

Note

The Command Line

If using the command line (i.e., terminal or shell) is new to you, be patient, take your time, and follow instructions from a trusted source closely. A few things that might help:

  • The command line is the programming interface into your OS itself. You don’t have to know everything about it to follow instructions.
  • Instructions can be different based on the type of command line. If you’re on a Mac that’s running macOS Catalina 10.15.7 or later, the terminal is Zsh. If you’re using Linux, the shell is Bash (and you probably already know that). And if you’re using Windows you’re working with PowerShell.

Once you have uv installed, it’s easy to install and manage Python versions.

  • To install the latest stable release of Python, on the command line, run uv python install. To see which versions of Python you already have installed, run uv python find; none of these will be the off-limits OS version.
  • You can also install specific versions of Python, such as uv python install 3.13.4 to install Python 3.13.4. To view Python versions that are available to install, run uv python list.

Tip

R and Julia

Python might be the most commonly used open source programming language for data wrangling, visualizations, and modeling – but it’s not the only one. The three most popular languages for data analytics are Julia, Python, and R (the Jupyter kernel was named for and designed to support all three). Each language comes with its own tradeoffs, culture, and overall vibe.

  • Julia is the newest and fastest and was developed by mathematicians.
  • Python is the most popular and diverse in terms of libraries and applications and was developed by computer scientists.
  • R is the most narrowly focused on data analytics and culturally cohesive and was developed by statisticians.

If you want to learn more than one of the three languages, and you arguably should, I recommend focusing on becoming proficient in one language first and then transferring that understanding to picking up a second. For example, see how I learned Python coming from a background using R.

You can also use uv to manage project environments and make them reproducible. A project environment is composed of the language(s) and libraries (including the dependencies) used for a given project. What makes a project environment reproducible is keeping track of which version of the language(s) and the libraries we’re using for a given project so that it can be easily reproduced on another machine by you (including future you) or someone else. You can set up your own project environment or use an existing one, like the one included when you use my project template. If you aren’t using my project template, it’s still easy to set up and manage a project environment.

  • Navigate to a project working directory via the command line or using a code editor or IDE like Positron. This working directory should not be in a location on your local machine that is being synced to the cloud via OneDrive, iCloud, etc.
  • Run uv init to initialize a project environment. This creates a pyproject.toml file with metadata about the project and a hidden .python-version file that specifies the default version of Python for the project. (It also creates main.py and README.md files that you can use or delete.)
  • With the project environment initialized, you can install libraries. For example, to install the Polars library, run uv add polars. This installs Polars, and any dependencies, and creates both a uv.lock file that keeps track of the versions of the libraries you’ve installed and a hidden .venv reproducible (or virtual, hence the “v” in venv) environment folder that serves as the project library.

All Python libraries are installed in a single, global library on your computer known as the system library. The fact that we have a project library highlights an important feature of making project environments reproducible: Each project will have its own project library and thus be isolated. If two projects use different versions of the same package, they won’t conflict with each other because they’ll each have their own project library. (Well, not exactly. Python employs a global cache to avoid having to install the same version of a given library more than once. The project library will reference the global cache.) Whenever you install new libraries, the uv.lock file is automatically updated.

Note

Library Preferences

While there are many Python libraries, I recommend the following for the three categories of data analytics tasks:

  • Data Wrangling: Polars is a fast, self-consistent library for data wrangling that is growing in popularity as an alternative to pandas.
  • Visualizations: seaborn.objects is a module built using the consistency of the grammar of graphics philosophy. While still in development, unlike its parent library Seaborn, its designed to minimize the need to invoke the underlying matplotlib for fine-tuning. Totally separate from the matplotlib architecture, plotnine is a Python port of R’s {ggplot2} package and also uses the grammar of graphics.
  • Modeling: scikit-learn is the most widely used library for machine learning, but it doesn’t do statistical inference. For that I recommend the statsmodels and Bambi libraries for frequentist and Bayesian modeling, respectively.

There is a lot more that uv can do. For example, if you’re starting with an existing project, run uv run for the libraries included in uv.lock to be automatically installed. And if someone is using another tool to install libraries instead of uv (e.g., pip), they will likely need a requirements.txt file or a pylock.toml file to reproduce the project environment, which you can create with uv export --format requirements.txt or uv export -o pylock.toml, respectively.

Positron

A code editor or integrated development environment (IDE), outside of an open source language, is arguably your most important tool as a data analyst or data engineer. There are many options, but I recommend Positron, a next-generation data science IDE. Built on VS Code’s open source core, Positron combines the multilingual extensibility of VS Code with essential data tools common to language-specific IDEs. After installing Python, download and install Positron.

The following highlights some of Positron’s essential data-friendly functionality. You can also watch a quick tour of Positron. There is more that Positron can do, including connecting to databases and virtual machines. Note that since Positron is built on VS Code’s open source core, VS Code’s excellent documentation remains largely relevant. However, since Positron is actually open source, there are certain proprietary VS Code extensions that aren’t available in Positron. See the Open VSX Registry to search all available extensions.

Console and Session

If you’ve used VS Code, Positron’s layout will look familiar. When selected from the activity bar, the explorer on the left shows the folder you have open, which also establishes your working directory, and the central pane is the editor where you type and run code. Two obvious differences are the integrated console (in the bottom pane by default) and the session information (in the right pane by default), which includes details of the variables and data that have been loaded.

The console is where the code executes while the terminal (i.e., command line, also in the bottom pane by default) also has the folder you have open in the explorer identified as the working directory (i.e., you don’t need to manually navigate to the working directory via the terminal). The variables and data in the session information help you keep track of what you’re working with.

Data Explorer

You can click on the data frame icon to the right of any data you have loaded (or even CSV or Parquet files in the explorer without importing first) to open the data explorer.

The data explorer is designed to facilitate coding, not replace it. The data explorer provides a summary of the data, including simple visualizations, and allows you to quickly sort and filter the data to inform programmatically going about data wrangling.

Plots

Along with variables and data, the session information has a dedicated location for visualizations, including a history gallery to click through and easily compare previous plots. Visualizations can also be opened as a separate tab in the editor pane. This also includes support for interactive plots.

Help

Including a question mark after most any command in the console will open the help (in the right pane by default). This serves as a built-in browser to allow you to reference online documentation, including parameter definitions and examples you can copy and use.

Positron Assistant

Positron Assistant is an AI tool integrated in Positron with contextual awareness of everything in your project. You can use Positron Assistant to ask questions, edit or refactor code, and function as an agent to accomplish specific tasks.

Important

Preview Feature

Positron Assistant is currently a preview feature with only Claude available for chat and either Claude or Copilot for inline code completions.

Command Palette

The command palette is the primary way to manage options (e.g., pane layout views and themes) and is a mainstay of the shortcut-heavy VS Code. Open with Cmd/Cntrl + Shift + P.

GitHub

Git is a powerful version control system. While it is the industry standard for software development, we can easily adopt this framework to provide structure for any kind of data project. GitHub is an online hosting service where each project lives in its own repository. Learning to use Git and GitHub not only aids in collaboration, it will ultimately allow you to develop an online portfolio of work.

To get started, you’ll need to register a GitHub account and install Git using the command line (substituting R references with Python and RStudio with Positron) or downloading the latest source. You’ll also need to both introduce yourself to Git (using the email associated with your GitHub account and again substituting R references with Python and RStudio with Positron) and then authorize Positron to use your GitHub credentials (which you should be prompted to do when first cloning your project repository). Note that when you get a prompt from Positron to use Git: Fetch automatically, go ahead and select yes. This will allow your local Git to be aware of updates, including new branches, on GitHub.

I recommend using my project template to create your project repository. Just click on “Use this template” and create a new repository with a short, lowercase, hyphenated slug as the repository name that’s consistent with the project (e.g., advanced-coursework). One person will maintain the repository and have it connected to their account (i.e., the mentor for ASC projects) while others working on the project can be added as collaborators. Anyone with access to the project repository can copy (i.e., fork) it to save, maintain, or contribute to if they aren’t a collaborator.

Note that there are certain limitations to the size and type of files that can be hosted (i.e., pushed to GitHub). There are also certain things that shouldn’t be accessible by the public (e.g., data we are under NDA to access). For these reasons, we have files and folders that are pushed to GitHub and those that are not. Here’s how the project repository is organized:

  • /code Scripts with prefixes (e.g., 01_import-data.py, 02_clean-data.py) and functions in /code/src.
  • /data Simulated and real data, the latter not pushed.
  • /figures PNG images and plots.
  • /output Output from model runs, not pushed.
  • /presentations Presentation slides.
  • /private A catch-all folder for miscellaneous files, not pushed.
  • /writing Paper, report, and case studies.
  • /.venv Hidden Python project library, not pushed.
  • .gitignore Hidden Git instructions file.
  • .python-version Hidden Python version file.
  • pyproject.toml Python project environment configuration file.
  • uv.lock Python project environment lockfile.

Any file or folder that begins with a period is hidden (i.e., you won’t see it in your OS file explorer by default, but you will see it in the explorer in Positron). The .gitignore file is what controls which files and folders are pushed. Note that the project template also includes instructions for using the project environment and as a link to this training.

Issues

To stay organized, manage your project by keeping track of tasks using GitHub’s issues (see the tab at the top of the project repository on GitHub). There you can have an ongoing conversation and close tasks out when a given issue is completed or resolved. Be sure to tag collaborators you want to see a specific comment (e.g., @marcdotson). Think of this as an email thread or chat channel except all of the conversations are in one place, easily searchable, and automatically archived as part of the version control.

Clone

Once you are a collaborator on a project repository, you can clone it. Cloning simply means you’re creating a local copy of the repository, though the clone isn’t just a copy of the folder. Git still works in the background keeping track of changes and managing the version control. Using the command palatte in Positron, use the Git:Clone command and select the project repository you’d like to clone.

You only need to clone the project repository one time. Please note that any of the files and folders that aren’t pushed to GitHub can be created in your cloned repository without any impact on the repository hosted on GitHub. For example, I typically store PDFs and other related materials that I don’t want to (or can’t) share in the /private folder so that the resources I need are all within the same directory on my computer.

Branches

Remember that Git and GitHub are built for software development. Following that analogy, Git operates through the use of branches. Each branch in a repository is a separate version of the repository that exists in parallel and is focused on a specific issue. For example, we could have a branch called initial-model and another branch called data-cleaning. For version control to be useful, we need to be concise and descriptive with branch and other naming conventions (i.e., no final-final-draft-02 nonsense here).

Every repository has a main branch. If this were software, the main branch would be the branch that is being used in production. Never make changes directly to the main branch. You can see which branch you’re working in by looking at the bottom left corner in Positron. Assuming the branch you need to work on has already been created, the first thing you should do when starting to work is navigate to the branch you want to work in. Use Git: Checkout to... via the command palette to select the correct branch.

Commit, Push, and Pull

You’ve identified what you need to work on using issues, cloned the project repository, and made sure you’re working on the correct branch. Finally, you can get to work! Then what? Once you’ve made a number of changes to your cloned project respository, how do you share that with your collaborators? If you click on the source control tab below the explorer (by default in the left pane in Positron) you’ll see all of the files you’ve changed. You first need to stage these changes using the plus sign next to the files. You then need to provide a commit message. Like the branch names, these should be short and descriptive, like “Created a function to parse text data” or “Cleaned up errors to the final model”.

It is the branch names and these commit messages that provide a record of the work we have done. With a descriptive message, you are ready to commit. This is like saving a file, except we can save multiple files all at once associated with the commit message we’ve written. These changes are now archived as part of the version control on our cloned project repository. To share them with our collaborators, we need to push them to the repository on GitHub. Similarly, to get changes others have pushed, we need to pull them from the repository on GitHub. In Positron, this is often summarized as one step called sync, which is pulling and then pushing sequentially.

To summarize, your daily work in Positron will look like this:

  1. Open the cloned project repository in the explorer pane to set your working directory.
  2. Make sure you’ve checked out the correct branch to work in (the branch name you’ve checked out is in the bottom left corner).
  3. Pull changes that have been pushed since you last worked on the project using the sync button in the source control pane or next to the branch name in the bottom left corner.
  4. Once you’ve done some amount of work, stage and commit changes with a descriptive message in the source control pane.
  5. Push changes using the sync button in the source control pane or next to the branch name in the bottom left corner.

Pull Requests

When you’ve completed work on the issue associated with the branch, create a pull request on GitHub. A pull request is exactly what it sounds like – a request submitted to the repository maintainer to pull the changes you’ve made. This allows the maintainer, or someone they assign, to review what you’ve done, have a conversation with you about it as part of the pull request itself (which looks a lot like an issue tied specifically to the pull request), and eventually pull what you’ve done into main. After the pull request is completed, the branch specific to that issue can be deleted and the associated issue can be closed out.

Note that when a branch is deleted on GitHub, it will still exist in your cloned repository. This isn’t necessarily a problem, though if you commit changes to a closed branch it will force the branch open again. Remember to make sure you’re working on the correct branch. Eventually you may want to clean up branches that have been merged into main and closed on GitHub by using Git: Delete Branch... via the command palette, followed by the branch name. You may also need to use Git: Fetch to prune tracking branches that are no longer on remote (i.e., on GitHub).

Quarto

Quarto is an open source publishing system where we can combine text along with code and its output. If you’ve used Jupyter notebooks, Quarto documents will be familiar. However, the most important difference is that the notebook format of a Quarto document is simply a means to an end. Quarto is built using a sophisticated tool called Pandoc that can take whatever we produce within the Quarto document and render it into a Word document, PowerPoint presentation, PDF, Revealjs slide deck, interactive dashboard, website, etc. Browse through the gallery to see what sort of things are possible.

Quarto is a command line tool that is also available as a VS Code extension that comes pre-installed with Positron. The project template has Quarto documents (e.g., README.qmd) used throughout. Whenever you make a change to a Quarto document, render the document into its specified format and a preview of the rendered document will appear in Positron’s viewer (in the right pane by default). If you are using Python within the Quarto document, Quarto will render the output using the Jupyter kernel in the background.

Please note that a Quarto document can be used in conjunction with a Jupyter notebook to render into all of these different outputs via Pandoc as well. For example, we can render a Jupyter notebook called data-analysis.ipynb into a PDF using the command line with quarto render data-analysis.ipynb --to typst. However, just because we can doesn’t mean we should. Unless we need to produce output in a format other than code, much of the code we write for a project can simply use flat text Python .py scripts.

The Quarto documentation is comprehensive and highly recommended, especially as you adapt work for different formats. The following sections highlight some of the essential features of Quarto documents.

YAML

The header at the top of any Quarto document is coded in YAML (i.e., Yet Another Markup Language), which follows a simple key: value syntax. For most Quarto documents in a project repository, you should set format: gfm. When you render your Quarto document, it will create a separate markdown document using “GitHub Flavored Markdown” that GitHub can parse. For example, the header for this document is:

---
title: "Data Stack"
format: gfm
---

If you want to render the document into a PDF, use format: typst instead. Typst is modern, fast typesetting software for creating PDFs. Typst comes pre-installed with Quarto. The alternative is to install and use a slower and more cumbersome typesetting distribution tied to format: pdf.

Markdown

Quarto documents use markdown, just like in Jupyter notebooks. Markdown is a simple, generic typesetting syntax. Note that GitHub recognizes this syntax, including in issues and pull requests.

Sometimes working with markdown alone can be challenging. Positron includes a visual mode you can access inside any Quarto document. The visual mode includes some point-and-click options to help you produce markdown syntax, which can be especially helpful for things like tables and citations.

Code

Quarto allows us to include code blocks and output as part of the document. Much like Jupyter notebooks, you can include Julia, Python, and R code as well as C++, Stan, and other code blocks and output. To run Python code only, specify jupyter: python3 in the header YAML.

There are a variety of options for each code block. In addition to specifying the language used within the code block, the code block can be given an identifier, can have warnings suppressed, can run without producing output, etc. These options are specified using YAML syntax following the hashpipe operator #| within the body of the code block. Any code block YAML that should apply to the document in its entirety can simply be moved into the header YAML.

Equations

If you need to include any math, you shouldn’t be surprised that there’s a typesetting syntax for that. It’s tied to LaTeX (pronounced “lah-tech” or “lay-tech”) and our primary interest is using it’s math syntax. Use $ around any in-line LaTeX notation or $$ around equations specified as a separate line. For example, we can reference $p(\theta | X) \propto p(X | \theta) \ p(\theta)$ in-line as well as centered on its own:

$$ p(\theta | X) \propto p(X | \theta) \ p(\theta) $$

Copilot

Everyone has their preferred AI tools. All Utah State students have access to Copilot. Cleary AI can contribute to learning and productivity. AI tools are especially helpful for drafting and debugging code and explaining concepts in new ways. However, AI can also be a detriment to learning and productivity. AI is at its most dangerous when we use it to replace rather than supplement thinking.

I’ve found that there is a Goldilocks zone for using AI tools like Copilot. When you know very little about a topic, AI can be dangerous because it can provide answers that sound plausible but you don’t know enough to evaluate the information. When you know a lot about a topic, AI can be unreliable and slow you down because it can easily hallucinate. AI can be most helpful when you know enough about a topic to be able to evaluate the information while also needing the help.

However you use AI, be careful, thoughtful, and transparent about it, including questioning results and citing it as a source – but not the only source – for information.

About

Tools and training for data analytics projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published