Prepare your data using Python and VS Code
Module Source
Manipulate and clean data in Python
Goals
In this workshop, you will learn how to use Python, and popular libraries like NumPy and pandas, to manipulate and clean data to prepare it for analysis.
Goal | Description |
---|---|
What will you learn | How to find information about, clean, and prepare data that’s stored in a pandas DataFrame. |
What you’ll need | Visual Studio Code environment set up to run Python and Jupyter notebooks |
Duration | 1 hr 20 min |
Just want to try the app or see the solution? | Solution |
Slides | Powerpoint |
Video
🎥 Click this image to watch Ornella walk you through the workshop
Pre-Learning
Prerequisites
- Visual Studio Code
- Python
- Python extension for Visual Studio Code
- Jupyter extension for Visual Studio Code
- Activated Anaconda environment
- A data science environment in VS Code
What students will learn
Say you want to perform some analysis on a dataset that you find interesting – like the squirrel population of Central Park, or various types of French cheese. The first thing you’ll need to do with any dataset is to clean it up. Many datasets have missing information, or won’t be formatted in the exact way you’d like. In this workshop, you will learn how to use data science libraries to prepare your data for analysis and visualization.
Introduction
In this section, you’ll review an introduction and make sure that your data science environment is set up correctly before continuing on to the next part of the workshop.
Explore DataFrame information
Next, you will learn how to use Python libraries to explore an iconic dataset. You will be able to understand how to use pandas DataFrames to get an immediate idea about the size, shape, and content of a particular dataset.
Work with missing data
Now that you know how to get an overall sense of the dataset you are working with, you will learn how to identify and deal with missing values.
Remove duplicate data
Another common thing you’ll have to do with most datasets you encounter is remove duplicate data. In this section of the workshop, you will learn how to use pandas to detect and remove duplicate entries.
Combine datasets
Sometimes, you will need to combine datasets together. Luckily, there are several methods available in pandas to merge and join datasets.
Exploratory statistics and visualization
So far, you’ve learned how to use pandas methods to examine some aspects of a DataFrame, and fill in, remove, and combine data. The final way we will seek to understand our data is by creating visualizations.
Next steps
- Explore and analyze data with Python
- Introduction to machine learning
- Discover the role of Python in space exploration
Practice
To test your knowledge, try downloading a free dataset from Kaggle that you find interesting. Use the techniques that you learned in this workshop to manipulate and clean your data!
Feedback
Be sure to give feedback about this workshop!