Chapter 3 Technical Path
3.1 Introduction
3.1.1 Opening Thoughts
There are many ways to learn technical material. Paid courses, YouTube videos, blog posts, and books are all equally effective based on your learning style. The content in the following sections tries to have a healthy balance of all content types.
By default, all of the recommended learning content is free. There are additional resources mentioned where you might have to pay for. These serve as additional avenues to strengthen the concepts and techniques learned in the initial free content.
Machine Learning falls into two camps of programming languages, those who use R and those who leverage Python. Both have their strengths and weaknesses. All learning content will have an equal amount of resources for both R and Python.
3.1.2 R or Python?
Short answer: Eventually you will have to learn both to become a successfully data practitioner, but if you could only pick one choose Python.
3.1.3 Getting the Most out of Learning
Deliberate practice is the best way of getting the learning to stick, and to rapidly evolve your skills.
Whenever you learn something new in the data and AI world, it’s best to usually apply it immediately to a real world project within your job or company. By using a real world problem to practice what you just learned, you’re able to reinforce the new knowledge into your long term memory while at the same time driving impact in your job by solving real problems. What a bonus! Be careful about only working on “toy data sets”, which is public data that has been beat to death by hundreds of blogs and courses. The real world of data is messy and unpredictable, so working on things related to your current job or company gets you comfortable with that uncertainty even faster.
Don’t feel bad looking up things on Bing/Google. Every technical person who works with computers today most likely looks up things online every day. Software syntax takes time to learn, and some of the best engineers still don’t remember all the ins and outs of a language. When it doubt look it up online! Sites like Stack Overflow will quickly become your best friend as you try to work through issues in your code.
3.2 Installing Software
Getting started with the right developer environment can save tons of headaches further down the road. While there are many options on what type of Interactive Developer Environment’s (IDE) to use, the below ones are quickly becoming the standard for each language.
3.3 Data Analysis and Manipulation
Learning how to manipulate data outside of existing tools like Excel or Power BI quickly give you data super powers you never thought possible before. Breaking out of the four walls of excel and into the data universe by leveraging languages like Python and R unlock so much more potential for impact in whatever job you do. Even if you don’t plan to build your own Machine Learning models, knowing the basics of data manipulation is an important skill to have, and builds a data foundation that Machine Learning is built upon if you ever want to come back and start building models.
3.3.3 Additional Resources
- Explore and Analyze Data in Python: Microsoft 📃 - Python
- Python for Excel 📕 - Python
- Python Crash Course 📕 - Python
- Advancing Into Analytics 📕 - Python/R
- Python and R for the Modern Data Scientist 📕 - Python/R
- R Programming Tutorial: Learn the Basics of Statistical Computing 📹 - R
- Data Transformation Cheat Sheet 📃 - R
- R Basics: Harvard 🏫 - R
3.4 Version Control
If you plan to work with others on any project that contains code, knowing version control and specifically git is a must. Get up to speed with how to use git and it’s most famous git server, GitHub. This skill opens up new opportunities to contribute to open source projects and even build your own open source software. It’s also required to work on any technical team who collaborate on projects together.
3.5 Machine Learning Basics
Let’s get our feet wet on the introductory concepts of machine learning. Learn more of the terminology, build a few models, and start to understand how the data science life cycle starts to take shape. This section is by no means a comprehensive view of machine learning today, but it’s a good starting point.
Future sections will cover most of these topics again but in more depth. Having some repetition of terms and concepts will help reinforce the knowledge in your brain and help you understand how there is always different angles to attack data problems with machine learning.
3.5.4 Additional Resources
- PCA Main Ideas 📹
- Introduction to Machine Learning: Microsoft 🏫 - Python
- Introduction to Machine Learning : Udemy 🏫 - Python
- Machine Learning for Beginners 📕 - Python
- Best Python Machine Learning Libraries 📃 - Python
- Analytical Skills for AI & Data Science 📕 - Python
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 📕 - Python
- Avoiding Machine Learning Mistakes: LinkedIn Learning 🏫 - Python
- Microsoft Approved Data Science Learning Resources 📃 - Python/R
- Introduction to Statistical Learning 📕 - R
- Companion Book to Introduction to Statistical Learning 📃 - R
- R Cheat Sheet 📃 - R
- Practical Data Science with R 📕 - R
- Modern Data Science with R 📕 - R
3.6 Regression
Regression deals with predicting numerical quantities. It will quickly become your bread and butter for leveraging machine learning in finance. Understanding how to use software packages to train models and how each model works are both crucial to leveraging regression techniques to the fullest. Most of the resources here deal with examples of regression in action. Take time to soak in how these tutorials and experts approach a regression problem, how they structure their code, and the way they communicate the outputs.
3.7 Time Series
Time series forecasting is a sub domain of regression, where we are trying to forecast a numerical quantity over time. Prediction over time is a separate world in machine learning, and has deep roots in more classic statistical methods.
While most regression models can be turned into a time series model by incorporating various date based features, there are also traditional statistical models that have been solely used for time series forecasting for decades. An interesting component of time series forecasting is that it can use multivariate data as well as univariate. For example you could forecast sales revenue by just using previous historical values of sales revenue (univariate) or use external regressor information like country holidays and population size to help forecast (multivariate). Knowing both types of models is a key component of being an expert time series practitioner.
3.7.3 R
- Forecasting: Principles and Practice 📕
- Introduction to Modeltime: Forecasting with Tidymodels 📹
- High Performance Time Series Forecasting 📹
- Arima Forecasting in R 📹
- Forecasting Multiple Time Series with Modeltime 📹
- Plotting Time Series in R 📹
- Microsoft Finance Time Series Forecast Framework: finnts 📃
3.7.4 How Various Models Work
- All regression models in the Regression chapter can be turned into time series models
- Arima 📕
- Exponential Smoothing 📕
3.8 Classification
Classification models try to forecast an outcome of an event. For example if a credit card transaction is fraud or if a self-driving car sees a stop sign next to the road. Usually the prediction outcome is a binary yes or no, and oftentimes a probability score between 0 and 1. With 1 having a 100% probability of something occurring. Classification models can even predict an outcome across multiple categories or buckets, like if a picture of a fruit is an apple, pear, orange, etc.
Classification models are some of the most widely used machine learning across industries today. Within finance there are many important implementations that range from compliance to risk management.
3.8.1 High Level Topics
- Classification in Machine Learning 📹 (skip tutorial at end)
- Confusion Matrix 📹
- Sensitivity and Specificity 📹
- ROC and AUC 📹
3.9 Unsupervised Learning
Unsupervised learning is an evolving field of machine learning, and many say is the future of AI in general. Instead of relying on existing data with known outcomes to learn from like supervised learning (regression and classification), unsupervised learning tries to learn its own unique things about a data set without needing to know the answer ahead of time. This can be a game changer in finance when trying to segment customers into specific groups based on their purchasing behavior or finding anomalies to flag for potential fraud or corruption.
3.10 Natural Language Processing
Natural language processing (NLP) is all about extracting insight from unstructured data in the form of text. Our world is drowning in openly available text from twitter, blogs, and countless documents like PDFs that could be useful within our jobs in finance. Knowing how to extract insights out of a pile of documents is a super power worth learning about!
3.11 Deep Learning
The most rapidly evolving area of AI is deep learning, which use a completely new modeling architecture called neural networks. Most of the most exciting advancements in AI over the last decade have come from training neural networks on huge data sets. Deep learning has the potential to totally change how we build any type of prediction across all types of machine learning.
3.11.4 Additional Resources
- Andrew Ng: Deep Learning, Education, and Real-World AI 📹
- Nuts and Bolts of Applying Deep Learning 📹
- History of Deep Learning 📹
- Visual Introduction to Deep Learning 📕
- Deep Learning for Coders with fastai and PyTorch 📕 - Python
- Deep Learning: Coursera 🏫 - Python
- Computer Vision Tutorial: Kaggle 🏫 - Python
- Deep Learning with Tensorflow 📕 - Python
- Deep Learning with R 📕 - R
3.12 Model Interpretability
A lot of times you may be asked to help understand how a particular machine learning model came up with its prediction. Knowing how to leverage various interpretability frameworks helps decode the black box of these models for better adoption by non-technical business partners and enables better understanding what features have the most impact in your model.
3.13 AI Ethics and Fairness
With great power, comes great responsibility. As machine learning becomes more ingrained in our society, ethical consequences of poorly deployed models will only increase. Make sure you are building models that help enrich a diverse and inclusive future by checking out the below resources.
3.14 Web Apps
Building user interfaces that bring machine learning models directly to the end user to consume code free can be a total game changer for your business partners. You don’t have to be a web developer to build applications that your users will love thanks to some amazing packages within the data science community. Check them out below.
3.15 Production on Azure
One of the harder aspects of machine learning is getting your work in a production environment to run at scale. This involves loading models to run in a cloud like Microsoft Azure.
3.15.2 General Data Analytics
- Azure Synapse 📃
- Azure Databricks 📃
- Spark 📕 - Python
- Spark 📕 - R