How To Train Your Graphics Card (To Read)
- Wicklow Hall 2A
- Start (Dublin time):
- Start (your time):
- 180 minutes
This tutorial aims to introduce new users to modern NLP using the open-source HuggingFace Transformers library. We'll use massive, pre-existing language models that you might have heard of like BERT and GPT to solve real-world tasks. By the end of this tutorial you'll be able to classify documents, get your computer to read a text and answer questions about it, and even translate between languages!
TutorialPyData: Deep Learning, NLP, CV
Note: Despite the title, no graphics card is needed! We will be using Colab notebooks for most of the tutorial. You can also run the notebooks on your local machine, but if you do that you'll need to install git-lfs to download some of the models we use.
Most practical machine learning these days is "supervised learning". In supervised learning, we show a model a collection of example inputs and outputs, and train it to give the right output for each input. For example, we might show it pictures of animals, combined with a "label" for each picture like "cat" or "dog", in order to train it to identify which animal is in each photo. Or we could show it samples of text from Twitter posts, and give the tweets "labels" like "toxic" or "not toxic", in order to train it to identify unwanted tweets and filter them out automatically. In effect, the model learns to predict the correct "label" for any input that it sees.
The golden rule in supervised learning is that the more data you have, the better the model you can train. More data means more accuracy, whether the task is recognizing animals in images, or classifying text, or even driving a self-driving car. This is a real problem, though, when data collection isn't free; without a huge dataset of inputs and labels, it might be hard or impossible to train a model that's accurate enough for what you want it to do.
Probably the single biggest revolution in machine learning in the last 5 years, particularly in NLP (natural language processing), has been the arrival of "foundation models", huge models trained for very long periods on vast amounts of text data. These models offer a solution to the problem of limited training data - by bringing a huge amount of linguistic prior knowledge with them, they greatly reduce the amount of data needed to learn a new task. In 2016, training a model to classify toxic comments might have required millions (or even tens of millions!) of examples and labels in order to achieve acceptable accuracy, but in 2022, we can start with a foundation model that already "knows" a lot about language, and achieve the same accuracy with a tiny fraction of that, and in a much shorter time, too!
Foundation models can be intimidating, though - they're often created by industrial or academic research labs and published in papers that can be very impenetrable for people without a strong research background. In this tutorial, we'll show you how to abstract away that complexity and load, train and use foundation models without needing a Ph.D, or even any prior experience in machine learning! By the end of this 3-hour session, you should have the knowledge and code samples you need to train a better machine learning model than someone at the cutting edge of the field in 2016 could have achieved even with an entire research team.
In this course, we will use HuggingFace Transformers combined with the TensorFlow machine learning library. We will also use some of the most popular data science libraries in Python like Numpy and Pandas when preparing our data. You don't have to be familiar with any of these before attending the tutorial, and I'll do my best to explain what we're doing with them as we go! I don't assume any specific background in machine learning, and we won't need any advanced mathematics. I will, however, assume that you're reasonably fluent in Python!