I had the pleasure of running a workshop on Weak Supervision at Crunch Conference in Budapest on 5/10/2022. Let me share a summary of the workshop here:
With simple and efficient out-of-the-box machine learning APIs finetuning and deploying machine learning models has never been easier. For many companies the larger challenge is understanding the goal posts of machine learning projects and the lack of labelled data. Weak supervision can help:
- labelling data more efficiently
- finetuning your models on noisy labelled data.
The workshop used skweak
a spacy
based weak supervision library to demonstrate how to use labelling functions to generate noisy labelled data.
Here’s an example skweak
labelling functions:
from skweak.base import SpanAggregator
class MoneyDetector(SpanAggregator):
def __init__(self):
super(MoneyDetector, self).__init__("money_detector")
def find_spans(self, doc):
for tok in doc[1:]:
if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
yield tok.i-1, tok.i+1, "MONEY"
money_detector = MoneyDetector()
This labelling function extracts any digits that are preceded by a currency.
skweak
allows you to combine multiple labelling functions using spacy
attributes or other methods.
Using labelling functions has a number of advantages:
- 💪 larger coverage, a single labelling function can cover many samples
- 🤓 involving experts, domain expert annotation is expensive, domain expert labelling functions are more economical due to coverage
- 🌬️ adopting to changing domains, labelling functions and data assets can be adapted to changing domains
Example Kaggle Notebook applying skweak to smoker not smoker detection you will need to verify your phone number and set the Internet connection setting on Kaggle to run the notebook