Had the pleasure to attend the Future of Data-Centric AI from Snorkel today
The core ideas of datacentricai:
Data curation is the bottleneck, compute and models are commodities To solve this problem re-weight and combine programmable weak supervision signals
I came across Snorkel a couple of years ago it demonstrated a couple cool ideas around programmable labels:
- thinking about programmable labels as labelling functions
- use spacy linguistic features in these functions. Thinking in terms of noisy functions really helps me conceptualise distant supervision and spacy features are great with text.
Originally Snorkel was an open source project for programmatic data labelling. Today the team is focusing more and more on Snorkel flow, a commercial offering to reduce the cost of data labelling.
As weak supervision is becoming more and more popular there are some interesting open source tools out there for leveraging noisy data:
Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings
Interactive weak supervision
Hoping to share more about using them in the next couple month!