Today I will be looking at the paper Integrating Graph Contextualized Knowledge into Pre-trained Language Models a biomedical NLP paper about integrating free text and a knowledge graph for entity typing and relation extraction.
In the biomedical domain supervised text training data can be rare while large strutcured knowledge graphs are widely available. Using knowledge graphs to improve BioNLP tasks seemed like an interesting idea.
The paper describes BERT-Medical Knowledge combining
- BioBERT, BERT language model built with open acess Pubmed abstracts and articles
- BERT model for UMLS knowledge graph embeddings initialised with TransE graph embeddings
BERT-Medical Knowledge architecture
Transformer on the left is BioBERT, Transformer on the right is BERT model for knowledge graph embeddings initialised with TransE graph embeddings. I
How are knowledge graphs used with a Transformer?
- Create a node sequence from an arbitrary subgraph
- Transformer based encoder to encode node sequences
- Reconstruct triples using the translating based scoring function
- Combine node embeddings with Biobert embeddings
###Create a node sequence from an arbitrary subgraph
As you can see from the above figure you create a sequence where all nodes that have a directed edge
s1 -> r1 -> o1 will be translated into a sequence
(s1)(r1)(o1) as you can see relations are jointly learnt as node embeddings. Using the above approach if there are multiple inbound edges then all inbound sequences will be placed before the node it set. So if we also have
s2 -> r2 -> o1 then the combined output sequence will be
In addition to the node sequences the below are created:
- a node position index P where each row represents a triple
- adjacency matrix A where each value represents and edge between two nodes
Transformer based encoder for node sequences
Seems like a standard Transformer encoder setup with an addition of limiting attention score to in-degree nodes and current nodes.
Concatenate the attention weights
Which are the softmax of the attention score scaled by a constant:
Calculate attention score for degree-in nodes and current node:
Training objective: Triple Reconstruction
Based on the translation based scoring function reconstruct triples:
The Bernoulli trick penalises highly connected head or tail nodes so they are more likely to be corrupted. This is based on the assumption that corrupting one of these nodes is more likely to be a true negative rather than corrupting sparsesly connected nodes which could be unknown false negatives.
This is the formula used to decide the probability to corrupt a triple with the Bernoulli trick:
Average number of tails per head - tph
Average number of heads per tail - hpt
Head gets corrupted based on tph / (tph + hpt)
Tail gets corrupted based on hpt / (tph + hpt)
Combining node & BioBERT embeddings
It is not clear from the paper how node and BioBERT embeddings are combined. The architecture diagram just shows an
Information Fusion step combining
Multi headed attention steps based on the embedding input. Based on this my basic assumption would be the multi headed attention output of the embeddings is concatenated as that would be the most basic form of
Combining BioBERT and ERNIE finetuning the combined model achieves good results as seen below:
Would be interesting to see this technique applied to other BioNLP tasks in the future. We can surely better leverage the large structured knowledge graphs to improve BioNLP
All images are from the paper