TaBERT Code Analysis Notes

0. Foreword

classtransformers.BertTokenize

offical document: https://huggingface.co/transformers/v3.0.2/model_doc/bert.html

tokenize-demo:

1
2
3
4
5
6
7
8
9
10
11
12
from transformers import BertTokenizer


def test_tokenize(raw_str: str):
tokens = tokenizer.tokenize(raw_str)
print('raw_str: {}, tokenized: {}'.format(raw_str, tokens))


if __name__ == '__main__':
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
test_tokenize('United States')
test_tokenize('21,439,453')
1
2
raw_str: United States, tokenized: ['united', 'states']
raw_str: 21,439,453, tokenized: ['21', ',', '43', '##9', ',', '45', '##3']

1.

1
2
3
4
def tokenize(self, tokenizer: BertTokenizer):



1
2
3
tensor_dict:


1
2
def get_row_input()


TaBERT Code Analysis Notes
https://www.hardyhu.cn/2023/11/15/TaBERT-Code-Analysis-Notes/
Author
John Doe
Posted on
November 15, 2023
Licensed under