'transformers and BERT downloading to your local machine
I am trying to replicates the code from this page.
At my workplace we have access to transformers and pytorch library but cannot connect to internet from our python environment. Could anyone help with how we could get the script working after manually downloading files to my machine?
my specific questions are -
should I go to the location bert-base-uncased at main and download all the files? Do I have put them in a folder with a specific name?
How should I change the below code
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
How should I change the below code
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
Please let me know if anyone has done this…thanks
###update1
I went to the link and manually downloaded all files to a folder and specified path of that folder in my code. Tokenizer works but this line model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True, # Whether the model returns all hidden-states. )
fails. Any idea what should i do? I noticed that 4 big files when downloaded have very strange name...should I rename them to same names as shown on the above page? Do I need to download any other files?
the error message is OSErrr: unable to load weights from pytorch checkpoint file for bert-base-uncased2/ at bert-base-uncased/pytorch_model.bin If you tried to load a pytroch model from a TF 2 checkpoint, please set from_tf=True
Solution 1:[1]
clone the model repo for downloading all the files
git lfs install
git clone https://huggingface.co/bert-base-uncased
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
git usage:
download git from here https://git-scm.com/downloads
paste these to your cli(terminal):
a. git lfs install
b. git clone https://huggingface.co/bert-base-uncasedwait for download, it will take time. if you want monitor your web performance
find the current directory simply pasting cd to your cli and get the file path(e.g "C:/Users/........./bert-base-uncased" )
use it as:
from transformers import BertModel, BertTokenizer model = BertModel.from_pretrained("C:/Users/........./bert-base-uncased") tokenizer = BertTokenizer.from_pretrained("C:/Users/........./bert-base-uncased")
Manual download, without git:
Download all the files from here https://huggingface.co/bert-base-uncased/tree/main
Put them in a folder named "yourfoldername"
use it as:
model = BertModel.from_pretrained("C:/Users/........./yourfoldername") tokenizer = BertTokenizer.from_pretrained("C:/Users/........./yourfoldername")
For only model(manual download, without git):
just click the download button here and download only pytorch pretrained model. its about 420mb https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin
download config.json file from here https://huggingface.co/bert-base-uncased/tree/main
put both of them in a folder named "yourfilename"
use it as:
model = BertModel.from_pretrained("C:/Users/........./yourfilename")
Solution 2:[2]
Answering "###update1" for the error: 'OSErrr: unable to load weights from pytorch checkpoint file for bert-base-uncased2/ at bert-base-uncased/pytorch_model.bin If you tried to load a pytroch model from a TF 2 checkpoint, please set from_tf=True'
Please try this methods from -> https://huggingface.co/transformers/model_doc/bert.html
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained("C:/Users/........./bert-base-uncased")
model = BertForMaskedLM.from_pretrained("C:/Users/........./bert-base-uncased")
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
if this works we understand that there is nothing wrong with filesystem or foldernames.
If it works try to get hiddenstate after(note that bert model already returns hiddenstate as explained: " The bare Bert Model transformer outputting raw hidden-states without any specific head on top." so you dont need to use "output_hidden_states = True,")
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained("C:/Users/........./bert-base-uncased")
model = BertModel.from_pretrained("C:/Users/........./bert-base-uncased")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
if this not works try to load pytorch model with one of these methods
# Load all tensors onto the CPU
torch.load("C:/Users/........./bert-base-uncased/pytorch_model.bin", map_location=torch.device('cpu'))
# Load all tensors onto GPU 1
torch.load("C:/Users/........./bert-base-uncased/pytorch_model.bin", map_location=lambda storage, loc: storage.cuda(1))
if pytorch load method is not worked, we understand that there is pytorch version compatibility problem between pytorch 1.4.0 and released bert pytorch model. Or maybe your pytorch_model.bin file not downloaded very well. And please pay attention when pytorch 1.4.0 released the last python was python3.4
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ynjxsjmh |
Solution 2 |