Category "nlp"

Spacy train ner using multiprocessing

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least

Tokenizing an HTML document

I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token. Here's my code: import spacy from spacy.symbols impo

Embedding 3D data in Pytorch

I want to implement character-level embedding. This is usual word embedding. Word Embedding Input: [ [‘who’, ‘is’, ‘this&rsquo

Tensorflow-addons seq2seq - start and end tokens in BaseDecoder or BasicDecoder

I am writing code inspired from https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/BasicDecoder. In the translation/generation we instantiate a Basic

assertion failed: [Condition x == y did not hold element-wise:]

I have built a BiLSTM model with an attention layer for sentence classification task but I am getting an error that my assertion has failed due to mismatch in n

TFA BeamSearchDecoder Clarification Request

If the question seems to dumb, it is because I am new to TensorFlow. I was implementing a toy endocer-decoder problem using TensorFlow 2’s TFA seq2seq imp

Read GloVe pre-trained embeddings into R, as a matrix

Working in R. I know the pre-trained GloVe embeddings (e.g., "glove.6B.50d.txt") can be found here: https://nlp.stanford.edu/projects/glove/. However, I've had

Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb?

My dataset is only 10 thousand sentences. I run it in batches of 100, and clear the memory on each run. I manually slice the sentences to only 50 characters. Af

ValueError: The first argument to `Layer.call` must always be passed

I was trying to build a model with the Sequential API (it has already worked for me with the Functional API). Here is the model that I'm trying to built in Sequ

AttributeError: 'ArabertPreprocessor' object has no attribute 'farasa_segmenter'

I had this error while using AraBERT, from arabert.preprocess import ArabertPreprocessor model_name = "bert-base-arabertv2" arabert_prep = ArabertPreprocessor(

The best and simple way to convert labeled text classification data to spaCy v3 format

Let's suppose we have labeled data for text classification in a nice CSV file. We have 2 columns - "text" and "label". I am kind of struggling to understand spa

How to extract relation between entities for stock prediction

I am trying to extract relation between two entities (entity1- relation- entity2) from news articles for stock prediction. I have used NER for entity extraction

Asking gpt-2 to finish sentence with huggingface transformers

I am currently generating text from left context using the example script run_generation.py of the huggingface transformers library with gpt-2: $ python transf

which algorithm does google keyboard uses for automatic suggestions (personal vocab included)?

I am confused since google cannnot train their text generation models with each individuals personal vocabulary. I was trying to develop something similar but

How to train a model in SageMaker Studio with .train and .test extension dataset files?

I'm trying to implement ML models with Amazon SageMaker Studio, the thing is that the model that I want to implement is from hugging face and It uses a Dataset

Counting number of co-occurrences of words for a specified vocabulary and within a specified radius?

I have a vocabulary V = ["anarchism", "originated", "term", "abuse"], and list of words test = ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'fi

How to extract only English words from a from big text corpus using nltk?

I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the E

Multilingual NLTK for POS Tagging and Lemmatizer

Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by traveler

NLTK agreement with distance metric

I have a task to calculate inter-annotator agreement in multi-label classification, where for each example more than one label can be assigned. I found that NLT

HTTP error 403 in Python 3 web scraping the publications

This is the traceback of the error that is happening when I am trying to put the URL of the publication. It works for the regular websites such as Stack Overflo