In this notebook I will use all the knowledge that we acquiring with the previous notebooks!
In this solution I will be using:
Transformers
HuggingFace
Preprocessing
Tensorflow
EarlyStopping
and more..
Remember that this belong to a NLP Notebook series where I am learning and testing different NLP approachs in this competition. Like NN, Embedding, RNN, Transformers, HuggingFace, etc.
To see the other notebooks visit: https://www.kaggle.com/code/diegomachado/seqclass-nn-embed-rnn-lstm-gru-bert-hf
Libraries
# A dependency of the preprocessing for BERT inputs!pip install -q -U "tensorflow-text==2.8.*"
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx-bsl 1.9.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15.5, but you have tensorflow 2.8.4 which is incompatible.
tensorflow-transform 1.9.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<2.10,>=1.15.5, but you have tensorflow 2.8.4 which is incompatible.
tensorflow-serving-api 2.9.0 requires tensorflow<3,>=2.9.0, but you have tensorflow 2.8.4 which is incompatible.
tensorflow-io 0.21.0 requires tensorflow<2.7.0,>=2.6.0, but you have tensorflow 2.8.4 which is incompatible.
tensorflow-io 0.21.0 requires tensorflow-io-gcs-filesystem==0.21.0, but you have tensorflow-io-gcs-filesystem 0.28.0 which is incompatible.
pytorch-lightning 1.7.7 requires tensorboard>=2.9.1, but you have tensorboard 2.8.0 which is incompatible.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport gcimport tensorflow as tffrom tensorflow.keras.layers import TextVectorization, Lambdafrom tensorflow.keras import layersfrom tensorflow.keras.utils import plot_modelfrom tensorflow.keras.preprocessing.text import text_to_word_sequencefrom tensorflow.keras import lossesfrom tensorflow.keras.callbacks import EarlyStopping, TensorBoard, ReduceLROnPlateau#import tensorflow_hub as hub#import tensorflow_text as text # Bert preprocess uses this from tensorflow.keras.optimizers import Adamimport reimport nltkfrom nltk.corpus import stopwordsimport stringfrom gensim.models import KeyedVectorsnltk.download('stopwords')
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
HuggingFace Model
# Try with a large modelmodel_name ="bert-large-uncased"
from transformers import AutoTokenizer# properly tokenizationtokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=True)tokenizer_max_length =161def tokenize_dataset(data):# Keys of the returned dictionary will be added to the dataset as columnsreturn tokenizer(data["text"], truncation=True, padding=True, max_length=tokenizer_max_length)
Data Pre-processing
I will try a preprocessing that I find here: (I lost the notebook! so sorry, please if someone find it let me know in the comments!)
# Some preprocess train = pd.read_csv("/kaggle/input/df-split/df_split/df_train.csv")test = pd.read_csv("/kaggle/input/df-split/df_split/df_test.csv")
import spacyimport renlp = spacy.load('en_core_web_sm')def preprocessing(text): text = text.replace('#','') text = decontracted(text) text = re.sub('\S*@\S*\s?','',text) text = re.sub('http[s]?:(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',text)#token=[]#result=''#text = re.sub('[^A-z]', ' ',text.lower())#text = nlp(text)#for t in text:# if not t.is_stop and len(t)>2: # token.append(t.lemma_)#result = ' '.join([i for i in token])return text.strip()
train.text = train.text.apply(lambda x : preprocessing(x)).astype(str)test.text = test.text.apply(lambda x : preprocessing(x)).astype(str)
# Save processed data to diskNEW_TRAIN_PATH ="preprocessed_train.csv"NEW_TEST_PATH ="preprocessed_test.csv"train.to_csv(NEW_TRAIN_PATH, index =False)test.to_csv(NEW_TEST_PATH, index =False)del traindel testgc.collect()
856
HF Dataset
# Now We can use HF Datasets# We need to load our data, for this we use HF datasetsfrom datasets import load_datasetdata_files = {"train": NEW_TRAIN_PATH,"test": NEW_TEST_PATH}dataset = load_dataset("csv", data_files = data_files, usecols = ['text', 'target'])
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-fc2dc1866b45d737/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-fc2dc1866b45d737/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.
2022-12-12 14:19:24.079862: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 125018112 exceeds 10% of free system memory.
2022-12-12 14:19:24.239512: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16777216 exceeds 10% of free system memory.
2022-12-12 14:19:24.254974: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16777216 exceeds 10% of free system memory.
2022-12-12 14:19:24.272507: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16777216 exceeds 10% of free system memory.
2022-12-12 14:19:24.294475: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16777216 exceeds 10% of free system memory.
All model checkpoint layers were used when initializing TFBertForSequenceClassification.
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-e12e02ba714b9048/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e12e02ba714b9048/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.
Here is a summary of what I use and add: - I do text pre-process - I use A large model from HuggingFace - I Fine Tune the model "bert-large-uncased" - I decrease the learning rate (I think that it actually was clever!) - I restore weights in Early Stopping
Nice, I think that was a large and entertainment study journey! We achieve a good score, and furhermore we learnt a lot! ππππππππππππππππππππππππ