Welcome to hashformers’s documentation!
✂️ hashformers
Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag.
Hashformers is the current state-of-the-art for hashtag segmentation. On average, hashformers is 10% more accurate than the second best hashtag segmentation library ( more details on the docs ).
Hashformers is also language-agnostic: you can use it to segment hashtags not just in English, but also in any language with a GPT-2 model on the Hugging Face Model Hub.
✂️ Read the documentation
✂️ Segment hashtags on Google Colab
✂️ Follow the step-by-step tutorial
Basic usage
from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
reranker_model_name_or_path="bert-base-uncased"
)
segmentations = ws.segment([
"#weneedanationalpark",
"#icecold"
])
print(segmentations)
# [ 'we need a national park',
# 'ice cold' ]
Installation
Hashformers is compatible with Python 3.7.
pip install hashformers
It is possible to use hashformers without a reranker:
from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
reranker_model_name_or_path=None
)
If you want to use a BERT model as a reranker, you must install mxnet. Here we install hashformers with mxnet-cu110
, which is compatible with Google Colab. If installing in another environment, replace it by the mxnet package compatible with your CUDA version.
pip install mxnet-cu110
pip install hashformers
Contributing
Pull requests are welcome! Read our paper for more details on the inner workings of our framework.
If you want to develop the library, you can install hashformers directly from this repository ( or your fork ):
git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .
Relevant Papers
Citation
@misc{rodrigues2021zeroshot,
title={Zero-shot hashtag segmentation for multilingual sentiment analysis},
author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
year={2021},
eprint={2112.03213},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```s
Evaluation
We provide a detailed evaluation of the accuracy and speed of the hashformers
framework in comparison with alternative libraries.
Although models based on n-grams such as ekphrasis
are orders of magnitude faster than hashformers
, they are remarkably unstable across different domains. Research on word segmentation usually try to bring the best of both worlds together and combine deep learning with statistical methods for reaching the best speed-accuracy trade-off.
Accuracy
In this figure we compare hashformers with HashtagMaster ( also known as “MPNR” ) and ekphrasis on five hashtag segmentation datasets.
HashSet-1 is a sample from the distant HashSet dataset. HashSet-2 is the lowercase version of HashSet-1, and HashSet-3 is the manually annotated portion of HashSet. More information on the datasets and their evaluation is available on the HashSet paper.
A script to reproduce the evaluation of ekphrasis is available on scripts/evaluate_ekphrasis.py.
dataset | library | accuracy |
---|---|---|
BOUN | HashtagMaster | 81.60 |
ekphrasis | 44.74 | |
hashformers | 83.68 | |
HashSet-1 | HashtagMaster | 50.06 |
ekphrasis | 0.00 | |
hashformers | 72.47 | |
HashSet-2 | HashtagMaster | 45.04 |
ekphrasis | 55.73 | |
hashformers | 47.43 | |
HashSet-3 | HashtagMaster | 41.93 |
ekphrasis | 56.44 | |
hashformers | 56.71 | |
Stanford-Dev | HashtagMaster | 73.12 |
ekphrasis | 51.38 | |
hashformers | 80.04 | |
average (all) | HashtagMaster | 58.35 |
ekphrasis | 41.65 | |
hashformers | 68.06 |
Speed
model | hashtags/second | accuracy | topk | layers |
---|---|---|---|---|
ekphrasis | 4405.00 | 44.74 | - | - |
gpt2-large | 12.04 | 63.86 | 2 | first |
distilgpt2 | 29.32 | 64.56 | 2 | first |
distilgpt2 | 15.00 | 80.48 | 2 | all |
gpt2 | 11.36 | - | 2 | all |
gpt2 | 3.48 | - | 20 | all |
gpt2 + bert | 1.38 | 83.68 | 20 | all |
In this table we evaluate hashformers under different settings on the Dev-BOUN dataset and compare it with ekphrasis. As ekphrasis relies on n-grams, it is a few orders of magnitude faster than hashformers.
All experiments were performed on Google Colab while connected to a Tesla T4 GPU with 15GB of RAM. We highlight distilgpt2
at topk = 2
, which provides the best speed-accuracy trade-off.
model: The name of the model. We evaluate ekphrasis under the default settings, and use the reranker only for the SOTA experiment at the bottom row.
hashtags/second: How many hashtags the model can segment per second. All experiments on hashformers had the
batch_size
parameter adjusted to take up close to 100% of GPU RAM. A sidenote: even at 100% of GPU memory usage, we get about 60% of GPU utilization. So you may get better results by using GPUs with more memory than 16GB.accuracy: Accuracy on the Dev-BOUN dataset. We don’t evaluate the accuracy of
gpt2
, but we know from the literature that it is expected to be betweendistilgpt2
(at 80%) andgpt2 + bert
(the SOTA, at 83%).topk: the
topk
parameter of the Beamsearch algorithm ( passed as thetopk
argument to theWordSegmenter.segment
method). Thesteps
Beamsearch parameter was fixed at a default value of 13 for all experiments with hashformers, as it doesn’t have a significant impact on performance astopk
.layers: How many Transformer layers were utilized for language modeling: either all layers or just the bottom layer.