Getting data for LLMs

Examples what and how data can be obtained for LLMs.
from langchain.document_loaders import ArxivLoader
docs = ArxivLoader(query="2306.07902", load_max_docs=1).load()
docs[0]
Document(page_content='Massively Multilingual Corpus of Sentiment Datasets\nand Multi-faceted Sentiment Classification Benchmark\nŁukasz Augustyniak\nWUST (Wrocław University of Science and Technology)\nlukasz.augustyniak@pwr.wroc.pl\nSzymon Wo´zniak\nBrand24 AI\nMarcin Gruza\nBrand24 AI, WUST\nPiotr Gramacki\nBrand24 AI, WUST\nKrzysztof Rajda\nBrand24 AI, WUST\nMikołaj Morzy\nPozna´n University of Technology\nTomasz Kajdanowicz\nWUST\nAbstract\nDespite impressive advancements in multilingual corpora collection and model\ntraining, developing large-scale deployments of multilingual models still presents\na significant challenge. This is particularly true for language tasks that are culture-\ndependent. One such example is the area of multilingual sentiment analysis,\nwhere affective markers can be subtle and deeply ensconced in culture. This work\npresents the most extensive open massively multilingual corpus of datasets for\ntraining sentiment models. The corpus consists of 79 manually selected datasets\nfrom over 350 datasets reported in the scientific literature based on strict quality\ncriteria. The corpus covers 27 languages representing 6 language families. Datasets\ncan be queried using several linguistic and functional features. In addition, we\npresent a multi-faceted sentiment classification benchmark summarizing hundreds\nof experiments conducted on different base models, training objectives, dataset\ncollections, and fine-tuning strategies.\n1\nIntroduction\nConsider a hotel booking service that allows its customers to post reviews. You have found just the\nperfect accommodation to stay for a couple of days with your family, but you browse through the\nreviews section of the website to check the experiences of former guests. Suddenly, you encounter\na review in Polish: "hotel jak hotel, mogło by´c gorzej." This review has the following sentiment\nscores1: sneg = 0.44, sneu = 0.44, spos = 0.12. Intrigued by the ambiguity of scores, you\ntranslate the review into English: "hotel like a hotel, all in all, it could have been worse," which\nis scored as sneg = 0.80, sneu = 0.16, spos = 0.04. Apparently, the stereotypically pessimistic\nPolish outlook on life gets lost in translation. The next review is written in Czech: "ok, ale nic\nzajímavého" with scores sneg = 0.32, sneu = 0.54, spos = 0.14. After translating into English\n("ok, but nothing interesting") the sentiment classification model scores the review as negative\n(sneg = 0.50, sneu = 0.37, spos = 0.13). After your stay, you decide to add a very positive review\nof the hotel ("it was a killer place to stay", sneg = 0.03, sneu = 0.05, spos = 0.92). You would be\nvery surprised to learn that the Czechs would be quite confused about your opinion ("to bylo vražedné\n1Sentiment\nscores\nin\nthis\nparagraph\nare\nproduced\nby\nthe\nmultilingual\ncardiffnlp/twitter-xlm-roberta-base-sentiment model [9]\nPreprint. Under review.\narXiv:2306.07902v1  [cs.CL]  13 Jun 2023\nmísto k pobytu", sneg = 0.51, sneu = 0.09, spos = 0.39), while the Poles would stay away from the\nhotel at all costs ("to było zabójcze miejsce na pobyt", sneg = 0.78, sneu = 0.09, spos = 0.13).\nmultilingual services become ubiquitous in the modern global economy. As more websites begin to\noffer automatic translation of content, users do not bother to express themselves in the lingua franca\nof the Web, writing instead in their native languages. Despite impressive advancements in automatic\ntranslation, many NLP tasks are still difficult in the multilingual setting. And sentiment classification\nis among the most challenging. The expression of sentiment is highly culture-dependent [28]. The\nemotional valence of individual words, the presence of specific phrasemes, and the expectations\naround sentiment values make sentiment classification across languages a demanding task.\nModels performing sentiment classifications have to cope with two independent sources of variance\nin the input data: cultural expressions of sentiment and errors in automatic translations. In addition,\nthe productizatio', metadata={'Published': '2023-06-13', 'Title': 'Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark', 'Authors': 'Łukasz Augustyniak, Szymon Woźniak, Marcin Gruza, Piotr Gramacki, Krzysztof Rajda, Mikołaj Morzy, Tomasz Kajdanowicz', 'Summary': 'Despite impressive advancements in multilingual corpora collection and model\ntraining, developing large-scale deployments of multilingual models still\npresents a significant challenge. This is particularly true for language tasks\nthat are culture-dependent. One such example is the area of multilingual\nsentiment analysis, where affective markers can be subtle and deeply ensconced\nin culture. This work presents the most extensive open massively multilingual\ncorpus of datasets for training sentiment models. The corpus consists of 79\nmanually selected datasets from over 350 datasets reported in the scientific\nliterature based on strict quality criteria. The corpus covers 27 languages\nrepresenting 6 language families. Datasets can be queried using several\nlinguistic and functional features. In addition, we present a multi-faceted\nsentiment classification benchmark summarizing hundreds of experiments\nconducted on different base models, training objectives, dataset collections,\nand fine-tuning strategies.'})