ESSLLI 2009 – NAACL-HLT 2010 – ESSLLI '16 & '18 – ESSLLI 2021 – Software & data sets – Bibliography
Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:
sparsesvd
(v0.2)wordspace
(v0.2-6)e1071
, rsparse
, Rtsne
, uwot
tm
, quanteda
, data.table
, wordcloud
, shiny
, spacyr
, udpipe
, coreNLP
(don't worry if some of these fail to install)NMF
(also install biocManager
, then run command BiocManager::install("bioBase")
)wordspaceEval
) from a password-protected Web page:Most of our hands-on examples work reasonably well in a standard R installation, even on a moderately powerful laptop computer. However, if you intend to work on real-life tasks and process large DSMs, it is important to enable multi-threaded computation in R. Since DSMs build on matrix operations, a multi-threaded linear algebra library (“BLAS”) is key.
sudo apt install libopenblas-dev
verb_dep.txt.gz
(21.6 MB)adj_noun_tokens.txt.gz
(8.3 MB)delta_de_termdoc.txt.gz
(18.4 MB)potter_l2r2.txt.gz
(51.3 MB)potter_lemmas.txt.gz
(1.1 MB) VSS.txt
(37 kB)
Pre-compiled DSMs for use with the wordspace
package for R. Each model is contained in an .rda
file, which can be loaded into R with the command load("model.rda")
and creates an object with the same name (model
).
These models were compiled from WP500
, a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized.
WP500_DepFilter_Lemma.rda
(31.1 MB) – 500 latent SVD dimensions: WP500_DepFilter_Lemma_svd500.rda
(179.3 MB)WP500_DepStruct_Lemma.rda
(31.6 MB) – 500 latent SVD dimensions: WP500_DepStruct_Lemma_svd500.rda
(180.3 MB)WP500_Win2_Lemma.rda
(51.8 MB) – 500 latent SVD dimensions: WP500_Win2_Lemma_svd500.rda
(177.1 MB)WP500_Win5_Lemma.rda
(103.9 MB) – 500 latent SVD dimensions: WP500_Win5_Lemma_svd500.rda
(179.9 MB)WP500_Win30_Lemma.rda
(311.4 MB) – 500 latent SVD dimensions: WP500_Win30_Lemma_svd500.rda
(182.8 MB)WP500_TermDoc_Lemma.rda
(105.1 MB) – 500 latent SVD dimensions: WP500_TermDoc_Lemma_svd500.rda
(162.5 MB)WP500_Ctype_L1R1_Lemma.rda
(55.8 MB) – 500 latent SVD dimensions: WP500_Ctype_L1R1_Lemma_svd500.rda
(157.0 MB)WP500_Ctype_L2R2_Lemma.rda
(33.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L2R2_Lemma_svd500.rda
(64.3 MB)WP500_Ctype_L2R2pos_Lemma.rda
(56.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L2R2pos_Lemma_svd500.rda
(175.3 MB)WP500_Win2_Word.rda
(63.9 MB) – 500 latent SVD dimensions: WP500_Win2_Word_svd500.rda
(185.5 MB)WP500_Win2_Word_WF.rda
(68.9 MB) – 500 latent SVD dimensions: WP500_Win2_Word_WF_svd500.rda
(185.9 MB)
Some publicly available pre-trained neural embeddings, converted into .rda
format for use with the wordspace
package.
GoogleNews300_wf200k.rda
(129.2 MiB)