Table of Contents
Courses and Tutorials on DSM
ESSLLI 2009 – NAACL-HLT 2010 – ESSLLI '16 & '18 – ESSLLI 2021 – Software & data sets – Bibliography
Software for the course
Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:
- Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
sparsesvd
(v0.2)wordspace
(v0.2-6)- recommended:
e1071
,rsparse
,Rtsne
,uwot
- optional:
tm
,quanteda
,data.table
,wordcloud
,shiny
,spacyr
,udpipe
,coreNLP
(don't worry if some of these fail to install) - optional:
NMF
(also installbiocManager
, then run commandBiocManager::install("bioBase")
)
- During the course, you will be asked to install a further package with additional evaluation tasks (
wordspaceEval
) from a password-protected Web page:- if you are stuck with R v3.x, please use the older package version 0.1: Source/Linux – MacOS – Windows (login required)
- download a suitable version and select “Install from: Package Archive File” in RStudio
- Download the sample data files listed below
- Download one or more of the pre-compiled DSMs listed below
Scaling R to large data sets
Most of our hands-on examples work reasonably well in a standard R installation, even on a moderately powerful laptop computer. However, if you intend to work on real-life tasks and process large DSMs, it is important to enable multi-threaded computation in R. Since DSMs build on matrix operations, a multi-threaded linear algebra library (“BLAS”) is key.
- In Linux, it should be sufficient to install the OpenBLAS package, e.g. in Ubuntu:
sudo apt install libopenblas-dev
- In MacOS, follow these instructions to enable the VecLib BLAS built into MacOS. You may also want to enable OpenMP for an additional speed boost on expensive distance metrics (but this is less important).
- In Windows, you can try installing Microsoft R Open or do a Web search for alternative solutions.
Example data sets
verb_dep.txt.gz
(21.6 MB)adj_noun_tokens.txt.gz
(8.3 MB)delta_de_termdoc.txt.gz
(18.4 MB)potter_l2r2.txt.gz
(51.3 MB)potter_lemmas.txt.gz
(1.1 MB)VSS.txt
(37 kB)
Pre-compiled DSMs
Pre-compiled DSMs for use with the wordspace
package for R. Each model is contained in an .rda
file, which can be loaded into R with the command load("model.rda")
and creates an object with the same name (model
).
DSMs based on the English Wikipedia
These models were compiled from WP500
, a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized.
- dependency-filtered:
WP500_DepFilter_Lemma.rda
(31.1 MB) – 500 latent SVD dimensions:WP500_DepFilter_Lemma_svd500.rda
(179.3 MB) - dependency-structured:
WP500_DepStruct_Lemma.rda
(31.6 MB) – 500 latent SVD dimensions:WP500_DepStruct_Lemma_svd500.rda
(180.3 MB) - L2/R2 surface span:
WP500_Win2_Lemma.rda
(51.8 MB) – 500 latent SVD dimensions:WP500_Win2_Lemma_svd500.rda
(177.1 MB) - L5/R5 surface span:
WP500_Win5_Lemma.rda
(103.9 MB) – 500 latent SVD dimensions:WP500_Win5_Lemma_svd500.rda
(179.9 MB) - L30/R30 surface span:
WP500_Win30_Lemma.rda
(311.4 MB) – 500 latent SVD dimensions:WP500_Win30_Lemma_svd500.rda
(182.8 MB) - term-document model:
WP500_TermDoc_Lemma.rda
(105.1 MB) – 500 latent SVD dimensions:WP500_TermDoc_Lemma_svd500.rda
(162.5 MB) - type contexts (L1+R1):
WP500_Ctype_L1R1_Lemma.rda
(55.8 MB) – 500 latent SVD dimensions:WP500_Ctype_L1R1_Lemma_svd500.rda
(157.0 MB) - type contexts (L2+R2):
WP500_Ctype_L2R2_Lemma.rda
(33.1 MB) – 500 latent SVD dimensions:WP500_Ctype_L2R2_Lemma_svd500.rda
(64.3 MB) - type contexts (L2+R2 POS tags):
WP500_Ctype_L2R2pos_Lemma.rda
(56.1 MB) – 500 latent SVD dimensions:WP500_Ctype_L2R2pos_Lemma_svd500.rda
(175.3 MB) - word forms L2/R2:
WP500_Win2_Word.rda
(63.9 MB) – 500 latent SVD dimensions:WP500_Win2_Word_svd500.rda
(185.5 MB) - word forms L2/R2 with non-lemmatized features:
WP500_Win2_Word_WF.rda
(68.9 MB) – 500 latent SVD dimensions:WP500_Win2_Word_WF_svd500.rda
(185.9 MB)
Neural word embeddings
Some publicly available pre-trained neural embeddings, converted into .rda
format for use with the wordspace
package.
- word2vec:
GoogleNews300_wf200k.rda
(129.2 MiB)
Web interfaces
- Web interface for several pre-trained Infomap models (CIMeC, U Trento)
- Explore word2vec embeddings (FAU Erlangen-Nürnberg)
- Explore DSMs based on Wikipedia (FAU Erlangen-Nürnberg)