csv
data files?init_py.R
is only called if the virtual environment is created. Can I force a new
call?This package was developed to create teaching material for various statistics courses aimed at students.
This package was developed to create teaching material for various statistics courses aimed at students.
The main function of this package is called gh
, which
can be used to perform various tasks, such as:
With this feature, students can access various educational materials such as interactive apps, R code, data files, and other resources that can be helpful in learning statistical concepts. By providing easy access to these materials, the package aims to facilitate the learning process for students and make it more interactive and engaging.
GitHub allows you to download the repository as a ZIP file, see in
the repository under the Code
button
(Download ZIP
). mmstat4” works with this ZIP file, but you
can also use one of your own ZIP files.
In my courses I assume that all R programs run in a freshly started R, i.e. there are no path dependencies, all necessary libraries are loaded in the R program and so on. My repositories contain not only the example programs for the students, but also the programs I use to create images and tables, and also the Shiny Apps I show.
ghget
A ZIP file or repository can be stored locally or in the internet. A
key-value approach can be used to determine the location of the source
ZIP file. If no key is defined then ghget
uses the base
name of the source ZIP file as the key.
ghget(dummy="https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")
Three repository keys are predefined: hu.data
,
hu.stat
and dummy
. You can retrieve them
via
ghget('dummy')
ghget('hu.stat')
ghget('hu.data')
If you do not use a key, the programme will create one and return it as result.
ghget(system.file("zip/mmstat4.dummy.zip", package = "mmstat4"))
#> [1] "mmstat4.dummy"
ghget("https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")
#> [1] "main"
# tries https://github.com/my/github_repo/archive/refs/heads/[main|master].zip
ghget("my/github_repo") # will fail
#> my/github_repo
#> https://github.com/my/github_repo/archive/refs/heads/main.zip
#> https://github.com/my/github_repo/archive/refs/heads/master.zip
#> Error in ghget("my/github_repo"): None of the previously displayed possible ZIP files were found!
#
ghget() # uses 'hu.data'
#> [1] "hu.data"
ghget
downloads the ZIP file, saves it to a temporary
location and unpacks it. For non-temporary locations, see the FAQ.
In addition, unique short names, related to the ZIP file content, are generated from the path components.
After unpacking the ZIP file, unique short names are generated for these files.
ghget('dummy')
#> [1] "dummy"
gd <- ghdecompose(ghlist(full.names=TRUE))
head(gd)
#> outpath inpath minpath filename
#> 1 /tmp/RtmpoL2VE9/mmstat4.dummy-main LICENSE
#> 2 /tmp/RtmpoL2VE9/mmstat4.dummy-main README.md
#> 3 /tmp/RtmpoL2VE9/mmstat4.dummy-main data 12411-0006.csv
#> 4 /tmp/RtmpoL2VE9/mmstat4.dummy-main data ArbeitsloseBerlin.csv
#> 5 /tmp/RtmpoL2VE9/mmstat4.dummy-main data BANK2.sav
#> 6 /tmp/RtmpoL2VE9/mmstat4.dummy-main data Preisindex.csv
#> source
#> 1 /tmp/RtmpoL2VE9/mmstat4.dummy-main/LICENSE
#> 2 /tmp/RtmpoL2VE9/mmstat4.dummy-main/README.md
#> 3 /tmp/RtmpoL2VE9/mmstat4.dummy-main/data/12411-0006.csv
#> 4 /tmp/RtmpoL2VE9/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv
#> 5 /tmp/RtmpoL2VE9/mmstat4.dummy-main/data/BANK2.sav
#> 6 /tmp/RtmpoL2VE9/mmstat4.dummy-main/data/Preisindex.csv
The file name is split into four parts. The last two parts,
minpath
and filename
, are used to create short
names:
/tmp/RtmpXXXXXX/mmstat4.dummy-main/LICENSE
is
LICENSE
. There was no other file named LICENSE
in the ZIP file. Therefore, it is sufficient to address this file in the
ZIP file./tmp/RtmpXXXXXX/mmstat4.dummy-main/data/BANK2.sav
is
data/BANK2.sav
. There is another file called
BANK2.sav
in the ZIP file, but to address it uniquely,
data/BANK2.sav
is sufficient for this file in the ZIP file
(the other is dbscan/BANK2.sav
). Currently, no check is
made whether two files with identical basenames are also identical in
content.ghlist("BANK2", full.names=TRUE) # full names
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/BANK2.sav"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/examples/data/cluster/dbscan/BANK2.sav"
ghlist("BANK2") # short names
#> [1] "data/BANK2.sav" "dbscan/BANK2.sav"
ghopen
, ghload
, ghsource
The short names (or the full names) can be used to work with the files
## x <- ghload("data/BANK2.sav") # load data via rio::import
## ghopen("univariate/example_ecdf.R") # open file in RStudio editor
## ghsource("univariate/example_ecdf.R") # execute file via source
ghlist("example_ecdf") # "univariate/" was unnecessary
#> [1] "example_ecdf.R"
ghlist
, ghquery
With ghlist
you can get a list of unique (short) names
for all files or a subset based on a regular expression
pattern
in the repository
str(ghlist()) # get all short names
#> chr [1:473] "LICENSE" "README.md" "12411-0006.csv" "ArbeitsloseBerlin.csv" ...
ghlist("\\.pdf$") # get all short names of PDF files
#> [1] "Aufgaben.pdf" "Formelsammlung.pdf" "Loesungen.pdf"
With ghquery
you can query the list of unique (short)
names for all files based on the overlap distance.
ghlist("bnk") # pattern = "bnk
#> character(0)
ghquery("bnk") # nearest string matching to "bnk"
#> [1] "data/BANK2.sav" "dbscan/BANK2.sav" "AverageGroupLinkage.R"
#> [4] "AverageLinkage.R" "CentroidLinkage.R" "CompleteLinkage.R"
ghfile
, ghpath
,
ghdecompose
ghfile
tries to find a unique match for a given file and
returns the full path. If there is no unique match, an error is returned
with some possible matches.
ghdecompose
builds a data frame and decomposes the full
names of the files into
outpath
the path part which is the same for all files
(basically the place where the ZIP file is extraced to),inpath
the path part that is not used in
minpath
, but in the ZIP file,minpath
the minimum path part, so that all files are
uniquely addressable,filename
the base name of the file, andsource
the input for shortpath.The short names for the files are built from the components
minpath
and filename
.
ghpath
builds up the short name with various path
components from a ghdecompose
object.
ghfile('data/BANK2.sav')
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/BANK2.sav"
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
fnf <- ghlist(full.names=TRUE)
dfn <- ghdecompose(fnf)
head(dfn)
#> outpath inpath minpath filename
#> 1 /tmp/RtmpoL2VE9/mmstat4.dummy data hhberlin.csv
#> 2 /tmp/RtmpoL2VE9/mmstat4.dummy data Preisindex.csv
#> 3 /tmp/RtmpoL2VE9/mmstat4.dummy data BANK2.sav
#> 4 /tmp/RtmpoL2VE9/mmstat4.dummy data 12411-0006.csv
#> 5 /tmp/RtmpoL2VE9/mmstat4.dummy data child_data.sav
#> 6 /tmp/RtmpoL2VE9/mmstat4.dummy data hhD.rda
#> source
#> 1 /tmp/RtmpoL2VE9/mmstat4.dummy/data/hhberlin.csv
#> 2 /tmp/RtmpoL2VE9/mmstat4.dummy/data/Preisindex.csv
#> 3 /tmp/RtmpoL2VE9/mmstat4.dummy/data/BANK2.sav
#> 4 /tmp/RtmpoL2VE9/mmstat4.dummy/data/12411-0006.csv
#> 5 /tmp/RtmpoL2VE9/mmstat4.dummy/data/child_data.sav
#> 6 /tmp/RtmpoL2VE9/mmstat4.dummy/data/hhD.rda
head(ghpath(dfn))
#> 1
#> "/tmp/RtmpoL2VE9/mmstat4.dummy/data/hhberlin.csv"
#> 2
#> "/tmp/RtmpoL2VE9/mmstat4.dummy/data/Preisindex.csv"
#> 3
#> "/tmp/RtmpoL2VE9/mmstat4.dummy/data/BANK2.sav"
#> 4
#> "/tmp/RtmpoL2VE9/mmstat4.dummy/data/12411-0006.csv"
#> 5
#> "/tmp/RtmpoL2VE9/mmstat4.dummy/data/child_data.sav"
#> 6
#> "/tmp/RtmpoL2VE9/mmstat4.dummy/data/hhD.rda"
The package comes with two RStudio addins (see under
Addins -> MMSTAT4
):
Open a file from a zip file (ghopenAddin
),
which gives access to the unzipped zip file and opens the selected file
in an RStudio editor window.
Execute a Shiny app from a zip file
(ghappAddin
), which extracts all directories containing
Shiny apps and opens the selected app in a web browser (using the
default browser).
In order to use Python scripts, Python must be installed locally. The
scripts are executed in a virtual environment called
mmstat4.xxxx
, which is created when a Python script is
either executed or opened. During the creation of the virtual
environment, users are prompted to approve the setup of this
environment; this step is critical for the proper execution of Python
scripts.
Once the virtual environment is set up, the script checks for the
presence of a file called init_py.R
in the ZIP file. If
this file exists, it is extracted and executed with the source command.
Normally, this file is used to install Python modules with
reticulate::py_install('module name')
.
Currently there are the following routines to support R code snippets:
packages
or modules
, which extracts all
library
/require
/import
calls from
code snippets and returns a frequency table of the packages or and
modules called.ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
files <- ghlist(pattern="*.R$", full.names = TRUE)
head(Rlibs(files), 30)
#> $R
#>
#> Amelia CHAID DescTools GGally Hmisc
#> 1 1 6 4 1
#> MASS MissingDataGUI NbClust QuantPsyc RColorBrewer
#> 130 1 1 6 2
#> TeachingDemos UsingR VIM additivityTests agricolae
#> 1 1 2 1 5
#> alphahull andrews ape aplpack ash
#> 1 4 1 3 2
#> boot car cluster coin dbscan
#> 4 13 17 1 3
#> deldir devtools e1071 effsize entropy
#> 1 3 5 1 3
#> flexclust foreign fpc gam ggmap
#> 2 4 1 4 1
#> ggplot2 glmnet gplots hexbin igraph
#> 20 2 3 1 2
#> lattice lawstat lmtest locfit mclust
#> 27 1 3 2 2
#> mgcv mice mitools mlbench mmstat4
#> 2 1 2 4 2
#> moments neuralnet nnet nortest np
#> 3 1 9 3 14
#> olsrr outliers paran perturb plot.3d
#> 6 3 2 1 14
#> plot.matrix plot3d plotrix proxy pscl
#> 1 1 5 2 2
#> psych qualityTools randomForest reshape2 rggobi
#> 14 1 1 1 1
#> rgl rio robustbase rpart rpart.plot
#> 2 26 2 15 2
#> ryouready sampling scagnostics scatterplot3d shiny
#> 1 3 13 13 6
#> shinyApp shinyExample shinyWidgets shinydashboard sm
#> 3 1 2 6 2
#> smvgraph spdep tabplot tibble vcd
#> 1 3 3 1 7
#> vcdExtra vioplot xlsx xtable
#> 1 2 1 4
You can add a file init_R.R
or init_py.R
to
your ZIP file, which installs the necessary R packages or Python
modules. A vector with the names of the R packages
(rlibs=...
) or Python modules (pymods=...
) to
be installed can be passed to the install
command.
checkFiles
checks whether each R code snippet runs
smoothly in a freshly started R.
# just check the last files from the list
# Note that the R console will show more output (warnings etc.)
checkFile(files, start=435) # alternatively: Rsolo
Three modes are available for checking a file
:
exist
: Does the source file exist?parse
: Is parse(file)
or
python -m "file"
successful? (default)run
: Is Rscript "file"
or
python3 "file"
successful?dupFiles
uses checksums to check whether files exist
twice.
files <- ghlist(full.names = TRUE)
head(dupFiles(files)) # alternatively: Rdups
#> $c300e8fe6f0bc562256e81670c23d8c0
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy/data/BANK2.sav"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/cluster/dbscan/BANK2.sav"
#>
#> $`4efddb6dc6c7ed743221295d55133817`
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/nnet/mincer_nnet3.R"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/nnet/mincer_nnet5.R"
#>
#> $`9f9fe7603aa82f33bbc85a9d32e39d03`
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/cluster/dbscan/app.tmpl"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/mgraphics/scagnostics/app.tmpl"
#>
#> $`0b74b824367df429803599708daf2e2e`
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/subgroup/example_mosaic.R"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/mgraphics/example_mosaic.R"
#>
#> $`8eaa4f89e233ba69fcda053d238699aa`
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/subgroup/example_mosaic_cotabplot.R"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/mgraphics/example_mosaic_cotabplot.R"
#>
#> $`8ed6128aab796148df5e71cbeab547da`
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/subgroup/example_mosaic_graphics.R"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/mgraphics/example_mosaic_graphics.R"
Note: there is also an error message if the necessary libraries are not installed!
Once you created your ZIP file you need to know under which names a
specific file can be accessed. In the example we use a ZIP file which
comes with the package mmstat4
:
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
ghnames <- ghdecompose(ghlist(full.names=TRUE))
ghnames[58,]
#> outpath inpath minpath filename
#> 58 /tmp/RtmpoL2VE9/mmstat4.dummy examples/data/cluster dbscan BANK2.sav
#> source
#> 58 /tmp/RtmpoL2VE9/mmstat4.dummy/examples/data/cluster/dbscan/BANK2.sav
The shortest possible name is determined by minpath
and
filename
. But all other paths determined by
uniquepath
, minpath
and filename
should also work.
For file number 58, the following access names are possible:
BANK2.sav
will not work since more than one file named
BANK2.sav
in the ZIP file.dbscan/BANK2.sav
will work since this the shortest
possible name.cluster/dbscan/BANK2.sav
,
data/cluster/dbscan/BANK2.sav
, and
examples/data/cluster/dbscan/BANK2.sav
will work.x1 <- ghload("BANK2.sav")
#> Best matches:
#> ghload(x = "data/BANK2.sav")
#> ghload(x = "dbscan/BANK2.sav")
#> Error in ghfile(x, msg = msg): No (unique) file 'BANK2.sav' found, check matches!
x2 <- ghload("dbscan/BANK2.sav")
x3 <- ghload("cluster/dbscan/BANK2.sav")
x4 <- ghload("data/cluster/dbscan/BANK2.sav")
x5 <- ghload("examples/data/cluster/dbscan/BANK2.sav")
Please email me at sigbert@hu-berlin.de
. You can also
try the current development version of the package from GitHub:
# install.packages("devtools")
devtools::install_github("sigbertklinke/mmstat4")
No, this is not supported.
ghget("dummy", .force=TRUE)
ghget("dummy", .tempdir=FALSE) # install non-temporarily
ghget("dummy", .tempdir="~/mmstat4") # install non-temporarily to ~/mmstat4
ghget("dummy", .tempdir=TRUE) # install again temporarily
Note: If a repository was installed permanently and you switch back to temporarily storage then the downloaded files will not be deleted.
ghget("dummy", .tempdir=TRUE)
ghlist(pattern="/(app|server)\\.R$")
ghopen("dbscan") # open the app
csv
data files?ghget("dummy", .tempdir=TRUE)
#> [1] "dummy"
ghlist(pattern="\\.csv$", ignore.case=TRUE, full.names=TRUE)
#> [1] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/12411-0006.csv"
#> [2] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv"
#> [3] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/Preisindex.csv"
#> [4] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/TelefonDaten.csv"
#> [5] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/haushalte.csv"
#> [6] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/haushalte_berlin.csv"
#> [7] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/hhberlin.csv"
#> [8] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/hhberlin_2017.csv"
#> [9] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/pechstein.csv"
#> [10] "/tmp/RtmpoL2VE9/mmstat4.dummy-main/data/rentcap.csv"
# use mmstat4::ghload for importing
ghlist(pattern="\\.csv$")
#> [1] "12411-0006.csv" "ArbeitsloseBerlin.csv" "Preisindex.csv"
#> [4] "TelefonDaten.csv" "haushalte.csv" "haushalte_berlin.csv"
#> [7] "hhberlin.csv" "hhberlin_2017.csv" "pechstein.csv"
#> [10] "rentcap.csv"
pechstein <- ghload("pechstein.csv")
str(pechstein)
#> 'data.frame': 29 obs. of 3 variables:
#> $ Datum : chr "04.02.00" "01.02.01" "10.11.01" "06.02.02" ...
#> $ Tag : int 34 397 679 767 771 783 1043 1160 1166 1421 ...
#> $ Retikulozyten: chr "2,3" "2,5" "2,45" "2,1" ...
For Ubuntu (Linux) install:
sudo apt-get install python3 python3-dev python3-pip python3-venv libbz2-dev
Note: mmstat4
installs these Python modules
numpy
, scipy
, statsmodels
,
pandas
, scikit-learn
, matplotlib
,
and seaborn
by default.
init_py.R
is only called if the virtual environment is
created. Can I force a new call?Yes, delete the virtual environment and recreate it
reticulate::virtualenv_remove('mmstat4')
ghinstall('py', force=TRUE)
The package recognises three standard repositories:
dummy
, hu.stat
, and hu.data
.
Repository | Size | ZIP file location |
---|---|---|
dummy |
3 MB | https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip |
hu.data |
29 MB | https://github.com/sigbertklinke/mmstat4.data/archive/refs/heads/main.zip |
hu.stat |
31 MB | https://github.com/sigbertklinke/mmstat4.stat/archive/refs/heads/main.zip |
dummy
is small subsample of hu.stat
and
hu.data
which is intended for examples and test
purposes.
Mathematische Grundlagen - Einführung - Grundbegriffe - Univariate Verteilungen - Parameter univariater Verteilungen - Bivariate Verteilungen - Parameter bivariater Verteilungen - Regressionanalyse - Zeitreihenanalyse - Indexzahlen - Wahrscheinlichkeitsrechnung - Zufallsvariablen - So lügt man mit Statistik - Wichtige Verteilungsmodelle - Stichprobentheorie - Statistische Schätzverfahren - Regressionsmodell - Konfidenzintervalle - Statistische Testverfahren - Parameterische Tests - Nichtparametrische Tests
ghget("hu.stat")
ghopen("Statistik.pdf")
ghopen("Aufgaben.pdf")
ghopen("Loesungen.pdf")
ghopen("Formelsammlung.pdf")
General - R - Basics and data generation - Test and estimation theory - Parameter of distributions - Distribution - Transformations - Robust statistics - Missing values - Subgroup analysis - Correlation and association - Multivariate graphics - Principal component analysis - Exploratory factor analysis - Reliability - Cluster analysis - Regression analysis - Linear regression - Nonparametric regression - Classification and regression trees - Neural networks
ghget("hu.data")
ghopen("dataanalysis.pdf")
Einführung - Entdeckung und Identifikation von Ausreißern - Prüfung der Verteilungsform von Variablen - Parametervergleiche bei unbhängigen Stichproben - Anhänge A-D, Literaturverzeichnis, Index
ghget("hu.data")
ghopen("cs1_roenz.pdf")
Vorwort - Überprüfung von Zusammenhängen - Regressionsanalyse - Reliabilitäts- und Homogenitätsanalyse von Konstrukten - Anhänge A-H, Literaturverzeichnis, Stichwortverzeichnis
ghget("hu.data")
ghopen("cs2_roenz.pdf")
Einführung - Verallgemeinerte lineare Modelle (generalized linear models, GLM) - Modellierung binärer Daten - Das multinomiale Logit Modell - Modellierung multinomialer Daten (log-lineare Modelle) - Literaturverzeichnis, Index
ghget("hu.data")
ghopen("glm_roenz.pdf")