Generates a dataset
object or a data.series
object (a list
of dataset
objects) storing simulation parameters as well as aggregate
daily buys and sells simulated following the assumption of the MPIN
model
of (Ersan 2016)
.
Usage
generatedata_mpin(series = 1, days = 60, layers = NULL,
parameters = NULL, ranges = list(), ...,
verbose = TRUE)
Arguments
- series
The number of datasets to generate.
- days
The number of trading days for which aggregated buys and sells are generated. Default value is
60
.- layers
The number of information layers to be included in the simulated data. Default value is
NULL
. Iflayers
is omitted or set toNULL
, the number of layers is uniformly selected from the set{1, ..., maxlayers}
.- parameters
A vector of model parameters of size
3J+2
whereJ
is the number of information layers and it has the following form {\(\alpha\)1, ...,\(\alpha\)J, \(\delta\)1,..., \(\delta\)J, \(\mu\)1,..., \(\mu\)J, \(\epsilon\)b, \(\epsilon\)s}.- ranges
A list of ranges for the different simulation parameters having named elements \(\alpha\), \(\delta\), \(\epsilon\)b, \(\epsilon\)s, and \(\mu\). The value of each element is a vector of two numbers: the first one is the minimal value
min_v
and the second one is the maximal valuemax_v
. If the element corresponding to a given parameter is missing, the default range for that parameter is used. If the argumentranges
is an empty list andparameters
isNULL
, the default ranges for the parameters are used. The simulation parameters are uniformly drawn from the interval (min_v
,max_v
) for the specified parameters. The default value islist()
.- ...
Additional arguments passed on to the function
generatedata_mpin()
. The recognized arguments areconfidence
,maxlayers
,eps_ratio
,mu_ratio
.confidence
(numeric
) denotes the range of the confidence interval associated with each layer such that all observations within the layerj
lie in the theoretical confidence interval of the Skellam distribution centered on the mean order imbalance, at the level'confidence'
. The default value is0.99
.maxlayers
(integer
) denotes the upper limit of number of layers for the generated datasets. If the argumentlayers
is missing, the layers of the simulated datasets will be uniformly drawn from{1,..., maxlayers}
. When missing,maxlayers
takes the default value of5
.eps_ratio
(numeric
) specifies the admissible range for the value of the ratio \(\epsilon\)s/\(\epsilon\)b, It can be a two-value vector or just a single value. Ifeps_ratio
is a vector of two values: the first one is the minimal value and the second one is the maximal value; and the function tries to generate \(\epsilon\)s and \(\epsilon\)b satisfying that their ratios \(\epsilon\)s/\(\epsilon\)b lies within the intervaleps_ratio
. Ifeps_ratio
is a single number, then the function tries to generate \(\epsilon\)s and \(\epsilon\)b satisfying \(\epsilon\)s = \(\epsilon\)b xeps_ratio
. If this range conflicts with other arguments such asranges
, a warning is displayed. The default value isc(0.75, 1.25)
.mu_ratio
(numeric
) it is the minimal value of the ratio between two consecutive values of the vectormu
. Ifmu_ratio = 1.25
e.g., then \(\mu\)j+1 should be larger than1.25
* \(\mu\)j for allj = 1, .., J
. Ifmu_ratio
conflicts with other arguments such asranges
orconfidence
, a warning is displayed. The default value isNULL
.
- verbose
(
logical
) a binary variable that determines whether detailed information about the progress of the data generation is displayed. No output is produced whenverbose
is set toFALSE
. The default value isTRUE
.
Value
Returns an object of class dataset
if series=1
, and an
object of class data.series
if series>1
.
Details
An information layer refers to a given type of information event existing
in the data. The PIN
model assumes a single type of information events
characterized by three parameters for \(\alpha\), \(\delta\), and
\(\mu\). The MPIN
model relaxes the assumption, by relinquishing the
restriction on the number of information event types. When layers = 1
,
generated data fit the assumptions of the PIN
model.
If the argument parameters
is missing, then the simulation parameters are
generated using the ranges specified in the argument ranges
.
If the argument ranges
is list()
, default ranges are used. Using the
default ranges, the simulation parameters are obtained using the following
procedure:
\(\alpha()\): a vector of length
layers
, where each \(\alpha\)j is uniformly distributed on(0, 1)
subject to the condition: \(\sum \alpha\)j\(< 1\).\(\delta()\): a vector of length
layers
, where each \(\delta\)j uniformly distributed on(0, 1)
.\(\mu()\): a vector of length
layers
, where each \(\mu\)j is uniformly distributed on the interval(0.5 max(
\(\epsilon\)b,
\(\epsilon\)s), 5 max(
\(\epsilon\)b,
\(\epsilon\)s))
. The \(\mu\):s are then sorted so the excess trading increases in the information layers, subject to the condition that the ratio of two consecutive \(\mu\)'s should be at least1.25
.\(\epsilon\)b: an integer drawn uniformly from the interval
(100, 10000)
with step50
.\(\epsilon\)s: an integer uniformly drawn from (
(3/4)
\(\epsilon\)b,(5/4)
\(\epsilon\)b) with step50
.
Based on the simulation parameters parameters
, daily buys and sells are
generated by the assumption that buys and sells
follow Poisson distributions with mean parameters (\(\epsilon\)b, \(\epsilon\)s) on days with no
information; with mean parameters
(\(\epsilon\)b + \(\mu\)j, \(\epsilon\)s) on days
with good information of layer \(j\) and
(\(\epsilon\)b, \(\epsilon\)s + \(\mu\)j) on days
with bad information of layer \(j\).
Considerations for the ranges of simulation parameters: While
generatedata_mpin()
function enables the user to simulate data series
with any set of theoretical parameters,
we strongly recommend the use of parameter sets satisfying below conditions
which are in line with the nature of empirical data and the theoretical
models used within this package.
When parameter values are not assigned by the user, the function, by default,
simulates data series that are in line with these criteria.
Consideration 1: any \(\mu\)'s value separable from \(\epsilon\)b and \(\epsilon\)s values, as well as other \(\mu\) values. Otherwise, the
PIN
andMPIN
estimation would not yield expected results.
[x] Sharp example.1: \(\epsilon\)b\( = 1000\); \(\mu = 1\). In this case, no information layer can be captured in a healthy way by the use of the models which relies on Poisson distributions.
[x] Sharp example.2: \(\epsilon\)s\( = 1000\), \(\mu\)1\( = 1000\), and \(\mu\)2\( = 1001\). Similarly, no distinction can be made on the two simulated layers of informed trading. In real life, this entails that there is only one type of information which would also be the estimate of theMPIN
model. However, in the simulated data properties, there would be 2 layers which will lead the user to make a wrong evaluation of model performance.Consideration 2: \(\epsilon\)b and \(\epsilon\)s being relatively close to each other. When they are far from each other, that would indicate that there is substantial asymmetry between buyer and seller initiated trades, being a strong signal for informed trading. There is no theoretical evidence to indicate that the uninformed trading in buy and sell sides deviate much from each other in real life. Besides, numerous papers that work with
PIN
model provide close to each other uninformed intensities. when no parameter values are assigned by the user, the function generates data with the condition of sell side uninformed trading to be in the range of(4/5):=80%
and(6/5):=120%
of buy side uninformed rate.
[x] Sharp example.3: \(\epsilon\)b\( = 1000\), \(\epsilon\)s\( = 10000\). In this case, thePIN
andMPIN
models would tend to consider some of the trading in sell side to be informed (which should be the actual case). Again, the estimation results would deviate much from the simulation parameters being a good news by itself but a misleading factor in model evaluation. See for example Cheng and Lai (2021) as a misinterpretation of comparative performances. The paper's findings highly rely on the simulations with extremely different \(\epsilon\)b and \(\epsilon\)s values (813-8124 pair and 8126-812).
References
Cheng T, Lai H (2021).
“Improvements in estimating the probability of informed trading models.”
Quantitative Finance, 21(5), 771-796.
Ersan O (2016).
“Multilayer Probability of Informed Trading.”
Available at SSRN 2874420.
Examples
# ------------------------------------------------------------------------ #
# There are different scenarios of using the function generatedata_mpin() #
# ------------------------------------------------------------------------ #
# With no arguments, the function generates one dataset object spanning
# 60 days, containing a number of information layers uniformly selected
# from `{1, 2, 3, 4, 5}`, and where the parameters are chosen as
# described in the details.
sdata <- generatedata_mpin()
# The number of layers can be deduced from the simulation parameters, if
# fed directly to the function generatedata_mpin() through the argument
# 'parameters'. In this case, the output is a dataset object with one
# information layer.
givenpoint <- c(0.4, 0.1, 800, 300, 200)
sdata <- generatedata_mpin(parameters = givenpoint)
# The number of layers can alternatively be set directly through the
# argument 'layers'.
sdata <- generatedata_mpin(layers = 2)
# The simulation parameters can be randomly drawn from their corresponding
# ranges fed through the argument 'ranges'.
sdata <- generatedata_mpin(ranges = list(alpha = c(0.1, 0.7),
delta = c(0.2, 0.7),
mu = c(3000, 5000)))
# The value of a given simulation parameter can be set to a specific value by
# setting the range of the desired parameter takes a unique value, instead of
# a pair of values.
sdata <- generatedata_mpin(ranges = list(alpha = 0.4, delta = c(0.2, 0.7),
eps.b = c(100, 7000),
mu = c(8000, 12000)))
#>
[Warning] The maximum layers possible given that alpha >= 0.4 is: 2.
#>
# If both arguments 'parameters', and 'layers' are simultaneously provided,
# and the number of layers detected from the length of the argument
# 'parameters' is different from the argument 'layers', the former is used
# and a warning is displayed.
sim.params <- c(0.4, 0.2, 0.9, 0.1, 400, 700, 300, 200)
sdata <- generatedata_mpin(days = 120, layers = 3, parameters = sim.params)
#>
[Warning]
#> The number of layers derived from 'parameters' is not compatible with 'layers'.
#> The argument 'layers' will be ignored
#>
# Display the details of the generated data
show(sdata)
#> ----------------------------------
#> Data series successfully generated
#> ----------------------------------
#> Simulation model : MPIN model
#> Number of layers : 2 layer(s)
#> Number of trading days : 120 days
#> ----------------------------------
#> Type object@data to get the simulated data
#>
#> Data simulation
#>
#> =========== ============== ================== =============
#> Variables Theoretical. Empirical. Aggregates.
#> =========== ============== ================== =============
#> alpha 0.4, 0.2 0.375000, 0.258333 0.633333
#> delta 0.9, 0.1 0.888889, 0.193548 0.605263
#> mu 400, 700 401.13, 708.48 526.5
#> eps.b 300 299 299
#> eps.s 200 201.7 201.7
#> ----
#> Likelihood - (1199.237) (1199.237)
#> mpin - 0.399745 0.399745
#> =========== ============== ================== =============
#>
#> -------
#> Running time: 0.006 seconds
# \donttest{
# ------------------------------------------------------------------------ #
# Use generatedata_mpin() to compare the accuracy of estimation methods #
# ------------------------------------------------------------------------ #
# The example below illustrates the use of the function 'generatedata_mpin()'
# to compare the accuracy of the functions 'mpin_ml()', and 'mpin_ecm()'.
# The example will depend on three variables:
# n: the number of datasets used
# l: the number of layers in each simulated datasets
# xc : the number of extra clusters used in initials_mpin
# For consideration of speed, we will set n = 2, l = 2, and xc = 2
# These numbers can change to fit the user's preferences
n <- l <- xc <- 2
# We start by generating n datasets simulated according to the
# assumptions of the MPIN model.
dataseries <- generatedata_mpin(series = n, layers = l, verbose = FALSE)
# Store the estimates in two different lists: 'mllist', and 'ecmlist'
mllist <- lapply(dataseries@datasets, function(x)
mpin_ml(x@data, xtraclusters = xc, layers = l, verbose = FALSE))
ecmlist <- lapply(dataseries@datasets, function(x)
mpin_ecm(x@data, xtraclusters = xc, layers = l, verbose = FALSE))
# For each estimate, we calculate the absolute difference between the
# estimated mpin, and empirical mpin computed using dataset parameters.
# The absolute differences are stored in 'mldmpin' ('ecmdpin') for the
# ML (ECM) method,
mldpin <- sapply(1:n,
function(x) abs(mllist[[x]]@mpin - dataseries@datasets[[x]]@emp.pin))
ecmdpin <- sapply(1:n,
function(x) abs(ecmlist[[x]]@mpin - dataseries@datasets[[x]]@emp.pin))
# Similarly, we obtain vectors of running times for both estimation methods.
# They are stored in 'mltime' ('ecmtime') for the ML (ECM) method.
mltime <- sapply(mllist, function(x) x@runningtime)
ecmtime <- sapply(ecmlist, function(x) x@runningtime)
# Finally, we calculate the average absolute deviation from empirical PIN
# as well as the average running time for both methods. This allows us to
# compare them in terms of accuracy, and speed.
accuracy <- c(mean(mldpin), mean(ecmdpin))
timing <- c(mean(mltime), mean(ecmtime))
comparison <- as.data.frame(rbind(accuracy, timing))
colnames(comparison) <- c("ML", "ECM")
rownames(comparison) <- c("Accuracy", "Timing")
show(round(comparison, 6))
#> ML ECM
#> Accuracy 0.000092 9.5e-05
#> Timing 2.979500 9.2e-02
# }