Skip to content

Working with Data And Metadata

Attilio Mattiocco edited this page Feb 14, 2025 · 25 revisions

Data Environments and Metadata Repositories are containers of data and metadata that the VTL E&E knows how to work with. Data environments often need to be coupled to a metadata repository to work (since VTL is a metadata drive language), but in some cases they can work in a standalone way. Environments and repositories can be used together in various ways, so you could have datasets in R and in CSV and use them together, possibly leveraging metadata coming from and SDMX REST provider.

The data environments that are available out-of-the-box are:

  • The R Environment: it manages data.frames contained in the R Global Environment (available only if the engine is launched from R). This environment can work in a standalone way or paired to a JSON metadata repository.
  • The CSV Environment: it manages CSV files contained in a specific directory. It must be paired to a JSON metadata repository.
  • The SDMX Environment: it is a URL pointing to an SDMX REST web service. It must be paired to an SDMX metadata repository.
  • The Spark Environment: it manages datasets contained in a spark local or remote installation. Refer to the Spark section of this wiki for details.

The metadata repositories that are available out-of-the-box are:

  • The JSON Repository: it's a JSON file that contains all structural information about datasets. It must be compiled according to the Technical Notes that have been agreed by the VTL community
  • The SDMX Repository: it is a URL pointing to an SDMX REST web service.
  • The SDMX & JSON combined Repository: it's a combination of the two repository types, useful to exploit the rich metadata contained in an SDMX provider plus the flexibility of the json repo.

Furthermore, the In-Memory Environment and In-Memory Repository are used by the engine to store the transformation steps results (final and intermediate).


The R Environment

If you are working with the VTL E&E inside an R session (i.e. using the RVTL package), the engine will be able to use the data.frames that are stored in the R Global Environment by default. This environment can be used in two different ways:

  1. Standalone: the data.frames are seamlessly available in the transformations, provided that they contain an attribute named identifiers that lists the dataset identifier components and an attribute named measures that lists the measure components. In the standalone mode the dataframe variables (the VTL components) can be of type: Integer, String, Numeric, Boolean and Date. Domains (codelists) are not supported.

  2. Coupled with the JSON metadata repository: in this case the data.frames have to be listed in the metadata JSON repository (see section on metadata repositories for details). There are no limitations on the data types.

Standalone R env Example

Open RStudio and load the RVTL package. Then create a data.frame as in the following code:

library(RVTL)
ds_1=data.frame(l=letters,                                    #string component
               n=as.integer(1:length(letters)),               #integercomponent
               m=rnorm(n = length(letters), mean = 0),        #numeric component
               b=rnorm(n = length(letters), mean = 0) >0,     #boolean component
               d=as.Date('2020-01-01') + 1:length(letters))   #date component
attr(ds_1, 'identifiers') <- c('l', 'n')                      #this attribute tells the VTL engine what components have to be considered identifiers
attr(ds_1, 'measures') <- c('m')                              #this attribute tells the VTL engine what components have to be considered measures

Now you can launch vtlStudio and use the editor to write a transformation that uses the ds1 dataset as input as in the following example:

ds_2:=ds_1;
ds_3:=ds_1 + ds_2;

Just compile the transformation and check the output in the Dataset, Structure and Graph explorers.

R env with JSON metadata Example

In RStudio create a simple data.frame. In this case there is no need to specify the identifiers and measures attributes.

library(RVTL)
ds_1=data.frame(Id_1=c(10L, 10L, 11L, 11L),                                     
               Id_2 = c('A', 'B', 'A', 'B'),
               Me_1 = c(5L, 2L, 3L, 4L),        
               Me_2 = c(5.0, 10.5, 12.2, 20.3)) 

To enable the JSON repository containing all the structural information about the CSV datasets, select it in the Metadata Repository choice list and set the URL to the json file (note the forward slashes and the file:/// start).

immagine

The information about the structure of the dataset will be retrieved from this repository. This is an example of a json file:

ex_1.json

{
  "datasets": [
    {
      "name": "ds_1",
      "structure": "DS_1"
    }
  ],
  "structures": [
    {
      "name": "DS_1",
      "components": [
        {
          "name": "Id_1",
          "role": "Identifier"
        },
        {
          "name": "Id_2",
          "role": "Identifier"
        },
        {
          "name": "Me_1",
          "role": "Measure"
        },
        {
          "name": "Me_2",
          "role": "Measure"
        }
      ]
    }
  ],
  "variables": [
    {
      "name": "Id_1",
      "domain": "integer"
    },
    {
      "name": "Id_2",
      "domain": "string"
    },
    {
      "name": "Me_1",
      "domain": "integer"
    },
    {
      "name": "Me_2",
      "domain": "number"
    }
  ]
}

In the editor write a simple transformation that uses ds_1 as input:

ds_2:=ds_1;
ds_3:=ds_1 + ds_2;

Just compile the transformation and check the output in the Dataset, Structure and Graph explorers.

The CSV Environment

The CSV environment offers users the possibility to deal, for their transformations, with very simple CSV files that are stored in a local directory. This environment works tightly coupled to the JSON repository (see the wiki section dedicated to metadata repositories for details).

CSV Env Example

The CSV environment is not available by default: you have to explicitly enable it in the Engine Settings panel, by moving its label from the Available to the Active box.

immagine

Once enabled, the environment directory must be configured in the Environment Properties box. Select the environment from the choice list and write the directory where your CSV files are stored (note the back slashes).

immagine

To enable the JSON repository containing all the structural information about the CSV datasets, select it in the Metadata Repository choice list and set the URL to the json file (note the forward slashes and the file:/// start).

immagine

Now you can write a transformation that uses the names of the CSV files as datasets. As an example, you can use the following CSVs and json file (just copy them into the directories you configure in your engine):

ds_1.csv

Id_1,Id_2,Me_1,Me_2
10,A,5,5.0
10,B,2,10.5
11,A,3,12.2
11,B,4,20.3

ex_1.json

{
  "datasets": [
    {
      "name": "ds_1",
      "structure": "DS_1",
      "source": "csv:ds_1.csv"
    }
  ],
  "structures": [
    {
      "name": "DS_1",
      "components": [
        {
          "name": "Id_1",
          "role": "Identifier"
        },
        {
          "name": "Id_2",
          "role": "Identifier"
        },
        {
          "name": "Me_1",
          "role": "Measure"
        },
        {
          "name": "Me_2",
          "role": "Measure"
        }
      ]
    }
  ],
  "variables": [
    {
      "name": "Id_1",
      "domain": "integer"
    },
    {
      "name": "Id_2",
      "domain": "string"
    },
    {
      "name": "Me_1",
      "domain": "integer"
    },
    {
      "name": "Me_2",
      "domain": "number"
    }
  ]
}

The transformation could be:

ds_2 := ds_1;
ds_3 := ds_1 + ds_2;

Just compile the transformation and check the output in the Dataset, Structure and Graph explorers.

The SDMX Environment

The SDMX Environment and Repository make it possible to use data and metadata coming from an SDMX REST provider. Data and metadata can be configured to use different providers.

SDMX Env example

The SDMX environment is not available by default: you have to explicitly enable it in the Engine Settings panel, by moving its label from the Available to the Active box.

immagine

Once enabled, the SDMX data provider URL must be configured in the Environment Properties box. Select the environment from the choice list and write the URL

immagine

To enable the SDMX repository containing all the structural information, select it in the Metadata Repository choice list and set the URL to SDMX metadata provider.

immagine

In this example we use the BIS public SDMX provider (https://stats.bis.org/api/v1). Always remember to configure the proxy in the Network Settings panel, if needed.

A simple example transformation using the italian effective exchange rates from the BIS could be:

ds:='BIS:WS_EER(1.0)/D.N.B.IT';

Just compile the transformation and check the output in the Dataset, Structure and Graph explorers.