Access rejustify with Python

beta

Without doubt, Python is great for data science! Rejustify supports Python with its custom-made module, which can be installed directly from PyPi. The development version you can find on our GitHub page. For stability reasons, we encourage to use Python >=3.6. While we dub the current release still as beta, it supports all the communication with the rejustify API. And of course, we're working hard to add more functions to the module.

Quick and easy installation process requires an e-mail address and a token, which you can get by creating a free account. While the free account offers basic access to the rejustify API, the most practical functionality can be unlocked with a premium or enterprise access.

The guide below demonstrates the basic package installation process and the most commonly used rejustify API functions. Every function is well documented in the help line in the module.

1. Installation

The rejustify Python module can be installed directly from PyPi. You can install it either with pip3

pip3 install rejustify

or with pip command, if it is linked to Python >= 3.6 release

pip install rejustify

The API authorization is achieved through e-mail and token authentication. The e-mail corresponds to the primary e-mail of the account. Each account is assigned a unique token, which can be seen after signing in.

Once installed, you can follow the standard Python syntax. As the access is designed mostly for DataFrame objects, it is also useful to load pandas module alongside rejustify.

import rejustify
import pandas as pd
setCurl()
register(token = "YOUR_TOKEN", email = "YOUR_EMAIL")

All set, you're ready to access the rejustify engine!

In the case you're connecting through a proxy server, you may specify the details of the connection directly in the setCurl() function.

setCurl(proxy_url = "PROXY_ADDRESS", proxy_port = 8080)

2. Analyze data

analyze() is one of the three core commands of the Python module which offers a direct access to the analyze endpoint. Its goal is to provide the basic characteristics of the data set and provide data suggestions for empty dimensions, which can be directly used by the fill endpoint (see below). Function analyze() recognizes any DataFrame data representation. By default it expects a vertical, i.e. colum-oriented data.

In an example below we create a mock data set with three columns with headers date, country and covid cases. The first two contain basic date and country dimensions, whereas the last column is empty - this is the column which will be rejustified.

# sample data set
df = pd.DataFrame()
df['date'] = pd.date_range('2020-06-01', periods=4).strftime("%Y-%m-%d")
df['country'] = ['Poland'] * 4
df['covid cases'] = ""
st = rejustify.analyze(df)

analyze() returns three important pieces of information:

descriptive details: dimension id, column id, dimension name and if a column is empty or not,
basic properties of the dimensions: classes, features, cleaners, formats and the corresponding classification accuracy,
resources which fit best into the empty dimensions of the data based on the data structure and history.

Currently, rejustify distinguishes between 6 classes: general, geography, unit, time, sector and number. They describe the basic characteristics of the values, and are further used to propose the best transformations and matching methods for data reconciliation in the fill endpoint. Classes are further supported by features, which determine these characteristics in greater detail, such as class geography may be further supported by feature country. Cleaner contains the basic set of transformations applied to each value in a dimension to retrieve machine-readable representation. For instance, values y1999, y2000, ..., clearly correspond to years, however, they will be processed much faster if stripped from the initial y character, such as ^y. Cleaner allows for basic regular expressions. Finally, format corresponds to the format of the values, and it is particularly useful for time-series operations. Format allows the standard date formats.

For instance, the data from the example above can be assigned the following structure (in the form of a DataFrame object).

   id  column         name  empty      class  feature cleaner    format  p_class provider table  p_data
0   1       1         date      0       time      day      NA  %Y-%m-%d   1.0000      NA     NA     NA
1   2       2      country      0  geography  generic      NA        NA   0.4000      NA     NA     NA
 2   3       3  covid cases      1  geography  country      NA        NA   0.3636      IMF   WEO  0.7664

3. Adjust structure

It is possible that the rejustify engine will not provide the suggestions the User had in mind initially. While it is possible to adjust each element in the structure manually, adjust() offers a intuitive and tidy way of changing single or multiple elements in one line. The functions expects which dimensions are to be changed, by either dimension id or column id (if both are given the preference is given towards the latter), and the items which are to be replaced (in the form of a list).

For the example above, class geography and feature country is not the most precise representation of the dimension for covid cases. Let's make it more accurate by changin it to a bit wider class general with no defined feature.

st = rejustify.adjust(st, id = 3, items = {'class':'general', 'feature': None})

   id  column         name  empty      class  feature cleaner    format  p_class provider table  p_data
0   1       1         date      0       time      day      NA  %Y-%m-%d      1.0      NA    NA      NA
1   2       2      country      0  geography  generic      NA        NA      0.4      NA    NA      NA
 2   3       3  covid cases      1    general     None      NA        NA     -1.0      IMF   WEO  0.7664

Similarly, one may want to switch from provider/table IMF/WEO to the data from ECDC information about COVID-19 (read more about our COVID-19 data support) given in REJUSTIFY/COVID-19-ECDC table. The full list of our resources, including data providers and tables, can be found in our repository browser.

st = rejustify.adjust(st, column = 3, items = {'provider': 'REJUSTIFY', 'table': 'COVID-19-ECDC'})

   id  column         name  empty      class  feature cleaner    format  p_class   provider          table  p_data
0   1       1         date      0       time      day      NA  %Y-%m-%d      1.0         NA             NA      NA
1   2       2      country      0  geography  generic      NA        NA      0.4         NA             NA      NA
2   3       3  covid cases      1    general     None      NA        NA     -1.0  REJUSTIFY  COVID-19-ECDC    -1.0

Upon changes in structure, the corresponding p_class or p_data will be set to -1. This is the way to inform API that the original structure has changed and in case the learning option is enabled, the new values will be used to train the AI algorithms. If learn=False, information will not be stored by the API but the changes will be recognized in the current API call.

4. Fill the data

fill() aims at filling the missing data for each empty dimension in the original data set by data points delivered by provider/table. Rejustify engine calls the submitted data set by x and any server-side data set by y. The corresponding structures are marked with the same principles, as structure.x and structure.y, for instance. The principle rule of any data manipulation is to never change data x (except for missing values), but only adjust y.

There are two main elements which need to be defined for this process:

matching keys - which elements from x and y are to be matched together and which matching method is the most appropriate,
default values for non-matched dimensions of y.

If not defined explicitly, the engine will suggest the best-fitting matching keys and default values.

To continue the example from above, fill() needs at least the original data set and structure.

rdf = rejustify.fill(df, st)

The rejustified data set will be given in element

rdf['data']

In effect, fill() substitutes missing values in covid cases column by ECDC COVID-19 figures for Poland in years the first days of June 2020. The values are determined by default specification as an index (Concept Indicator: Newly reported cases).

Original data set

       date country  covid cases
0  2020-06-01  Poland 	 	 	 	
1  2020-06-02  Poland 	 	 	 	
2  2020-06-03  Poland 	 	 	 	
3  2020-06-04  Poland

Rejustified data set

         date country  covid cases
0  2020-06-01  Poland          215
1  2020-06-02  Poland          379
2  2020-06-03  Poland          230
3  2020-06-04  Poland          292

4.1. Matching keys

The elements in keys are determined based on information provided in data x and y, for each empty column. The details behind both data structures can be visualized by structure.x and structure.y elements of the object returned by the fill() command. The latter in our example corresponds to the REJUSTIFY/COVID-19-ECDC table which will be used to fill the missing values in column.id.x=3.

rdf['structure.y']

{'column.id.x': [3],
 'structure.y': [id  column               name  empty      class  feature    format  p_class
                  1       1     Time Dimension      0       time      day  %Y-%m-%d        1
                  2       2             Geo Id      0  geography  country        NA        1
                  3       3  Concept Indicator      0    general       NA        NA        1
                  4       4    Primary Measure      0    numeric  generic        %Y        1]}

Matching keys are given consecutively, i.e. the first elements in id.x and name.x correspond to the first elements in id.y and name.y. Dimension names are given for the better readability of the results, however, they are not necessary for the engine.

rdf['keys']

[{'id.x': [1, 2],
  'name.x': ['date', 'country'],
  'id.y': [1, 2],
  'name.y': ['Time Dimension', 'Geo Id'],
  'class': ['time', 'geography'],
  'method': ['time-matching', 'synonym-matching'],
  'column.id.x': 3,
  'column.name.x': 'covid cases'}]

Currently, the rejustify engine supports 6 matching methods: synonym-proximity-matching, synonym-matching, proximity-matching, time-matching, exact-matching and value-selection, which are given in a diminishing order of complexitiy. synonym-proximity-matching uses the proximity between the values in data x and y to the coresponding values in rejustify dictionary. If the proximity is above accuracy threshold and there are values in x and y pointing to the same element in the dictionary, the values will be matched. synonym-matching and proximity-matching use a similar logic either of the steps described for synonym-proximity-matching. time-matching aims at standardizing the time values to the same format before matching. For proper functioning it requires an accurate characterization of date format in structure.x (structure.y is already classified by rejustify). exact-matching will match two values only if they are identical. value-selection is a quasi matching method which for single-valued dimension x will return single value from y, as suggested by default specification. It is the most efficient matching type for dimensions which do not show any variability.

4.2. Default values

Default values are used to lock dimensions in y which will be not used for matching against x. Each empty column to be filled, characterized by column.id.x, must contain description of the default values. The default values include both codes and labels, however, only the former are relevant for rejustify engine. Object rdf['default'] includes all the default values, however, if a dimension is used for matching, the default values are ignored and all dimension values are used for matching instead.

rdf['default']

{'column.id.x': [3],
  'default': [  code_default         label_default
Time Dimension        latest   Label not available
Geo Id                    DE               Germany
Concept Indicator      cases  Newly reported cases
Primary Measure         None   Label not available]}