Access rejustify with Python
betaWithout doubt, Python is great for data science! Rejustify supports Python with its custom-made module, which can be installed directly from PyPi. The development version you can find on our GitHub page. For stability reasons, we encourage to use Python >=3.6. While we dub the current release still as beta, it supports all the communication with the rejustify API. And of course, we're working hard to add more functions to the module.
Quick and easy installation process requires an e-mail address and a token, which you can get by creating a free account. While the free account offers basic access to the rejustify API, the most practical functionality can be unlocked with a premium or enterprise access.
The guide below demonstrates the basic package installation process and the most commonly used rejustify API functions. Every function is well documented in the help line in the module.
1. Installation
The rejustify Python module can be installed directly from PyPi. You can install it either with pip3
pip3 install rejustify
or with pip command, if it is linked to Python >= 3.6 release
pip install rejustify
The API authorization is achieved through e-mail and token authentication. The e-mail corresponds to the primary e-mail of the account. Each account is assigned a unique token, which can be seen after signing in.
Once installed, you can follow the standard Python syntax. As the access is designed mostly for DataFrame objects, it is also useful to load pandas module alongside rejustify.
import rejustify
import pandas as pd
setCurl()
register(token = "YOUR_TOKEN", email = "YOUR_EMAIL")
All set, you're ready to access the rejustify engine!
In the case you're connecting through a proxy server, you may specify the details of the connection directly in the setCurl() function.
setCurl(proxy_url = "PROXY_ADDRESS", proxy_port = 8080)
2. Analyze data
analyze() is one of the three core commands of the Python module which offers a direct access to the analyze endpoint. Its goal is to provide the basic characteristics of the data set and provide data suggestions for empty dimensions, which can be directly used by the fill endpoint (see below). Function analyze() recognizes any DataFrame data representation. By default it expects a vertical, i.e. colum-oriented data.
In an example below we create a mock data set with three columns with headers date, country and covid cases. The first two contain basic date and country dimensions, whereas the last column is empty - this is the column which will be rejustified.
# sample data set
df = pd.DataFrame()
df['date'] = pd.date_range('2020-06-01', periods=4).strftime("%Y-%m-%d")
df['country'] = ['Poland'] * 4
df['covid cases'] = ""
st = rejustify.analyze(df)
analyze() returns three important pieces of information:
- descriptive details: dimension id, column id, dimension name and if a column is empty or not,
- basic properties of the dimensions: classes, features, cleaners, formats and the corresponding classification accuracy,
- resources which fit best into the empty dimensions of the data based on the data structure and history.
Currently, rejustify distinguishes between 6 classes: general, geography, unit, time, sector and number. They describe the basic characteristics of the values, and are further used to propose the best transformations and matching methods for data reconciliation in the fill endpoint. Classes are further supported by features, which determine these characteristics in greater detail, such as class geography may be further supported by feature country. Cleaner contains the basic set of transformations applied to each value in a dimension to retrieve machine-readable representation. For instance, values y1999, y2000, ..., clearly correspond to years, however, they will be processed much faster if stripped from the initial y character, such as ^y. Cleaner allows for basic regular expressions. Finally, format corresponds to the format of the values, and it is particularly useful for time-series operations. Format allows the standard date formats.
For instance, the data from the example above can be assigned the following structure (in the form of a DataFrame object).
id column name empty class feature cleaner format p_class provider table p_data
0 1 1 date 0 time day NA %Y-%m-%d 1.0000 NA NA NA
1 2 2 country 0 geography generic NA NA 0.4000 NA NA NA
2 3 3 covid cases 1 geography country NA NA 0.3636 IMF WEO 0.7664
3. Adjust structure
It is possible that the rejustify engine will not provide the suggestions the User had in mind initially. While it is possible to adjust each element in the structure manually, adjust() offers a intuitive and tidy way of changing single or multiple elements in one line. The functions expects which dimensions are to be changed, by either dimension id or column id (if both are given the preference is given towards the latter), and the items which are to be replaced (in the form of a list).
For the example above, class geography and feature country is not the most precise representation of the dimension for covid cases. Let's make it more accurate by changin it to a bit wider class general with no defined feature.
st = rejustify.adjust(st, id = 3, items = {'class':'general', 'feature': None})
id column name empty class feature cleaner format p_class provider table p_data
0 1 1 date 0 time day NA %Y-%m-%d 1.0 NA NA NA
1 2 2 country 0 geography generic NA NA 0.4 NA NA NA
2 3 3 covid cases 1 general None NA NA -1.0 IMF WEO 0.7664
Similarly, one may want to switch from provider/table IMF/WEO to the data from ECDC information about COVID-19 (read more about our COVID-19 data support) given in REJUSTIFY/COVID-19-ECDC table. The full list of our resources, including data providers and tables, can be found in our repository browser.
st = rejustify.adjust(st, column = 3, items = {'provider': 'REJUSTIFY', 'table': 'COVID-19-ECDC'})
id column name empty class feature cleaner format p_class provider table p_data
0 1 1 date 0 time day NA %Y-%m-%d 1.0 NA NA NA
1 2 2 country 0 geography generic NA NA 0.4 NA NA NA
2 3 3 covid cases 1 general None NA NA -1.0 REJUSTIFY COVID-19-ECDC -1.0
Upon changes in structure, the corresponding p_class or p_data will be set to -1. This is the way to inform API that the original structure has changed and in case the learning option is enabled, the new values will be used to train the AI algorithms. If learn=False, information will not be stored by the API but the changes will be recognized in the current API call.
4. Fill the data
fill() aims at filling the missing data for each empty dimension in the original data set by data points delivered by provider/table. Rejustify engine calls the submitted data set by x and any server-side data set by y. The corresponding structures are marked with the same principles, as structure.x and structure.y, for instance. The principle rule of any data manipulation is to never change data x (except for missing values), but only adjust y.
There are two main elements which need to be defined for this process:
- matching keys - which elements from x and y are to be matched together and which matching method is the most appropriate,
- default values for non-matched dimensions of y.
If not defined explicitly, the engine will suggest the best-fitting matching keys and default values.
To continue the example from above, fill() needs at least the original data set and structure.
rdf = rejustify.fill(df, st)
The rejustified data set will be given in element
rdf['data']
In effect, fill() substitutes missing values in covid cases column by ECDC COVID-19 figures for Poland in years the first days of June 2020. The values are determined by default specification as an index (Concept Indicator: Newly reported cases).
Original data set
date country covid cases
0 2020-06-01 Poland
1 2020-06-02 Poland
2 2020-06-03 Poland
3 2020-06-04 Poland
Rejustified data set
date country covid cases
0 2020-06-01 Poland 215
1 2020-06-02 Poland 379
2 2020-06-03 Poland 230
3 2020-06-04 Poland 292
4.1. Matching keys
The elements in keys are determined based on information provided in data x and y, for each empty column. The details behind both data structures can be visualized by structure.x and structure.y elements of the object returned by the fill() command. The latter in our example corresponds to the REJUSTIFY/COVID-19-ECDC table which will be used to fill the missing values in column.id.x=3.
rdf['structure.y']
{'column.id.x': [3],
'structure.y': [id column name empty class feature format p_class
1 1 Time Dimension 0 time day %Y-%m-%d 1
2 2 Geo Id 0 geography country NA 1
3 3 Concept Indicator 0 general NA NA 1
4 4 Primary Measure 0 numeric generic %Y 1]}
Matching keys are given consecutively, i.e. the first elements in id.x and name.x correspond to the first elements in id.y and name.y. Dimension names are given for the better readability of the results, however, they are not necessary for the engine.
rdf['keys']
[{'id.x': [1, 2],
'name.x': ['date', 'country'],
'id.y': [1, 2],
'name.y': ['Time Dimension', 'Geo Id'],
'class': ['time', 'geography'],
'method': ['time-matching', 'synonym-matching'],
'column.id.x': 3,
'column.name.x': 'covid cases'}]
Currently, the rejustify engine supports 6 matching methods: synonym-proximity-matching, synonym-matching, proximity-matching, time-matching, exact-matching and value-selection, which are given in a diminishing order of complexitiy. synonym-proximity-matching uses the proximity between the values in data x and y to the coresponding values in rejustify dictionary. If the proximity is above accuracy threshold and there are values in x and y pointing to the same element in the dictionary, the values will be matched. synonym-matching and proximity-matching use a similar logic either of the steps described for synonym-proximity-matching. time-matching aims at standardizing the time values to the same format before matching. For proper functioning it requires an accurate characterization of date format in structure.x (structure.y is already classified by rejustify). exact-matching will match two values only if they are identical. value-selection is a quasi matching method which for single-valued dimension x will return single value from y, as suggested by default specification. It is the most efficient matching type for dimensions which do not show any variability.
4.2. Default values
Default values are used to lock dimensions in y which will be not used for matching against x. Each empty column to be filled, characterized by column.id.x, must contain description of the default values. The default values include both codes and labels, however, only the former are relevant for rejustify engine. Object rdf['default'] includes all the default values, however, if a dimension is used for matching, the default values are ignored and all dimension values are used for matching instead.
rdf['default']
{'column.id.x': [3],
'default': [ code_default label_default
Time Dimension latest Label not available
Geo Id DE Germany
Concept Indicator cases Newly reported cases
Primary Measure None Label not available]}