Training Dataset Generation Example¶

First we import all the relevant modules. The generator requires two inputs:¶

  • the number of rows required and
  • the path to the file that lists the expected distributions.
In [1]:
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
from trainingdatagenerator import TrainingDataGenerator
In [2]:
REQUIRED_ROWS = 100
In [3]:
distribution_file_path = "Trainings Datafields Structure percentages.xlsx"

After this, we initialise the generator with the number of required rows. The output is a dataframe with identical columns to the original database.¶

In [4]:
training_data_generator = TrainingDataGenerator(REQUIRED_ROWS)
In [5]:
df = training_data_generator.create_list_with_defined_distributions(
        distribution_file_path
)
In [6]:
df.head()
Out[6]:
unique training identification code target group training language related occupation accreditation quality assurance way of teaching supplementary services lowest educational attainment eqf levels ... 3. minimum competence level required for the training 3. completion of the training provides an average level of competence 4. Framework category 4. ID and name of existing competence 4. minimum competence level required for the training 4. completion of the training provides an average level of competence 5. Framework category 5. ID and name of existing competence 5. minimum competence level required for the training 5. completion of the training provides an average level of competence
0 MT7639 no target group can be selected Czech Non-commissioned Armed Forces Officers not accredited Internal QA systems personal attendance Group discussion and activities Primary education 4.0 ... advanced expert NaN NaN NaN NaN DigComp 4.1 Protecting devices (DigComp) 4 4
1 MT1750 civil servants English Health Associate Professionals accredited by a national institution Feedback from participants online at any time Hands-on training Upper secondary education NaN ... basic expert DigComp 2.4 Collaborating through digital technologies... 7 1 NaN NaN NaN NaN
2 MT9661 no target group can be selected English Protective Services Workers accredited by a national institution Internal QA systems online at any time Case studies or other required reading Primary education 4.0 ... B2 B1 NaN NaN NaN NaN NaN NaN NaN NaN
3 MT8889 change residence within the EU Available in any language Refuse Workers and Other Elementary Workers accredited by a national institution Feedback from participants online at any time Lectures Primary education 2.0 ... C1 B1 DigComp 2.2 Sharing through digital technologies (DigC... 6 3 NaN NaN NaN NaN
4 MT3726 women Available in any language Electrical and Electronic Trades Workers not accredited Internal QA systems online at any time Simulation Upper secondary education NaN ... expert expert DigComp 5.2 Identifying needs and technological respon... 1 3 NaN NaN NaN NaN

5 rows × 39 columns

In [7]:
df.shape
Out[7]:
(100, 39)

Below we compare the generated dataframe with original one.¶

In [8]:
original_df = pd.read_excel("MASTER database for matching P_v6 & T_v6__v3_dmlab.xlsx")
In [9]:
original_df.head()
Out[9]:
Matching_T_code name of training target group secondary target group tertiary target group training language country of training NUTS2 related occupation accreditation ... 3. minimum competence level required for the training 3. completion of the training provides an average level of competence 4. Framework category 4. ID and name of existing competence 4. minimum competence level required for the training 4. completion of the training provides an average level of competence 5. Framework category 5. ID and name of existing competence 5. minimum competence level required for the training 5. completion of the training provides an average level of competence
0 M_T001 50 hours online course in German language no target group can be selected NaN NaN NaN NaN NaN NaN not accredited ... A1 B1 CEFRL spoken production (CEFRL) A1 B1 CEFRL writing (CEFRL) A1 B1
1 M_T002 A2 Level Slovak Language Course for Effective ... immigrants from outside the EU NaN NaN Ukrainian Slovakia Slovensko - Bratislavský kraj unclassifiable occupation accredited by a national institution ... A1 A2 CEFRL reading (CEFRL) A1 A2 CEFRL spoken production (CEFRL) A1 A2
2 M_T003 Acquiring basic financial competences for ever... employed unemployed NaN NaN NaN NaN unclassifiable occupation not accredited ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 M_T004 Advanced Digital Marketing Strategies for Mana... employed NaN NaN NaN Poland NaN Managers not accredited ... advanced expert EntreComp Creativity (EntreComp) basic advanced LifeComp S2 communication (LifeComp) advanced advanced
4 M_T005 Advanced Digital Skills for the Digital Age self-employed workers in jobs at risk from digitalisation/au... non-native speakers English NaN NaN unclassifiable occupation accredited by a national institution ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 47 columns

In [10]:
original_df = original_df.rename(columns=lambda x: x.strip())
In [11]:
columns_to_plot = list(set(df.columns) & set(original_df.columns))
In [12]:
fixing_types = [". minimum competence level required for the training", ". completion of the training provides an average level of competence"]

for i in range(1, 6):
    for col in fixing_types:
        df[str(i) + col] = df[str(i) + col].astype(str)
In [13]:
for i in range(1, 6):
    for col in fixing_types:
        original_df[str(i) + col] = original_df[str(i) + col].astype(str)
In [14]:
exclude_from_plots = [str(i) + ". ID and name of existing competence" for i in range(1, 6)]
In [15]:
num_columns = len([i for i in columns_to_plot if i not in exclude_from_plots])
columns_per_row = 3
num_rows = (num_columns + columns_per_row - 1) // columns_per_row
fig, axes = plt.subplots(num_rows, columns_per_row, figsize=(15, 5 * num_rows), layout="constrained")
axes = axes.flatten()

for col, ax in zip([i for i in columns_to_plot if i not in exclude_from_plots], axes):
    new = df[col].tolist()
    og = original_df[col].tolist()
    data = pd.DataFrame({'new': new, 'original': og, 'column': col})
    data = data.melt(id_vars='column')


    p = sns.histplot(data=data, x='value', hue='variable', ax=ax)
    p.tick_params(axis='x', rotation=90)
    ax.set_title(col)
In [ ]:
 
In [ ]: