Generator

Generator

class nclustgen.Generator.Generator(n, dstype='NUMERIC', patterns=None, bktype='UNIFORM', clusterdistribution=None, contiguity=None, plaidcoherency='NO_OVERLAPPING', percofoverlappingclusters=0.0, maxclustsperoverlappedarea=0, maxpercofoverlappingelements=0.0, percofoverlappingrows=1.0, percofoverlappingcolumns=1.0, percofoverlappingcontexts=1.0, percmissingsonbackground=0.0, percmissingsonclusters=0.0, percnoiseonbackground=0.0, percnoiseonclusters=0.0, percnoisedeviation=0.0, percerroesonbackground=0.0, percerrorsonclusters=0.0, percerrorondeviation=0.0, silence=False, seed=None, *args, **kwargs)[source]

Bases: object

Abstract class from where dimensional specific subclasses should inherit. Should not be called directly. This class abstracts dimensionality providing core implemented methods and abstract methods that should be implemented for any n-clustering generator.

Parameters
  • n (int, internal) – Determines dimensionality (e.g. Bi/Tri clustering). Should only be used by subclasses.

  • dstype ({'NUMERIC', 'SYMBOLIC'}, default 'Numeric') – Type of Dataset to be generated, numeric or symbolic(categorical).

  • patterns (list or array, default [['CONSTANT', 'CONSTANT']]) –

    Defines the type of patterns that will be hidden in the data.

    Shape: (number of patterns, number of dimensions)

    Patterns_Set: {CONSTANT, ADDITIVE, MULTIPLICATIVE, ORDER_PRESERVING, NONE}

    Numeric_Patterns_Set: {CONSTANT, ADDITIVE, MULTIPLICATIVE, ORDER_PRESERVING, NONE}

    Symbolic_Patterns_Set: {CONSTANT, ORDER_PRESERVING, NONE}

    Pattern_Combinations:

    2D Numeric Patterns Possible Combinations

    index

    pattern combination

    0

    [‘Order_Preserving’, ‘None’]

    1

    [‘None’, ‘Order_Preserving’]

    2

    [‘Constant’, ‘Constant’]

    3

    [‘None’, ‘Constant’]

    4

    [‘Constant’, ‘None’]

    5

    [‘Additive’, ‘Additive’]

    6

    [‘Constant’, ‘Additive’]

    7

    [‘Additive’, ‘Constant’]

    8

    [‘Multiplicative’, ‘Multiplicative’]

    9

    [‘Constant’, ‘Multiplicative’]

    10

    [‘Multiplicative’, ‘Constant’]

    2D Symbolic Patterns Possible Combinations

    index

    pattern combination

    0

    [‘Order_Preserving’, ‘None’]

    1

    [‘None’, ‘Order_Preserving’]

    2

    [‘Constant’, ‘Constant’]

    3

    [‘None’, ‘Constant’]

    4

    [‘Constant’, ‘None’]

    3D Numeric Patterns Possible Combinations

    index

    pattern combination

    0

    [‘Order_Preserving’, ‘None’, ‘None’]

    1

    [‘None’, ‘Order_Preserving’, ‘None’]

    2

    [‘None’, ‘None’, ‘Order_Preserving’]

    3

    [‘Constant’, ‘Constant’, ‘Constant’]

    4

    [‘None’, ‘Constant’, ‘Constant’]

    5

    [‘Constant’, ‘Constant’, ‘None’]

    6

    [‘Constant’, ‘None’, ‘Constant’]

    7

    [‘Constant’, ‘None’, ‘None’]

    8

    [‘None’, ‘Constant’, ‘None’]

    9

    [‘None’, ‘None’, ‘Constant’]

    10

    [‘Additive’, ‘Additive’, ‘Additive’]

    11

    [‘Additive’, ‘Additive’, ‘Constant’]

    12

    [‘Constant’, ‘Additive’, ‘Additive’]

    13

    [‘Additive’, ‘Constant’, ‘Additive’]

    14

    [‘Additive’, ‘Constant’, ‘Constant’]

    15

    [‘Constant’, ‘Additive’, ‘Constant’]

    16

    [‘Constant’, ‘Constant’, ‘Additive’]

    17

    [‘Multiplicative’, ‘Multiplicative’, ‘Multiplicative’]

    18

    [‘Multiplicative’, ‘Multiplicative’, ‘Constant’]

    19

    [‘Constant’, ‘Multiplicative’, ‘Multiplicative’]

    20

    [‘Multiplicative’, ‘Constant’, ‘Multiplicative’]

    21

    [‘Multiplicative’, ‘Constant’, ‘Constant’]

    22

    [‘Constant’, ‘Multiplicative’, ‘Constant’]

    23

    [‘Constant’, ‘Constant’, ‘Multiplicative’]

    3D Numeric Patterns Possible Combinations

    index

    pattern combination

    0

    [‘Order_Preserving’, ‘None’, ‘None’]

    1

    [‘None’, ‘Order_Preserving’, ‘None’]

    2

    [‘None’, ‘None’, ‘Order_Preserving’]

    3

    [‘Constant’, ‘Constant’, ‘Constant’]

    4

    [‘None’, ‘Constant’, ‘Constant’]

    5

    [‘Constant’, ‘Constant’, ‘None’]

    6

    [‘Constant’, ‘None’, ‘Constant’]

    7

    [‘Constant’, ‘None’, ‘None’]

    8

    [‘None’, ‘Constant’, ‘None’]

    9

    [‘None’, ‘None’, ‘Constant’]

  • bktype ({'NORMAL', 'UNIFORM', 'DISCRETE', 'MISSING'}, default 'UNIFORM') – Determines the distribution used to generate the background values.

  • clusterdistribution (list or array, default [['UNIFORM', 4.0, 4.0], ['UNIFORM', 4.0, 4.0]]) –

    Distribution used to calculate the size of a cluster.

    Shape: number of dimensions, 3 -> param1(str), param2(float), param3(float)

    The first parameter(param1) is always the type of distribution {‘NORMAL’, ‘UNIFORM’}. If param1==UNIFORM, then param2 and param3 represents the min and max, respectively. If param1==NORMAL, then param2 and param3 represents the mean and standard deviation, respectively.

  • contiguity ({'COLUMNS', 'CONTEXTS', 'NONE'}, default None) –

    Contiguity can occur on COLUMNS or CONTEXTS. To avoid contiguity use None.

    If dimensionality == 2 and contiguity == ‘CONTEXTS’ it defaults to None.

  • plaidcoherency ({'ADDITIVE', 'MULTIPLICATIVE', 'INTERPOLED', 'NONE', 'NO_OVERLAPPING'}, default 'NO_OVERLAPPING') – Enforces the type of plaid coherency. To avoid plaid coherency use NONE, to avoid any overlapping use ‘NO_OVERLAPPING’.

  • percofoverlappingclusters (float, default 0.0) –

    Percentage of overlapping clusters. Defines how many clusters are allowed to overlap.

    Not used if plaidcoherency == ‘NO_OVERLAPPING’.

    Range: [0,1]

  • maxclustsperoverlappedarea (int, default 0) –

    Maximum number of clusters overlapped per area. Maximum number of clusters that can overlap together.

    Not used if plaidcoherency == ‘NO_OVERLAPPING’.

    Range: [0, nclusters]

  • maxpercofoverlappingelements (float, default 0.0) –

    Maximum percentage of values shared by overlapped clusters.

    Not used if plaidcoherency == ‘NO_OVERLAPPING’.

    Range: [0,1]

  • percofoverlappingrows (float, default 1.0) –

    Percentage of allowed amount of overlaping across clusters rows.

    Not used if plaidcoherency == ‘NO_OVERLAPPING’.

    Range: [0,1]

  • percofoverlappingcolumns (float, default 1.0) –

    Percentage of allowed amount of overlaping across clusters columns.

    Not used if plaidcoherency == ‘NO_OVERLAPPING’.

    Range: [0,1]

  • percofoverlappingcontexts (float, default 1.0) –

    Percentage of allowed amount of overlaping across clusters contexts.

    Not used if plaidcoherency == ‘NO_OVERLAPPING’ or cuda >= 3.

    Range: [0,1]

  • percmissingsonbackground (float, 0.0) –

    Percentage of missing values on the background, that is, values that do not belong to planted clusters.

    Range: [0,1]

  • percmissingsonclusters (float, 0.0) –

    Maximum percentage of missing values on each cluster.

    Range: [0,1]

  • percnoiseonbackground (float, 0.0) –

    Percentage of noisy values on background, that is, values with added noise.

    Range: [0,1]

  • percnoiseonclusters (float, 0.0) –

    Maximum percentage of noisy values on each cluster.

    Range: [0,1]

  • percnoisedeviation (int or float, 0.0) –

    Percentage of symbol on noisy values deviation, that is, the maximum difference between the current symbol on the matrix and the one that will replaced it to be considered noise.

    If dstype == Numeric then percnoisedeviation -> float else int.

    Ex: Let Alphabet = [1,2,3,4,5] and CurrentSymbol = 3, if the noiseDeviation is ‘1’, then CurrentSymbol will be, randomly, replaced by either ‘2’ or ‘4’. If noiseDeviation is ‘2’, CurrentSymbol can be replaced by either ‘1’,’2’,’4’ or ‘5’.

  • percerroesonbackground (float, 0.0) –

    Percentage of error values on background. Similar as noise, a new value is considered an error if the difference between it and the current value in the matrix is greater than noiseDeviation.

    Ex: Alphabet = [1,2,3,4,5], If currentValue = 2, and errorDeviation = 2, to turn currentValue an error, it’s value must be replaced by ‘5’, that is the only possible value that respects abs(currentValue - newValue) > noiseDeviation

    Range: [0,1]

  • percerrorsonclusters (float, 0.0) –

    Percentage of errors values on background. Similar as noise, a new value is considered an error if the difference between it and the current value in the matrix is greater than noiseDeviation.

    Ex: Alphabet = [1,2,3,4,5], If currentValue = 2, and errorDeviation = 2, to turn currentValue an error, it’s value must be replaced by ‘5’, that is the only possible value that respects abs(currentValue - newValue) > noiseDeviation

    Range: [0,1]

  • percerrorondeviation (int or float, 0.0) –

    Percentage of symbol on error values deviation, that is, the maximum difference between the current symbol on the matrix and the one that will replaced it to be considered error.

    If dstype == Numeric then percnoisedeviation -> float else int.

  • silence (bool, default False) – If True them the class does not print to the console.

  • seed (int, default -1) –

    Seed to initialize random objects.

    If seed is None or -1 then random objects are initialized without a seed.

  • timeprofile ({'RANDOM', 'MONONICALLY_INCREASING', 'MONONICALLY_DECREASING', None}, default None) –

    It determines a time profile for the ORDER_PRESERVING pattern. Only used if ORDER_PRESERVING in patterns.

    If None and ORDER_PRESERVING in patterns it defaults to ‘RANDOM’.

  • realval (bool, default True) – Indicates if the dataset is real valued. Only used when dstype == ‘NUMERIC’.

  • minval (int or float, default -10.0) – Dataset’s minimum value. Only used when dstype == ‘NUMERIC’.

  • maxval (int or float, default 10.0) – Dataset’s maximum value. Only used when dstype == ‘NUMERIC’.

  • symbols (list or array of strings, default None) –

    Dataset’s alphabet (list of possible values/symbols it can contain). Only used if dstype == ‘SYMBOLIC’.

    Shape: alphabets length

  • nsymbols (int, default 10) –

    Defines the length of the alphabet, instead of defining specific symbols this parameter can be passed, and a list of strings will be create with range(1, cuda), where cuda represents this parameter.

    Only used if dstype == ‘SYMBOLIC’ and symbols is None.

  • mean (int or float, default 14.0) – Mean for the background’s distribution. Only used when bktype == ‘NORMAL’.

  • stdev (int or float, default 7.0) – Standard deviation for the background’s distribution. Only used when bktype == ‘NORMAL’.

  • probs (list or array of floats) –

    Background weighted distribution probabilities. Only used when bktype == ‘DISCRETE’. No default probabilities, if probs is None and bktype == ‘DISCRETE’, bktype defaults to ‘UNIFORM’.

    Shape: Number of symbols or possible integers

    Range: [0,1]

    Math: sum(probs) == 1

  • in_memory (bool, default None) –

    Determines if generated datasets return dense or sparse matrix (True/False).

    If None then if the generated dataset’s size is larger then 10**5 it defaults to sparse, else outputs dense.

    Note

    This parameter can be overwritten in the generate method.

_n

Dimensionality.

Type

int

_stdout

default System.out

Type

System object (java)

dstype

Type of Dataset to be generated, numeric or symbolic(categorical).

Type

{‘NUMERIC’, ‘SYMBOLIC’}

patterns

Type of patterns that will be hidden in the data.

Type

list

clusterdistribution

Distribution used to calculate the size of a cluster.

Type

list

contiguity

Data contiguity.

Type

{‘COLUMNS’, ‘CONTEXTS’, ‘NONE’}

time_profile

Time profile for the ORDER_PRESERVING pattern.

Type

{‘RANDOM’, ‘MONONICALLY_INCREASING’, ‘MONONICALLY_DECREASING’, None}

seed

Seed to initialize random objects.

Type

int

realval

If the dataset is real valued.

Type

bool

minval

Dataset’s minimum value.

Type

float

maxval

Dataset’s maximum value.

Type

float

noise

Dataset’s noise settings.

Type

tuple

errors

Dataset’s error settings.

Type

tuple

missing

Dataset’s missing settings.

Type

tuple

symbols

Dataset’s alphabet.

Type

list

nsymbols

Length of the alphabet.

Type

int

plaidcoherency

Type of plaid coherency.

Type

{‘ADDITIVE’, ‘MULTIPLICATIVE’, ‘INTERPOLED’, ‘NONE’, ‘NO_OVERLAPPING’}

percofoverlappingclusts

Percentage of overlapping clusters.

Type

float

maxclustsperoverlappedarea

Maximum number of clusters overlapped per area.

Type

int

maxpercofoverlappingelements

Maximum percentage of values shared by overlapped clusters.

Type

float

percofoverlappingrows

Percentage of allowed amount of overlaping across clusters rows.

Type

float

percofoverlappingcolumns

Percentage of allowed amount of overlaping across clusters columns.

Type

float

percofoverlappingcontexts

Percentage of allowed amount of overlaping across clusters contexts.

Type

float

background

Dataset’s background settings

Type

list

generatedDataset

Generated dataset.

Type

Dataset object (java)

X

Generated dataset as tensor.

Type

dense or sparse tensor

Y

Hidden cluster labels.

Type

list

graph

N-partite graph

Type

Graph object

in_memory

If dataset should be saved in memory (dense format)

Type

bool

silenced

If prints to the console.

Type

bool

get_params()[source]

Returns the classes attributes.

Returns

Values of class attributes.

Return type

dict

Examples

>>> generator = BiclusterGenerator()
>>> generator.get_params()
{'X': None, 'Y': None, 'background': ['UNIFORM'], 'clusterdistribution': [['UNIFORM', 4.0, 4.0],
['UNIFORM', 4.0, 4.0]], 'contiguity': 'NONE', 'dstype': 'NUMERIC', 'errors': (0.0, 0.0, 0.0),
'generatedDataset': None, 'graph': None, 'in_memory': None, 'maxclustsperoverlappedarea': 0,
'maxpercofoverlappingelements': 0.0, 'maxval': 10.0, 'minval': -10.0, 'missing': (0.0, 0.0),
'noise': (0.0, 0.0, 0.0), 'patterns': [['CONSTANT', 'CONSTANT']], 'percofoverlappingclusts': 0.0,
'percofoverlappingcolumns': 1.0, 'percofoverlappingcontexts': 1.0, 'percofoverlappingrows': 1.0,
'plaidcoherency': 'NO_OVERLAPPING', 'realval': True, 'seed': -1, 'silenced': False, 'time_profile': None}
property cluster_info

Returns clusters info.

Returns

Hidden cluster info.

Return type

dict

Examples

>>> generator = BiclusterGenerator(silence=True)
>>> generator.generate(no_return=True)
>>> generator.cluster_info
{'0': {'%Errors': '0', 'Type': 'Numeric', '%Missings': '0', '%Noise': '0', 'X': [15, 51, 63, 92],
'Y': [7, 29, 35, 94], 'RowPattern': 'Constant', 'ColumnPattern': 'Constant',
'Data': [['-8.61', '-8.61', '-8.61', '-8.61'], ['-8.61', '-8.61', '-8.61', '-8.61'],
['-8.61', '-8.61', '-8.61', '-8.61'], ['-8.61', '-8.61', '-8.61', '-8.61']],
'PlaidCoherency': 'No Overlapping', '#rows': 4, '#columns': 4}}
property coverage

Returns clusters dataset coverage.

Returns

Percentage of cluster coverage.

Return type

float

Examples

>>> generator = BiclusterGenerator(silence=True)
>>> generator.generate(no_return=True)
>>> generator.coverage
0.16
abstract save(extension='default', file_name='example_dataset', path=None, single_file=None, **kwargs)[source]

Saves data files to chosen path.

Parameters
  • extension ({'default', 'csv'}, default 'default') – Extension of saved data file.

  • file_name (str, default 'example_dataset') – Saved files prefix.

  • path (str, default None) – Path to save files. If None then files are saved in the current working directory.

  • single_file (Bool, default None.) – If False dataset is saved in multiple data files. If None then if the dataset’s size is larger then 10**5 it defaults to False, else True. Only used if extension==’default’.

  • **kwargs (any, default None) – Additional keywords that are passed on.

to_tensor(generatedDataset=None, in_memory=None, keys=None)[source]

Returns generated dataset as somekind of tensor and hidden cluster labels.

Parameters
  • generatedDataset (Dataset object) – Generated dataset (java object).

  • in_memory (bool, default None) –

    Determines if generated datasets return dense or sparse matrix (True/False).

    If None then if the generated dataset’s size is larger then 10**5 it defaults to sparse, else outputs dense.

  • keys (list, default ['X', 'Y', 'Z']) – Axis keys. Do not overwrite, unless you are using a different dataset object.

Returns

  • dense or sparse tensor – Generated dataset as tensor.

    Shape: (ncontexts, nrows, ncols) or (nrows, ncols)

  • list – Hidden cluster labels.

    Shape: (nclusters, any)

Examples

>>> generator = BiclusterGenerator(silence=True)
>>> generator.generate(no_return=True)
>>> x, y = generator.to_tensor(generatedDataset=generator.generatedDataset, in_memory=True)
>>> x
array([[-4.15,  9.88,  7.69, ...,  3.68,  1.72, -6.95],
       [ 7.37,  2.63, -0.13, ..., -2.53,  2.03,  8.03],
       [ 4.28,  0.36,  8.66, ..., -1.11,  6.28, -1.03],
       ...,
       [-9.25, -9.15, -4.68, ...,  2.06, -6.19,  2.54],
       [ 2.63, -3.03,  3.8 , ...,  4.13, -4.17,  7.68],
       [-1.98,  8.02,  1.89, ...,  3.59,  4.27,  6.4 ]])
to_graph(x=None, framework='networkx', device='cpu', **kwargs)[source]

Returns a n-partite graph, where n==dim.

Parameters
  • x (numpy array) – Data array.

  • framework ({networkx, dgl}, default 'networkx') – Backend to use to build graph.

  • device ({'cpu', 'gpu'}, default 'cpu') – Type of device for storing the tensor. Only used if framework==dgl.

  • **kwargs (any, default None) – Additional keywords that are passed on.

Returns

N-partite graph, where n==dim.

Shape: (nrows + ncols + ncontexts, nrows * ncols * ncontexts * 3(dim)) or (nrows + ncols, nrows * ncols)

Return type

Graph object

Examples

>>> generator = BiclusterGenerator(silence=True)
>>> X, y = generator.generate()
>>> g = generator.to_graph(X, framework='dgl')
Graph(num_nodes={'col': 100, 'row': 100},
      num_edges={('row', 'elem', 'col'): 10000},
      metagraph=[('row', 'col', 'elem')])
generate(nrows=100, ncols=100, ncontexts=3, nclusters=1, no_return=False, **kwargs)[source]

Generates dataset, and may return somekind of tensor and hidden cluster labels.

Parameters
  • nrows (int, default 100) – Number of rows in generated dataset.

  • ncols (int, default 100) – Number of columns in generated dataset.

  • ncontexts (int, default 3) – Number of contexts in generated dataset. Only used if dim >= 3.

  • nclusters (int, default 1) – Number of clusters in generated dataset.

  • no_return (bool, default False) – If True method returns None.

  • **kwargs (any, default None) – Additional keywords that are passed on.

Returns

  • dense or sparse tensor – Generated dataset as tensor.

    Shape: (ncontexts, nrows, ncols) or (nrows, ncols)

  • list – Hidden cluster labels.

    Shape: (nclusters, any)

  • None – If no_return==True.

Examples

>>> gen = BiclusterGenerator(silence=True)
>>> x, y = gen.generate(nrows=100, ncols=200, nclusters=20, in_memory=True)
>>> x
array([[-7.36,  4.88,  8.42, ..., -5.04, -4.93,  6.35],
       [-7.1 ,  0.47, -2.58, ..., -3.03,  0.42,  8.76],
       [-8.08,  4.19,  2.53, ..., -4.3 ,  7.54,  0.94],
       ...,
       [-0.52,  0.38,  6.98, ..., -7.6 ,  5.71,  9.24],
       [-1.28, -3.55, -3.13, ..., -4.17, -6.05, -9.87],
       [-5.79, -6.05, -2.24, ...,  1.88,  1.97,  6.05]])
static shutdownJVM()[source]

Shuts down JVM.

Caution

If the JVM is shutdown it cannot be restarted on the same session.