Generating Data

Generators

This tool provides data generators for bi-clustering and tri-clustering, both generators are based on nclustgen.Generator.Generator, a (dimensions) abstract class. A complete explanation of the parameters of the generator can be found in the API reference.

It can generate real-valued, integer, and categorical datasets, with different settings for cluster patterns, distributions, cluster overlapping, noise, missing values, and other parameters.

# Generate real-valued dataset

from nclustgen import BiclusterGenerator

# Initialize generator
generator = BiclusterGenerator(
    # Dataset type
    dstype='NUMERIC',
    # If real-valued
    realval=True,
    minval=1,
    maxval=10
)

x, y = generator.generate()
x

# Generate categorical dataset

from nclustgen import BiclusterGenerator

# Initialize generator
generator = BiclusterGenerator(
    # Dataset type
    dstype='SYMBOLIC',
    # Number of symbols
    nsymbols=10
)

x, y = generator.generate()
x

A seed argument can also be used to ensure reproducibility:

from nclustgen import BiclusterGenerator

generator = BiclusterGenerator(seed=3)

x, y = generator.generate()
x

To generate a dataset, the nclustgen.Generator.Generator.generate() method can be called. This method receives as input the dataset’s shape and number of hidden clusters:

# Generate bicluster dataset

from nclustgen import BiclusterGenerator
generator = BiclusterGenerator()

x, y = generator.generate(nrows=100, ncols=20, nclusters=20)
x

# Generate tricluster dataset

from nclustgen import TriclusterGenerator
generator = TriclusterGenerator()

x, y = generator.generate(nrows=100, ncols=20, ncontexts=3, nclusters=20)
x

Different patterns can be used for biclusters or triclusters:

# Generate bicluster dataset
from nclustgen import BiclusterGenerator
generator = BiclusterGenerator(
    patterns = [['Additive', 'Constant'], ['Constant', 'Multiplicative']]
)

x, y = generator.generate()
x

# Generate tricluster dataset
from nclustgen import TriclusterGenerator
generator = TriclusterGenerator(
    patterns = [['Order_Preserving', 'None', 'None'], ['Constant', 'Constant', 'Constant']]
)

x, y = generator.generate()
x

Biclustering Generator

The biclustering generator uses G-Bic a Java library as the backend generator, check this library if you prefer a graphical interface or to work with Java directly. More information can also be found there if you wish to modify the actual generator.

Patterns

The biclustering generator as specified earlier accepts a number of different bicluster patterns here is a complete list:

2D Numeric Patterns Possible Combinations
index	pattern combination
0	[‘Order Preserving’, ‘None’]
1	[‘None’, ‘Order Preserving’]
2	[‘Constant’, ‘Constant’]
3	[‘None’, ‘Constant’]
4	[‘Constant’, ‘None’]
5	[‘Additive’, ‘Additive’]
6	[‘Constant’, ‘Additive’]
7	[‘Additive’, ‘Constant’]
8	[‘Multiplicative’, ‘Multiplicative’]
9	[‘Constant’, ‘Multiplicative’]
10	[‘Multiplicative’, ‘Constant’]

2D Symbolic Patterns Possible Combinations
index	pattern combination
0	[‘Order Preserving’, ‘None’]
1	[‘None’, ‘Order Preserving’]
2	[‘Constant’, ‘Constant’]
3	[‘None’, ‘Constant’]
4	[‘Constant’, ‘None’]

Triclustering Generator

The triclustering generator similarly uses G-Tric a Java library as the backend generator.

Patterns

Like the biclustering generator, triclustering generator also accepts several different patterns:

3D Numeric Patterns Possible Combinations
index	pattern combination
0	[‘Order Preserving’, ‘None’, ‘None’]
1	[‘None’, ‘Order Preserving’, ‘None’]
2	[‘None’, ‘None’, ‘Order Preserving’]
3	[‘Constant’, ‘Constant’, ‘Constant’]
4	[‘None’, ‘Constant’, ‘Constant’]
5	[‘Constant’, ‘Constant’, ‘None’]
6	[‘Constant’, ‘None’, ‘Constant’]
7	[‘Constant’, ‘None’, ‘None’]
8	[‘None’, ‘Constant’, ‘None’]
9	[‘None’, ‘None’, ‘Constant’]
10	[‘Additive’, ‘Additive’, ‘Additive’]
11	[‘Additive’, ‘Additive’, ‘Constant’]
12	[‘Constant’, ‘Additive’, ‘Additive’]
13	[‘Additive’, ‘Constant’, ‘Additive’]
14	[‘Additive’, ‘Constant’, ‘Constant’]
15	[‘Constant’, ‘Additive’, ‘Constant’]
16	[‘Constant’, ‘Constant’, ‘Additive’]
17	[‘Multiplicative’, ‘Multiplicative’, ‘Multiplicative’]
18	[‘Multiplicative’, ‘Multiplicative’, ‘Constant’]
19	[‘Constant’, ‘Multiplicative’, ‘Multiplicative’]
20	[‘Multiplicative’, ‘Constant’, ‘Multiplicative’]
21	[‘Multiplicative’, ‘Constant’, ‘Constant’]
22	[‘Constant’, ‘Multiplicative’, ‘Constant’]
23	[‘Constant’, ‘Constant’, ‘Multiplicative’]

3D Numeric Patterns Possible Combinations
index	pattern combination
0	[‘Order Preserving’, ‘None’, ‘None’]
1	[‘None’, ‘Order Preserving’, ‘None’]
2	[‘None’, ‘None’, ‘Order Preserving’]
3	[‘Constant’, ‘Constant’, ‘Constant’]
4	[‘None’, ‘Constant’, ‘Constant’]
5	[‘Constant’, ‘Constant’, ‘None’]
6	[‘Constant’, ‘None’, ‘Constant’]
7	[‘Constant’, ‘None’, ‘None’]
8	[‘None’, ‘Constant’, ‘None’]
9	[‘None’, ‘None’, ‘Constant’]

Dense Tensors

If nclustgen.Generator.Generator.in_memory is True, then a dense tensor will be generated, in this case numpy is used. If you are not familiar with numpy follow this link to learn more about it: https://numpy.org/doc/stable/user/quickstart.html

>>> from nclustgen import BiclusterGenerator
>>> generator = BiclusterGenerator(in_memory=True)
>>> x, y = generator.generate()
>>> type(x)
<class 'numpy.ndarray'>

Matrix

When the generator’s output is a dense matrix, it will be of shape (nrows, ncols)

>>> from nclustgen import BiclusterGenerator
>>> generator = BiclusterGenerator(in_memory=True)
>>> x, y = generator.generate(nrows=100, ncols=50)
>>> x.shape
(100, 50)

Tensor

On the other hand, when the generator’s output is a dense tensor, it will be of shape (ncontext, nrows, ncols)

>>> from nclustgen import TriclusterGenerator
>>> generator = TriclusterGenerator(in_memory=True)
>>> x, y = generator.generate(nrows=100, ncols=50, ncontexts=30)
>>> x.shape
(30, 100, 50)

Sparse Tensors

If nclustgen.Generator.Generator.in_memory parameter is False, then a sparse tensor will be generated, in this case different packages are used depending on the dimensionality of the dataset. But the shape follows the standard set by the dense option.

Matrix

When the generator’s output is a sparse matrix, scipy’s csr_matrix will be used.

>>> from nclustgen import BiclusterGenerator
>>> generator = BiclusterGenerator(in_memory=False)
>>> x, y = generator.generate()
>>> type(x)
<class 'scipy.sparse.csr.csr_matrix'>

Tensor

On the other hand, when the generator’s output is a sparse tensor a sparse’s COO object will be outputted.

>>> from nclustgen import TriclusterGenerator
>>> generator = TriclusterGenerator(in_memory=False)
>>> x, y = generator.generate()
>>> type(x)
<class 'sparse._coo.core.COO'>

Graphs

The nclustgen.Generator.Generator.to_graph() method allows for either a bipartite or tripartite graph to be generated, depending on the datasets dimension.

The datasets shape will be transformed in the following way:

number of nodes = nrows + ncols (+ ncontexts)

number of edges = nrows * ncols ( ncontexts * 3)*

The graphs can be outputted in two different formats as a NetworkX Multigraph, or as a DGL heterograph with a pytorch backend.

The networkX is a very well known framework to deal with graph data, while DGL is a more recent library mainly for deep learning with graphs, so if you intend to use this data for deep learning models DGL is recommended, otherwise, networkX will probably be a better option.

>>> from nclustgen import BiclusterGenerator
>>> generator = BiclusterGenerator()
>>> x, y = generator.generate(100, 50)
>>> g = generator.to_graph(framework='dgl')
>>> g
<networkx.classes.graph.Graph object at 0x10a011d60>
>>> len(g.nodes) == 100 + 50
True
>>> len(g.edges) == 100 * 50
True
>>> g = generator.to_graph(framework='dgl')
>>> g
Graph(num_nodes={'col': 50, 'row': 100},
      num_edges={('row', 'elem', 'col'): 5000},
      metagraph=[('row', 'col', 'elem')])
>>> g.num_nodes() == 100 + 50
True
>>> g.num_edges() == 100 * 50
True

In case dgl framework is being used the nclustgen.Generator.Generator.to_graph() method can also receive two additional parameters, the device and cuda parameters. The first determines if the tensors are stored in cpu or gpu memory, the second is only used for gpu devices and sets the index of the gpu device to be used in multi-gpu machines if that’s not the case ignore it as it defaults to 0.

>>> g = generator.to_graph(framework='dgl', device='gpu', cuda=0)
>>> g.device
device(type='gpu')