Data¶

DataPreprocess¶

class neuralkg_ind.data.DataPreprocess.KGData(args)[source]¶

Bases: object

Data preprocessing of kg data.

args¶: Some pre-set parameters, such as dataset path, etc.

ent2id¶: Encoding the entity in triples, type: dict.

rel2id¶: Encoding the relation in triples, type: dict.

id2ent¶: Decoding the entity in triples, type: dict.

id2rel¶: Decoding the realtion in triples, type: dict.

train_triples¶: Record the triples for training, type: list.

valid_triples¶: Record the triples for validation, type: list.

test_triples¶: Record the triples for testing, type: list.

all_true_triples¶: Record all triples including train,valid and test, type: list.

TrainTriples¶

Relation2Tuple¶

RelSub2Obj¶

hr2t_train¶: Record the tail corresponding to the same head and relation, type: defaultdict(class:set).

rt2h_train¶: Record the head corresponding to the same tail and relation, type: defaultdict(class:set).

h2rt_train¶: Record the tail, relation corresponding to the same head, type: defaultdict(class:set).

t2rh_train¶: Record the head, realtion corresponding to the same tail, type: defaultdict(class:set).

get_id()[source]¶

Getting entity/relation id, and entity/relation number.

Update:: self.ent2id: Entity to id. self.rel2id: Relation to id. self.id2ent: id to Entity. self.id2rel: id to Relation. self.args.num_ent: Entity number. self.args.num_rel: Relation number.

get_triples_id()[source]¶

Getting triples id, save in the format of (h, r, t).

Update:: self.train_triples: Train dataset triples id. self.valid_triples: Valid dataset triples id. self.test_triples: Test dataset triples id.

get_hr2t_rt2h_from_train()[source]¶

Getting the set of hr2t and rt2h from train dataset, the data type is numpy.

Update:: self.hr2t_train: The set of hr2t. self.rt2h_train: The set of rt2h.

static count_frequency(triples, start=4)[source]¶

Getting frequency of a partial triple like (head, relation) or (relation, tail).

The frequency will be used for subsampling like word2vec.

Parameters:

triples – Sampled triples.
start – Initial count number.

Returns:

Record the number of (head, relation).

Return type:

count

get_h2rt_t2hr_from_train()[source]¶

Getting the set of h2rt and t2hr from train dataset, the data type is numpy.

Update:: self.h2rt_train: The set of h2rt. self.t2rh_train: The set of t2hr.

get_hr_trian()[source]¶

Change the generation mode of batch. Merging triples which have same head and relation for 1vsN training mode.

Returns:: The tuple(hr, t) list for training
Return type:: self.train_triples

class neuralkg_ind.data.DataPreprocess.GRData(args, db_name_pos, db_name_neg)[source]¶

Bases: Dataset

Data preprocessing of subgraph. – DGL Only

args¶: Some pre-set parameters, such as dataset path, etc.

db_name_pos¶: Database name of positive sample, type: str.

db_name_neg¶: Database name of negative sample, type: str.

m_h2r¶: The matrix of head to rels, type: NDArray[signedinteger].

m_t2r¶: The matrix of tail to rels, type: NDArray[signedinteger].

ssp_graph¶: The collect of head to tail csc_matrix. type: list.

graph¶: Dgl graph of train or test, type: DGLHeteroGraph.

id2entity¶: Record the id to entity. type: dict.

id2relation¶: Record the id to relation. type: dict.

load_data_grail()[source]¶

Load train dataset, adj_list, ent2idx, etc.

Returns:: The collect of head to tail csc_matrix. type: list. triplets: Triple of train-train and train-validation. train_ent2idx: Entity to idx of train graph. train_rel2idx: Relation to idx of train graph. train_idx2ent: idx to entity of train graph. train_idx2rel: idx to relation of train graph. h2r: Head to relation of train-train triple. m_h2r: The matrix of head to rels. t2r: Tail to relation of train-train triple. m_t2r: The matrix of tail to rels
Return type:: adj_list

load_ind_data_grail()[source]¶

Load test dataset, adj_list, ent2idx, etc.

Returns:: The collect of head to tail csc_matrix. type: list. triplets: Triple of test-train and test-test. train_ent2idx: Entity to idx of test graph. train_rel2idx: Relation to idx of test graph. train_idx2ent: idx to entity of test graph. train_idx2rel: idx to relation of test graph. h2r: Head to relation of test-train triple. m_h2r: The matrix of head to rels. t2r: Tail to relation of test-train triple. m_t2r: The matrix of tail to rels
Return type:: adj_list

generate_train()[source]¶

generate_valid()[source]¶

prepare_subgraphs(nodes, r_label, n_labels)[source]¶

Initialize subgraph nodes and relation characteristics.

Parameters:

nodes – The nodes of subgraph.
r_label – The label of relation in subgraph corresponding triple.
n_labels – The label of node in subgraph.

Returns:

Subgraph after processing.

Return type:

subgraph

prepare_features_new(subgraph, n_labels, r_label=None)[source]¶

prepare subgraph node features

Parameters:

subgraph – Extract subgraph.
r_label – The label of relation in subgraph corresponding triple.
n_labels – The label of node in subgraph.

Returns:

Subgraph after initialize node label.

Return type:

subgraph

class neuralkg_ind.data.DataPreprocess.MetaTrainGRData(args)[source]¶

Bases: Dataset

Data preprocessing of meta train task.

subgraphs_db¶: database of train subgraphs.

class neuralkg_ind.data.DataPreprocess.MetaValidGRData(args)[source]¶

Bases: Dataset

Data preprocessing of meta valid task.

subgraphs_db¶: database of valid subgraphs.

class neuralkg_ind.data.DataPreprocess.KGEEvalData(args, eval_triples, num_ent, hr2t, rt2h)[source]¶

Bases: Dataset

Data processing for kge evaluate.

triples¶: Evaluate triples. type: list.

num_ent¶: The number of entity. type: int.

hr2t¶: Head and raltion to tails. type: dict.

rt2h¶: Relation and tail to heads. type: dict.

num_cand¶: The number of candidate entities. type: str or int.

get_label(true_tail, true_head)[source]¶

Filter head and tail entities.

Parameters:

true_tail – Existing tail entities in dataset.
true_head – Existing head entities in dataset.

Returns:

Label of tail entities. y_head: Label of head entities.

Return type:

y_tail

static collate_fn(data)[source]¶

class neuralkg_ind.data.DataPreprocess.BaseSampler(args)[source]¶

Bases: KGData

Traditional random sampling mode.

corrupt_head(t, r, num_max=1)[source]¶

Negative sampling of head entities.

Parameters:

t – Tail entity in triple.
r – Relation in triple.
num_max – The maximum of negative samples generated

Returns:

The negative sample of head entity filtering out the positive head entity.

Return type:

neg

corrupt_tail(h, r, num_max=1)[source]¶

Negative sampling of tail entities.

Parameters:

h – Head entity in triple.
r – Relation in triple.
num_max – The maximum of negative samples generated

Returns:

The negative sample of tail entity filtering out the positive tail entity.

Return type:

neg

head_batch(h, r, t, neg_size=None)[source]¶

Negative sampling of head entities.

Parameters:

h – Head entity in triple
t – Tail entity in triple.
r – Relation in triple.
neg_size – The size of negative samples.

Returns:

The negative sample of head entity. [neg_size]

tail_batch(h, r, t, neg_size=None)[source]¶

Negative sampling of tail entities.

Parameters:

h – Head entity in triple
t – Tail entity in triple.
r – Relation in triple.
neg_size – The size of negative samples.

Returns:

The negative sample of tail entity. [neg_size]

get_train()[source]¶

get_valid()[source]¶

get_test()[source]¶

get_all_true_triples()[source]¶

class neuralkg_ind.data.DataPreprocess.RevSampler(args)[source]¶

Bases: KGData

Adding reverse triples in traditional random sampling mode.

For each triple (h, r, t), generate the reverse triple (t, r`, h). r` = r + num_rel.

hr2t_train¶: Record the tail corresponding to the same head and relation, type: defaultdict(class:set).

rt2h_train¶: Record the head corresponding to the same tail and relation, type: defaultdict(class:set).

add_reverse_relation()[source]¶

Getting entity/relation/reverse relation id, and entity/relation number.

Update:: self.ent2id: Entity id. self.rel2id: Relation id. self.args.num_ent: Entity number. self.args.num_rel: Relation number.

add_reverse_triples()[source]¶

Generate reverse triples (t, r`, h).

Update:: self.train_triples: Triples for training. self.valid_triples: Triples for validation. self.test_triples: Triples for testing. self.all_ture_triples: All triples including train, valid and test.

get_train()[source]¶

get_valid()[source]¶

get_test()[source]¶

get_all_true_triples()[source]¶

corrupt_head(t, r, num_max=1)[source]¶

Negative sampling of head entities.

Parameters:

t – Tail entity in triple.
r – Relation in triple.
num_max – The maximum of negative samples generated

Returns:

The negative sample of head entity filtering out the positive head entity.

Return type:

neg

corrupt_tail(h, r, num_max=1)[source]¶

Negative sampling of tail entities.

Parameters:

h – Head entity in triple.
r – Relation in triple.
num_max – The maximum of negative samples generated

Returns:

The negative sample of tail entity filtering out the positive tail entity.

Return type:

neg

head_batch(h, r, t, neg_size=None)[source]¶

Negative sampling of head entities.

Parameters:

h – Head entity in triple
t – Tail entity in triple.
r – Relation in triple.
neg_size – The size of negative samples.

Returns:

The negative sample of head entity. [neg_size]

tail_batch(h, r, t, neg_size=None)[source]¶

Negative sampling of tail entities.

Parameters:

h – Head entity in triple
t – Tail entity in triple.
r – Relation in triple.
neg_size – The size of negative samples.

Returns:

The negative sample of tail entity. [neg_size]

class neuralkg_ind.data.DataPreprocess.BaseGraph(args)[source]¶

Bases: object

Base subgraph class

collect train, valid, test dataset for inductive.

get_train()[source]¶

get_valid()[source]¶

get_test()[source]¶

generate_ind_test()[source]¶

generate inductive test triples.

Returns:: Negative triplets.
Return type:: neg_triplets

load_data_grail_ind()[source]¶

Load train dataset, adj_list, ent2idx, etc.

Returns:: The collect of head to tail csc_matrix. dgl_adj_list: The collect of undirected head to tail csc_matrix. triplets: Triple of test-train and test-test. m_h2r: The matrix of head to rels. m_t2r: The matrix of tail to rels
Return type:: adj_list

get_neg_samples_replacing_head_tail(test_links, adj_list, num_samples=50)[source]¶

Sample negative triplets by relacing head or tail.

Parameters:

test_links – test-test triplets.
adj_list – The collect of head to tail csc_matrix.
num_samples – The number of candidates.

Returns:

Sampled negative triplets.

Return type:

neg_triplets

class neuralkg_ind.data.DataPreprocess.BaseMeta(args)[source]¶

Bases: object

Base meta class

collect train, valid, test dataset for meta task.

get_train()[source]¶

get_valid()[source]¶

get_test()[source]¶

Grounding¶

class neuralkg_ind.data.Grounding.GroundAllRules(args)[source]¶

Bases: object

PropositionalizeRule()[source]¶

readData(fnEntityIDMap, fnRelationIDMap, fnTrainingTriples)[source]¶

groundRule(fnRuleType, fnOutput)[source]¶

KGDataModule¶

Base DataModule class.

class neuralkg_ind.data.KGDataModule.KGDataModule(*args: Any, **kwargs: Any)[source]¶

Bases: BaseDataModule

Base DataModule. Learn more at https://pytorch-lightning.readthedocs.io/en/stable/datamodules.html

get_data_config()[source]¶: Return important settings of the dataset, which will be passed to instantiate models.

prepare_data()[source]¶: Use this method to do things that might write to disk or that need to be done only from a single GPU in distributed settings (so don’t set state self.x = y).

setup(stage=None)[source]¶: Split into train, val, test, and set dims. Should assign torch Dataset objects to self.data_train, self.data_val, and optionally self.data_test.

get_train_bs()[source]¶

Get batch size for training.

If the num_batches isn`t zero, it will divide data_train by num_batches to get batch size. And if user don`t give batch size and num_batches=0, it will raise ValueError.

Returns:: The batch size for training.
Return type:: self.args.train_bs

train_dataloader()[source]¶

Implement one or more PyTorch DataLoaders for training.

Returns:: A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this page.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

fit()
…
prepare_data()
setup()
train_dataloader()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}

val_dataloader()[source]¶

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

fit()
…
prepare_data()
train_dataloader()
val_dataloader()
test_dataloader()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:: A torch.utils.data.DataLoader or a sequence of them specifying validation samples.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

test_dataloader()[source]¶

Implement one or multiple PyTorch DataLoaders for testing.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a postive integer.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

fit()
…
prepare_data()
setup()
train_dataloader()
val_dataloader()
test_dataloader()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Returns:: A torch.utils.data.DataLoader or a sequence of them specifying testing samples.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

RuleDataLoader¶

class neuralkg_ind.data.RuleDataLoader.RuleDataset(args)[source]¶: Bases: Dataset

class neuralkg_ind.data.RuleDataLoader.RuleDataLoader(args)[source]¶

Bases: DataLoader

dataset: Dataset[T_co]¶

batch_size: int | None¶

num_workers: int¶

pin_memory: bool¶

drop_last: bool¶

timeout: float¶

sampler: Sampler¶

prefetch_factor: int¶

Sampler¶

class neuralkg_ind.data.Sampler.SubSampler(args)[source]¶

Bases: BaseGraph

Sampling subgraphs.

Prepare subgraphs and collect batch of subgraphs.

sampling(data)[source]¶

Sampling function to collect batch of subgraph for training.

Parameters:: data – List of train data.
Returns:: batch of train data.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.RMPISampler(args)[source]¶

Bases: BaseGraph

Sampling subgraphs for RMPI training, which add disclosing subgraph.

sampling(data)[source]¶

Sampling function to collect batch of subgraph for RMPI training.

Parameters:: data – List of RMPI train data.
Returns:: batch of RMPI train data.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.UniSampler(args)[source]¶

Bases: BaseSampler

Random negative sampling Filtering out positive samples and selecting some samples randomly as negative samples.

cross_sampling_flag¶: The flag of cross sampling head and tail negative samples.

sampling(data)[source]¶

Filtering out positive samples and selecting some samples randomly as negative samples.

Parameters:: data – The triples used to be sampled.
Returns:: The training data.
Return type:: batch_data

uni_sampling(data)[source]¶

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.BernSampler(args)[source]¶

Bases: BaseSampler

Using bernoulli distribution to select whether to replace the head entity or tail entity.

lef_mean¶: Record the mean of head entity

rig_mean¶: Record the mean of tail entity

sampling(data)[source]¶

Using bernoulli distribution to select whether to replace the head entity or tail entity.

Parameters:: data – The triples used to be sampled.
Returns:: The training data.
Return type:: batch_data

calc_bern()[source]¶

Calculating the lef_mean and rig_mean.

Returns:: Record the mean of head entity. rig_mean: Record the mean of tail entity.
Return type:: lef_mean

static sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.AdvSampler(args)[source]¶

Bases: BaseSampler

Self-adversarial negative sampling, in math:

pleft(h_{j}^{prime}, r, t_{j}^{prime} midleft{left(h_{i}, r_{i}, t_{i}

ight) ight} ight)= rac{exp lpha f_{r}left(mathbf{h}_{j}^{prime}, mathbf{t}_{j}^{prime} ight)}{sum_{i} exp lpha f_{r}left(mathbf{h}_{i}^{prime}, mathbf{t}_{i}^{prime} ight)}

Attributes:
freq_hr: The count of (h, r) pairs. freq_tr: The count of (t, r) pairs.

sampling(pos_sample)[source]¶

Self-adversarial negative sampling.

Parameters:: data – The triples used to be sampled.
Returns:: The training data.
Return type:: batch_data

calc_freq()[source]¶

Calculating the freq_hr and freq_tr.

Returns:: The count of (h, r) pairs. freq_tr: The count of (t, r) pairs.
Return type:: freq_hr

class neuralkg_ind.data.Sampler.AllSampler(args)[source]¶

Bases: RevSampler

Merging triples which have same head and relation, all false tail entities are taken as negative samples.

sampling(data)[source]¶

Randomly sampling from the merged triples.

Parameters:: data – The triples used to be sampled.
Returns:: The training data.
Return type:: batch_data

sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.CrossESampler(args)[source]¶

Bases: BaseSampler

sampling(data)[source]¶: 一个样本同时做head/tail prediction

init_label(row)[source]¶

sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.ConvSampler(args)[source]¶

Bases: RevSampler

Merging triples which have same head and relation, all false tail entities are taken as negative samples.

The triples which have same head and relation are treated as one triple.

label¶: Mask the false tail as negative samples.

triples¶: The triples used to be sampled.

sampling(pos_hr_t)[source]¶

Randomly sampling from the merged triples.

Parameters:: pos_hr_t – The triples ((head,relation) pairs) used to be sampled.
Returns:: The training data.
Return type:: batch_data

sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.XTransESampler(args)[source]¶

Bases: RevSampler

Random negative sampling and recording neighbor entities.

triples¶: The triples used to be sampled.

neg_sample¶: The negative samples.

h_neighbor¶: The neighbor of sampled entites.

h_mask¶: The tag of effecitve neighbor.

max_neighbor¶: The maximum of the neighbor entities.

sampling(data)[source]¶

Random negative sampling and recording neighbor entities.

Parameters:: data – The triples used to be sampled.
Returns:: The training data.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.GraphSampler(args)[source]¶

Bases: RevSampler

Graph based sampling in neural network.

entity¶: The entities of sampled triples.

relation¶: The relation of sampled triples.

triples¶: The sampled triples.

graph¶: The graph structured sampled triples by dgl.graph in DGL.

norm¶: The edge norm in graph.

label¶: Mask the false tail as negative samples.

sampling(pos_triples)[source]¶

Graph based sampling in neural network.

Parameters:: pos_triples – The triples used to be sampled.
Returns:: The training data.
Return type:: batch_data

get_sampling_keys()[source]¶

sampling_negative(mode, pos_triples, num_neg)[source]¶

Random negative sampling without filtering

Parameters:

mode – The mode of negtive sampling.
pos_triples – The positive triples.
num_neg – The number of negative samples corresponding to each triple.

Results:: neg_samples: The negative triples.

build_graph(num_ent, triples, power)[source]¶

Using sampled triples to build a graph by dgl.graph in DGL.

Parameters:

num_ent – The number of entities.
triples – The positive sampled triples.
power – The power index for normalization.

Returns:

The relation of sampled triples. graph: The graph structured sampled triples by dgl.graph in DGL. edge_norm: The edge norm in graph.

Return type:

rela

comp_deg_norm(graph, power=-1)[source]¶

Calculating the normalization node weight.

Parameters:

graph – The graph structured sampled triples by dgl.graph in DGL.
power – The power index for normalization.

Returns:

The node weight of normalization.

Return type:

tensor

node_norm_to_edge_norm(graph, node_norm)[source]¶

Calculating the normalization edge weight.

Parameters:

graph – The graph structured sampled triples by dgl.graph in DGL.
node_norm – The node weight of normalization.

Returns:

The edge weight of normalization.

Return type:

tensor

sampling_positive(positive_triples)[source]¶

Regenerate positive sampling.

Parameters:: positive_triples – The positive sampled triples.

Results:: The regenerate triples and entities filter invisible entities.

class neuralkg_ind.data.Sampler.KBATSampler(args)[source]¶

Bases: BaseSampler

Graph based n_hop neighbours in neural network.

n_hop¶: The graph of n_hop neighbours.

graph¶: The adjacency graph.

neighbours¶: The neighbours of sampled triples.

adj_matrix¶: The triples of sampled.

triples¶: The sampled triples.

triples_GAT_pos¶: Positive triples.

triples_GAT_neg¶: Negative triples.

triples_Con¶: All triples including positive triples and negative triples.

label¶: Mask the false tail as negative samples.

sampling(pos_triples)[source]¶

Graph based n_hop neighbours in neural network.

Parameters:: pos_triples – The triples used to be sampled.
Returns:: The training data.
Return type:: batch_data

get_sampling_keys()[source]¶

bfs(graph, source, nbd_size=2)[source]¶

Using depth first search algorithm to generate n_hop neighbor graph.

Parameters:

graph – The adjacency graph.
source – Head node.
nbd_size – The number of hops.

Returns:

N_hop neighbor graph.

Return type:

neighbors

get_neighbors(nbd_size=2)[source]¶

Getting the relation and entity of the source in the n_hop neighborhood.

Parameters:: nbd_size – The number of hops.
Returns:: Record the relation and entity of the source in the n_hop neighborhood.
Return type:: self.neighbours

get_unique_entity(triples)[source]¶

Getting the set of entity.

Parameters:: triples – The sampled triples.
Returns:: The set of entity
Return type:: numpy.array

get_batch_nhop_neighbors_all(nbd_size=2)[source]¶

Getting n_hop neighbors of all entities in batch.

Parameters:: nbd_size – The number of hops.
Returns:: The set of n_hop neighbors.

sampling_negative(mode, pos_triples, num_neg)[source]¶

Random negative sampling.

Parameters:

mode – The mode of negtive sampling.
pos_triples – The positive triples.
num_neg – The number of negative samples corresponding to each triple.

Results:: neg_samples: The negative triples.

sam_negative(mode, pos_triples, num_neg)[source]¶

Random negative sampling without filter.

Parameters:

mode – The mode of negtive sampling.
pos_triples – The positive triples.
num_neg – The number of negative samples corresponding to each triple.

Results:: neg_samples: The negative triples.

class neuralkg_ind.data.Sampler.CompGCNSampler(args)[source]¶

Bases: GraphSampler

Graph based sampling in neural network.

relation¶: The relation of sampled triples.

triples¶: The sampled triples.

graph¶: The graph structured sampled triples by dgl.graph in DGL.

norm¶: The edge norm in graph.

label¶: Mask the false tail as negative samples.

sampling(pos_hr_t)[source]¶

Graph based n_hop neighbours in neural network.

Parameters:: pos_hr_t – The triples(hr, t) used to be sampled.
Returns:: The training data.
Return type:: batch_data

get_sampling_keys()[source]¶

node_norm_to_edge_norm(graph, node_norm)[source]¶

Calculating the normalization edge weight.

Parameters:

graph – The graph structured sampled triples by dgl.graph in DGL.
node_norm – The node weight of normalization.

Returns:

The edge weight of normalization.

Return type:

norm

class neuralkg_ind.data.Sampler.TestSampler(sampler)[source]¶

Bases: object

Sampling triples and recording positive triples for testing.

sampler¶: The function of training sampler.

hr2t_all¶: Record the tail corresponding to the same head and relation.

rt2h_all¶: Record the head corresponding to the same tail and relation.

num_ent¶: The count of entities.

get_hr2t_rt2h_from_all()[source]¶

Get the set of hr2t and rt2h from all datasets(train, valid, and test), the data type is tensor.

Update:: self.hr2t_all: The set of hr2t. self.rt2h_all: The set of rt2h.

sampling(data)[source]¶

Sampling triples and recording positive triples for testing.

Parameters:: data – The triples used to be sampled.
Returns:: The data used to be evaluated.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.ValidSampler(sampler)[source]¶

Bases: object

Sampling subgraphs for validation.

sampler¶: The function of training sampler.

args¶: Model configuration parameters.

sampling(data)[source]¶

Sampling function to collect batch of subgraph for validation.

Parameters:: data – List of subgraph data for validation.
Returns:: The batch of validating data.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.ValidRMPISampler(sampler)[source]¶

Bases: object

Sampling subgraphs for RMPI validation.

sampler¶: The function of training sampler.

args¶: Model configuration parameters.

sampling(data)[source]¶

Sampling function to collect batch of RMPI subgraph(enclosing and disclosing) for validation.

Parameters:: data – List of subgraph data for RMPI validation.
Returns:: The batch of RMPI validating data.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.TestSampler_hit(sampler)[source]¶

Bases: object

Sampling subgraphs for testing link prediction.

sampler¶: The function of training sampler.

args¶: Model configuration parameters.

m_h2r¶: The matrix of head to rels.

m_t2r¶: The matrix of tail to rels.

sampling(data)[source]¶

Sampling function to collect batch of subgraph for testing mrr and hit@1,5,10.

Parameters:: data – List of subgraph data for testing.
Returns:: The batch of testing data.
Return type:: batch_data

get_sampling_keys()[source]¶

get_subgraphs(all_links, adj_list, dgl_adj_list, max_node_label_value, m_h2r, m_t2r)[source]¶

Extracting and labeling subgraphs.

Parameters:

all_links – All head or tail entities link to corresponding triple.
adj_list – List of adjacency matrix.
dgl_adj_list – List of undirected head to tail matrix.
max_node_label_value – Max value of node label.
m_h2r – The matrix of head to rels.
m_t2r – The matrix of tail to rels.

Returns:

Subgraphs for testing. r_labels: Labels of relation.

Return type:

subgraphs

prepare_features(subgraph, n_labels, max_n_label, n_feats=None)[source]¶

One hot encode the node label feature and concat to n_featsure.

Parameters:

subgraph – Subgraph for processing.
n_labels – Node labels.
max_n_label – Max value of node label.
n_feats – node features.

Returns:

Subgraph after processing.

Return type:

subgraph

class neuralkg_ind.data.Sampler.TestRMPISampler_hit(sampler)[source]¶

Bases: object

Sampling subgraphs for RMPI testing link prediction.

sampler¶: The function of training sampler.

args¶: Model configuration parameters.

sampling(data)[source]¶

Sampling function to collect batch of subgraph for RMPI testing.

Parameters:: data – List of subgraph data for RMPI sting.
Returns:: The batch of RMPI testing data.
Return type:: batch_data

get_sampling_keys()[source]¶

prepare_subgraph(dgl_adj_list, nodes, rel, node_labels, max_node_label_value)[source]¶

Prepare enclosing or disclosing subgraph.

Parameters:

dgl_adj_list – List of undirected head to tail matrix.
nodes – Nodes of subgraph.
rel – Relation idx.
node_labels – Node labels.
max_node_label_value – Max value of node label.

Returns:

Subgraph for testing.

Return type:

subgraph

get_subgraphs(all_links, adj_list, dgl_adj_list, max_node_label_value)[source]¶

Extracting and labeling subgraphs.

Parameters:

all_links – All head or tail entities link to corresponding triple.
adj_list – List of adjacency matrix.
dgl_adj_list – List of undirected head to tail matrix.
max_node_label_value – Max value of node label.

Returns:

Subgraphs for testing. r_labels: Labels of relation.

Return type:

subgraphs

prepare_features(subgraph, n_labels, max_n_label, n_feats=None)[source]¶

One hot encode the node label feature and concat to n_featsure for RMPI.

Parameters:

subgraph – Subgraph for processing.
n_labels – Node labels.
max_n_label – Max value of node label.
n_feats – node features.

Returns:

Subgraph after processing.

Return type:

subgraph

class neuralkg_ind.data.Sampler.TestSampler_auc(sampler)[source]¶

Bases: object

Sampling subgraphs for testing triple classification.

sampler¶: The function of training sampler.

args¶: Model configuration parameters.

sampling(data)[source]¶

Sampling function to collect batch of subgraph for testing auc and auc_pr.

Parameters:: data – List of subgraph data for testing.
Returns:: The batch of testing data.
Return type:: batch_data

class neuralkg_ind.data.Sampler.TestRMPISampler_auc(sampler)[source]¶

Bases: object

Sampling subgraphs for testing RMPI triple classification.

sampler¶: The function of training sampler.

args¶: Model configuration parameters.

sampling(data)[source]¶

Sampling function to collect batch of subgraph for RMPI testing auc and auc_pr.

Parameters:: data – List of subgraph data for RMPI testing.
Returns:: The batch of RMPI testing data.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.MetaSampler(args)[source]¶

Bases: BaseMeta

Sampling meta task and collecting train data for training.

sampling(data)[source]¶

Sampling function to collect batch of meta task for training, which is default.

Parameters:: data – List of task for training.
Returns:: List of task for training.
Return type:: data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.ValidMetaSampler(sampler)[source]¶

Bases: object

Collecting task for validating.

sampling(data)[source]¶

Sampling function to collect batch of meta task for validating, which is default.

Parameters:: data – List of task for validating.
Returns:: List of task for validating.
Return type:: data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.TestMetaSampler_hit(sampler)[source]¶

Bases: object

Collecting task for testing.

sampling(data)[source]¶

Sampling function to collect batch of meta task for testing mrr and hit@1,5,10.

Parameters:: data – List of task for testing.
Returns:: Batch of task for testing mrr and hit@1,5,10.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.TestMetaSampler_auc(sampler)[source]¶

Bases: object

Collecting task for testing.

sampling(data)[source]¶

Sampling function to collect batch of meta task for testing auc and auc_pr.

Parameters:: data – List of task for testing.
Returns:: Batch of task for testing auc and auc_pr.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.GraphTestSampler(sampler)[source]¶

Bases: object

Sampling graph for testing.

sampler¶: The function of training sampler.

hr2t_all¶: Record the tail corresponding to the same head and relation.

rt2h_all¶: Record the head corresponding to the same tail and relation.

num_ent¶: The count of entities.

triples¶: The training triples.

get_hr2t_rt2h_from_all()[source]¶

Get the set of hr2t and rt2h from all datasets(train, valid, and test), the data type is tensor.

Update:: self.hr2t_all: The set of hr2t. self.rt2h_all: The set of rt2h.

sampling(data)[source]¶

Sampling graph for testing.

Parameters:: data – The triples used to be sampled.
Returns:: The data used to be evaluated.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.CompGCNTestSampler(sampler)[source]¶

Bases: object

Sampling graph for testing.

sampler¶: The function of training sampler.

hr2t_all¶: Record the tail corresponding to the same head and relation.

rt2h_all¶: Record the head corresponding to the same tail and relation.

num_ent¶: The count of entities.

triples¶: The training triples.

get_hr2t_rt2h_from_all()[source]¶

Get the set of hr2t and rt2h from all datasets(train, valid, and test), the data type is tensor.

Update:: self.hr2t_all: The set of hr2t. self.rt2h_all: The set of rt2h.

sampling(data)[source]¶

Sampling graph for testing.

Parameters:: data – The triples used to be sampled.
Returns:: The data used to be evaluated.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.SEGNNTrainProcess(args)[source]¶

Bases: RevSampler

get_h2rt_t2hr_from_train()[source]¶

Getting the set of h2rt and t2hr from train dataset, the data type is numpy.

Update:: self.h2rt_train: The set of h2rt. self.t2rh_train: The set of t2hr.

get_onehot_label(label)[source]¶

get_sampling()[source]¶

construct_kg(directed=False)[source]¶: construct kg. :param directed: whether add inverse version for each edge, to make a undirected graph. False when training SE-GNN model, True for comuting SE metrics. :return:

class neuralkg_ind.data.Sampler.SEGNNTrainSampler(args)[source]¶

Bases: object

get_train()[source]¶

get_valid()[source]¶

get_test()[source]¶

sampling(data)[source]¶

class neuralkg_ind.data.Sampler.SEGNNTestSampler(sampler)[source]¶

Bases: Dataset

get_hr2t_rt2h_from_all()[source]¶

Get the set of hr2t and rt2h from all datasets(train, valid, and test), the data type is tensor.

Update:: self.hr2t_all: The set of hr2t. self.rt2h_all: The set of rt2h.

sampling(data)[source]¶

Sampling triples and recording positive triples for testing.

Parameters:: data – The triples used to be sampled.
Returns:: The data used to be evaluated.
Return type:: batch_data

get_sampling_keys()[source]¶

class neuralkg_ind.data.Sampler.KGDataset(triples)[source]¶: Bases: Dataset

base_data_module¶

Base DataModule class.

class neuralkg_ind.data.base_data_module.Config[source]¶: Bases: dict

class neuralkg_ind.data.base_data_module.BaseDataModule(*args: Any, **kwargs: Any)[source]¶

Bases: LightningDataModule

Base DataModule. Learn more at https://pytorch-lightning.readthedocs.io/en/stable/datamodules.html

static add_to_argparse(parser)[source]¶

prepare_data()[source]¶: Use this method to do things that might write to disk or that need to be done only from a single GPU in distributed settings (so don’t set state self.x = y).

setup(stage=None)[source]¶: Split into train, val, test, and set dims. Should assign torch Dataset objects to self.data_train, self.data_val, and optionally self.data_test.

train_dataloader()[source]¶

Implement one or more PyTorch DataLoaders for training.

Returns:: A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this page.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

fit()
…
prepare_data()
setup()
train_dataloader()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}

val_dataloader()[source]¶

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

fit()
…
prepare_data()
train_dataloader()
val_dataloader()
test_dataloader()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:: A torch.utils.data.DataLoader or a sequence of them specifying validation samples.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

test_dataloader()[source]¶

Implement one or multiple PyTorch DataLoaders for testing.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a postive integer.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

fit()
…
prepare_data()
setup()
train_dataloader()
val_dataloader()
test_dataloader()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Returns:: A torch.utils.data.DataLoader or a sequence of them specifying testing samples.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

get_config()[source]¶