BIGDATA: F: DeepWalking Graphs for Feature Extraction

NSF grant IIS-1546113

PI: Steven Skiena

Project Summary

The sparsity of large networks makes it difficult to efficiently extract features for machine learning algorithms. Recent work on network embeddings (DeepWalk) has revealed how neural language modeling can be applied to a very general class of graph analysis problems in data mining and information retrieval. This project will improve training algorithms and data representation for large-scale networks, creating better, more powerful graph embeddings for weighted and attributed networks. It will also enable meaningful comparison of the relative performance of network connectivity features vs. more text-oriented features. It is possible that there might be more usable information in links than in the readable content itself.

This project will develop these methods in several new directions, including extensions to new graph classes and speed/scale enhancements. The original DeepWalk induced latent representations only from unweighted, undirected, and connected graphs. But there is considerable interest in applying it to more general graphs arising in data analysis. Doing the right thing on such natural networks as bipartite and disconnected graphs presents surprisingly subtle issues of theoretical and practical significance. This project will also explore several ideas to increase training performance of network embeddings, including more efficient gradient updates and improved graph sampling methods and particularly the power of self-avoiding random walks to oversample otherwise rare nodes. This project seeks to extend the effective range of DeepWalk by several orders of magnitude, from the 10 million vertex graphs we routinely handle today to web-scale networks on billions of nodes. The broader impacts of this work are far reaching across data mining and information retrieval, including user profiling/demographic inference, online advertising, and fraud detection. The software and data resources developed under this research project will be released as open source. They will be directly applicable to the biomedical and social sciences, and serve as both an educational and scholarly resource. For further information, see the project website at http://www.cs.stonybrook.edu/~skiena/deepwalking.

Primary References