Home Research Publications Teaching Funding Software Group News

Dr. Wang's research interests include Scalable Big Data Management and Analytics, Spatial and Temporal Data Management and Analytics, Medical Imaging Informatics, Healthcare and Public Health Data Analytics. The work is sponsored by NSF, NIH, CDC, Pitney Bowes, Amazon and Google.
Scalable Big Data Management and Analytics
My research goal on big data management and analytics is to address the research challenges for delivering effective, scalable and high performance software systems for managing, querying and mining complex big data at multiple dimensions, including 2D and 3D spatial and imaging data, temporal data, spatial-temporal data, and sequencing data. This is driven by emerging spatial big data problems from geospatial applications, location based services, and social network applications with cost effective ubiquitous positioning technologies and collaborative spatial data collection. Meanwhile, rapid improvement of data acquisition technologies have produced tremendous amount of scientific data, such as high resolution digital pathology images in both 2D and 3D, and next generation whole genome sequencing data. Managing and analyzing such data poses several major challenges, including explosion of data volume, high complexity of data, and/or temporal dynamics. My research will ultimately create novel open source software systems to support challenging applications in multiple domains, by researching different architectures for supporting such software with consideration of different forms of heterogeneity, processing patterns, massive parallelism, and layers of storage with different characteristics.
Related projects:
Medical Imaging Informatics (2D and 3D)
Systematic analysis of large-scale whole slide image data can involve many interrelated analyses on large amount of images, generating tremendous amount of quantifications such as shape and texture, as well as classifications of the quantified features. Our research focuses on large scale data management and analytics of 2D and 3D whole slide images, on supporting the management, queries and analytics of the data at large scale.
Related projects:
GIS Based Healthcare Analytics
GIS oriented public health research has a strong focus on the locations of patients and the agents of disease, and studies the community and region level patterns and variations, and the impact of demographical, socio-economical, and environmental factors on diseases and human health. In the past, due to limited accessibility of health outcome data, public health studies often were limited at macro scale levels such as county level, and may not allow public health researchers and health officials to adequately identify, analyze, and monitor health problems at the community level. In this research, we take advantage of New York State SPARCS open dataset, which collects patient level detail on patient characteristics, diagnoses and treatments, services, and charges for each hospital inpatient stay and outpatient. Such data also provides street level location information for each patient and healthcare facility site. Through geocoding and geo-mapping, we will provide spatial oriented data analysis on New York state health records at the community level. We study geospatial distributions of diseases in New York State at multiple spatial resolutions, and provide multi-dimensional analysis by grouping patients into different groups. We discover potential spatial clusters, hot spots or anomalies of disease distributions. We will also study potential correlations between socio-economic determinants and diseases by integrating additional spatial datasets, including social-economic data and environment data (air quality indexes, pollen counts).
Related projects:
  • Coming soon...
Clinical Natural Language Processing
While electronic medical record (EMR) systems employ increasingly rich data models that offer a wide variety of options for structured data entry, a large amount of medical data is in free-form, narrative text reports. Our research goal in clinical natural language processing is to provide convenient and intelligent information extraction and classification from medical reports by taking advantage of both individual human interventions and collective human intelligence, to ultimately improve diagnosis, reduce errors, and inform medical practice and decision making. One ongoing project is IDEAL-X (http://ideal-x.org), an interactive, incrementally learning based information extraction system to facilitate the process of information extraction and classification from narrative medical reports and transform extracted data into normalized structured forms. The system takes an incremental learning based approach which quickly learns from users' feedbacks from a small set of reports, and a chieves high accuracy on data extraction with minimal effort from users. Extracted data can be further normalized through controlled vocabularies. IDEAL-X requires no special configuration or training sets, and is not constrained to specific domains, thus it is easy to use and highly portable. IDEAL-X is being used for cohort identification from tens of thousands of patients, and for automated classification for massive number of radiology reports from CDC.
Related projects:
Spatial and Temporal Data Management
Research and devlop expressive and efficient spatial and temporal query languages, efficient and scalable temporal and spatial data management and processing, and support real world spatial and temporal applications, including biomedical applications, geospatial applications and mobile applications. Related projects:
Semantics and Data Standardization
Research and develop biomedical ontologies, systems for data normalization with controlled vocabularies, semantic interoprablity, and semantic enabled queries.
Related projects: