CSE391 (Spring `04)
Web Queries: Methods and Tools

[ General Information | Lectures | Handouts | Other Pointers | Requirements ]
[ Announcements ] [ Wish List ]

General Information

Course description: This course is for students interested in methods and tools for querying the Web as well as basic concepts and underlying principles. Topics include: (1) describing, transforming, and querying semi-structured data (XML, XSLT, XQUERY, SQL/XML); (2) understanding and navigating in Web pages (HTML, Web search engines, targeted Web query tools); and (3) capturing and manipulating semantic data (RDF, ontologies, semantic Web services). Students will do a course project; there will also be weekly or bi-weekly homeworks centered around the project and a midterm exam. | Prerequisites: CSE305. | Credits: 3.

Instructor: Annie Liu | Email: liuATcsDOTsunysbDOTedu | Office: Computer Science 1433 | Phone: 632-8463.

TA: Chen Zhao | Email: chzhaoATcsDOTsunysbDOTedu.

Hours: Tue Thu 3:50-5:10PM, in Computer Science 2129/31. | Annie's office hours: Tue 2:00-3:20PM, Thu 10-11:20AM, in CS 1433. | Chen's office hours: Wed 4:00-5:20PM, Fri 12:50-2:10PM, in CS 2110.

Textbook: The first part of the course will use XML and Web Data, Chapter 18 of Database and Transaction Processing: An Application-Oriented Approach in the upcoming new edition, Addison Wesley, by Michael Kifer, Arthur Bernstein, and Philip Lewis. The department will provide copies of the chapter to students in the course, with a small fee. The rest of the course will use introductory and tutorial materials provided throughout the course. You are asked to take good course notes yourself. Slides will be used for parts of the course and will be made available.

Grading: The midterm exam is worth 40% of the grade, the course project 30%, and homeworks together 30%. A bi-weekly homework is worth twice as much as a weekly homework, and all homeworks are also part of the course project. There will also be bonus points with the project and exam as they fit. No late handins will receive credit. Exceptions only when supported with official documents will be accommodated. The Pass/No Credit (P/NC) option is not available for this course.

Course homepage: http://www.cs.sunysb.edu/~liu/cse391/, containing all course related information.


Course overview
Lecture 1 (01/27/04): Overview. Course info; web; web queries, greatness and challenging problems. Reading: anything you can find about web query problems and web query tools. Assign 1.
Lecture 2 (01/29/04): Projects part I. More course info; assign 1; project overview; kinds of data and queries for part I; ideas for part II. Reading: anything you can find about web query methods and tools.
Describing, transforming, and querying semistructured data
Lecture 3 (02/03/04): XML data description. From relational model to object-oriented model to semistructured data and XML. Reading: Sec.16.1 (slides 4-13), Sec.18.1-18.2.1. Assign 2.
Lecture 4 (02/05/04): XML and DTD. XML elements and DB objects, attributes and ID/REF, well-formedness; XML namespace; DTD, validity, inadequacy. Reading: Sec.18.2.2-18.2.5.
Lecture 5 (02/10/04): XML Schema. Namespaces in schema and instance documents, include and import of schemas, simple and complex types. Reading: Sec.18.3-18.3.3.
Lecture 6 (02/12/04): XML Schema. Review, extension and restiction of base types, putting all together, anonymous types, integrity constraints. Reading: Sec.18.3.4-18.3.5.
Lecture 7 (02/17/04): XML query languages: XPath. Document tree, navigation steps, axis, selector, wildcards, predicates for queries, XPointer. Reading: Sec.18.4-18.4.1. Assign 3.
Lecture 8 (02/19/04): XSLT. Basics of stylesheets, XSLT instructions, templates, matching patterns, applying templates recursively, defaults, limitations. Reading: Sec.18.4.2.
Lecture 9 (02/24/04): XQuery. Basics in SQL style, iterate-test-return, adding tags, variable binding, unique values, semantics. Reading: Sec.18.4.3.
Lecture 10 (02/26/04): XQuery. Semantics, user-defined functions, importing schemas, namespace, grouping and aggregation, quantifications. Reading: Sec.18.4.3.
Lecture 11 (03/02/04): SQL/XML. Publish relations as XML, create XML from queries; XML type, store XML in RDB, query XML stored in relations, modify. Reading: Sec.18.4.4.
Midterm, review, and preview
Lecture 12 (03/04/04): Review for midterm exam. Review of SQL/XML; topics for the exam; what to bring to the exam; practice problems with solutions. Assign 3 due.
Midterm exam (03/09/04): In-class exam. Open the XML chapter, our lecture slides, and your own handwritten notes; closed everything else.
Lecture 14 (03/11/04): Projects part II. Preliminary exam results, assign 4, possible projects, amount of work, describing what+how, questionnaire. Reading: anything for choosing a project. Assign 4.
Understanding and navigating in web pages
Lecture 15 (03/16/04): Team projects (2-5, in CS1433): Jiande+Chao, Muhammad+Youssef, Ahmed+Kunal, Adewale+Stephanie, Ben+Matt, Ralph+Gory, Qiang+Kim, John Paul+Sofoklis, (3/17 2PM) Lunyin+Kwokbun.
Lecture 16 (03/18/04): Team projects (12:30-2, 3:30-5): Qiting+Guoqiang, Mark, Andrew+Robin, Chaitanya+Shika, David+Guo, Bin+Xiao, Jinchao+Changtai, Daniel+Wei, Ying+Brian, Shanen+Yash, Sungpill+Simon. Assign 5.
Lecture 17 (03/23/04): An intelligent web query tool, guest lecture by Prof. I.V. Ramakrishnan. WinAgent, introduction, demo, learning by examples. Reading: a winagent paper.
Lecture 18 (03/25/04): Web query assistants, simple&neat:-). Meta tools: WinAgent, MultiRunner, MultiQuery; neat ideas for why, what, how. Reading: any neat web query tools you can find (extra credit).
Lecture 19 (03/30/04): Web search engines. Architecture; crawling: page selection, page fresh; storage and indexing; ranking and link analysis: PageRank and HITS. Reading: a web search paper.
Lecture 20 (04/01/04): Web page generation. Static vs dynamic; easy, fast, and secure for users and servers; technologies and languages overview and comparison. Reading: see links upto Web Survey. Assign 6.
Spring break: April 5-9. Have a good one!
Capturing and manipulating semantic data
Lecture 21 (04/13/04): Semantic data, RDF triples, using URIs, namespace prefix, class and property ontologies; comparing formats. Reading: primer+formats in tutorial.
Lecture 22 (04/15/04): Shortcuts for paths and lists; OWL web ontology lang, same & equiv class, diff & disjoint, cardinality, class hierarchies, domain & range. Reading: shortcuts+ontologies in tutorial.
Lecture 23 (04/20/04): Rules. Converting and merging using cwm; rules, as facts, variables and forAll/forSome; inference, action, transformation. Reading: rules+processing in tutorial.
Lecture 24 (04/22/04): Applications: integration problem, reading legacy formats using Perl&Python/cwm, rules for converting between vocabularies, exporting data using cwm/Python. Project report and presentation requirements. Reading: application in tutorial. Assign 6 due.
Project presentations
Lecture 25 (04/27/04): Stock data analysis: Andrew&Robin; China Index Analysis: LunYin&KwokBun; Company data analysis: Xiao&Bin; Product Category and Price Comparison: Jiande&Chaoying; Oscar Winners and History: Brian&Ying.
Lecture 26 (04/29/04): College Decision Maker: Simon&Sungpill; Course Data Querying System:Haris&Johnny; NSF award data analysis: Chaitanya&Shika; JinChao&ChangTai; Wei&Daniel.
Lecture 27 (05/04/04): NSF award data analysis: Sofoklis&JohnPaul; Qiting&Guoqiang; Ahmed&Kunal; Shanen&Yash; Muhammad&Youssef; David&GuoChao.
Lecture 28 (05/06/04): MetaSearch: Ben&Matthew; Intelligent Search Tool: Adewale&Stephanie; Image Scanner: Courtland&Ralph; Sharing Comments between Portal Sites: Mark Drago; Basketball game data analysis: Qiang&Kim. Project report due.


Handout Q: Questionnaire

Handout Q2: Questionnaire 2

Slides Set 1: XML Part I

Slides Set 2: XML Part II

Slides with Texts: Semantic Web Tutorial Using N3 (all and only explanation part)
| primer (slides, all slides) + formats (slides, all slides)
| shortcuts (slides, all slides) + ontologies (slides, all slides)
| rules + processing (slides, all slides)
| application in travel tools (slides, all slides)

Handout A1: Assignment 1: A Web Query Tool

Handout A2: Assignment 2: XML Data Description

Handout A3: Assignment 3: XML Queries

Handout A4: Assignment 4: Project Description

Handout A5: Assignment 5: Project Design and Prototyping

Handout A6: Assignment 6: Project Implementation and Preliminary Experiments

Handout S2: Sample XML Schema Solution to Assignment 2

Handout S3: Sample XQuery Solution to Assignment 3

Handout D: Sample Data for Assignment 3 Demo

Handout MP: Midterm Practice Problems

Handout M: Midterm Exam

Handout MS: Solution to Midterm Exam

Handout P: Project Report, Presentation, and Demo

Other Pointers

XML Languages and Technologies

XML Schema

XQuery | W3C XML Query (XQuery) Project
XQuery 1.0 and XPath 2.0 Functions and Operators

QuiP: An XQuery implementation freeware | Downloading QuiP
eXist: An open source XML database

The XML Revolution, Technologies for the future web: An overview of XML technologies | Contents

Other Web Programming Languages and Technologies

HTML | W3C MarkUp Validation Service

PHP: Hypertext Preprocessor

A Tutorial on Java Servlets and Java Server Pages (JSP) | An overview of Servlets and JSP
Dynamic Page Languages: An overview pre Servlets and JSP

Interactive Web Services with Java: An overview, including their own work on JWIG | Contents

Web Survey and Internet Research Reports by SecuritySpace | Web Server Markete Share | Apache Modules

Semantics Web Languages and Technologies

Semantic Web | Specifications
RDF | Developer tools
OWL Web Ontology Language Overview

Some General Tools (Independant of Application Domains)

HTTrack: A website copier

ixquick: A meta searcher

Xstrudel: An HTML generator | Tutorial

Some Specialized Tools (Specific to Application Domains)

Seeing negative feedbacks of ebay users (from Charles and Bin, 1/31/03)

Searching book prices (from Charles and Bin, 1/31/03)


You should learn all information on the course homepage. Check the homepage periodically for Announcements.

Do all course work. The homeworks and projects are integral parts of the course as they provide concrete experiences with the basic ideas covered in the class.

Your handins, whether on papers or in electronic forms, should include the following information at the top: your name, student id, course number, homework/project number, and due date.

Your work should be submitted in a neat and organized fashion; for handins on papers, if your handwriting is hard to read, then your work needs to be typed.

Your approach to solving problems is as important as your final solutions; you need to show how you arrived at your solutions and include appropriate explanations.

If you feel your grade was assigned incorrectly, please bring it up no later than two weeks after the assignment was returned to the class.

All work must be done individually unless permitted explicitly. You are encouraged to discuss with others and look up references, but you should write up your own solutions independently and credit all sources that you used. Any plagiarism or other forms of cheating will result in an F or worse.

Computing facilities: You will be given an account in the Transaction Processing Lab. Never let anyone else use your account; it is against the rules. Please be conscious of security in the lab; theft or vandalism will be punished severely. If you have any problems with the hardware or software in the lab (other than with the requirements of the course work itself), please email ntadminATcsDOTsunysbDOTedu with a copy to me; neither the TA nor I could fix such problems.

Disability: If you have a physical, psychological, medical or learning disability that may have an impact on your ability to carry out assigned course work, please contact the staff in the Disabled Student Services office (DSS), Room 133 Humanities, 632-6748/TDD. DSS will review your concerns and determine with you what accommodations are necessary and appropriate. All information and documentation of disability are confidential.

Annie Liu