CSE2/CSE5ALG– Algorithms and Data Structures – 2017 Assignment Part 1
Assessment: This part 1 of the assignment is worth 15% of the final mark for this subject.
Delays caused by computer downtime cannot be accepted as a valid reason for a late submission without penalty. Students must plan their work to allow for both scheduled and unscheduled downtime. Penalties are applied to late assignments (accepted up to 5 days after the due date only). See the departmental Student Handbook for details.
Copying, Plagiarism: Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. The Department of Computer Science and Com-puter Engineering treats academic misconduct seriously. When it is detected, penalties are strictly imposed. Refer to the subject learning guide for further information and strategies you can use to avoid a charge of academic misconduct.
Submission Details: Submit all the Java files that are required for the tasks described in this handout. The code has to run under Unix on the latcs8 machine. You submit your files from your latcs8 account. Make sure you are in the same directory as the files yo u are submitting. Submit each file separately using the submit command. For example, for the file c alled (say) `SalesInfoMiner.java, use command:
submit ALG SalesInfoMiner.java
After submitting the files, you can run the following command that lists the files submitte d from your account:
verify
You can submit the same filename as many times as you like before the assignment d eadline; the previously submitted copy will be replaced by the latest one.
Testing Platform: While you are free to develop the code for this assignment on any operating system, your solution must run on the latcs8 system. We should be able to compile your classes with the simple command javac *.java, and execute your programs with a simple command, e.g. java SalesInfoMiner products.txt sales.txt report.txt.
Return of Assignment: Assignments will be returned within three weeks of the due date. Students will be notified by email and via the CSE2/CSE5ALG website when markin g sheets are available for collection.
Assignment Objectives
- To understand various data structures and searching and sorting techniques;
- To analyze these techniques in relation to a particular problem;
- To implement these techniques within a Java program.
Introduction
Mining information from sales data is a common task in both traditional and online stores. Past sales information can guide purchasing and enable the company to make recommendations to customers about what they may like to buy.
The aim of the assignment is to perform two data mining tasks. First, it determines the ranking of products according to how many of each product has been sold. Second, it also determines the five other products most commonly bought with each product.
Performing these tasks on large data sets is computationally expensive and the use of data struc-tures and algorithms that facilitate fast searching and sorting is desirable.
The assignment is divided into two parts. For part 1, you are required to write a simple version, where execution efficiency is not taken into account. For part 2, you ar e required to write an “optimized” version where efficiency, in addition to correctness, is the major concern.
With the exceptions of ArrayList and LinkedList, you are not permitted to use any of the classes in the Java Collections Framework, e.g. TreeMap, HashMap, ListIterator.
Requirements
Write a Java program called SalesInfoMiner that reads in a file of products, stored in text format (2 lines per entry). The program must then read in information about past sales transactions and output all the products to a new file, sorted by product description, along with their sales rank and the five other products with which they were most commonly purchased.
The program accepts four command line arguments and must not expect any user interaction from the keyboard. The first parameter to the program is the name of the produc t list to read in; the second parameter is the file containing information about past transactions. The third parameter is the name of the file to which the ordered product list will be written, and the fou rth parameter is the name of the file to which any error or warning messages will be output.
That is, the program will be run with the following command:
java SalesInfoMiner <products> <transactions> <output-file>
If an incorrect number of arguments is supplied, then the program must output an error message explaining the arguments required before terminating.
For Part 1, you can assume that both input files are in correct format.
Product File
This file contains data for one or more products
- Each product occupies two lines
- Line 1 contains the product id, which is a string of 13 digits
- Product IDs are unique
- Line 2 is the description of the product, which is not an empty string
Thus, the data for a product has the format shown below:
<ProductID> (A thirteen digit string) <Product Description> (A text string)
For example:
1234567890123
Milk – full cream 600ml 9832983102938
Bread – wholemeal loaf 1299128391283
Eggs, Extra Large, Barn laid
Sales Transaction File
This file contains data for one or more sale transactions
- Each sale transaction has 2 + n lines, where n is the number of items in this sale transaction (as specified on line 2 of the sale transaction)
- Line 1 has the format Receipt number: <receipt number> where <receipt number> is a positive integer
- Receipt numbers are unique
- Line 2 has the format Number of items: <number of items> where <number of items> is a positive integer
- Each sale item occupies a line, which contains the product id of the item
- All product IDs in a sale transaction are distinct and must exist in the product file
Thus, data for a sale has the format shown below:
Receipt number: <ReceiptNumber> (<ReceiptNumber> is an integer) Number of items: <NumberItems> (<NumberItems> is an integer) <ProductID>
…
For example:
Receipt number: 1 Number of items: 3 1234567890123 9832983102938 1299128391283 Receipt number: 2 Number of items: 4 9832983102938 1234567890123 1299128391283 1234567890123
Product Output File
- For each product, the program determines its sale transaction count, i.e. number of transac-tions that the product appears in.
- The program lists the products in an output file, sorted by sale transaction c ount, in decreasing order.
- When there is a tie in the sale transaction count, the products are listed in alphabetical order of their descriptions (see example below)
- For each product, the program also lists the top 5 products that are most frequently bought with this product. The number of time two products are bought together is known as their bought-together frequency. The top 5 list must be listed in decreasing order of the bought-together frequencies. Whenever there is a tie in frequency, the products are listed in alpha-betical order of their descriptions (see example below)
- The “top 5” list may have less than 5 products. It can also have for than 5 d ue to ties.
- The format is as shown below
Product: <Product Description>
ID: <ProductID>
Sales Count: <SalesCount>
This product was purchased most often with:
- (<ProductID>) – <Product Description>
- (<ProductID>) – <Product Description>
- (<ProductID>) – <Product Description>
- (<ProductID>) – <Product Description>
- (<ProductID>) – <Product Description>
For example:
Product: Milk full cream 600ml
ID: 1234567890123
Sales Count: 82
This product was purchased most often with:
- (1299128391283) Eggs, Extra Large, Barn laid [8]
- (9832983102938) Bread wholemeal loaf [5]
- (3289742398473) Apples Golden Delicious [4]
- (9348029348934) Chocolate, Dark, 200g [4]
- (3024823094839) Noodles, Udon Fresh 250g [4]
- (0002011106917) Nuts, almonds, dry roasted [4] Product: Pork – cured, ham, separable fat, boneless ID: 0000140956153
Sales Count: 82
This product was purchased most often with:
. . .
“Sales count” is the number of transactions that a product appears in. Th e numbers in square brackets are the number of transactions a product appears with another product (i.e. the bought-together frequency).
Sample Data Sets
For Part 1, three sample data sets are provided.
Set | Number of Prod- | Number of Sales | Maximum Number | Remarks | |||
ucts | of Item Per Sale | ||||||
P1, S1 | 5 | 10 | 4 | Tiny data set. You | |||
can inspect the data | |||||||
to verify your re- | |||||||
sults | |||||||
P2, S2 | 10 | 10 | 8 | Small | data | set. | |
Some | of the | top | |||||
five lists | have | less | |||||
than | 5 | products, | |||||
some have more. | |||||||
P3, S3 | 1000 | 10, 000 | 10 | ||||
For Part 2, a sample data set will be released that has 15,000 products and 1000,000 sales receipts.
Marking Scheme Overview
- The program (classes) will be marked on correctness, and coding style and comments.
- 10% of the total mark of part 1 of the assignment will be allocated to coding style and comments: Do the programs follow good design and programming practices? Do the indentation and code layout follow a good, consistent standard? Are the identifiers meaningful? Are comments being used effectively?
- File Format: .java
- No of files: 9