Project 5: Email Scraper

  < Previous  Next >

Assignment id: project5
Required Files: project5.py, rubric5.txt

Due Date

See the calendar for due date.

Objectives:

Description:

Write a python script (with comments) that finds all of the valid email addresses in a specified file and writes them to another file. An email address is comprised of three parts:
  1. username
  2. @
  3. domain name
For the purposes of this project, a valid email address has the following:
  1. The username has only the following characters: a-z, A-Z, 0-9, ".", "-", and "_".
  2. One and only one "@"
  3. The domain name:
    1. Has only the following characters: a-z, A-Z, 0-9, ".", and "-".
    2. Has at least one "."
    3. Does not have "." as the first or last character
    4. Does not have two "." next to each other
Before writing any code, work out an algorithm to accomplish this task. Write down the steps of your algorithm as comments at the beginning of your python script.

Examples

Example input files: The following are examples of correct execution (with the text in bold being the input from the user):
Please enter the input filename: project5-inputA.txt
You entered project5-inputA.txt
Please enter the output filename: out.txt
You entered out.txt
Found 63 valid email addresses
out.txt (For all of the contents, see project5-answerKeyA.txt)
mikkiflower@gmail.com
mikkiflower@gmail.com
f72991ad0702230813u7ca6c4ebm9c708c6dbe3e8a30@mail.gmail.com
Mikki_Rose@siggraph.org
mikkiflower@gmail.com
info@academart.edu
alvarolanchart@apanimationschool.com
dst@asu.edu
...
MikkiRose@siggraph.org
Mikki_Flower@siggraph.org
Please enter the input filename: project5-inputB.txt
You entered project5-inputB.txt
Please enter the output filename: outB.txt
You entered outB.txt
Found 5 valid email addresses
outB.txt
validEmailAddress@email.com
username2@email.com
username4@valid.domain.com
username5@email.com
username9@valid.domain.com

Submission

Submit your python script and rubric using the handin program. For handin, for this lab, type the following in a terminal window exactly as it appears:
handin  project5  project5.py  rubric5.txt
To verify your submission, type the following in a terminal window:
handin  project5

Rubric:

Points       Item
----------   --------------------------------------------------------------
_____ / 15   Algorithm outline (as comments)
_____ / 10   Meaniful comments
_____ / 18   Properly handles input and output files (e.g., handles IO exceptions, closes each file)
_____ / 55   Correctly finds email addresses
_____ /  2   Completed rubric (estimates for each line including hours spent)

_____ /100   Total


_____  Approximate number of hours spent

Helps

  1. Looking for a place to get started? Can you print out the index of each @ character for each line?

Notes

  1. Optionally, you can replace typing from the keyboard with the contents of a file. For example, try on ranger:
    python3 project5.py  <  /nfshome/hcarroll/public_html/1170/private/projects/project5-stdinA.txt
    If you want to match my output exactly, then run the following on ranger:
    python3 project5.py  <  /nfshome/hcarroll/public_html/1170/private/projects/project5-stdinA.txt
    diff /nfshome/hcarroll/public_html/1170/private/projects/project5-answerKeyA.txt  out.txt
    If the two files match exactly (which is what you want) then there should be NO output from diff. If diff shows one or more differences, fix them and run it again. To get side-by-side output (with the answer key on the left and your output on the right), replace the last line with:
    diff  --side-by-side  /nfshome/hcarroll/public_html/1170/private/projects/project5-answerKeyA.txt  out.txt
    For details about interpreting the output of diff, see the Using diff section on the Misc. webpage.