Build a search engine

Udacity CS101(Building a search engine) – unit-2

Procedures and Control (Introduction to Web Browsers)

In Unit 1, you wrote a program to extract the first link from a web page. The next step towards building your search engine is to extract all of the links from a web page. In order to write a program to extract all of the links, you need to know these two key concepts:

1. Procedures – a way to package code so it can be reused with different inputs.

2. Control – a way to have the computer execute different instructions depending on the data (instead of just executing instructions one after the other).

In this unit, you will learn three important programming constructs: procedures, if statements, and while loops.

Procedures, also known in Python as “functions,” enable you to abstract code from its inputs;

if statements allow you to write code that executes differently depending on the data; and

while loops provide a convenient way to repeat the same operations many times. You will combine these to solve the problem of finding all of the links on a web page.

Procedures   A procedure takes in inputs, does some processing, and produces outputs. which allows you to use few lines of code to do many different thing.

Let’s consider how to turn the code for finding the first link into a get_next_target procedure that finds the next link target in the page contents. for this procedure the input should be –> string giving the rest of the content of the web page for the procedure output should be –> URL and end_quote

def   get_next_target(page):

          start_link = page.find(‘<a href=’)

          start_quote = page.find(‘”‘, start_link)

         end_quote = page.find(‘”‘, start_quote + 1)

          url = page[start_quote + 1:end_quote]

          return url, end_quote

Procedures are a very important concept and the core of programming is breaking problems into
procedures, and implementing those procedures.

Making Decisions

Now, let’s figure out a way to make code behave differently based on decisions. To do so we want to find a way to make comparisons, which will give you a way to test and ultimately allow your program to decide what to
do.

*If Statements it executes the code inside only if the IF condition is true Else Statement is used beside if statement to provide another alternative if the condition was not executed or add another condition.

Loops they are used to do code again and again if a certain condition is always true

*While loop  as long as the condition is true the code inside the loop body is executed if we write break inside the loop the execution of the loop body will stop. 

So here we know enough concepts to do the goal of the unit and write the code that extract all the links in the web page 

before starting using our procedure get next target there is a small problem we didn’t think about if there are no links in the page ! …. what is the expected output from the code ?

The program returns, “the whole page content except the last letter in the text ” because when the find operation does not find what it is looking for it returns -1. When -1 is used as an index, it eliminates the last character of the string

So, the get_next_target function will be as follow

def get_next_target(page):
           start_link = page.find(‘<a href=’)
          if start_link == -1:
          return None, 0
          start_quote = page.find(‘”‘, start_link)
          end_quote = page.find(‘”‘, start_quote + 1)
          url = page[start_quote + 1:end_quote]
return url, end_quote

and the main code to find all the webpage links (use the previous function) will be as follow

NOTE  use the function get_page(page url) in python to get the source code of the webpage

def print_all_links(page):
     while True:
          url, endpos = get_next_target(page)
          if url:
               print url
              page = page[endpos
         else:
              break

Udacity CS101(Building a search engine) – unit-1

Introducing the Web Crawler
A web crawler is a program that collects content from the web. A web crawler finds web pages by starting from a seed page and following links to find other pages, and following links from the other pages it finds, and continuing to follow links until it has found many web pages.

Here is the process that a web crawler follows:
Start from one preselected page. We call the starting page the “seed” page.
Extract all the links on that page. (This is the part we will work on in this unit and Unit 2.)
Follow each of those links to find new pages.
Extract all the links from all of the new pages found.
Follow each of those links to find new pages.
Extract all the links from all of the new pages found.

This keeps going as long as there are new pages to find, or until it is stopped.
In this unit we will be writing a program to extract the first link from a given web page. In Unit 2, we will figure out how to extract all the links on a web page. In Unit 3, we will figure out how to keep the crawl going over many pages.

we will use Python to write the code that finds the first link in the web page =)

Python String Functions
A string is sequence of characters surrounded by quotes. The quotes can be either single or double
quotes, but the quotes at both ends of the string must be the same type. Here are some examples of
strings in Python:
“silly”
‘string’
“I’m a valid string, even with a single quote in the middle!”

String Concatenation
We can use the + operator on strings, but it has a different meaning than when it is used on numbers.
string concatenation: +
outputs the concatenation of the two input strings (pasting the string together with no space between them)

We can also use the multiplication operator on strings:
string multiplication: *
outputs a string that is copies of the input pasted together

Indexing Strings
The indexing operator provides a way to extract subsequences of characters from a string.
string indexing: []
outputs a single-character string containing the character at position
of the input . Positions in the string are counted
starting from 0, so s[1] outputs the second character in s. If the
is negative, positions are counted from the end of the
string: s[-1] is the last character in s.

string extraction:
[:]
outputs a string that is the subsequence of the input string starting
from position and ending just before position
. If is missing, starts from the
beginning of the input string; if is missing, goes to the
end of the input string.

Finding Subsequence in Strings
The find method provides a way to find sub-sequences of characters in strings.
find: .find()
outputs a number giving the position in where
first appears. If there is no occurrence of
in , outputs -1.
To find later occurrences, we can also pass in a number to find:
find at or after: .find(, )
outputs a number giving the position in where
first appears that is at or after the position give by
. If there is no occurrence of in
at or after , outputs -1.

Converting Numbers to Strings
str: str()
outputs a string that represents the input number. For example,
str(23) outputs the string ’23’.

So that is enough for the Python string functions now the code to find the first link in the web page assuming we know the source code of the web page (we will learn that later)