Lab4-3_BA

Berent Aldikacti

09/16/20

In [1]:
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
print(animals)
['lion', 'tiger', 'crocodile', 'vulture', 'hippo']

Challenge - Loops

In [2]:
for i in animals:
    print(i)
lion
tiger
crocodile
vulture
hippo
  1. What happens if we don’t include the pass statement?
In [3]:
for i in animals:  
  File "<ipython-input-3-0fa469c7c90f>", line 1
    for i in animals:
                       ^
SyntaxError: unexpected EOF while parsing
  1. Rewrite the loop so that the animals are separated by commas, not new lines (Hint: You can concatenate strings using a plus sign. For example, print(string1 + string2) outputs ‘string1string2’).
In [9]:
for i in animals:
    print(i,end=",")
lion,tiger,crocodile,vulture,hippo,

Exercise

In [10]:
import os
os.mkdir('data/yearly_files') # equivalent to mkdir in shell
os.listdir('data') # equivalent to ls in shell
---------------------------------------------------------------------------
FileExistsError                           Traceback (most recent call last)
<ipython-input-10-cb450a0c2017> in <module>
      1 import os
----> 2 os.mkdir('data/yearly_files') # equivalent to mkdir in shell
      3 os.listdir('data') # equivalent to ls in shell

FileExistsError: [Errno 17] File exists: 'data/yearly_files'
In [1]:
import pandas as pd
surveys_df = pd.read_csv('data/surveys.csv')
surveys_df['year'].unique()
Out[1]:
array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002])
In [12]:
for year in surveys_df['year'].unique():
   filename='data/yearly_files/surveys' + str(year) + '.csv'
   print(filename)
data/yearly_files/surveys1977.csv
data/yearly_files/surveys1978.csv
data/yearly_files/surveys1979.csv
data/yearly_files/surveys1980.csv
data/yearly_files/surveys1981.csv
data/yearly_files/surveys1982.csv
data/yearly_files/surveys1983.csv
data/yearly_files/surveys1984.csv
data/yearly_files/surveys1985.csv
data/yearly_files/surveys1986.csv
data/yearly_files/surveys1987.csv
data/yearly_files/surveys1988.csv
data/yearly_files/surveys1989.csv
data/yearly_files/surveys1990.csv
data/yearly_files/surveys1991.csv
data/yearly_files/surveys1992.csv
data/yearly_files/surveys1993.csv
data/yearly_files/surveys1994.csv
data/yearly_files/surveys1995.csv
data/yearly_files/surveys1996.csv
data/yearly_files/surveys1997.csv
data/yearly_files/surveys1998.csv
data/yearly_files/surveys1999.csv
data/yearly_files/surveys2000.csv
data/yearly_files/surveys2001.csv
data/yearly_files/surveys2002.csv
In [13]:
for year in surveys_df['year'].unique():

    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year]

    # Write the new DataFrame to a CSV file
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

Challenge - Modifying Loops

  1. Some of the surveys you saved are missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and do not show up in the text files). Modify the for loop so that the entries with null values are not included in the yearly files.
In [14]:
for year in surveys_df['year'].unique():

    # Select data for the year
    surveys_df_noNA = surveys_df[~pd.isnull(surveys_df).any(axis=1)]
    surveys_year = surveys_df_noNA[surveys_df_noNA.year == year]

    # Write the new DataFrame to a CSV file
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)
  1. Let’s say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977?
In [30]:
for year in surveys_df['year'].unique()[::5]:

    # Select data for the year
    surveys_df_noNA = surveys_df[~pd.isnull(surveys_df).any(axis=1)]
    surveys_year = surveys_df_noNA[surveys_df_noNA.year == year]

    # Write the new DataFrame to a CSV file
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    print(filename)
data/yearly_files/surveys1977.csv
data/yearly_files/surveys1982.csv
data/yearly_files/surveys1987.csv
data/yearly_files/surveys1992.csv
data/yearly_files/surveys1997.csv
data/yearly_files/surveys2002.csv
  1. Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique CSV file for each species?
In [33]:
surveys_df.head()
Out[33]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
In [37]:
import os
os.mkdir('data/species_files') # equivalent to mkdir in shell
os.listdir('data') # equivalent to ls in shell
Out[37]:
['surveys2001.csv',
 'weightbyyearsex.csv',
 'surveys2002.csv',
 'speciesSubset.csv',
 'bouldercreek_09_2013.txt',
 'species_files',
 'species.csv',
 'references.bib',
 'yearly_files',
 'surveys.csv',
 'portal_mammals.sqlite',
 'README.txt',
 'plots.csv']
In [41]:
for i in surveys_df['species_id'].unique():

    # Select data for the year
    surveys_df_noNA = surveys_df[~pd.isnull(surveys_df).any(axis=1)]
    surveys_species = surveys_df_noNA[surveys_df_noNA.species_id == i]

    # Write the new DataFrame to a CSV file
    filename = 'data/species_files/surveys' + str(i) + '.csv'
    surveys_species.to_csv(filename)
    print(filename)
data/species_files/surveysNL.csv
data/species_files/surveysDM.csv
data/species_files/surveysPF.csv
data/species_files/surveysPE.csv
data/species_files/surveysDS.csv
data/species_files/surveysPP.csv
data/species_files/surveysSH.csv
data/species_files/surveysOT.csv
data/species_files/surveysDO.csv
data/species_files/surveysOX.csv
data/species_files/surveysSS.csv
data/species_files/surveysOL.csv
data/species_files/surveysRM.csv
data/species_files/surveysnan.csv
data/species_files/surveysSA.csv
data/species_files/surveysPM.csv
data/species_files/surveysAH.csv
data/species_files/surveysDX.csv
data/species_files/surveysAB.csv
data/species_files/surveysCB.csv
data/species_files/surveysCM.csv
data/species_files/surveysCQ.csv
data/species_files/surveysRF.csv
data/species_files/surveysPC.csv
data/species_files/surveysPG.csv
data/species_files/surveysPH.csv
data/species_files/surveysPU.csv
data/species_files/surveysCV.csv
data/species_files/surveysUR.csv
data/species_files/surveysUP.csv
data/species_files/surveysZL.csv
data/species_files/surveysUL.csv
data/species_files/surveysCS.csv
data/species_files/surveysSC.csv
data/species_files/surveysBA.csv
data/species_files/surveysSF.csv
data/species_files/surveysRO.csv
data/species_files/surveysAS.csv
data/species_files/surveysSO.csv
data/species_files/surveysPI.csv
data/species_files/surveysST.csv
data/species_files/surveysCU.csv
data/species_files/surveysSU.csv
data/species_files/surveysRX.csv
data/species_files/surveysPB.csv
data/species_files/surveysPL.csv
data/species_files/surveysPX.csv
data/species_files/surveysCT.csv
data/species_files/surveysUS.csv

Challenge - Functions

In [42]:
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2
  1. Change the values of the arguments in the function and check its output
In [43]:
this_is_the_function_name(1,2)
The function arguments are: 1 2 (this is done inside the function!)
Out[43]:
2
In [44]:
this_is_the_function_name(56,25)
The function arguments are: 56 25 (this is done inside the function!)
Out[44]:
1400
  1. Try calling the function by giving it the wrong number of arguments (not 2) or not assigning the function call to a variable (no product_of_inputs =)
In [46]:
this_is_the_function_name(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-e4fcc2d6795e> in <module>
----> 1 this_is_the_function_name(1)

TypeError: this_is_the_function_name() missing 1 required positional argument: 'input_argument2'
  1. Declare a variable inside the function and test to see where it exists (Hint: can you print it from outside the function?)
In [48]:
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')
    x = input_argument1 + input_argument2
    # And returns their product
    return input_argument1 * input_argument2

this_is_the_function_name(5,10)
The function arguments are: 5 10 (this is done inside the function!)
Out[48]:
50
In [49]:
x
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-49-6fcf9dfbd479> in <module>
----> 1 x

NameError: name 'x' is not defined
  1. Explore what happens when a variable both inside and outside the function have the same name. What happens to the global variable when you change the value of the local variable?
In [50]:
x = 5
this_is_the_function_name(5,10)
The function arguments are: 5 10 (this is done inside the function!)
Out[50]:
50
In [51]:
x
Out[51]:
5

Challenge - More Function

In [52]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'data/yearly_files/function_surveys' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)

def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate CSV files for each year of data.

    start_year -- the first year of data we want
    end_year -- the last year of data we want
    all_data -- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)
  1. Add two arguments to the functions we wrote that take the path of the directory where the files will be written and the root of the file name. Create a new set of files with a different name in a different directory.
In [53]:
def one_year_csv_writer(this_year, all_data, pathdir, rootid):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    pathdir -- Path to the export directory
    rootid -- Basename for the files to be exported
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'pathdir/rootid' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)
  1. How could you use the function yearly_data_csv_writer to create a CSV file for only one year? (Hint: think about the syntax for range)
In [ ]:
def one_year_csv_writer(this_year, all_data, pathdir, rootid):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    pathdir -- Path to the export directory
    rootid -- Basename for the files to be exported
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'pathdir/rootid' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)
  1. Make the functions return a list of the files they have written. There are many ways you can do this (and you should try them all!): either of the functions can print to screen, either can use a return statement to give back numbers or strings to their function call, or you can use some combination of the two. You could also try using the os library to list the contents of directories.
In [ ]:
def one_year_csv_writer(this_year, all_data, pathdir, rootid):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    pathdir -- Path to the export directory
    rootid -- Basename for the files to be exported
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = pathdir + '/' + rootid + str(this_year) + '.csv'
    surveys_year.to_csv(filename)
    return(filename)
  1. Explore what happens when variables are declared inside each of the functions versus in the main (non-indented) body of your code. What is the scope of the variables (where are they visible)? What happens when they have the same name but are given different values?
In [57]:
filename
Out[57]:
'data/species_files/surveysUS.csv'

Exercise

In [3]:
def yearly_data_arg_test(all_data, start_year=None, end_year=None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    all_data -- DataFrame with multi-year data
    start_year -- the first year of data we want, Check all_data! (default None)
    end_year -- the last year of data we want; Check all_data! (default None)
    """

    if start_year is None:
        start_year = min(all_data.year)
    if end_year is None:
        end_year = max(all_data.year)

    return start_year, end_year


start, end = yearly_data_arg_test(surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)

start, end = yearly_data_arg_test(surveys_df)
print('Default values:\t\t\t', start, end)
Both optional arguments:	 1988 1993
Default values:			 1977 2002

Challenge - Variables

  1. What type of object corresponds to a variable declared as None? (Hint: create a variable set to None and use the function type())
In [4]:
x = None
type(x)
Out[4]:
NoneType
  1. Compare the behavior of the function yearly_data_arg_test when the arguments have None as a default and when they do not have default values.
In [5]:
def yearly_data_arg_test(all_data, start_year, end_year):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    all_data -- DataFrame with multi-year data
    start_year -- the first year of data we want, Check all_data! (default None)
    end_year -- the last year of data we want; Check all_data! (default None)
    """

    if start_year is None:
        start_year = min(all_data.year)
    if end_year is None:
        end_year = max(all_data.year)

    return start_year, end_year


start, end = yearly_data_arg_test(surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)

start, end = yearly_data_arg_test(surveys_df)
print('Default values:\t\t\t', start, end)
Both optional arguments:	 1988 1993
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-9eb962137e16> in <module>
     19 print('Both optional arguments:\t', start, end)
     20 
---> 21 start, end = yearly_data_arg_test(surveys_df)
     22 print('Default values:\t\t\t', start, end)

TypeError: yearly_data_arg_test() missing 2 required positional arguments: 'start_year' and 'end_year'
  1. What happens if you only include a value for start_year in the function call? Can you write the function call with only a value for end_year? (Hint: think about how the function must be assigning values to each of the arguments - this is related to the need to put the arguments without default values before those with default values in the function definition!)
In [8]:
def yearly_data_arg_test(all_data, start_year=1977, end_year=None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    all_data -- DataFrame with multi-year data
    start_year -- the first year of data we want, Check all_data! (default None)
    end_year -- the last year of data we want; Check all_data! (default None)
    """

    if start_year is None:
        start_year = min(all_data.year)
    if end_year is None:
        end_year = max(all_data.year)

    return start_year, end_year


start, end = yearly_data_arg_test(surveys_df, end_year=1993)
print('Both optional arguments:\t', start, end)
Both optional arguments:	 1977 1993

Challenge - Modifying functions

  1. Rewrite the one_year_csv_writer and yearly_data_csv_writer functions to have keyword arguments with default values
In [10]:
def one_year_csv_writer(this_year, all_data, pathdir='data/yearly_files', rootid='surveys'):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    pathdir -- Path to the export directory
    rootid -- Basename for the files to be exported
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = pathdir + '/' + rootid + str(this_year) + '.csv'
    return(filename)

one_year_csv_writer(1998, all_data=surveys_df)
Out[10]:
'data/yearly_files/surveys2100.csv'
  1. Modify the functions so that they don’t create yearly files if there is no data for a given year and display an alert to the user (Hint: use conditional statements to do this. For an extra challenge, use try statements!)
In [25]:
def one_year_csv_writer(this_year, all_data, pathdir='data/yearly_files', rootid='surveys'):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    pathdir -- Path to the export directory
    rootid -- Basename for the files to be exported
    """
    if this_year in surveys_df['year'].unique():
        surveys_year = all_data[all_data.year == this_year]    
        filename = pathdir + '/' + rootid + str(this_year) + '.csv'
        return(filename)
    else:
        print('There is no data available for', str(this_year))
    
one_year_csv_writer(2100, all_data=surveys_df)
There is no data available for 2100
  1. The code below checks to see whether a directory exists and creates one if it doesn’t. Add some code to your function that writes out the CSV files, to check for a directory to write to.
In [40]:
import os
def one_year_csv_writer(this_year, all_data, pathdir='yearly_files', rootid='surveys'):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    pathdir -- Path to the export directory
    rootid -- Basename for the files to be exported
    """
    if pathdir in os.listdir('.'):
        print('Processed directory exists')
    else:
        os.mkdir(pathdir)
        print('Processed directory created')

    if this_year in surveys_df['year'].unique():
        surveys_year = all_data[all_data.year == this_year]    
        filename = 'data/'+ pathdir + '/' + rootid + str(this_year) + '.csv'
        surveys_year.to_csv(filename)
        print(filename)
    else:
        print('There is no data available for', str(this_year))
    
one_year_csv_writer(1977, all_data=surveys_df)
Processed directory exists
data/yearly_files/surveys1977.csv
  1. The code that you have written so far to loop through the years is good, however it is not necessarily reproducible with different datasets. For instance, what happens to the code if we have additional years of data in our CSV files? Using the tools that you learned in the previous activities, make a list of all years represented in the data. Then create a loop to process your data, that begins at the earliest year and ends at the latest year using that list.
In [41]:
year_list = surveys_df['year'].unique()
for i in year_list:

    # Select data for the year
    surveys_df_noNA = surveys_df[~pd.isnull(surveys_df).any(axis=1)]
    surveys_species = surveys_df_noNA[surveys_df_noNA.species_id == i]

    # Write the new DataFrame to a CSV file
    filename = 'data/species_files/surveys' + str(i) + '.csv'
    print(filename)
data/species_files/surveys1977.csv
data/species_files/surveys1978.csv
data/species_files/surveys1979.csv
data/species_files/surveys1980.csv
data/species_files/surveys1981.csv
data/species_files/surveys1982.csv
data/species_files/surveys1983.csv
data/species_files/surveys1984.csv
data/species_files/surveys1985.csv
data/species_files/surveys1986.csv
data/species_files/surveys1987.csv
data/species_files/surveys1988.csv
data/species_files/surveys1989.csv
data/species_files/surveys1990.csv
data/species_files/surveys1991.csv
data/species_files/surveys1992.csv
data/species_files/surveys1993.csv
data/species_files/surveys1994.csv
data/species_files/surveys1995.csv
data/species_files/surveys1996.csv
data/species_files/surveys1997.csv
data/species_files/surveys1998.csv
data/species_files/surveys1999.csv
data/species_files/surveys2000.csv
data/species_files/surveys2001.csv
data/species_files/surveys2002.csv