Statistical Programming: Problem Set 1 — Solutions

Author

Affiliation

Philipp Bach, Jan Teichert-Kluge

University of Hamburg

Notes

Exercises

This is a very extensive problem set to give you ample opportunity to practice programming in Python. For the tutorials, please try to solve the following Exercises

Section 1
- Exercise 1 a., b. and d.
- Exercise 3 a.
- Exercise 4 a. and c.
- Exercise 5
Section 2
- Exercise 1 a., b., c., f.
- Exercise 2
Section 3
- Exercise 1, b.
- Exercise 2
Section 4
- Exercise 1, b.
Section 5
- Exercise 1

General Note

You will very likely find the solution to these exercises online. I, however, strongly encourage you to work on these exercises without doing so. Understanding someone else’s solution is very different from coming up with your own. Use the lecture notes and try to solve the exercises independently.

Traditionally, the first program you write in a new language is called “Hello World!”. Use the print() statement to display “Hello World!” on the screen.

Hint: help(print)

Section 1: Values and data types

Exercise 1: User Input and Strings

Write a Python program that accepts the user’s first and last name and prints them in reverse order with a space between them.

Hint: Use the Python function input() and f-strings.

# If input() does not work, you can replace it by a string of your choice
first_name = input("What is your first name? ")
last_name = input("What is your last name? ")
full_name_reverse = f"{last_name} {first_name}"
print(full_name_reverse)
print(last_name, first_name)

Write a program that accepts a string from the user and …
- … only shows the first three letters,
- … only shows the last three letters,
- … takes the 2nd letter and repeats it 4 times,
- … shows the length of the string,
- … shows whether the string contains another user-provided substring, for example "a",
- … shows the very first occurence of the substring "a",
- … shows the very first occurence of the substring "a" excluding the first and the last letter.

string_input = input("Please provide a string: ")
print(string_input[0:3])
print(string_input[-3:])
print(string_input[1]*4)
print(len(string_input))

string_input = input("Please provide a string: ")
substring_input = input("Please provide a substring: ")
contains_substring = substring_input in string_input
print(f"'{string_input}' contains substring '{substring_input}: {contains_substring}")

string_input = input("Please provide a string: ")
substring_input = input("Please provide a substring: ")
contains_substring = substring_input in string_input
print(contains_substring)

index = string_input.index(substring_input)
print(f"The first occurence of '{substring_input}' in '{string_input}' is: {index}")

# It's also possible to include an if-statement and executed conditionally
if contains_substring:
    index = string_input.index(substring_input)
    print(f"The first occurence of '{substring_input}' in '{string_input}' is: {index}")
else:
    print(f"'{substring_input}' is not contained in '{string_input}'")

string_input = input("Please provide a string: ")
substring_input = input("Please provide a substring: ")
contains_substring = substring_input in string_input[1:-1]
index = string_input.index(substring_input, 1, len(string_input)-1)
print(f"The first occurence of '{substring_input}' in '{string_input}' is: {index}")

# Again, conditional execution is possible here
if contains_substring:
    index = string_input.index(substring_input, 1, len(string_input)-1)
    print(f"The first occurence of '{substring_input}' in '{string_input}' is: {index}")
else:
    print(f"'{substring_input}' is not contained in '{string_input}'")

Define two strings ‘string1’ and ‘string2’ and describe what happens…,
- … if you compare them with ==, > and <,
- … if you test them with in and not,

string1 = "stringabc"
string2 = "stringabcd"
print(string1 > string2)
print(string1 < string2)
print(string1 <= string2)
print(string1 == string2)

False
True
True
False

string1 = "stringabc"
string2 = "stringabc"
print(string1 > string2)
print(string1 < string2)
print(string1 <= string2)
print(string1 == string2)

False
False
True
True

string1 = "stringabc"
string2 = "stri"
print(string1 in string2)
print(string1 not in string2)
print(string2 in string1)

False
True
True

Write a program that takes a string as a user input and only returns…
- …. the letters that are present at an even index, e.g., the input “test run” would result in displaying “t”, “s”, and “u”, (Hint: range())
- … the letters that are vowels.

string_input = input("Please provide a string: ")
for i in range(0,len(string_input)):
    if i%2 == 0:
        print(string_input[i])

string_input = input("Please provide a string: ")
print(string_input[::2])

string_input = input("Please provide a string: ")
vowels = ["a", "e", "i", "o", "u"]

for letter in string_input:
    if letter in vowels:
        print(letter)

Exercise 2: User Input and Integers

Accept two integer values from the user and return their product. If the product is greater than 1000, then return their sum.

Hint: isnumeric().

n1 = input("Please provide an integer number: ")
n2 = input("Please provide another integer number: ")
if n1.isnumeric() == False or n2.isnumeric() == False:
    print("Input must be integer numbers!")
else:
    n1 = int(n1)
    n2 = int(n2)
    if n1 * n2 > 1000:
        print(n1+n2)
    else:
        print(n1*n2)

a="4" # if you only want to accept integers as input
a.isnumeric()

True

Exercise 3: Tuples, Lists and Dictionaries

Consider the screenhot in Figure 1 taken from wikipedia.org.

Figure 1: List of heaviest land mammals.

Which Python data types could you use in principle to collect …
- … the names of the five heaviest mammals? Implement it in every possible data type. Write a program that outputs the first, the second and the fifth of the heaviest mammals.
- … the names and the (maximum) mass of the mammals? Again, implement it in every possible data type and write a program that outputs the name and the maximum mass of the first, the second and the fifth heaviest mammal

mammals_names1 = ("African bush elephant", "Asian elephant", "African forest elephant", "White rhinoceros", "Indian rhinoceros")

mammals_names2 = ["African bush elephant", "Asian elephant", "African forest elephant", "White rhinoceros", "Indian rhinoceros"]

mammals_names3 = {'names': ["African bush elephant", "Asian elephant", "African forest elephant", "White rhinoceros", "Indian rhinoceros"]}

mammals_names1[0]
mammals_names1[1]
mammals_names1[4]

mammals_names3['names'][0]


# Manual index provision
for i in [0,1,4]:
    print(mammals_names1[i])

# Manual index provision
for i in [0,1,4]:
    print(mammals_names2[i])

# Short hand loop (one-liner) ("list comprehension")
print([mammals_names1[i] for i in [0,1,4]])

African bush elephant
Asian elephant
Indian rhinoceros
African bush elephant
Asian elephant
Indian rhinoceros
['African bush elephant', 'Asian elephant', 'Indian rhinoceros']

mammals_mass1 = (8000, 5000, 4000, 3600, 3000)
mammals_mass2 = [8000, 5000, 4000, 3600, 3000]

# Nested tuple
mammals1 = (("African bush elephant",8000), ("Asian elephant", 5000), ("African forest elephant", 4000), ("White rhinoceros", 3600), ("Indian rhinoceros", 3000))

for i in [0,1,4]:
    print(mammals1[i])

# Nested list
mammals2 = [["African bush elephant",8000], ["Asian elephant", 5000], ["African forest elephant", 4000], ["White rhinoceros", 3600], ["Indian rhinoceros", 3000]]

for i in [0,1,4]:
    print(mammals2[i])

for i in range(0, len(mammals2), 2):
    print(mammals2[i])

# Dictionary (with lists)
mammals3 = {"Name": mammals_names1,
            "Mass": mammals_mass1}
mammals3.items()


mammals4 = {"African bush elephant" : 8000,
    "Asian elephant" : 5000,
    "African forest elephant" : 4000,
    "White rhinoceros" : 3600,
    "Indian rhinoceros": 3000
}
# doesn't work
# mammals4[0]
mammals4['African bush elephant']
list(mammals4.items())[0]

# List of dictionary
mammals_list_with_dict = []
for i in range(0, len(mammals2)):
    a_dict = {"Name": mammals2[i][0], "Mass": mammals2[i][1]}
    mammals_list_with_dict.append(a_dict)

print(mammals_list_with_dict)


# Dictionary from list
mammals_dict_from_list = {i : mammals2[i] for i in range(0, len(mammals2) ) }

for i in [0,1,4]:
    print(mammals_dict_from_list[i])

mammals_dict_from_list2 = {mammals2[i][0] : mammals2[i][1] for i in range(0, len(mammals2) ) }

# getting list from dict
list(mammals4.keys())

# using dict()
mammals_dict_from_list3 = dict.fromkeys(mammals_names2, 0)
mammals_dict_from_list3['African bush elephant'] = 123 # or in loop

mammals_dict_from_list2  = dict(zip(mammals_names2, mammals_mass2))

('African bush elephant', 8000)
('Asian elephant', 5000)
('Indian rhinoceros', 3000)
['African bush elephant', 8000]
['Asian elephant', 5000]
['Indian rhinoceros', 3000]
['African bush elephant', 8000]
['African forest elephant', 4000]
['Indian rhinoceros', 3000]
[{'Name': 'African bush elephant', 'Mass': 8000}, {'Name': 'Asian elephant', 'Mass': 5000}, {'Name': 'African forest elephant', 'Mass': 4000}, {'Name': 'White rhinoceros', 'Mass': 3600}, {'Name': 'Indian rhinoceros', 'Mass': 3000}]
['African bush elephant', 8000]
['Asian elephant', 5000]
['Indian rhinoceros', 3000]

Based on Part a), consider the following problems. Discuss/show whether and how you could solve them with the different data type implementations.
- Consider the names of the heaviest mammals only. A researcher wants to sort the names of the heaviest animals according to their alphabetical order. Create a new list called mammals_names_sorted. What happens do the original list of names, if you use Python’s .sort() method for lists?
- A new study reports the observation of a new species called “Giant Australian hamster” with a mass of 5000 kg. Insert this observation in your list. Hint: Use Python’s .insert() method for lists.
- Now, consider the list of the mammals’ name and mass. A new study reports that an African bush elephant with a mass of 10000 kg has been observed. Update the list of heaviest mammals in the solutions from Part a) (if possible).

mammals_names_sorted = mammals_names2.copy()
mammals_names_sorted.sort()
print(mammals_names_sorted)
print(mammals_names2)

['African bush elephant', 'African forest elephant', 'Asian elephant', 'Indian rhinoceros', 'White rhinoceros']
['African bush elephant', 'Asian elephant', 'African forest elephant', 'White rhinoceros', 'Indian rhinoceros']

# Sort lists and tuples with sorted()
mammals_names_sorted = sorted(mammals_names2)
print(mammals_names_sorted)

['African bush elephant', 'African forest elephant', 'Asian elephant', 'Indian rhinoceros', 'White rhinoceros']

mammals_names2.insert(1, "Giant Australian hamster")
print(mammals_names2)

['African bush elephant', 'Giant Australian hamster', 'Asian elephant', 'African forest elephant', 'White rhinoceros', 'Indian rhinoceros']

mammals2[0][1] = 10000
print(mammals2)

mammals2

# dictionary
mammals4['African bush elephant'] = 10000

[['African bush elephant', 10000], ['Asian elephant', 5000], ['African forest elephant', 4000], ['White rhinoceros', 3600], ['Indian rhinoceros', 3000]]

Exercise 4: Lists

Given a list of integers (which you define in your code), output “True” if the first and last number of the list are the same (else output “False”).

list1 = [1,23,4,123,2,3,5,6,2]
if (list1[0] == list1[-1]):
    print("True")
else:
    print("False")

False

Given a list of numbers (which you define in your code), print only those numbers that can be divided by 5 without residual.

list2 = [1,23,4,123,2,3,5,6,10]
for i in list2:
    if i%5 == 0:
        print(i)

5
10

Write a program that combines two lists by alternatingly taking elements, e.g. ["a","b","c"], [1,2,3] → ["a",1,"b",2,"c",3]

list4 = ['a','b','c']
list5 = [1,2,3]
list6 = []
for i in range(0,len(list5)):
    list6.append(list4[i])
    list6.append(list5[i])
print(len(list4))
print(len(list5))
print(len(list6))
print(list6)

3
3
6
['a', 1, 'b', 2, 'c', 3]

Let a small data set be \[5\ 2\ 11\ 19\ 6.\]

Enter these numbers into a list x_list.
Find the square of each number and enter these numbers into a list x_square.

x_list = [5, 2, 11, 19, 6]
# empty list
x_square = [0]*len(x_list)
x_square

x_list=[5, 2, 11, 19, 6]
x_square = [0]*len(x_list) 
for i in range(len(x_list)):
    x_square[i]=x_list[i]**2
print(x_square)

# Or short-hand notation
x_square2 = [i**2 for i in x_list]

[25, 4, 121, 361, 36]

Exercise 5: Functions

Write a function compare that returns the value
- 1 if $x>y$,
- 0 if $x=y$ and
- -1 if $x<y$.

The following program counts the number of times the letter “a” appears in a string

word="banana"
count=0
for letter in word:
    if letter == "a":
        count=count+1
print(count)

Encapsulate the code above in a function named count(w,l,s) that counts the number of times a letter l appears in a word w. Additionally, s gives the index in w where it should start the search.
Generate a random integer between 1 and 10. Write a guessing game where the user has to guess the secret number. After every guess, the program tells the user whether their number was too large or too small. At the end the number of tries needed should be printed.

Hint: Use the function randint() provided in the random module to randomly draw integers. Also use input().

def compare(x,y):
    if x>y:
        return 1
    elif x==y:
        return 0 
    else:
        return -1
    
compare(2,3)

-1

def count(w,l,s):
    count=0
    for letter in w[(s):len(w)]:
        if letter == l:
            count=count+1
    return(count)
 
  # Example
count('hello', 'l', 0)

# Example for random integer with random
import random
random.randint(1,10)

import random

guess = 0
r = random.randint(1,10)
attempts = 1

while guess != r:
    guess = input("Please guess!")
    guess = int(guess)
    if guess > r:
        print ("Smaller!")
        attempts+=1
    if guess < r:
        print("Bigger!")
        attempts+=1
print (f"Great, you found the number ({r}) in {attempts} tries!")

Section 2: Files, Modules and Classes

Exercise 1: Object Orientation

Mike and Jenny both have a car. Mike’s car is a Opel Manta, Jenny drives a Cadillac Escalade. Both collected data on their fuel stops in the last month.

Mike’s record:

Mileage¹	Amount of Gasoline
1200
2000	70
2250	22

Jenny’s record:

Mileage	Amount of Gasoline
8200
9400	80
10400	68
12000	110

Define a new object class Car which represents a car in terms of the owner’s name (attribute owner), the model of the car (model), the mileage (kilometers) and amount of gasoline (gas) at the recorded fuel stops. Generate two new instances for Mike’s and Jenny’s car, assign attributes and access them using pythons’s dot notation.
Implement a method print_car() that describes the object in a message saying something like: "This is Mike's car. It is a Opel Manta.".
Add a method avg_consumption() that returns the average consumption (in liter per 100 kilometers) between the recorded fuel stops.
Implement a new method add_stop() that allows to add data from a new fuel stop, i.e., enter a new entry for kilometers and gas. Add a new fuel stop and check if it is successfully listed in the output of the avg_consumption() call.
Generate a new class Carpool (without any functionality) in which you encapsulate Mike’s and Jenny’s cars. Add a new stop to Mike’s record and see if the encapsulated object has changed. Explain why (not) and provide an example how you can avoid that your change is effective for the encapsulated object.

Hint: Use import copy to use the call deepcopy for generating a deep copy of a list.

Consider the following objects and do the tasks below.

BobsCar = Car("Bob", "VW T1", [200000, 200100, 200300], [22, 40])
CarlsCar = Car("Carl", "Fiat Multipla", [179400, 183120, 183998], [89, 68])
MikesCar = Car("Mike", "Opel Manta", [1200, 2000, 2250], [70, 22])
JennysCar = Car("Jenny", "Cadillac Escalade", [8200, 9400, 10400, 12000], [80, 68, 110])

car_list = [MikesCar, JennysCar,  BobsCar, CarlsCar]
mixed_list = [MikesCar, BobsCar, 23, "abc"]

Sort all cars in car_list according to the mileage (= maximum of entries in property .kilometers) starting with the largest value.
Based on car_list, create two different lists one for cars with an average consumption below 10l/kilometer and one with the cars with 10l/kilometers and higher.
Get all objects of class Car from mixed_list.
(Optional: Implement the first two tasks in a method for the class CarPool.)

# a)
## Object orientation: 
class Car:
    """Represents a car in terms of total kilometers and fuels stops.
    
    attributes: kilometers, stop
    """

## Initialization
class Car:
    """Represents a car in terms of its owner, model, mileage and amount of 
    gasoline at several fuel stops.
    
    attributes: owner, model, kilometers, stop
    """   
    # initialization
    def __init__(self, owner, model, kilometers, gas):
        """Create a new instance of a car based on the car's owner, model, mileage and gas consumption."""
        self.owner = owner
        self.model = model
        self.kilometers = kilometers
        self.gas = gas
        print("Congratulation. You created a new Car!")
        
# Initialize
MikesCar = Car("Mike", "Opel Manta", [1200, 2000, 2250], [70, 22])
JennysCar = Car("Jenny", "Cadillac Escalade", [8200, 9400, 10400, 12000], [80, 68, 110])

# Access attributes
MikesCar.owner
MikesCar.model
MikesCar.kilometers
MikesCar.gas

Congratulation. You created a new Car!
Congratulation. You created a new Car!

[70, 22]

# b)
# add print_car method 
class Car:
    """Represents a car in terms of its owner, model, mileage and amount of 
    gasoline at several fuel stops.
    
    attributes: owner, model, kilometers, stop
    """   
    # initialization
    def __init__(self, owner, model, kilometers, gas):
        """Create a new instance of a car based on the car's owner, model, mileage and gas consumption."""    
        self.owner = owner
        self.model = model
        self.kilometers = kilometers
        self.gas = gas
        print("Congratulation. You created a new Car!")
        
    def print_car(self):
        """Prints a description of a car."""
        print(f"This is {self.owner}'s car. It is a {self.model}.")

# Comment: It is common to refer to a method that describes an object in a string as a "*string representation". In Python, the method name `__str__` is reserved for this method.

# c)
# add avg_consumption method
class Car:
    """Represents a car in terms of its owner, model, mileage and amount of 
    gasoline at several fuel stops.
    
    attributes: owner, model, kilometers, stop
    """   
    # initialization
    def __init__(self, owner, model, kilometers, gas):
        """Create a new instance of a car based on the car's owner, model, mileage and gas consumption."""    
        self.owner = owner
        self.model = model
        self.kilometers = kilometers
        self.gas = gas
        print("Congratulation. You created a new Car!")
        
    def print_car(self):
        """Prints a description of a car."""
        print(f"This is {self.owner}'s car. It is a {self.model}.")
    
    def add_stop(self, newmileage, newgas):
        """Add a new fuel stop with an entry on mileage and gas consumption."""
        self.kilometers.append(newmileage)
        self.gas.append(newgas)
        print("A new fuel stop has been added.")
        
    def avg_consumption(self):
        """Calculates the average fuel consumption (l/km) based on a car's gas and mileage."""
        self.avg_l_per_km = []
        for i in range(len(self.gas)):
            this_l_per_km = self.gas[i]/(self.kilometers[i+1] - self.kilometers[i])*100
            self.avg_l_per_km.append(this_l_per_km)

        return(self.avg_l_per_km)

# d)
class Car:
    """Represents a car in terms of its owner, model, mileage and amount of 
    gasoline at several fuel stops.
    
    attributes: owner, model, kilometers, stop
    """   
    # initialization
    def __init__(self, owner, model, kilometers, gas):
        """Create a new instance of a car based on the car's owner, model, mileage and gas consumption."""
        self.owner = owner
        self.model = model
        self.kilometers = kilometers
        self.gas = gas
        print("Congratulation. You created a new Car!")
        
    def print_car(self):
        """Prints a description of a car."""
        print(f"This is {self.owner}'s car. It is a {self.model}.")

    def avg_consumption(self):
        """Calculates the average fuel consumption (l/km) based on a car's gas and mileage."""
        self.avg_l_per_km = []
        for i in range(len(self.gas)):
            this_l_per_km = self.gas[i]/(self.kilometers[i+1] - self.kilometers[i])*100
            self.avg_l_per_km.append(this_l_per_km)

        return(self.avg_l_per_km)

    def add_stop(self, newmileage, newgas):
        """Add a new fuel stop with an entry on mileage and gas consumption."""
        self.kilometers.append(newmileage)
        self.gas.append(newgas)
        print("A new fuel stop has been added.")

class Car2(Car):
    """Extends Car by providing a max_mileage property."""
    @property
    def max_mileage(self):
        """The highest odometer reading."""
        return max(self.kilometers)
    
    @property
    def avg_consumption(self):
        """Returns overall average consumption (L/100km) for the entire recorded mileage."""
        total_distance = self.kilometers[-1] - self.kilometers[0]
        total_gas = sum(self.gas)
        return (total_gas / total_distance) * 100


# Instantiate Car objects
BobsCar = Car2("Bob", "VW T1", [200000, 200100, 200300], [22, 40])
CarlsCar = Car2("Carl", "Fiat Multipla", [179400, 183120, 183998], [89, 68])
MikesCar = Car2("Mike", "Opel Manta", [1200, 2000, 2250], [70, 22])
JennysCar = Car2("Jenny", "Cadillac Escalade", [8200, 9400, 10400, 12000], [80, 68, 110])

car_list = [MikesCar, JennysCar, BobsCar, CarlsCar]
mixed_list = [MikesCar, BobsCar, 23, "abc"]

# Task 1: Sort cars by maximum mileage (descending)


sorted_by_mileage = sorted(
    car_list,
    key=lambda car: car.max_mileage,
    reverse=True
)

# Task 2: Partition cars by average consumption
low_consumption = [car for car in car_list if car.avg_consumption < 10]
high_consumption = [car for car in car_list if car.avg_consumption >= 10]

# Task 3: Filter only Car instances from mixed_list
only_cars = [item for item in mixed_list if isinstance(item, Car)]

# Optional: Implement CarPool
class CarPool:
    def __init__(self, cars=None):
        self.cars = cars or []

    def sort_by_mileage(self, descending=True):
        return sorted(
            self.cars,
            key=lambda car: car.max_mileage,
            reverse=descending
        )

    def partition_by_consumption(self, threshold=10):
        low = [car for car in self.cars if car.avg_consumption < threshold]
        high = [car for car in self.cars if car.avg_consumption >= threshold]
        return low, high

# Example usage
pool = CarPool(car_list)
print("Sorted by mileage:", [car.model for car in pool.sort_by_mileage()])
low, high = pool.partition_by_consumption()
print("Low consumption cars:", [car.model for car in low])
print("High consumption cars:", [car.model for car in high])
print("Filtered from mixed_list:", [car.model for car in only_cars])

Congratulation. You created a new Car!
Congratulation. You created a new Car!
Congratulation. You created a new Car!
Congratulation. You created a new Car!
Sorted by mileage: ['VW T1', 'Fiat Multipla', 'Cadillac Escalade', 'Opel Manta']
Low consumption cars: ['Opel Manta', 'Cadillac Escalade', 'Fiat Multipla']
High consumption cars: ['VW T1']
Filtered from mixed_list: ['Opel Manta', 'VW T1']

Exercise 2: Basic Calculations with the `math` module

Import the math module by

import math

Use Python as you would a calculator to find the numeric answers

$1+2(3+4)$
$4^3+3^{2+1}$
$\frac{1+2}{3+4}$
$\sqrt{(4+3)(2+1)}$
$\sin(\pi)+\cos(\pi)$
$\exp(0) \log(2)$

print(1+2*(3+4))
print(4**3+3**(2+1))
print((1+2)/(3+4))
print(math.sqrt((4+3)*(2+1)))
print(math.sin(math.pi)+math.cos(math.pi))
print(math.exp(0)*math.log(2))

15
91
0.42857142857142855
4.58257569495584
-0.9999999999999999
0.6931471805599453

Section 3: `NumPy`

Exercise 1: Working with `NumPy` Arrays

You recorded your car’s mileage at your last six fill-ups as

\[ 34332\ 34604\ 34828\ 35019\ 35273\ 35513\]

Enter these numbers into the array m_age. Calculate the amount of miles that was driven after each fill up and enter these numbers into the variable miles. How many values does this variable include? What is the average amount of miles?
- Hint: np.diff()

m_age = np.array([34332, 34604, 34828, 35019, 35273, 35513])

miles = np.diff(m_age)
print(miles)
print(len(miles))
print(np.mean(miles))

[272 224 191 254 240]
5
236.2

Your cell-phone bill varied the last year from month to month the following way

\[ 22\ 31\ 18\ 35\ 29 \ 32\ 19\ 23\ 23\ 25\ 20\ 33\]

What is the largest amount you spent in a month? What is the smallest?
Find the total amount and the average amount per month that you spent in the last year!
How many months was the amount greater than 30$?
Is it worth to enter into a flatrate (25$ per month) for the next year expecting on average the same bills?

bill = np.array([22,31,18,35,29,32,19,23,23,25,20,33])
print(np.max(bill))
print(np.min(bill))
print(np.sum(bill))
print(np.mean(bill))

print(bill > 30 )
print(np.sum(bill > 30))
np.sum(bill)-25*12

35
18
310
25.833333333333332
[False  True False  True False  True False False False False False  True]
4

np.int64(10)

Exercise 2: Linear Regression with `NumPy`

The linear regression (see lecture 2) is a basic concept in statistics. Consider the linear regression model \[ y_i = \beta_0 + \beta_1X_i + \varepsilon_i = \mathbf{X}_i^T \boldsymbol{\beta} + \varepsilon_i \] One can estimate the coefficients $\mathbf{\hat{\beta}}$ by solving the following matrix equation: \[ \boldsymbol{\hat{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \] Calculate $\boldsymbol{\hat{\beta}}$ and the residuals $\boldsymbol{\varepsilon} = \mathbf{y} - \mathbf{\hat{y}}$ using NumPy for the following small dataset.

Shoe Size ($X$)	Height ($y$)
36	170
37	175
38	180
38	169
40	190

Hint: To have an intercept $\hat{\beta}_0$ in the model, just add a full column of $1$s to your independent variable $X$ to get your design matrix $\mathbf{X}$ with np.ones().

x_1 = np.array([36, 37, 38, 38, 40])
y = np.array([170, 175, 180, 169, 190])

X = np.vstack((np.ones(shape=x_1.shape), x_1)).T

beta_hat = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T), y)
y_hat = np.matmul(X, beta_hat)
residuals = np.subtract(y, y_hat)

print(beta_hat[0]) # Intercept
print(beta_hat[1]) # Coefficient
print(residuals) # Residuals

-2.749999999993207
4.749999999999762
[ 1.75  2.    2.25 -8.75  2.75]

Section 4: `Pandas`

Exercise 1: Pandas DataFrames

M and M’s Example

A bag of the candy M and M’s has many different colors. The table in Figure 2 contains the targeted color distribution in a bag of M and M’s as percentages for various types of packaging.

Enter this table into a pandas.DataFrame called mandms. Use pandas to answer the following questions:

Which packaging is missing one of the six colours?
Confirm that the color shares sum up to 100% for each package type.
- Hint: DataFrame.sum()
Write a code that displays the colour that is most frequent!
- Hint: DataFrame.sum()

import pandas as pd
A = np.array([[10,30, 10, 10, 20, 20 ],[20,20, 10, 10, 20, 20 ],
    [20,20, 20, 0, 20, 20 ],[16.667, 16.667, 16.667, 16.667, 16.667, 16.667 ],
    [16.667,16.667, 16.667, 16.667, 16.667, 16.667 ]])

namesr = ['MilkChoc','Peanut','PeanutButter','Almond','KidMinis']
namesc = ['blue','brown','green','orange','red','yellow']
mandms = pd.DataFrame(A, index=namesr, columns=namesc)
mandms

	blue	brown	green	orange	red	yellow
MilkChoc	10.000	30.000	10.000	10.000	20.000	20.000
Peanut	20.000	20.000	10.000	10.000	20.000	20.000
PeanutButter	20.000	20.000	20.000	0.000	20.000	20.000
Almond	16.667	16.667	16.667	16.667	16.667	16.667
KidMinis	16.667	16.667	16.667	16.667	16.667	16.667

mandms = pd.DataFrame.from_dict({'MilkChoc': [10,30, 10, 10, 20, 20 ],
    'Peanut': [20,20, 10, 10, 20, 20 ],
    'PeanutButter': [20,20, 20, 0, 20, 20 ],
    'Almond': [16.667, 16.667, 16.667, 16.667, 16.667, 16.667 ],
    'KidMinis': [16.667,16.667, 16.667, 16.667, 16.667, 16.667 ]},
    orient='index', columns=['blue','brown','green','orange','red','yellow'])

# easy way
mandms == 0
missing = mandms.where(mandms==0)
print(missing)

mandms.where(mandms==0).dropna()

              blue  brown  green  orange  red  yellow
MilkChoc       NaN    NaN    NaN     NaN  NaN     NaN
Peanut         NaN    NaN    NaN     NaN  NaN     NaN
PeanutButter   NaN    NaN    NaN     0.0  NaN     NaN
Almond         NaN    NaN    NaN     NaN  NaN     NaN
KidMinis       NaN    NaN    NaN     NaN  NaN     NaN

	blue	brown	green	orange	red	yellow

missing = mandms[mandms==0]
print(missing)

              blue  brown  green  orange  red  yellow
MilkChoc       NaN    NaN    NaN     NaN  NaN     NaN
Peanut         NaN    NaN    NaN     NaN  NaN     NaN
PeanutButter   NaN    NaN    NaN     0.0  NaN     NaN
Almond         NaN    NaN    NaN     NaN  NaN     NaN
KidMinis       NaN    NaN    NaN     NaN  NaN     NaN

missing.dropna(axis = 0, how = 'all')

	blue	brown	green	orange	red	yellow
PeanutButter	NaN	NaN	NaN	0.0	NaN	NaN

missing.dropna(axis = 0, how = 'all').index[0]

'PeanutButter'

mandms.sum(axis = 1)

MilkChoc        100.000
Peanut          100.000
PeanutButter    100.000
Almond          100.002
KidMinis        100.002
dtype: float64

S = mandms.sum(axis=0)
print(S)

blue       83.334
brown     103.334
green      73.334
orange     53.334
red        93.334
yellow     93.334
dtype: float64

maxmandms = S[S == S.max()]
print(maxmandms.index[0])

brown

Brain size data set

Download and load the brain_size.csv data set from STiNE. The dataset contains a small sample of observations of features (like gender, weight, height and three different IQ-measures) for different individuals.

import pandas as pd
brain_size = pd.read_csv('https://raw.githubusercontent.com/JanTeichertKluge/data-statprog/refs/heads/main/data/brain_size.csv', sep=';', na_values=".")
brain_size.head()

	Unnamed: 0	Gender	FSIQ	VIQ	PIQ	Weight	Height	MRI_Count
0	1	Female	133	132	124	118.0	64.5	816932
1	2	Male	140	150	124	NaN	72.5	1001121
2	3	Male	139	123	150	143.0	73.3	1038437
3	4	Male	133	129	128	172.0	68.8	965353
4	5	Female	137	132	134	147.0	65.0	951545

Before we perform some analysis, we will do some preprocessing.

Load the data set and drop the column Unnamed: 0.
Use DataFrame.dropna() to clean your data from entries containing NaN values. How many observations are left after the data cleaning?

Calculate the following descriptive statistics.

How many observations/individuals do we have?
Calculate the means and standard deviations for VIQ, Weight and Height. What is the median of Weight?
What is the percentage of women in the data set?
As a primitive approach to determine whether there are differences based on gender, calculate the means of the features above, again conditioned on gender.

brain_size.drop(['Unnamed: 0'], axis = 1, inplace=True)

print(brain_size.shape)
brain_size.dropna(inplace = True)
print(brain_size.shape)

(40, 7)
(38, 7)

print(brain_size['Weight'].mean())
print(brain_size['VIQ'].mean())
print(brain_size.loc[:,'VIQ'].mean())

print(brain_size.loc[:,'VIQ'].mean())
print(brain_size['VIQ'].std())
print(brain_size['Weight'].mean())
print(brain_size['Weight'].std())
print(brain_size['Height'].mean())
print(brain_size['Height'].std())

151.05263157894737
112.13157894736842
112.13157894736842
112.13157894736842
22.939604667570226
151.05263157894737
23.478509286005146
68.42105263157895
3.993789631262101

brain_size.describe()

	FSIQ	VIQ	PIQ	Weight	Height	MRI_Count
count	38.000000	38.000000	38.000000	38.000000	38.000000	3.800000e+01
mean	113.552632	112.131579	111.342105	151.052632	68.421053	9.067542e+05
std	23.815391	22.939605	22.597867	23.478509	3.993790	7.256175e+04
min	77.000000	71.000000	72.000000	106.000000	62.000000	7.906190e+05
25%	90.250000	90.250000	89.250000	135.250000	66.000000	8.548115e+05
50%	116.500000	113.000000	115.000000	146.500000	68.000000	9.053990e+05
75%	135.000000	129.000000	128.000000	172.000000	70.375000	9.495405e+05
max	144.000000	150.000000	150.000000	192.000000	77.000000	1.079549e+06

print(brain_size['Weight'].median())

146.5

print(np.sum(brain_size["Gender"]=="Female"))
print(np.sum(brain_size["Gender"]=="Male"))
brain_size.groupby('Gender').mean()

brain_size.groupby('Gender').agg('mean')

20
18

	FSIQ	VIQ	PIQ	Weight	Height	MRI_Count
Gender
Female	111.900000	109.450000	110.450000	137.200000	65.765000	862654.600000
Male	115.388889	115.111111	112.333333	166.444444	71.372222	955753.722222

brain_size.groupby('Gender').mean().diff(axis = 0).dropna()

brain_size.groupby('Gender').agg('mean').diff(axis = 0).dropna()

	FSIQ	VIQ	PIQ	Weight	Height	MRI_Count
Gender
Male	3.488889	5.661111	1.883333	29.244444	5.607222	93099.122222

Section 5: Data Visualization

Exercise 1: Basic visualization with `matplotlib` and `seaborn`

In this exercise, we will visualize data from a cosinus function. To do so, set up data for a variable x with np.linspace, on which you apply the cosinus function.

Create a scatter plot with matplotlib.
Create a line plot with seaborn.
Add a sinus function to the graph in matplotlib.
Add a sinus function to the graph in seaborn via a dashed line.
Create a random variable that simulates a coin flip. Use the information according to the coin flip to provide a color argument to the scatter plot in matplotlib.
Put the four different figures as subfigures into one major figure. Save the figure in a file my_plot.png.

# Step 1: imports
import numpy as np
import matplotlib.pyplot as plt

# Step 2: Generate the data for variable x
x = np.linspace(0, 2*np.pi, 100)

# Step 3: Apply the cosine function to x
y = np.cos(x)

# Step 4: Create a scatter plot using matplotlib
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter Plot')

# Step 5: Show the plots
plt.show()

import pandas as pd
import seaborn as sns

# data types for seaborn
# https://seaborn.pydata.org/tutorial/data_structure.html

# Step 1: Generate the data for variable x
x = np.linspace(0, 2*np.pi, 100)

# Step 2: Apply the cosine function to x
y = np.cos(x)

# Step 3: Create a line plot using seaborn
df_plot = pd.DataFrame({'x': x, 'y': y})
sns.lineplot(df_plot, x = 'x', y = 'y')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Line Plot')

# Step 4: Show the plot
plt.show()

# Step 1: imports
import numpy as np
import matplotlib.pyplot as plt

# Step 2: Generate the data for variable x
x = np.linspace(0, 2*np.pi, 100)

# Step 3: Apply the cosine function to x
y = np.cos(x)
z = np.sin(x)

# Step 4: Create a scatter plot using matplotlib
plt.scatter(x, y)
plt.scatter(x, z)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter Plot')
plt.show()

# Step 5: Show the plots

np.random.seed(123)

coin_flip_color = np.random.choice(['red', 'blue'], 100)

# Step 1: imports
import numpy as np
import matplotlib.pyplot as plt

# Step 2: Generate the data for variable x
x = np.linspace(0, 2*np.pi, 100)

# Step 3: Apply the cosine function to x
y = np.cos(x)
z = np.sin(x)

# Step 4: Create a scatter plot using matplotlib
plt.scatter(x, y, color = coin_flip_color)
plt.scatter(x, z, color = coin_flip_color)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter Plot')

# Step 5: Show the plots
plt.show()

import numpy as np
import matplotlib.pyplot as plt

# Step 2: Generate the data for variable x
x = np.linspace(0, 2*np.pi, 100)

# Step 3: Apply the cosine function to x
y = np.cos(x)
z = np.sin(x)

df = pd.DataFrame({'x': x, 'y': y, 'z': z, 'coin_flip': coin_flip_color})

# Step 4: Create a scatter plot using matplotlib
sns.scatterplot(df, x = 'x', y = 'y', hue = 'coin_flip')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter Plot with coin flip')

# Step 5: Show the plots
plt.show()

import seaborn as sns

# Step 1: Generate the data for variable x
x = np.linspace(0, 2*np.pi, 100)

# Step 2: Apply the cosine function to x
y = np.cos(x)
z = np.sin(x)


# Step 3: Create a line plot using seaborn
df_plot = pd.DataFrame({'x': x, 'y': y, 'z': z, 'coin_flip': coin_flip_color})
sns.lineplot(df_plot, x = 'x', y = 'y')
sns.lineplot(df_plot, x = 'x', y = 'z', linestyle = '--')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Line Plot')

# Step 4: Show the plot
plt.show()

# %%
# Subfigures in matplotlib: https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subfigures.html
# https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html
import matplotlib.pyplot as plt

# Create the figure and subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# Set the title for each subplot
# Comment: You can do this in a loop too!
axs[0, 0].set_title("Scatter plot matplotlib")
axs[0, 1].set_title("Line plot seaborn")
axs[1, 0].set_title("Colored scatter plot matplotlib")
axs[1, 1].set_title("Cosine and Sine line plot seaborn")

# Scatter plot
axs[0, 0].scatter(x, y)
axs[0, 0].set_xlabel('X')
axs[0, 0].set_ylabel('Y')

# Lineplot plot seaborn
df = pd.DataFrame({'x' : x, 'cosinus' : y})
sns.lineplot(df, x = 'x', y = 'cosinus', ax=axs[0, 1])

# Colored scatter plot
axs[1,0].scatter(x, y, color = coin_flip_color)
axs[1,0].set_xlabel('x')
axs[1,0].set_ylabel('y')
axs[1,0].set_title('Scatter Plot')

# Cosine and sinus with seaborn
df = pd.DataFrame({'x' : x, 'cosinus' : y,
                   'sinus': z}).set_index("x")
sns.lineplot(data=df, ax=axs[1, 1])

# Adjust the spacing between subplots
plt.tight_layout()

# Show the figure
plt.show()

Exercise 2: Times University Ranking Data

Load the Times University Ranking Data, which collects data on a ranking of international universities. You can download a zip archive with the data from STiNE. Download and unpack the zip file and load the file timesData.csv.

file_path = "https://raw.githubusercontent.com/JanTeichertKluge/data-statprog/refs/heads/main/data/timesData.csv"
import pandas as pd
df_times = pd.read_csv(file_path)

Open the data set with pandas and have a look at the very first rows. What do you see?
What has been the best university from Germany in 2016? What is its rank?

df_times[(df_times['country'] == 'Germany') & (df_times['year'] == 2016)].sort_values(by = 'total_score', ascending = False).head(5)

	world_rank	university_name	country	teaching	international	research	citations	income	total_score	num_students	student_staff_ratio	international_students	female_male_ratio	year
1831	29	LMU Munich	Germany	70.5	62.8	77.4	85.7	100.0	77.3	35,691	15.5	13%	62 : 38	2016
1839	37	Heidelberg University	Germany	68.8	62.8	69.6	88.2	68.2	74.4	28,881	24.5	17%	55 : 45	2016
1851	49	Humboldt University of Berlin	Germany	63.7	62.6	77.0	73.6	36.1	69.9	29,987	52.5	16%	NaN	2016
1855	53	Technical University of Munich	Germany	61.0	63.8	66.0	80.1	99.2	69.4	35,565	31.5	20%	33 : 67	2016
1874	72	Free University of Berlin	Germany	57.9	69.2	72.2	60.2	35.1	63.2	33,062	39.3	20%	58 : 42	2016

How has the university’s ranking changed over time? Create a scatter plot that shows the development of the university’s ranking over time.
Create a graphic that shows the changes in terms of the variables international, research, citation, income and total_score for this university over time.
Modify the graphics in a way you like. Can you make them interactive with plotly or bokeh? Create your own visualization as based on the Times Data set.

df_lmu = df_times[df_times['university_name'] == "LMU Munich"]
df_lmu[['year', 'world_rank']].sort_values(by = 'year')
# Pandas built-in plotting
df_lmu.plot(x = 'year', y = 'world_rank', kind = 'scatter')

import seaborn as sns
sns.scatterplot(df_lmu, x = 'year', y = 'world_rank')

import matplotlib.pyplot as plt

df_lmu_details = df_lmu.loc[:, ['year', 'international', 'research', 'citations', 'income', 'total_score', 'world_rank']]

# Create a figure with subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10, 10))

# Generate scatter plots for each column
sns.scatterplot(data=df_lmu_details, x="year", y="international", ax=axes[0, 0])
sns.scatterplot(data=df_lmu_details, x="year", y="research", ax=axes[0, 1])
sns.scatterplot(data=df_lmu_details, x="year", y="citations", ax=axes[1, 0])
sns.scatterplot(data=df_lmu_details, x="year", y="income", ax=axes[1, 1])
sns.scatterplot(data=df_lmu_details, x="year", y="total_score", ax=axes[2, 0])
sns.scatterplot(data=df_lmu_details, x="year", y="world_rank", ax=axes[2, 1])

# Set titles and labels
axes[0, 0].set_title("International")
axes[0, 1].set_title("Research")
axes[1, 0].set_title("Citations")
axes[1, 1].set_title("Income")
axes[2, 0].set_title("Total Score")
axes[2, 1].set_title("World Rank")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

Footnotes

Mileage denotes the total number of kilometers driven with a car. The first value is the number of kilometers at the last fuel stop before record.↩︎