Write two functions, get_rows
and get_columns
, that get a two dimensional array as parameter. They should return the list of rows and columns of the array, respectively. The rows and columns should be one dimensional arrays.
Test your solution in the main function. Example of usage:
a = np.array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6],
[8, 8, 1, 6]])
get_rows(a)
[array([5, 0, 3, 3]), array([7, 9, 3, 5]), array([2, 4, 7, 6]), array([8, 8, 1, 6])]
get_columns(a)
[array([5, 7, 2, 8]), array([0, 9, 4, 8]), array([3, 3, 7, 1]), array([3, 5, 6, 6])]
Create function get_row_vectors
that returns a list of rows from the input array of shape (n,m)
, but this time the rows must have shape (1,m)
. Similarly, create function get_columns_vectors
that returns a list of columns (each having shape (n,1)
) of the input matrix .
Example: for a 2x3 input matrix
[[5 0 3]
[3 7 9]]
The result should be
Row vectors:
[array([[5, 0, 3]]), array([[3, 7, 9]])]
Column vectors:
[array([[5],
[3]]),
array([[0],
[7]]),
array([[3],
[9]])]
The above output is basically just the returned lists printed with print
.
Create a function diamond
that returns a two dimensional integer array where the 1
s form a diamond shape. Rest of the numbers are 0
. The function should get a parameter that tells the length of a side of the diamond. Do this using the eye
and concatenate
functions of NumPy and array slicing.
Example of usage:
print(diamond(3))
[[0 0 1 0 0]
[0 1 0 1 0]
[1 0 0 0 1]
[0 1 0 1 0]
[0 0 1 0 0]]
print(diamond(1))
[[1]]
vector_lengths
that gets a two dimensional array of shape (n,m)
as a parameter. Each row in this array corresponds to a vector. The function should return an array of shape (n,), that has the length of each vector in the input. The length is defined by the usual Euclidean norm. Don’t use loops at all in your solution. Instead use vectorized operations. You must use at least the numpy.sum
and the numpy.sqrt
functions.Let $x$ and $y$ be m-dimensional vectors. The angle $\alpha$ between two vectors is defined by the equation $\cos_{xy}(\alpha) = {\langle x,y \rangle}/{(\|x\| \|y\|)}$, where the angle brackets denote the inner product, and $\|x\| = \sqrt{\langle x,x \rangle}$.
Write function vector_angles
that gets two arrays X
and Y
with same shape (n,m)
as parameters. Each row in the arrays corresponds to a vector. The function should return vector of shape (n,) with the corresponding angles between vectors of X
and Y
in degrees, not in radians. Again, don’t use loops, but use vectorized operations.
column_comparison
that gets a two dimensional array as parameter. The function should return a new array containing those rows from the input that have the value in the second column larger than in the second last column. You may assume that the input contains at least two columns. Don’t use loops, but instead vectorized operations.
For array
[[8 9 3 8 8]
[0 5 3 9 9]
[5 7 6 0 4]
[7 8 1 6 2]
[2 1 3 5 8]]
the result would be
[[8 9 3 8 8]
[5 7 6 0 4]
[7 8 1 6 2]]
Write function first_half_second_half
that gets a two dimensional array of shape (n,2*m)
as a parameter. The input array has 2*m
columns. The output from the function should be a matrix with those rows from the input that have the sum of the first m
elements larger than the sum of the last m
elements on the row. Your solution should call the np.sum
function or the corresponding method exactly twice.
Example of usage:
a = np.array([[1, 3, 4, 2],
[2, 2, 1, 2]])
first_half_second_half(a)
array([[2, 2, 1, 2]])
Write function most_frequent_first
that gets a two dimensional array and an index c
of a column as parameters. The function should then return the array whose rows are sorted based on column c
, in the following way. Rows are ordered so that those rows with the most frequent element in column c
come first, then come the rows with the second most frequent element in column c
, and so on. Therefore, the values outside column c
don’t affect the ordering in any way.
Example of usage:
a:
[[5 0 3 3 7 9 3 5 2 4]
[7 6 8 8 1 6 7 7 8 1]
[5 9 8 9 4 3 0 3 5 0]
[2 3 8 1 3 3 3 7 0 1]
[9 9 0 4 7 3 2 7 2 0]
[0 4 5 5 6 8 4 1 4 9]
[8 1 1 7 9 9 3 6 7 2]
[0 3 5 9 4 4 6 4 4 3]
[4 4 8 4 3 7 5 5 0 1]
[5 9 3 0 5 0 1 2 4 2]]
print(most_frequent_first(a, -1))
[[4 4 8 4 3 7 5 5 0 1]
[2 3 8 1 3 3 3 7 0 1]
[7 6 8 8 1 6 7 7 8 1]
[5 9 3 0 5 0 1 2 4 2]
[8 1 1 7 9 9 3 6 7 2]
[9 9 0 4 7 3 2 7 2 0]
[5 9 8 9 4 3 0 3 5 0]
[0 3 5 9 4 4 6 4 4 3]
[0 4 5 5 6 8 4 1 4 9]
[5 0 3 3 7 9 3 5 2 4]]
If we look at the last column, we see that the number 1 appears three times, then both numbers 2 and 0 appear twice, and lastly numbers 3, 9, and 4 appear only once. Note that, for example, among those rows that contain in column c a number that appear twice in column c the order can be arbitrary.
Hint: the function np.unique may be useful.
Write function matrix_power
that gets as first argument a square matrix a
and as second argument a non-negative integer n
. The function should return the matrix a
multiplied by itself n-1
times. Use Python’s reduce
function and a generator expression.
Extend the matrix_power
function. For negative powers, we define $a^{−1}$ to be equal to the multiplicative inverse of a
. You can use NumPy’s function numpy.linalg.inv
for this. And you may assume that the input matrix is invertible.
Load the iris dataset via the following Python code:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
The four columns of the returned array correspond to
What are the correlations between all the variables. Write a function correlations
that returns an array of shape (4,4) containing the correlations. Use the function np.corrcoef
. Which pair of variables is the most highly correlated?
Write function subfigures
that creates a figure that has two subfigures (two axes in matplotlib parlance). The function gets a two dimensional array a
as a parameter. In the left subfigure draw using the plot
method a graph, whose x coordinates are in the first column of a
and the y coordinates are in the second column of a
. In the right subfigure draw using the scatter
method a set of points whose x coords are again in the first column of a and whose y coordinates are in the second column of a
. Additionally, the points should get their color from the third column of a
, and size of the point from the fourth column of a. For this, use the c
and s
named parameters of scatter
, respectively.
Test your function subfigure
by one or two example data.
read_series
that reads input lines from the user and return a Series. Each line should contain first the index and then the corresponding value, separated by whitespace. The index and values are strings (in this case dtype
is object
). An empty line signals the end of Series. Malformed input should cause an exception. An input line is malformed, if it is non-empty and, when split at whitespace, does not result in two parts.inverse_series
that get a Series as a parameter and returns a new series, whose indices and values have swapped roles. DateFrame
of top Chinese cities by population:
城市 人口(万人) 总面积(平方公里)
重庆市 3101.79 667.5
上海市 2423.78 885.7
北京市 2154.20 1289.3
成都市 1633 408.66
天津市 1559.60 571.5
广州市 1490.44 785.44
Make function powers_of_series
that takes a Series and a positive integer k
as parameters and returns a DataFrame. The resulting DataFrame should have the same index as the input Series. The first column of the dataFrame should be the input Series, the second column should contain the Series raised to power of two. The third column should contain the Series raised to the power of three, and so on until (and including) power of k
. The columns should have indices from 1 to k.
The values should be numbers, but the index can have any type. Test your function. Example of usage:
s = pd.Series([1,2,3,4], index=list("abcd"))
print(powers_of_series(s, 3))
Should print:
1 2 3
a 1 1 1
b 2 4 8
c 3 9 27
d 4 16 64
运行以下命令,读入数据coupon_nm.csv
import pandas as pd
coupon_nm=pd.read_csv("coupon_nm.csv",encoding='gbk')
coupon_nm.head()
1) 列出所有的列名。并对购买人数,团购价和市场价进行汇总分析(如均值,标准差,分位数等)。
2) 产生一个数据框,只包含coupon_nm中 团购评价
等于5的样本.。
3) 把列名为到期时间
的列拆分成三列:年
,月
, 和 日
, 并返回一个有五列的数据框,其列名分别为 团购活动ID
,年
,月
, 和 日
. 可以借助 map
函数实现。
通过pandas.read_excel
命令读入shops_nm.xlsx
1) 检查数据的维数是否是699*9,并查看前5行数据。
2) 找出团购评价数最多的商家并输出,输出内容为:
```
团购评价数最多的商家为: ***
```
3) 找出人均价格前10的商家,将结果输出到一个新的DataFrame
shops_top10,包含的列分别为 商家店名,评分,评价数,人均。
4) 找出评价数缺失的商户并输出。
对6中的数据做如下数据清洗和预处理:
1) 把人均
列中的文字全部去掉,如‘大概92左右’换成92, ‘人均:100’换成100,并将其数据类型设置为float
2) 按照商家的评分将降价分成低级商户(低于2分)、中级商户(2-3.5)和高级商户(高于3.5),并将其加入和原始的DataFrame中,列名为‘商家等级’。
3) 根据商家等级分组,对人均按照均值进行整合,结果应该为
人均
商家等级
0 63.335065
1 55.055556
2 52.230167