Basic efficiency tips for beginners in Python3

2020-04-02 14:47:42
OfStack

Sometimes I ask myself why I didn't know there was an easier way to do "such" things in Python 3, and as I look for answers, over time I certainly find cleaner, more efficient, and less buggy code. Overall (and not just for this article), the total number of "those" things was more than I thought it would be, but here are the first few features that weren't obvious, and then I looked for more efficient/simple/maintainable code.
The dictionary

Keys () and items() in the dictionary

You can do a lot of interesting things in the dictionary keys and items, which are similar to sets:


 
aa = { ' mike':  ' male',  ' kathy':  ' female',  ' steve':  ' male',  ' hillary':  ' female'}
 
bb = { ' mike':  ' male',  ' ben':  ' male',  ' hillary':  ' female'}
 
aa.keys() & bb.keys() # { ' mike',  ' hillary'} # these are set-like
aa.keys() - bb.keys() # { ' kathy',  ' steve'}
# If you want to get the common key-value pairs in the two dictionaries
aa.items() & bb.items() # {( ' mike',  ' male'), ( ' hillary',  ' female')}

Too simple!

Verify the existence of a key in the dictionary

How many times have you written this code?


 
dictionary = {}
for k, v in ls:
  if not k in dictionary:
    dictionary[k] = []
  dictionary[k].append(v)

This code isn't that bad, but why do you always need an if statement?


 
from collections import defaultdict
dictionary = defaultdict(list) # defaults to list
for k, v in ls:
  dictionary[k].append(v)

This makes it clearer that you don't have a single redundant and ambiguous if statement.

Update a dictionary with another dictionary


 
from itertools import chain
a = { ' x': 1,  ' y':2,  ' z':3}
b = { ' y': 5,  ' s': 10,  ' x': 3,  ' z': 6}
 
# Update a with b
c = dict(chain(a.items(), b.items()))
c # { ' y': 5,  ' s': 10,  ' x': 3,  ' z': 6}

This looks good, but it's not simple enough. See if we can do better:


 
c = a.copy()
c.update(b)

Clearer and more readable!

Get the maximum value from a dictionary

If you want to get the maximum value in a dictionary, it might look something like this:


 
aa = {k: sum(range(k)) for k in range(10)}
aa # {0: 0, 1: 0, 2: 1, 3: 3, 4: 6, 5: 10, 6: 15, 7: 21, 8: 28, 9: 36}
max(aa.values()) #36

This works, but if you need the key, then you need to find the key on top of the value. However, we can use zip to make the presentation flatter and return a key-value form like this:


 
max(zip(aa.values(), aa.keys()))
# (36, 9) => value, key pair

Similarly, if you want to traverse a dictionary from largest to smallest, you can do this:


 
sorted(zip(aa.values(), aa.keys()), reverse=True)
# [(36, 9), (28, 8), (21, 7), (15, 6), (10, 5), (6, 4), (3, 3), (1, 2), (0, 1), (0, 0)]

Open any number of items in a list

We can use the magic of * to get any items into the list:


 
def compute_average_salary(person_salary):
  person, *salary = person_salary
  return person, (sum(salary) / float(len(salary)))
 
person, average_salary = compute_average_salary([ " mike " , 40000, 50000, 60000])
person #  ' mike'
average_salary # 50000.0

It's not all that fun, but what if I told you it could be something like this:


 
def compute_average_salary(person_salary_age):
  person, *salary, age = person_salary_age
  return person, (sum(salary) / float(len(salary))), age
 
person, average_salary, age = compute_average_salary([ " mike " , 40000, 50000, 60000, 42])
age # 42

It looks so simple!

When you think of a dictionary with a key of string type and the value of a list, instead of walking through a dictionary and processing the value sequentially, you can use a flatter representation (a list within a list) like this:


 
# Instead of doing this
for k, v in dictionary.items():
  process(v)
 
# we are separating head and the rest, and process the values
# as a list similar to the above. head becomes the key value
for head, *rest in ls:
  process(rest)
 
# if not very clear, consider the following example
aa = {k: list(range(k)) for k in range(5)} # range returns an iterator
aa # {0: [], 1: [0], 2: [0, 1], 3: [0, 1, 2], 4: [0, 1, 2, 3]}
for k, v in aa.items():
  sum(v)
 
#0
#0
#1
#3
#6
 
# Instead
aa = [[ii] + list(range(jj)) for ii, jj in enumerate(range(5))]
for head, *rest in aa:
  print(sum(rest))
 
#0
#0
#1
#3
#6

You can unpack the list into head, *rest,tail, etc.

Collections as counters

Collections is one of my favorite libraries in python, and in python, if you need any other data structure besides the original default, you should look at this.

Part of my basic day job is to count a lot of words that are not very important. Someone might say, you can use these words as keys to a dictionary, and their respective values as values, and I might agree with you when I don't touch Counter in collections (yes, that's why we're doing all this introduction because Counter).

Suppose you read wikipedia in python, convert it to a string, and put it in a list (marked in order) :


 
import re
word_list = list(map(lambda k: k.lower().strip(), re.split(r'[;,:(.s)]s*', python_string)))
word_list[:10] # [ ' python',  ' is',  ' a',  ' widely',  ' used',  ' general-purpose',  ' high-level',  ' programming',  ' language',  ' [17][18][19]']

So far so good, but if you want to calculate the words in this list:


 
from collections import defaultdict # again, collections!
dictionary = defaultdict(int)
for word in word_list:
  dictionary[word] += 1

It's not that bad, but if you have Counter, you'll save your time doing more meaningful things.


 
from collections import Counter
counter = Counter(word_list)
# Getting the most common 10 words
counter.most_common(10)
[( ' the', 164), ( ' and', 161), ( ' a', 138), ( ' python', 138),
( ' of', 131), ( ' is', 102), ( ' to', 91), ( ' in', 88), ( ' ', 56)]
counter.keys()[:10] # just like a dictionary
[ ' ',  ' limited',  ' all',  ' code',  ' managed',  ' multi-paradigm',
 ' exponentiation',  ' fromosing',  ' dynamic']

Pretty neat, but if we look at the methods available in Counter:


 
dir(counter)
[ ' __add__',  ' __and__',  ' __class__',  ' __cmp__',  ' __contains__',  ' __delattr__',  ' __delitem__',  ' __dict__',
 ' __doc__',  ' __eq__',  ' __format__',  ' __ge__',  ' __getattribute__',  ' __getitem__',  ' __gt__',  ' __hash__',
 ' __init__',  ' __iter__',  ' __le__',  ' __len__',  ' __lt__',  ' __missing__',  ' __module__',  ' __ne__',  ' __new__',
 ' __or__',  ' __reduce__',  ' __reduce_ex__',  ' __repr__',  ' __setattr__',  ' __setitem__',  ' __sizeof__',
 ' __str__',  ' __sub__',  ' __subclasshook__',  ' __weakref__',  ' clear',  ' copy',  ' elements',  ' fromkeys',  ' get',
 ' has_key',  ' items',  ' iteritems',  ' iterkeys',  ' itervalues',  ' keys',  ' most_common',  ' pop',  ' popitem',  ' setdefault',
 ' subtract',  ' update',  ' values',  ' viewitems',  ' viewkeys',  ' viewvalues']

Have you seen the methods of s/s and s/s sub__? Yes, Counter supports addition and subtraction. So, if you have a lot of text to count words, you don't need Hadoop, you can take Counter(as a map) and add them up (reduce). So you have mapreduce built on Counter, and you'll probably thank me later.

Flat nested lists

Collections also have the _chain function, which can be used as a flat nested list


 
from collections import chain
ls = [[kk] + list(range(kk)) for kk in range(5)]
flattened_list = list(collections._chain(*ls))

Open both files at the same time

If you're working on one file (say, line by line) and want to write those lines to another file, you might be tempted to write something like this:


 
with open(input_file_path) as inputfile:
  with open(output_file_path,  ' w') as outputfile:
    for line in inputfile:
      outputfile.write(process(line))

In addition, you can open multiple files on the same line, as follows:


 
with open(input_file_path) as inputfile, open(output_file_path,  ' w') as outputfile:
  for line in inputfile:
    outputfile.write(process(line))

This is much more concise!
Find Monday in a pile of data

If you have a data set that you want to standardize (like before or after Monday), you might do something like this:


 
import datetime
previous_monday = some_date - datetime.timedelta(days=some_date.weekday())
# Similarly, you could map to next monday as well
next_monday = some_date + date_time.timedelta(days=-some_date.weekday(), weeks=1)

That's how it's done.
Handle HTML

If you're trying to crawl a site out of interest or interest, you might be faced with HTML tags all the time. To parse various HTML tags, you can use html.parer:


from html.parser import HTMLParser
 
class HTMLStrip(HTMLParser):
 
  def __init__(self):
    self.reset()
    self.ls = []
 
  def handle_data(self, d):
    self.ls.append(d)
 
  def get_data(self):
    return  ' '.join(self.ls)
 
  @staticmethod
  def strip(snippet):
    html_strip = HTMLStrip()
    html_strip.feed(snippet)
    clean_text = html_strip.get_data()
    return clean_text
 
snippet = HTMLStrip.strip(html_snippet)

If you just want to avoid HTML:


escaped_snippet = html.escape(html_snippet)
 
# Back to html snippets(this is new in Python 3.4)
html_snippet = html.unescape(escaped_snippet)
# and so forth ...