This article was written by Armin Ronacher (@mitsuhiko) On Sunday, August 24, 2014, after a discussion by the Python community that took place when the proposal to add type annotations to Python was made. This is a translation of the article written by Mr. com / mitsuhiko)).
If you are interested in type annotations that you are considering introducing in Python 3.5, please refer to the following.
I myself am not familiar with the type system and other languages, so I think there are some untranslated parts, misunderstandings and mistranslations. If you find such an error, it would be helpful if you could send us an edit request.
This is Part 2 about "Python I Want". Based on recent discussions, we'll explore Python's type system a bit. Part of this article refers to previous article about slot. .. Like its predecessor, this article provides insight into future language designers of the language Python, and dive into the world of the CPython interpreter.
As one of the Python programmers, types are a bit tricky. Types do exist and act differently from each other, but most of the time you will notice the existence of a type only when the type does not behave as intended and the exception or execution fails.
Python was proud of how it handled the typing. I remember reading the FAQ in this language many years ago, which said that duck typing was awesome. Duck typing is an excellent solution in terms of practicality because it is fair. It basically doesn't fight the type system and doesn't limit what you want to do, so it implements a good API. Especially the things you do most often are super easy in Python.
Most APIs I designed for Python don't work in other languages. Even something very simple, like the generic interface of click, still doesn't work in other languages. The main reason for this is to constantly fight the mold.
There was a recent debate about adding static typing to Python. I'm sure the train will leave the station and go far away and never come back. Here's my thoughts on why I hope Python doesn't adapt to explicit typing because it's interesting.
The type system is a rule of how types interact. There is even one field in computer science that deals only with the whole mold. The mold itself is very impressive. However, it is difficult to ignore type systems even if you are not particularly interested in theoretical computer science.
I don't want to be absorbed in the type system for two reasons: The first reason is that I have little understanding of the type system. The second reason is that understanding is not so important in order to "realize" the logical consequences of the type system. It's important to me how types behave, and from there it affects how the API is designed. So think of this article as a basic introduction to a better API from my delusions, rather than an introduction to the correct type.
The type system has many characteristics, but the most important thing to distinguish a type is the amount of information it provides when trying to explain it.
Let's use Python as an example. Python has types. When asked for the type of the number 42
, Python replies that it is an integer type. It has many implications and lets the interpreter define rules on how integer types interact with other integer types.
But there's one thing Python doesn't have. It is a complex type. All Python types are primitive. This basically means that only one type works at a time. The opposite of the base type is the composite. From time to time, you'll see Python composite types in different contexts.
The simplest complex type that most programming languages have is a structure. Python doesn't have a struct directly, but there are often situations where you need to define your own struct in a rally. For example, Django and SQLAlchemy ORM models are essentially structures. The columns in each database are represented through Python descriptors, which in turn correspond to the fields of the structure. If you call the primary key ʻid and ʻIntegerField ()
, it defines the model as a composite type.
Composite types are not limited to structures. For example, if you want to use multiple integers, use a collection like an array. Python provides lists, and the individual elements of the list can be of any type. This is in contrast to a list defined by specifying a type (such as a list of integer types).
It can be said that an "integer type list" is not a list. You can argue that you can figure out the type by iterating through that list, but you'll have problems with an empty list. I don't know its type when using a list with no elements in Python.
The exact same problem in Python is caused by a null reference (None
). If you pass a user object to a function and that user object can be None
, you have no idea if the argument is a user object.
So is there a solution? To have an explicit typed array with no null references. Haskell is, of course, a language everyone knows to have it, but there are others that don't seem inappropriate. Rust, for example, is a well-known language that is largely C ++-like, but it brings a very powerful type system to its tables.
So how do you say "user doesn't exist" if there are no null references? For example, the answer in Rust is option type. ʻOption means either
Some (user)or
None. The former is a tagged enum that wraps a value (a specific user). Now that the variable either has a value or does not exist, all code dealing with that variable will not compile unless you explicitly handle the
None` case.
In the past, the world was divided into clearly dynamically typed interpreted languages and pre-statically typed compiled languages. This is changing with the emergence of new trends.
The first sign that we started heading into that unexplored territory was C #. It's a statically compiled language, and initially it was very similar to Java. As the language improved, many new type system-related features were added. Most important is the introduction of generics that provide strong typing for collections, lists, dictionaries, etc. other than those provided by the compiler. Since then, C # has gone in the opposite direction of static typing, allowing you to stop statically typing for each individual variable and make it dynamically typed. This is ridiculously convenient. This is especially true in the context of working with data provided by web services (JSON, XML, etc.). With dynamic typing, you can try to do something that may not be type-safe, catch a type error due to incorrect input data, and display it to the user.
Today's C # type systems are very powerful in terms of covariance and contravariance specifications. Not only that, but many language-level support for dealing with nullable types has also been developed. For example, the null coalescing operator (??
) was introduced to provide default values for objects represented as null. C # has come too late to remove null
from the language, but it does allow you to control the harm caused by null
.
At the same time, other languages that are traditionally statically precompiled are also exploring new areas. C ++ is always statically typed, but it's still being considered for type inference at many levels. The days of MyType <X, Y> <:: const_iterator iter
are gone. Nowadays, in most situations, simply replacing the type with ʻauto` will cause the compiler to embed the type instead.
Rust as a language also supports good type inference for writing statically typed programs without completely arbitrary type definitions.
Rust
use std::collections::HashMap;
fn main() {
let mut m = HashMap::new();
m.insert("foo", vec!["some", "tags", "here"]);
m.insert("bar", vec!["more", "here"]);
for (key, values) in m.iter() {
println!("{} = {}", key, values.connect("; "));
}
}
I think we are heading into the future with a powerful type system. I don't think this will be the end of dynamic typing, but it seems that there is a marked tendency to accept strong static typing with local type inference.
So, not long ago, someone convinced people around them at conferences that static typing should be a great language feature. I don't know exactly what the argument was, but the end result was declared that the combination of mypy's type module and Python 3's type annotation syntax would become the standard for typing in Python.
If you haven't seen the suggestion yet, something like this has been proposed:
Python3
from typing import List
def print_all_usernames(users: List[User]) -> None:
for user in users:
print(user.username)
To be honest, I don't think this is a good decision at all for many reasons. The main reason for this is that Python suffers from just not having a good type system. The semantics of this language change depending on how you look at it.
For static typing to make sense, the type system needs to be good. Given two types, it's a type system that lets you know how the types relate to each other. Python doesn't.
If you've read the article about the slot system I wrote earlier, you'll remember that Python types have different semantics depending on whether they're implemented in C or Python. This is a fairly unusual feature of this language and is usually not found in many other languages. It's true that many languages have types implemented at the interpreter level for bootstrap purposes, but they are usually considered basic types and are treated specially.
Python doesn't have a real "fundamental" type. However, there are quite a few types implemented on the C language side. These are not limited to primitives or fundamental types at all, they appear everywhere without any logic. For example, collections.OrderedDict
is a type implemented on the Python side, while collections.defaultdict
in the same module is a type implemented on the C language side.
This actually causes quite a few problems with PyPy. This is because we need to mimic the original type as much as possible in order to achieve a similar API where these differences are not noticeable. It's very important to understand what this miscellaneous difference between the C language level interpreter code and the rest of the language means.
As an example, let's point out the re
module up to Python 2.7. (This behavior has ultimately changed in the re
module, but there are still miscellaneous issues with interpreters that work outside the language.)
The re
module provides the ability to compile regular expressions into regular expression patterns (compile
). It takes a string and returns a pattern object. It looks like this:
Python2.7
>>> re.compile('foobar')
<_sre.SRE_Pattern object at 0x1089926b8>
As you can see, this pattern object is inside the _sre
module, but it's versatile:
Python2.7
>>> type(re.compile('foobar'))
<type '_sre.SRE_Pattern'>
Unfortunately it was a bit of a lie. The _sre
module does not actually contain that type.
Python2.7
>>> import _sre
>>> _sre.SRE_Pattern
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'SRE_Pattern'
Yes, that's right. It's not the first time to lie if the type isn't there, and in any case it's an internal type. So proceed to the next. We know that the type of this pattern object is _sre.SRE_Pattern
. It is itself a subclass of ʻobject:
Python2.7
>>> isinstance(re.compile(''), object)
True
As we know, all objects implement some common methods. For example, all objects implement __repr__
.
Python2.7
>>> re.compile('').__repr__()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: __repr__
Ahh. What the hell happened? Well, the answer is pretty weird. I don't know why, but internally the SRE pattern object had a custom tp_getattr
slot up to Python 2.7. This slot had a custom attribute discovery process that provided access to custom methods and attributes. If you actually look at the object with dir ()
, you'll notice that many features are missing.
Python2.7
>>> dir(re.compile(''))
['__copy__', '__deepcopy__', 'findall', 'finditer', 'match',
'scanner', 'search', 'split', 'sub', 'subn']
What's more, how this type actually works invites you to a truly bizarre adventure. Here's what's going on.
The type type claims that it is a subclass of ʻobject. This is true in the world of CPython interpreters, but not in the language Python. It's a shame that this isn't the same, but it's a common case. The type does not correspond to the interface of ʻobject
on the Python layer. Calls through the interpreter work, but calls through the Python language fail. That is, type (x)
succeeds, while x.__ class__
fails.
The example above shows that Python can have another subclass that doesn't match the behavior of the base class. This is especially problematic when talking about static typing. In Python 3, for example, you cannot implement an interface of type dict
unless you write the type on the C language side. The reason is that the type guarantees certain behavior of the view object, which is not easy to implement. What do you mean?
So when you statically annotate a function that receives a dictionary of string keys and integer objects, it's not entirely clear if it's an object like a dict or dict, or if it allows a subclass of the dictionary.
The strange behavior of the pattern object above changed in Python 2.7, but the underlying problem still remains. Like the dict instance behavior mentioned earlier, the language behaves differently depending on how the code is written. And it is impossible to fully understand the rigorous semantics of the type system.
A very strange case inside such an interpreter is type comparison, for example in Python 2. This particular case doesn't exist in Python 3 because the interface has changed, but the basic problem can be found at various levels.
Let's take a set type sort as an example. Python's set type is useful, but the comparison operation is pretty weird. In Python 2, there is a function called cmp ()
that returns a number that indicates which of the two given types is greater. Values less than 0 mean the first argument is less than the second argument, 0 means they are equal, and positive numbers mean the second argument is greater than the first argument.
Here's what happens when you compare sets:
Python2.7
>>> cmp(set(), set())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot compare sets using cmp()
Why? I honestly don't know the exact thing. Probably because the comparison operator actually sets the subset and it doesn't work with cmp ()
. But for example the frozenset type works.
Python2.7
>>> cmp(frozenset(), frozenset())
0
It will fail if one of these frozen sets is not empty. I wonder why? The answer to this is that it's an optimization of the CPython interpreter, not a language feature. frozenset interns a common value. An empty frozenset is always the same value (it is immutable and cannot be added), so an empty frozenset will be the same object. If two objects have the same pointer address, cmp
normally returns 0
. Due to the complex comparison logic in Python 2, I don't immediately understand why this happens, but there are multiple code paths in the comparison routine that may cause this result.
The point is that rather than this being a bug, Python doesn't have the proper semantics of how types actually interact. For a really long time, the behavior of the type system has been "at the mercy of CPython".
You'll find countless changesets in PyPy that have tried to reconstruct the behavior of CPython. Given that PyPy is written in Python, it's a very interesting issue for the language. Had the language Python fully defined the behavior of the Python part of the language, PyPy would have had much less problems.
Let's assume you have a virtual Python that fixes all the issues mentioned above. Still, static typing doesn't fit Python well. The main reason is that at the Python language level, types traditionally have little meaning when it comes to how objects interact.
For example, a datetime object can generally be compared to other objects, but the time zone settings must be compatible when comparing to other datetime objects. Similarly, the results of many operations are not clear to the hand until the object is examined. Combining two strings in Python 2 creates either a unicode or a bytestring object. The codec system's encoding and decoding API returns any object.
Python as a language is too dynamic to work with annotations. Think about how important generators are to your language. However, the generator can do different type conversions with each iteration.
Type annotations may be good in part, but they can have a negative impact on API design. Unless you remove the type annotations at runtime, it will be at least slow. You will never be able to implement a language that compiles efficiently and statically unless you transform Python into something other than Python.
What I personally think from Python is that languages are ridiculously complex. Python is a language that suffers from the complex interactions between these different types without a language specification. It seems that it will never come together. There are so many mysterious behaviours and a bit of weird behaviour, so when you try to create a language spec, you'll end up with just a transcript of the CPython interpreter.
I think it makes little sense to put type annotations on this foundation.
If in the future someone develops another dynamically typed language, more effort should be made to clearly define how types work. JavaScript is doing pretty well in that regard. All the strange but built-in semantics are clearly defined. I think this is generally a good thing. Once you have a clear definition of how that semantics work, you have room for optimization or later optional static typing.
Keeping the language lean and well-defined is well worth the challenge. Future language designers should never make the mistakes PHP, Python, or Ruby made. There it ends with the conclusion that the behavior of the language is "at the mercy of the interpreter".
What I think of Python is unlikely to change at this point. The time and effort to clean the language and interpreter outweighs the value you get.
© Copyright 2014 by Armin Ronacher. Content licensed under the Creative Commons attribution-noncommercial-sharealike License.
Recommended Posts