Here's a setup to help you analyze your data in Python.
If you are interested in data analysis, please also check here. If you are interested in data scientists, please take a look around here first. Summary of literature and videos (added as needed) --Qiita
http://jupyter.org/ An environment for interactive code execution It is very suitable for data analysis, and once you get used to it, you will not be able to analyze it with other IDEs.
In addition to being able to execute each code block that is divided arbitrarily and display the result each time, ・ Inline display of graph ・ Formula description (Latex) ・ Markdown method text description
It is very suitable for analysis work while exploring, sharing and storage of results, etc. It is also widely used in the scientific industry because it can be written in a dissertation-like format by drawing sentences and charts with iPython.
There is also a product called jupyterhub for use by multiple people. https://github.com/jupyter/jupyterhub
Google Cloud Datalab https://cloud.google.com/datalab/?hl=ja Google Cloud data discovery front end based on Jupyter Reference: BigQuery integration for Python users --Qiita
beaker notebook http://beakernotebook.com/
Apache Zeppelin https://zeppelin.incubator.apache.org/
Numpy http://www.numpy.org/ Compared to Python's built-in List, for handling array-to-array operations and multidimensional arrays (matrix calculation) A library that provides good objects A collection of Numpy Arrays will be the Pandas dataframe objects introduced below.
Learn more about using Numpy and Pandas in this book
Introduction to data analysis with Python-Data processing using NumPy and pandas http://www.oreilly.co.jp/books/9784873116556/
Pandas http://pandas.pydata.org/ Library for handling data in RDB-like form (data frame) in Python It has become the standard for data analysis, including Sciki learn and Matplotlib. Coordination with Pandas objects is smooth
Commentary article
A rudimentary summary of data manipulation in Python Pandas http://qiita.com/hik0107/items/d991cc44c2d1778bb82e
Scipy http://docs.scipy.org/doc/scipy/reference/ Library for scientific and technical calculations Includes various techniques such as special functions, optimizations, statistical processing (quite many)
Example of scipy.optimize for function approximation (qiita article)
Non-linear function modeling in Python http://qiita.com/hik0107/items/9bdc236600635a0e61e8
csv http://docs.python.jp/2/library/csv.html#module-csv A convenient library for loading, processing, and operating csv Provide a reader or writer for csv files
There is a library for connecting to various DBs such as MySQL, PostgreSQL, BigQuery, SQLite, etc.
MySQL : MySQL-Connector-Python https://pypi.python.org/pypi/mysql-connector-python/
PostgreSQL : Pycopg2 http://initd.org/psycopg/download/
BigQuery : BigQuery-Python https://github.com/tylertreat/BigQuery-Python
Or see here for how to use Pandas http://qiita.com/hik0107/items/3944ccea04371331c3b4
SQLite: SQLite3 (installation is not required because it is built-in) http://docs.python.jp/2/library/sqlite3.html
pivottablejs https://pypi.python.org/pypi/pivottablejs A library that accepts Pandas objects and allows you to work like an Excel PivotTable Useful when you want to make simple tabulations and check data
http://docs.python.jp/2/library/collections.html Module containing functions such as "Counter" that can be used like Count Distinct and "named tuple" that can design simplified objects of data frames
scikitlearn http://scikit-learn.org/ Machine learning package packed with models for classification and prediction This is also almost de facto in data analysis in Python.
matplotlib (+ seaborn) http://matplotlib.org/ http://stanford.edu/~mwaskom/software/seaborn/ matplotlib is effectively the de facto tool for data visualization in Python. seaborn is a wrapper like that, which makes it easier to draw beautiful graphs.
There are various graphs such as line graphs, bar graphs, histograms, scatter plots, etc.
Qiita article
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier http://qiita.com/hik0107/items/3dc541158fceb3156ee0
Both are high-performance graphing tools If you don't like matplotlib, aren't satisfied with it, or are a former R user, please check it out.
・ Bokeh http://bokeh.pydata.org/en/latest/ -Ggplot (Python version of R's ggplogt2 library) http://ggplot.yhathq.com/ ・ Plotly https://plot.ly/
http://cython.org/ Compile some Python code into C code for fast execution Useful when the amount of calculation is large and speed becomes a bottleneck
http://www.sympy.org/en/index.html
http://docs.python.jp/2/library/datetime.html
It's time to seriously think about the definition and skill set of data scientists http://qiita.com/hik0107/items/f9bf14a7575d5c885a16
Recommended Posts