python-mysql-replication (or PyMySQL) would return less than 20 bytes
for info-hashes that had null bytes near the end, leaving incomplete
hashes in the ES index. Without delving too deep into the real issue
(be it lack of understanding MySQL storing binary data or a bug in
the libraries), thankfully we can just pad the fixed-size info-hashes
to be 20 bytes.
Padding in import_to_es.py may be erring on the side of caution, but
safe is established to be better than sorry.
(SQLAlchemy is unaffected by this bug)
instead of flushing every N seconds, it flushed N seconds after
the last change, which could drag out to N seconds * M batch size
if there are few updates. Practically this doesn't change anything
since stuff is always happening.
Also fix not writing a save point if nothing is happening. Also
practically does nothing, but for correctness.
also fixed the save time loop and spaced it out
to 10k events instead of 100.
Notably, the event no. of rows caps out at around 5 by default
because of default -binlog-row-event-max-size=8192 in mysql; that's
how many (torrent) rows fit into a single event.
We could increase that, but instead I think it's finally time to finally
multithread this thing; both the binlog read and the ES POST shouldn't
use the GIL so it'll actually work.
mainly helps with the stat updates, that come in
a single INSERT VALUES (...) ON CONFLICT UPDATE event,
which helpfully translates to a bulk index event.
It seems like elasticsearch should still be buffering that up
internally, so maybe the refresh_interval: 30s change will help
more than this.