Introducing precise-lite-trainer

OpenVoiceOS precise trainer

WIP - open during construction, sharing for early feedback

you can now easily train precise models, all updated to latest tensorflow, this repo is recommended to train and convert tflite models

Several training strategies are available, each may provide better results for different datasets and wake words, some sounds might be easier to learn than others and the kinds of data available for each word will be different

  • train - select epochs, batch size and go!
  • train with replacement - use a different subset of the training data in every epoch, helps avoid overfitting
  • train incremental - every epoch test the model and move false positives to training set, helps if you have an unbalanced dataset (a lot not-ww samples)
  • train incremental with replacement - unbalanced dataset + overfitting to specific voices
  • train optimized - using bbopt search the optimal hyperparams (dropout and recurrent units), train several models and keep best one
  • train optimized incremental
  • train optimized with replacement
from precise_trainer import PreciseTrainer

model_name = "hey_computer"
folder = f"/home/user/ww_datasets/{model_name}"  # dataset here
model_path = f"/home/user/trained_models/{model_name}"  # save here
log_dir = f"logs/fit/{model_name}"  # for tensorboard

# pick one training method
model_file = trainer.train()
model_file = trainer.train_with_replacement(mini_epochs=10)
model_file = trainer.train_incremental(mini_epochs=20)
model_file = trainer.train_incremental_with_replacement(balanced=True, porportion=0.6)
model_file = trainer.train_optimized(cycles=20)
model_file = trainer.train_optimized_with_replacement(porportion=0.8)
model_file = trainer.train_optimized_incremental(cycles=50)

# convert a previous model
model_file = ".../my_model"
PreciseTrainer.convert(model_file, model_file + ".tflite")

# test a previous model
model_file = ".../my_model.tflite"
PreciseTrainer.test(model_file, folder)

3 Likes

created this early prototype with precise-trainer, we can use it for several audio classification tasks not only for wake word detection

Data: <TrainData wake_words=87243 not_wake_words=26209 test_wake_words=62258 test_not_wake_words=17654>

datasets used:

test set (threshold of 0.5)

=== Counts ===
False Positives: 223
True Negatives: 17430
False Negatives: 6847
True Positives: 55411

=== Summary ===
72841 out of 79911
91.15%

1.26% false positives
11.00% false negatives

test set (threshold of 0.3)

=== Counts ===
False Positives: 745
True Negatives: 16908
False Negatives: 3429
True Positives: 58829

=== Summary ===
75737 out of 79911
94.78%

4.22% false positives
5.51% false negatives

train set (threshold of 0.5)

=== Counts ===
False Positives: 164
True Negatives: 26044
False Negatives: 9424
True Positives: 77819

=== Summary ===
103863 out of 113451
91.55%

0.63% false positives
10.80% false negatives

train set (threshold of 0.3)

=== Counts ===
False Positives: 932
True Negatives: 25276
False Negatives: 4467
True Positives: 82776

=== Summary ===
108052 out of 113451
95.24%

3.56% false positives
5.12% false negatives

Logs

some quick logs taken with ovos-dinkum-listener, threshold can be tuned for your microphone

a usb pseye mic was used for testing

with silence after the wake word (probability remains < 0.1)

2023-04-17 19:27:23.680 - DEBUG - Record begin
2023-04-17 19:27:23.778 - DEBUG - speech probability: 0.012165711194470092
2023-04-17 19:27:23.779 - DEBUG - speech probability: 0.007387953200404283
2023-04-17 19:27:23.779 - DEBUG - speech probability: 0.0074164837438903996
2023-04-17 19:27:23.916 - DEBUG - speech probability: 0.00963889665079164
2023-04-17 19:27:24.075 - DEBUG - speech probability: 0.0015273538706807768
2023-04-17 19:27:24.154 - DEBUG - speech probability: 0.0013048631504227427
2023-04-17 19:27:24.312 - DEBUG - speech probability: 0.004202510508154393
2023-04-17 19:27:24.392 - DEBUG - speech probability: 0.011264940791642497
2023-04-17 19:27:24.551 - DEBUG - speech probability: 0.029089080423976375
2023-04-17 19:27:24.709 - DEBUG - speech probability: 0.0330160392731354
2023-04-17 19:27:24.789 - DEBUG - speech probability: 0.04045075434080284
2023-04-17 19:27:24.940 - DEBUG - speech probability: 0.035880844467858636
2023-04-17 19:27:25.098 - DEBUG - speech probability: 0.0343127893222466
2023-04-17 19:27:25.177 - DEBUG - speech probability: 0.048825275993444835
2023-04-17 19:27:25.336 - DEBUG - speech probability: 0.06496073160893169
2023-04-17 19:27:25.415 - DEBUG - speech probability: 0.06533937089905799
2023-04-17 19:27:25.574 - DEBUG - speech probability: 0.020742919937412153
2023-04-17 19:27:25.733 - DEBUG - speech probability: 0.03015348096774408
2023-04-17 19:27:25.812 - DEBUG - speech probability: 0.03280407266906129
2023-04-17 19:27:25.970 - DEBUG - speech probability: 0.033551127034316175
2023-04-17 19:27:26.050 - DEBUG - speech probability: 0.02258507788816991
2023-04-17 19:27:26.208 - DEBUG - speech probability: 0.021683746631835112
2023-04-17 19:27:26.367 - DEBUG - speech probability: 0.014940036273039907
2023-04-17 19:27:26.441 - DEBUG - speech probability: 0.023203975309233977
2023-04-17 19:27:26.599 - DEBUG - speech probability: 0.010978107529107326
2023-04-17 19:27:26.758 - DEBUG - speech probability: 0.01085717516984829
2023-04-17 19:27:26.838 - DEBUG - speech probability: 0.00673192570027706
2023-04-17 19:27:26.996 - DEBUG - speech probability: 0.009145248970350673
2023-04-17 19:27:27.075 - DEBUG - speech probability: 0.014728564085050872
2023-04-17 19:27:27.234 - DEBUG - speech probability: 0.018967839075865807
2023-04-17 19:27:27.393 - DEBUG - speech probability: 0.021906081942161357
2023-04-17 19:27:27.472 - DEBUG - speech probability: 0.020956804970967465
2023-04-17 19:27:27.631 - DEBUG - speech probability: 0.0283355462028866
2023-04-17 19:27:27.710 - DEBUG - speech probability: 0.03398455944396449
2023-04-17 19:27:27.868 - DEBUG - speech probability: 0.04453903342935014
2023-04-17 19:27:28.021 - DEBUG - speech probability: 0.04238703692519086
2023-04-17 19:27:28.099 - DEBUG - speech probability: 0.04019838764322674
2023-04-17 19:27:28.259 - DEBUG - speech probability: 0.02769013747752503
2023-04-17 19:27:28.419 - DEBUG - speech probability: 0.01499332085241939
2023-04-17 19:27:28.497 - DEBUG - speech probability: 0.006008901477572134
2023-04-17 19:27:28.655 - DEBUG - speech probability: 0.0045410861370690616
2023-04-17 19:27:28.735 - DEBUG - speech probability: 0.0027176982604409645
2023-04-17 19:27:28.894 - DEBUG - speech probability: 0.003474307268425031
2023-04-17 19:27:29.052 - DEBUG - speech probability: 0.0036373366435647186
2023-04-17 19:27:29.132 - DEBUG - speech probability: 0.006373721983397192
2023-04-17 19:27:29.291 - DEBUG - speech probability: 0.009820855380996639
2023-04-17 19:27:29.443 - DEBUG - speech probability: 0.026175543470491943
2023-04-17 19:27:29.522 - DEBUG - speech probability: 0.0203906866834103
2023-04-17 19:27:29.681 - DEBUG - speech probability: 0.012525286747740462
2023-04-17 19:27:29.760 - DEBUG - speech probability: 0.008839902469956445
2023-04-17 19:27:29.919 - DEBUG - speech probability: 0.005125477038747509
2023-04-17 19:27:30.078 - DEBUG - speech probability: 0.03135289363471317
2023-04-17 19:27:30.157 - DEBUG - speech probability: 0.0501774431087121
2023-04-17 19:27:30.316 - DEBUG - speech probability: 0.04399249493571764
2023-04-17 19:27:30.474 - DEBUG - speech probability: 0.037273510113372876
2023-04-17 19:27:30.554 - DEBUG - speech probability: 0.02796516340765478
2023-04-17 19:27:30.713 - DEBUG - speech probability: 0.014262378958243262
2023-04-17 19:27:30.793 - DEBUG - speech probability: 0.006917675543123764
2023-04-17 19:27:30.944 - DEBUG - speech probability: 0.007190964077191803
2023-04-17 19:27:31.103 - DEBUG - speech probability: 0.0031543336871801423
2023-04-17 19:27:31.182 - DEBUG - speech probability: 0.004690789790075773
2023-04-17 19:27:31.341 - DEBUG - speech probability: 0.0030493663030975084
2023-04-17 19:27:31.500 - DEBUG - speech probability: 0.007445112471145236
2023-04-17 19:27:31.579 - DEBUG - speech probability: 0.010005859629393689
2023-04-17 19:27:31.738 - DEBUG - speech probability: 0.00699867247156005
2023-04-17 19:27:31.817 - DEBUG - speech probability: 0.010897353990519618
2023-04-17 19:27:31.975 - DEBUG - speech probability: 0.007736864654712764
2023-04-17 19:27:32.133 - DEBUG - speech probability: 0.008414858717085034
2023-04-17 19:27:32.212 - DEBUG - speech probability: 0.008974473819920836
2023-04-17 19:27:32.371 - DEBUG - speech probability: 0.006449014494575834
2023-04-17 19:27:32.523 - DEBUG - speech probability: 0.0062499767732257265
2023-04-17 19:27:32.602 - DEBUG - speech probability: 0.006837515874981506
2023-04-17 19:27:32.760 - DEBUG - speech probability: 0.0073311854962891726
2023-04-17 19:27:32.840 - DEBUG - speech probability: 0.009283943655403859
2023-04-17 19:27:32.998 - DEBUG - speech probability: 0.010737426081278957
2023-04-17 19:27:33.157 - DEBUG - speech probability: 0.012570888528013672
2023-04-17 19:27:33.236 - DEBUG - speech probability: 0.02473344466122114
2023-04-17 19:27:33.394 - DEBUG - speech probability: 0.024982791734738385
2023-04-17 19:27:33.631 - DEBUG - transformers metadata: {'client_name': 'ovos_dinkum_listener', 'source': 'audio', 'destination': ['skills']}

with speech after the wake word (during speech probability is between 0.5 and 0.99)

2023-04-17 19:27:36.716 - - DEBUG - Record begin
2023-04-17 19:27:36.873 - DEBUG - speech probability: 0.0074738396751538125
2023-04-17 19:27:36.874 - DEBUG - speech probability: 0.005463857926285648
2023-04-17 19:27:36.874 - DEBUG - speech probability: 0.0030883582006644853
2023-04-17 19:27:36.945 - DEBUG - speech probability: 0.0037136731908849635
2023-04-17 19:27:37.104 - DEBUG - speech probability: 0.004728898006598107
2023-04-17 19:27:37.184 - DEBUG - speech probability: 0.0030493663030975084
2023-04-17 19:27:37.342 - DEBUG - speech probability: 0.0031410389137608414
2023-04-17 19:27:37.502 - DEBUG - speech probability: 0.025150235592824657
2023-04-17 19:27:37.580 - DEBUG - speech probability: 0.023047880477860005
2023-04-17 19:27:37.739 - DEBUG - speech probability: 0.03186467857525982
2023-04-17 19:27:37.897 - DEBUG - speech probability: 0.02548805399370427
2023-04-17 19:27:37.976 - DEBUG - speech probability: 0.01970189245134693
2023-04-17 19:27:38.134 - DEBUG - speech probability: 0.016381035309930944
2023-04-17 19:27:38.213 - DEBUG - speech probability: 0.019033565640216746
2023-04-17 19:27:38.372 - DEBUG - speech probability: 0.014313525991548993
2023-04-17 19:27:38.526 - DEBUG - speech probability: 0.01943213422131634
2023-04-17 19:27:38.605 - DEBUG - speech probability: 0.013177221757724856
2023-04-17 19:27:38.763 - DEBUG - speech probability: 0.01225473752907891
2023-04-17 19:27:38.922 - DEBUG - speech probability: 0.026262602221521738
2023-04-17 19:27:39.001 - DEBUG - speech probability: 0.027417460114405132
2023-04-17 19:27:39.160 - DEBUG - speech probability: 0.05125093034399201
2023-04-17 19:27:39.239 - DEBUG - speech probability: 0.0472147014568574
2023-04-17 19:27:39.398 - DEBUG - speech probability: 0.11362504696008084
2023-04-17 19:27:39.557 - DEBUG - speech probability: 0.5593268408157918
2023-04-17 19:27:39.636 - DEBUG - speech probability: 0.7194453226817796
2023-04-17 19:27:39.795 - DEBUG - speech probability: 0.8158054922282034
2023-04-17 19:27:39.874 - DEBUG - speech probability: 0.8774687703063826
2023-04-17 19:27:40.027 - DEBUG - speech probability: 0.9534068029700106
2023-04-17 19:27:40.186 - DEBUG - speech probability: 0.5817289460986372
2023-04-17 19:27:40.264 - DEBUG - speech probability: 0.799326634796633
2023-04-17 19:27:40.424 - DEBUG - speech probability: 0.9263158586794051
2023-04-17 19:27:40.582 - DEBUG - speech probability: 0.9649571924637619
2023-04-17 19:27:40.661 - DEBUG - speech probability: 0.9905739379712881
2023-04-17 19:27:40.820 - DEBUG - speech probability: 0.9645149239189981
2023-04-17 19:27:40.899 - DEBUG - speech probability: 0.8768197419467589
2023-04-17 19:27:41.057 - DEBUG - speech probability: 0.7557448514449752
2023-04-17 19:27:41.216 - DEBUG - speech probability: 0.7123238196389171
2023-04-17 19:27:41.295 - DEBUG - speech probability: 0.5744186314762657
2023-04-17 19:27:41.448 - DEBUG - speech probability: 0.5140038699014324
2023-04-17 19:27:41.606 - DEBUG - speech probability: 0.19568735916628419
2023-04-17 19:27:41.686 - DEBUG - speech probability: 0.04007271005723843
2023-04-17 19:27:41.845 - DEBUG - speech probability: 0.055409797254604584
2023-04-17 19:27:41.924 - DEBUG - speech probability: 0.023439845868977315
2023-04-17 19:27:42.083 - DEBUG - speech probability: 0.023360991166509474
2023-04-17 19:27:42.241 - DEBUG - speech probability: 0.039947368706159704
2023-04-17 19:27:42.320 - DEBUG - transformers metadata: {'client_name': 'ovos_dinkum_listener', 'source': 'audio', 'destination': ['skills']}
1 Like

Its near impossible to stop overfitting and underfitting of a dataset of hugely vary vectors that is splt into only x2 groups of KW and !KW.
Instantly you have a hugely sprawling !KW which is any sound but not the KW so underfitted and then in comparison the very tight grouping of KW is overfitted.
But also ist been a while but the Precise engine uses an alg to get a similar effect to a softmax and this causes further problems as the small probability from the KW classification often if amplified because the !KW probability is extremely small.
You get a KW hit not because the input is not the !KW and so must be the KW and this is due to the bad dataset choice of a vast and sprawling !KW subset.

I have mentioned this several times before and to stop the pointless arguments this basic truth seems to create I will just lay the gauntlet of allowing extra classifications to the dataset so that !KW can be split and help create a more balamced dataset.
Near all other KWS use a non speech ‘Noise’ classification and often ‘Unknown’ is used for !KW speech.
KW stays as it is as the main problem of balance is with !KW and that simple change will have a huge effect on the historical bad false postives the community has been getting for years that can be read in forum posts.

When it comes to false positives testing on the dataset of training gives a very false indication and has little worth as the false postives are failing on input that is not in the dataset and more specifically input that isn’t matching !KW is boosting KW probability.

Picovoice do a much better benchmark by using a different dataset to test for false postives by just injecting librispeech (in simple terms).

I have never really worked out why the same missconceptions and simple to correct bad methods seem to get regurgiated and that its just time and resources, but have at times wondered its vested interest and they are pushed as a trojan horse as they are just simple bad methods.
Same with the PS3eye mic as its beamforming electronics where in the PS3 and on its own it has no advantages at all without applying at least some basic algs.

I agree with a lot of what you said!

i look at precise-trainer as a binary classifier, feed it some dataset and you get a yes/no answer, it’s simple.

As is the norm with ML garbage in means garbage out, if you want to train a wake word and give it 10000x more not-ww than ww the model can just learn to never activate at all and reach 99.9999% accuracy, that is a meaningless way to measure a model indeed

this is why i introduced some new training strategies, with more to come! most of these strategies actually have more to do with the data than with the training,

a good illustration of how accuracy is misleading can be seen by using the training with replacement strategy, it will give you a much better overview of how the model is performing, this strategy will use different training data and always balanced (50/50, configurable) for a few epochs a a time, but testing is against the whole (test) dataset, in tensorboard you see test accuracy is 99.99% percent, but training accuracy (subset of train set) is much lower, something like 60% at times. In this strategy we are validating the model against much more data than we are training it with and not overfitting to specific samples

Not saying the above is the best way to train precise models, just agreeing that precise training can be very misleading if you just look at accuracy!

i want to extend it to support multi label classification also, but i am wondering if that should even be called precise or just something totally new, after all precise is Mycroft IP, loading and training precise models is one thing, using the name for a totally different implementation would be crossing a (trademark usage) line.

as for the pseye mic, it is very cheap usb mic since everyone seems to be selling theirs, its nice for quick prototypes since you just plug it to a pi and test stuff, but in no way a recommended good mic… at the same time because its cheap and very available its something we should support as it is representative of typical low quality scenarios, which we want to support as best as possible by being noise resistant. But yeah, it is often oversold as a great option, its not very good just ubiquitous and cheap

I hacked together a 2 channel delay sum purely out of frustration it didn’t exist.

4 channels is just a param from memory and should be easy to change, its been a while and have forgot.
Really it should be polished to use neon accelerated FFT but the C/C++ hack is at least passable than Python.
So it is possible to recreate the software alg if you do you a DSP performant language and the above should be easy to hack and tidy.

To get a binary answer out of a model doesn’t mean it needs to be a binary classification, in fact its better to add classification where each additional one gives more differentiation.
I only posted as with a quick glance it seemed to be once more a binary dataset of KW & !KW which is and historically has been a bad over simplification and bad dataset, likely due to its age when datasets where much rarer and non existent where now numerous do exist.

I have always presumed Mycroft believe Precise has IP but really its just a GRU running on tensorflow and think actually the reality is zero.
Its sort of strange really as its just a model runner and the real IP is in the models framework and its very easy to run and train those models with any smattering of Python.
Creating a dataset and taking some of the hugely boring and very tedious nature of making a dataset has often been overlooked as its the dataset that makes a good KWS irrespective of model and training.

UrbanSound8K - Urban Sound Datasets
FSD50K (eval) - FSD50K | Zenodo
https://pdsounds.tuxfamily.org/

Are all good resources but actually contain a few samples with spoken words that reduce accuracy and need pruning.

Is the biggest single word dataset where ML Commons trawled Common voice with a forced aligner to extract words and like all the datasets we have it does have a significant number of bad samples.
As opensource we have quite a range of datasets to choose from but I doubt they come anyway close to the gold standard datasets big data has currated.
Its currated datasets we need so each user is not having to prune out the dross as GSC (Google Speech Commands) is not a good dataset and never has been it purely a benchmark datset to commonly test models on but the contents is of poor quality and often incorrectly labelled which is likely a completely different story to thier in-house datasets that they do not release.

Precise was never precise and likely never contained any IP just has the semblance likely to attract investors, so yeah create your own and give it a name but in reality all it is, is a model framework runner and that IP is the frameworks.
Dataset creation and augmentation and noise mixing never really got a look in and that is what creates a good model and those tools, community consensus, a model zoo, currated datasets of the ones that are avail need a common store.
So really it doesn’t matter a jot about Precise or any named KWS just a model runner to select the correct indexes of whatever framework is of preference.